In the previous part of this series, we came to the point that we can read rows of a file into an SSIS job. Now, we’re going on with filtering these rows and extracting the most common words and phrases of it.
Filtering out unnecessary content from data
The next step is to do some data cleaning, because my lyrics files contain some unnecessary content that can distort the analyzing. Just a few examples:
Original Liedtext [Chorus:] [Repeat chorus, 2nd verse, chorus]
The problem is with these words that they can get into the most common words, but they don’t belong to the real content, the lyrics. What we need is throwing out all the rows that contain such unnecessary terms like “Liedtext” or “Chorus”. The solution for this is filtering, that can be done with the SSIS component Conditional Split (under Common): the rows, that fulfill a condition, can go on a specified output, while the others have to choose another way.
So, pull a Conditional Split element to the Data Flow under the Lyrics file source component, and name it Filter chorus and liedtext out. After binding the Lyrics file source with the Conditional Split component, the output rows of the Flat File Source will be the input rows of the Conditional Split.
(Bonus: the data flow can be formatted to look good. Just select the whole data flow, then click on the Format menu in Visual Studio, and choose Width from the Make Same Size menu, and then, select Centers under the Align menu. Then, you will reach the look like the picture below.)
After double-clicking on the Conditional Split, you can get the window on the next picture to set it. You can add the different outputs of the component into the table, and the conditions to them against the input rows to be written to that output. In the Condition field, you have to write an SSIS expression. Luckily, the boxes above the table help us by providing the functions, operators, variables and the input column values as variables, which you can drag-and-drop into the expression in the Condition field. Additionally, there is a Description box, that shows information about the chosen function.
By our example, we only need two outputs: the rows which don’t contain the words “Chorus” and “Liedtext”, and the others, that contain at least one of them. For the former group, add a new output to the table by typing its name into the Output Name, let it be Text. The rows of the latter group will be the exceptions to the condition of the Text output, so they can go to the default output. Name the default output “Chorus and liedtext” at the bottom of the window.
Now, the only task that remained about the Conditional Split is to express the condition of the Text output. Here, the
FINDSTRING function will help us, its description can be read on the next picture in the Description box. It searches for a special occurrence number of a given string parameter in the given text and returns its location in the text. So, the plan is to search for the first occurrence of all the critical words in the text of the row, and we check, that the function didn’t find any of them, so their location is 0 (if the searched word starts at the first character, then the function returns 1 as location). Unfortunately, the function is case-sensitive, so we have to search for the uppercase- and the lowercase-beginning version of the critical words to filter them out. The full condition is this (“liedtext” didn’t occur in my files, so I don’t check it):
(FINDSTRING(Text,"chorus",1) == 0) && (FINDSTRING(Text,"Chorus",1) == 0) && (FINDSTRING(Text,"Liedtext",1) == 0)
Anyway, we could have written these outputs in a reversed way, so that the output, that contains rows with critical words, could have been given in the table, where we could have checked if the
FINDSTRING functions return positive values, and we would have replaced the AND (
&&) operators to OR (
||) operators. In this way, the rows, that fit our needs, would have gone to the default output.
The next task will be the peak of our data flow: analyzing the input text and returning the list of the most common words in it, along with their frequency points.
For this job, the suitable SSIS component will be Term Extraction. Microsoft uses the word “term” to describe words and phrases at once. The description of Term Extraction can be read on the next picture. As it states, it searches only for English terms with the help of an English dictionary. Therefore, I recommended to provide English-only texts to the flow at the beginning of the previous article. According to the description, the output stands of two columns: terms and their scores.
Now, drag-and-drop a Term Extraction component (from Other Transforms) to the design surface under the Conditional Split component, and name it Term extraction. Then, pull the arrow from the Conditional Split to the Term extraction, and then, you should see this window below:
This means, that you can choose from two outputs, both come from the Conditional Split component that you have set previously. Select Text, because there are the cleaned rows we need, so they will be the inputs of the Term Extraction component.
Now, double-click on the Term Extraction component, and the window on the next screenshot will come up. Here, at the Term Extraction tab, you have to choose the column whose values you want to analyze with this component. So, check the only column, Text (it’s not the same as the chosen output in the picture above, but the column from this output). At the bottom of the window, you can see the names of the two output columns, and you can change them if you like.
Go to the Exclusion tab. As you can see here, you can play with the exclusion terms or another name, stop-words. Exclusion term means words or phrases that you want to skip by the analyzing , so you don’t want to see them in the results.
If we want to use exclusion terms, then we have to connect to a database, and choose a column from a database table to get the exclusion terms. For simplicity, we don’t do it.
We could have taken the filtered “Chorus” and “Liedtext” words to a database of stop-words, but I thought it would be an overhead, and I wanted to show a filtering in an SSIS job.
At the Advanced tab, you can see several settings you can play with to affect the results. At Term type, you can choose that only nouns, or only noun phrases or both should be included in the results. Phrases stand of more words, and in noun phrases, adjectives or numbers can be included, too. 
Under Score type, you can select the scoring method of the terms: it can be Frequency or TFIDF. The former measures the occurrence number of the actual term, while the latter has the formula below:
TFIDF of a Term T = (Frequency of T) * log( (#rows in Input) / (#rows having T) ) 
This means that TFIDF takes also the density of a given word in the text into consideration. The more occurrence of the same term is concentrated to the relatively less rows, the higher the TFIDF score will be. TFIDF points will be 0 to a term, if every row contains it.
Under Parameters, we can set threshold of frequency score or the maximum length of the found terms. For example, if you change the frequency threshold to 3, and you have Frequency set as score type, then a term won’t be in the results if it occurs fewer than 3 times in the input text. At the default “Maximum length of term” settings, if a phrase or a word stands of more than 12 characters, then it won’t be in the results, too.
At Options, you can check case-sensitive term extraction, if you want, but in our case, it’s better to analyze the words in the case-insensitive way.
After clicking OK, the data flow should look like the one on the picture below. Before continuing the building of the data flow, let’s make it clear what deficiencies the Term Extraction has.
The first one is that by Term type, we can’t choose verbs or verb phrases, so they won’t be in the results, but these could have been interesting, too. For example, I would be curious how often does Manowar use the verbs „kill” or „die”, but I would exclude the too common verbs like “have” or “be”.
The other disadvantage of the term extractor is that it only works for English texts. This is because the term extraction algorithm is strongly built on the internal English dictionary and the English grammar to tag the found words into parts of speeches, taking English plural forms into consideration, and it throws out the words that it doesn’t recognize as an English noun, adjective or number.  If we would run this algorithm on a non-English text, then it would throw out everything. So, if we would like to use a similar component that analyzes for example a Hungarian text, then we would have to change the algorithm to take e.g. the suffixes or the Hungarian plural forms into consideration, and we would have to create a Hungarian dictionary.
If we would use a simpler solution, like counting every random words of a text, and making a list of them with their occurrence number, then a lot of irrelevant word would be there. So, in this case, we should create a list of all stop-words or all accepted words, which would be an incredibly big and hard work.
In this article, we filtered out the unnecessary rows of the analyzable content, and then, we set the Term Extraction component to find the most common terms of the text and calculate their frequency points. The next article will be about writing errors and results to files.