Have you always wanted to know which words or phrases are most commonly used by your favorite singer or band? Or are you curious about the same question about your favorite book? Or would you analyze articles or any texts from this point of view? In this blogpost series, I’ll show you how to answer these questions by building a Microsoft SQL Server Integration Services (SSIS) job for this purpose, while I'll lead you through some basics of it.
My motivation to this SSIS job was that I noticed that maybe no other heavy metal bands repeat themselves so much in their lyrics than the band Manowar! Their lyrics are a bit like they would stand of the same 20-25 words in another combinations. So, it made me curious, which words or phrases do they use most. Therefore, this article series focuses on analyzing lyrics.
My SSIS solution for this task can be found on my GitHub site here: https://github.com/cherryBomba58/most-common-words-in-texts. In this repository, four packages can be found, their names are Part 1-4. In this three-part article series, I will only describe the installation of SSIS and the creation of the first package, named Part 1. I will write about the other packages in a later article series.
In this part of the series, I’ll show you how to install SSIS, how to start building an SSIS job and how to read content of a file. So, let’s see it, step by step!
Installing SSIS, new solution
If you want to use SSIS, and you have Visual Studio 2017, then you have to install SQL Server Data Tools (SSDT), and in the installer, you have to choose SQL Server Integration Services. However, if you have Visual Studio 2019, then you don’t have to install SSDT, but the SQL Server Integration Services Projects extension from the marketplace.  At the time of writing, the extension is only in Preview status, and Microsoft does not recommend using it for production. It can be downloaded from here: https://marketplace.visualstudio.com/items?itemName=SSIS.SqlServerIntegrationServicesProjects. For creating my project, I used Visual Studio 2019 and the SSIS Projects extension, so my screenshots are from there, too.
After installing SSIS to Visual Studio, you can create new SSIS projects now. Search for the Integration Services Project template and select it.
Name the new project MostCommonWordsFinder and place it where you want.
After creating the solution, the screen looks like this in Visual Studio.
As you can see in the Solution Explorer on the right side, there is a Package.dtsx file, which will contain our SSIS job. You can create more packages under a project. On the left side, the SSIS Toolbox can be found, which contains the components you can drag-and-drop on the middle of the screen, the design interface of the packages.
Now, open File Explorer and let’s create a data directory into the root of the solution (where the solution file can be found). This will contain all the files we will work with in the SSIS jobs. The analyzable texts will be here, too, under the texts directory.
For the task, I had lyrics of Manowar songs on my computer in txt format, but only from three of their albums: Kings of Metal, Warriors of the World and Gods of War. The lyrics of these albums are very typical of the band, so they are a good point to start. I copied the txt files into the data/texts path, but for legal reasons, I didn’t upload them to my GitHub repository, so you can only see an empty data/texts directory on it (OK, .gitkeep is there to keep the directory in the version control).
However, the SSIS job will work on every kind of free-form texts. So, you don’t have to get lyrics from Manowar, they can be from another artist(s), too, but an article or a book or even a random text can be also a good content to analyze, but in all cases, I strongly recommend the texts to be in English (I will write about this in the next article). So, make some txt files, that contain the texts to be analyzed, and copy them to the data/texts directory.
Analyzing lyrics of one song
The first SSIS job we will make will only analyze the content of one specific file from the data/texts directory.
First, rename Package.dtsx to Part1.dtsx. As you can see on the design interface (on the next picture), there are some tabs, but for us, Control Flow and Data Flow tabs are the most important. The difference between them is that under Control Flow, you can decide the structural form of your job, where the data flows and other tasks will be the nodes, so it is the high-level process. Meanwhile, under the Data Flow tab, you can plan the low-level data flows, where the reading, transforming and writing of data will happen.
Next, drag-and-drop a Data Flow Task from the SSIS Toolbox to the Control Flow, and rename it to Lyrics analyzing.
Then, double-click on this Data Flow Task, or go to the Data Flow tab, and you’ll get the design surface of the Data Flow Task. As you can see in the picture below, here, the SSIS toolbox contains other elements than by the Control Flow. This is because on Control Flow, only task or structural (e.g. foreach) components can be used, while on Data Flow, data sources, data destinations and transformations are the appropriate elements.
Reading data from a file
As the first step of building our data flow, we will read the analyzable content from a file. For this, drag-and-drop a Flat File Source (it may be under Other Sources in the SSIS Toolbox) to the Data Flow design interface, and name it Lyrics file source. This component is responsible for reading data from a simple file like txt or csv files. Then, after double-clicking on the Flat File Source, we can set it on the window you can see on the next screenshot. As the warning box tells us on this editor, we have to create a new flat file connection manager to connect to the simple file we want to read from, so click on the New… button.
On the Flat File Connection Manager Editor, firstly, give a meaningful name to the manager we are creating, so let’s name it One source file manager. Then, choose the file that contains the lyrics of the Hand of Doom Manowar song for example. Its path will be in the File name field. Then, set the Locale to the language of the file (English US), and the Code Page to the character coding of the file (UTF-8). Take the check out of the “Column names in the first data row” setting, because the lyrics file doesn’t contain column names.
Leave the default delimited format, because the file has no fix width, and it’s not ragged right (every column has fix width, except the last one ), but its width is flexible. The text qualifier remains none, because the texts in the file are not surrounded by special characters like apostrophe or quotation marks, that could define each literal.  The header row delimiter remains the default, because I don’t have header in the file, and therefore the number of the skippable header rows is 0.
After stepping to the Columns page, we can see what the connection manager extracted from the given file. As you can see on the next screenshot, it could read the full text, and it has one column as we expected, but the name of this column is Column 0, which is not a meaningful name.
In this page, you could set the row and column delimiter characters, if the format of the read text is not what you wanted. However, SSIS finds out in most cases, which characters are the row and column separators, for example, it recognizes that a typical csv file has semicolons as column separators.
No change needed in this page, because the rows are well separated, and the comma does not interfere, because it does not separate the rows into more columns.
We can refine the settings of the column under the Advanced page. SSIS helps you, because it finds out how many columns are in the file because of the separators, and it recognizes their types and settings, too, but these are only recommendations, we can change them. Rename the column from Column 0 to Text, set its DataType from string to Unicode string, and set the output column width from 50 to 100 (or the maximum width you expect from the rows of your file). I extended the width because the file reading can cut the rows, and in the result rows, the maximum number of the characters is only the set width. I expect the rows being maximum 100 characters long. The Unicode string setting is only a conformation to the UTF-8 coding.
On the next page, Preview, you can verify your rows and columns. If you find everything right, then you can click OK.
After that, your Flat File Connection Manager is created. We are in the Flat File Source Editor again.
Under the Columns page, you can verify what will happen to each column. Here, you could change the name of the output column name, because this page is for mapping the columns of the read input file to the output columns of the component, but for the simplicity, we don’t change here anything.
Under the page Error Output, we can change the behavior of the component, when there is a failure in it. One option is, that the whole component can fail, and then the job will stop with error. Or, it can ignore the failure, and the job can carry on running. Or, it can redirect the row, for example, to write the faulty rows to another file. Leave the default Fail component setting, so the job will fail by the failure of the component.
By clicking OK, we have a Flat File Source in the Data Flow Task, so we can read the rows of the chosen file, and these will travel to the next component of the Data Flow, or another name, an ETL (extract, transform, load) flow.
In this article, we installed SSIS and started to build an SSIS job, where we can read the analyzable content from a simple file. In the next article, I will write about filtering the rows of the content and extracting the most common words of it.