Wikidata on Azure - Part 2: Loading Wikidata dump into Data Lake Store

Towards analytics

In the first part of this series we copied Wikidata dump to Azure storage (specifically Blob Storage). Now, we are ready to feed this data to various data processing services of Azure. Our first target will be Azure Data Lake Analytics - as the name suggests, we would be able to run analytic-type workloads with it.

Data Lake Analytics can work with different data sources, among others with Blob Storage. This sounds quite compelling, as our data resides currently in Blob Storage. However, there is another option, Azure Data Lake Store - a HDFS compatible storage, optimized for big data analytics workloads. Moreover, every Data Lake Analytics (ADLA) service must be connected to a Data Lake Store (ADLS), hence, if we are going to use ADLA, we will get ADLS anyway.

If one needs extra input for making the decision about the data source, there is a detailed comparison between ADLS and Blob Storage. As we want to make analytical queries, the comparison with the row Analytics Workload Performance containing

Optimized performance for parallel analytics workloads. High Throughput and IOPS

for ADLS and

Not optimized for analytics workloads

for Blob Storage, clearly suggests that one should choose ADLS. However, if you use blob storage e.g., to seed ADLS using ADLA ETL tools, you will be fine with blob storage.

So, let's copy our data to ADLS.
Data Factory copy plan

Prerequisites

For this part, you will need

  • an Azure subscription - preferably the one that you created in the previous part of the guide. We will use the Blob Storage with the transferred Wikidata dump as well as the Data Factory.

Transfer Wikidata data dump to Azure Data Lake Store - using Azure Data Factory

After the first part, we already know a tool to move data - the Azure Data Factory (ADF). The Copy Tool of ADF is capable of using Blob Storage as input and ADLS as target.

Alternatively, one could use the Adlcopy tool to make this copy.

Provisioning Azure Data Lake Analytics and Azure Data Lake Store resources

We do not necessarily need to create ADLA for this part, but it is very convenient to create ADLA with ADLS in one go. Furthermore, for the next part, we will need ADLA anyway.

While creating ADLA, you will have to create ADLS (or point to an existing one). The creation process is very straightforward, but there are simple guides in the Azure documentation: for ADLA here and for ADLS here. While you are at it, create a new empty folder in ADLS for storing the Wikidata dump. You can do this on Azure Portal by following this guide.

Granting ADF access to ADLS

Access to files stored in ADLS is governed by Azure Active Directory (AD). When a service wants to read or write a file in ADLS, it has to it as an Active Directory service user. First, you have to create this service user, then assign the service user to the empty folder you have created earlier. Just follow this guide from step 1 through 3. Make sure that you

  • save the tenant ID, the authentication key and the application ID in step 2
  • assign write and execute permissions to the empty ADLS folder
  • add this permission as "An access permission entry and a default permission entry"

In this example, the application is called datalakesvc and these are its permissions:
Data Lake Store - Folder permissions

You might wonder, why we created an application instead of a user? What does application even mean in the context of Azure AD? Well, there is an article on this topic in the Azure docs, I just cite one sentence:

When you register an Azure AD application in the Azure portal, two objects are created in your Azure AD tenant: an application object, and a service principal object.

So, when you create your AD application, you create a new service user at the same time.

In order to ADF play nicely with ADLS, we have to do one more thing: assign a Reader role in ADLS to the service principal. You can do this on the Access Control tab on the ADLS resource page.
Data Lake Store - Access Control

There are multiple ways to connect ADF with ADLS, you can read up on them here.

Now we are fully prepared to fire up the Copy Data Tool once again.

Using the Copy Data tool

As you are already familiar with the Copy Data Tool, we just jump right to the key steps. Fyi, there is a similar guide on the Azure docs site.

On the first page in the wizard, you can name your task, add a description, and set schedule. For schedule, set Run once now.

Set task properties in Copy Data Tool

In the Source step, select the blob storage from the existing connections tab.

Select Source in Copy Data Tool Confirm Source in Copy Data Tool

If you do not see the blob storage on the existing connections tab, you can add a new connection like in the destination step in the first part of this series.

Next, configure the source:

  • Browse the blob storage for the dump file.
  • Check binary copy as we do not want any additional parsing or processing.
  • Select the compression type according to the compression type used by the dump file. Wikidata dumps currently offered as Gzip or Bzip2 compressed flavors.

This last setting causes the dump being decompressed in the process. For further information on configuring (de)compression on the source and/or destination side, refer to this doc.

Select source file in Copy Data Tool Configure source file in Copy Data Tool

As the binary copy is selected, you get right to the Destination step. Select the Data Lake Store option on the Connect to a data store tab, then configure it as follows:

  • Network Environment - Public Network in Azure Environment
  • Data Lake selection method - From Azure subscriptions
  • Azure subscription - select the subscription where your ADLS is
  • Storage account name - select the ADLS resource
  • Authentication type - Service Principal
  • Tenant - paste the tenant ID that you saved in the granting access step
  • Service principal id - paste the application ID
  • Service principal key - paste the authentication key

You cannot move forward until the credentials of the service principal and the access to the ADLS is verified.

Set Destination in Copy Data Tool - step 1

On the next page:

  • Folder path - choose the empty folder created earlier in the ADLS
  • Compression type - we do not want to compress at the destination side

Set Destination in Copy Data Tool - step 2

For the Settings step, you can leave the options as-is. After reviewing the settings on the Summary page, you can start the task.

Check what have we done

Do not hold your breath while this task runs. For the dump dated 2018-01-22 it took nearly half a day and resulted in a 421GB file. Anyway, sooner or later, the task should end successfully.
You can then check the target folder - hopefully not empty anymore - and look for the decompressed dump file.

Check decompressed file in ADLS

You can even peek into the file by clicking the "..." in the last cell (highlighted above) and selecting Preview.

Preview file in ADLS

Next time

At this point, we are already up to do some analytic processing. Data Lake Analytics can process the dump file as-is, and answer analytic queries. However, life is not as simple as it should be - so there will be issues. But worry not, we will solve them, and finally, we will get to query away happily. See for yourself in the next part.

Stay tuned!

Acknowledgements

Thanks to moczard for validating the process.

Azure ADF icon