Wikidata on Azure - Part 1: Seeding from Wikidata data dump

Preface

This series serves as a guide to hosting a knowledge base called Wikidata in Microsoft Azure Cloud. Wikidata is a Wikimedia project aiming to be a collaboratively edited, structured database for the factual data part of the Wikipedia content. Due to its structured nature, the collaboration is not just between humans aided by some markup-crunching bots, but more sophisticated data processing tools can be involved in order to query, enrich, validate, transform, etc. the knowledge.
wikidata logo

There are a lots of ways that humans and machines can interact with Wikidata: humans can view and edit data through various web interfaces, programs can query through public query endpoint and edit via bot interface, there is also a regular data dump or you can host your own query server. And these are not covering by far all the possibilities.

In this series, we are building a cloud ecosystem where you can host your Wikidata playground. The ecosystem is basically a set of Azure resources configured, connected to serve diverse types of workloads, like OLTP, analytical queries or full-text search.

Target audience

This series is for you, if

  • as a researcher or data scientist, you want a solid, scalable and high-performance backend for analyzing or mining Wikidata data
  • as a developer, you want your self-hosted Wikidata service to develop a Wikidata application
  • you just like to learn more about Azure's data processing options

Prerequisites

Throughout the series, you will need

Transfer Wikidata data dump to Azure - using Azure Data Factory

Wikidata data dumps are generated regularly. You can download it in various formats. We will work with the json dump as it has a stable format and it comes closest to the Wikidata's storage format, no additional mappings are involved in the generation process. For further info about the specifics of formats, please refer to the documentation.

There are multiple tools in Azure portal or in Visual Studio to upload files to the Azure Cloud. So, we could just simply download the dump to our dev box then upload it. However, note that json dumps have a size of 20-30 GBs (depending on the compression algorithm), now, consider the wasted time and bandwidth compared to a direct upload method.

Enter Azure Data Factory (ADF),

a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.

Let's concentrate on the movement part. With ADF we can move data between Azure services or between Azure services and other services outside Azure. Other services, like external HTTP services, like the Wikidata data dump website. On the destination side, there are various storage services in Azure. We will not go into the details of them, just pick the simple, go-to solution if you just want to store files - the blob storage.

So, we plan to use ADF to download the data dump from the Wikidata web server and transfer to Azure Blob Storage. Let's dive in!
Data Factory copy plan

Creating Azure Blob Storage and Azure Data Factory v2

First, we have to provision the following Azure resources:

  • Blob Storage
  • Data Factory v2 resource

There is a nice quickstart tutorial that will guide you through, just follow it until the step Start the Copy Data tool. Additionally, while initializing your new storage account, you do not need to save your access key or upload file, just create a container (folder) named like wikidata-json-dump or something similar.

Using the Copy Data tool

Copy Data tool offers a step-by-step wizard interface to assemble copy pipelines without writing any code. To start the wizard, first select the Author & Monitor option that takes you to the Azure Data Factory web UI, then on the welcome page, select the Copy Data option.
Launch Data Factory UI Launch Copy Tool

On the first page in the wizard, you can name your task, add a description, and set schedule. For schedule, set Run once now.

Set task properties in Copy Data Tool

In the Source step, select the HTTP option on the Connect to a data store tab, then configure it as follows:

You do not have to set any setting in the Advanced Settings section.
Set Source in Copy Data Tool

As the binary copy is selected, you get right to the Destination step. Select the Azure Blob Storage option on the Connect to a data store tab, then configure it as follows:

  • Network selection method - Public Network in Azure Environment
  • Account selection method - From Azure subscriptions
  • Azure subscription - select the subscription where your blob storage is
  • Storage account name - select the blob storage resource

Set Destination in Copy Data Tool - step 1

On the next page:

  • Folder path - choose the folder created earlier in the blob storage
  • Compression type - None (dump is already compressed)

Set Destination in Copy Data Tool - step 2

For the Settings step, you can leave the options as-is. Finally, on the Summary page, you can review the task.
Summary in Copy Data Tool

If you want to make changes, this is the last step, where you can step back with the Previous button. If everything looks good, choose Next. This will prepare and start the job.
Deployment in Copy Data Tool

Monitoring Data Factory jobs

On the bottom of the Deployment page, there is a button, that takes you to the Monitor tool page. You can also switch to this page by selecting it on the welcome page.
Data Factory Monitor tool shortcut

As the name suggests, with this tool you can monitor your running jobs. A couple of interesting properties you can examine, like job status, duration, error info, etc.

Data Factory monitoring job

For a more detailed description of the monitor tool, please refer to the quickstart tutorial.

Check what have we done

Couple hours later the job should end successfully. Yaay!
Data Factory monitoring - job success

You can check the blob is downloaded by navigating the blob storage blade, then select Containers.
Blob storage - check file

Note, that multiple levels of subcontainers generated corresponding to the path part of the HTTP URL.

Next time

We have the data dump stored in Azure, so if we want this data utilized in various Azure services, we can fetch it from blob storage instead of downloading it every time from an external website.

Our next step is to prepare the data for analytic querying in Azure Data Lake Analytics. To do this, we copy the data in Azure Data Lake Store and make a trivial processing step - decompressing, along the way.

Stay tuned!

Acknowledgements

Thanks to moczard for validating the process.
Azure ADF icon