This series serves as a guide to hosting a knowledge base called Wikidata in Microsoft Azure Cloud. Wikidata is a Wikimedia project aiming to be a collaboratively edited, structured database for the factual data part of the Wikipedia content. Due to its structured nature, the collaboration is not just between humans aided by some markup-crunching bots, but more sophisticated data processing tools can be involved in order to query, enrich, validate, transform, etc. the knowledge.
There are a lots of ways that humans and machines can interact with Wikidata: humans can view and edit data through various web interfaces, programs can query through public query endpoint and edit via bot interface, there is also a regular data dump or you can host your own query server. And these are not covering by far all the possibilities.
In this series, we are building a cloud ecosystem where you can host your Wikidata playground. The ecosystem is basically a set of Azure resources configured, connected to serve diverse types of workloads, like OLTP, analytical queries or full-text search.
This series is for you, if
- as a researcher or data scientist, you want a solid, scalable and high-performance backend for analyzing or mining Wikidata data
- as a developer, you want your self-hosted Wikidata service to develop a Wikidata application
- you just like to learn more about Azure's data processing options
Throughout the series, you will need
- an Azure subscription. If you don't have, you can sign up for a free trial. Researchers can apply for research grant.
Transfer Wikidata data dump to Azure - using Azure Data Factory
Wikidata data dumps are generated regularly. You can download it in various formats. We will work with the json dump as it has a stable format and it comes closest to the Wikidata's storage format, no additional mappings are involved in the generation process. For further info about the specifics of formats, please refer to the documentation.
There are multiple tools in Azure portal or in Visual Studio to upload files to the Azure Cloud. So, we could just simply download the dump to our dev box then upload it. However, note that json dumps have a size of 20-30 GBs (depending on the compression algorithm), now, consider the wasted time and bandwidth compared to a direct upload method.
Enter Azure Data Factory (ADF),
a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
Let's concentrate on the movement part. With ADF we can move data between Azure services or between Azure services and other services outside Azure. Other services, like external HTTP services, like the Wikidata data dump website. On the destination side, there are various storage services in Azure. We will not go into the details of them, just pick the simple, go-to solution if you just want to store files - the blob storage.
So, we plan to use ADF to download the data dump from the Wikidata web server and transfer to Azure Blob Storage. Let's dive in!
Creating Azure Blob Storage and Azure Data Factory v2
First, we have to provision the following Azure resources:
- Blob Storage
- Data Factory v2 resource
There is a nice quickstart tutorial that will guide you through, just follow it until the step Start the Copy Data tool. Additionally, while initializing your new storage account, you do not need to save your access key or upload file, just create a container (folder) named like wikidata-json-dump or something similar.
Using the Copy Data tool
Copy Data tool offers a step-by-step wizard interface to assemble copy pipelines without writing any code. To start the wizard, first select the Author & Monitor option that takes you to the Azure Data Factory web UI, then on the welcome page, select the Copy Data option.
On the first page in the wizard, you can name your task, add a description, and set schedule. For schedule, set Run once now.
In the Source step, select the HTTP option on the Connect to a data store tab, then configure it as follows:
- DataSourceType - Cloud Data Source
- URL - full URL pointing to the compressed JSON dump file, e.g., https://dumps.wikimedia.org/wikidatawiki/entities/20180122/wikidata-20180122-all.json.bz2
- Check binary copy as we do not want any additional parsing or processing
- Server Certification Validation - Enable
- Authentication type - Anonymous
You do not have to set any setting in the Advanced Settings section.
As the binary copy is selected, you get right to the Destination step. Select the Azure Blob Storage option on the Connect to a data store tab, then configure it as follows:
- Network selection method - Public Network in Azure Environment
- Account selection method - From Azure subscriptions
- Azure subscription - select the subscription where your blob storage is
- Storage account name - select the blob storage resource
On the next page:
- Folder path - choose the folder created earlier in the blob storage
- Compression type - None (dump is already compressed)
For the Settings step, you can leave the options as-is. Finally, on the Summary page, you can review the task.
If you want to make changes, this is the last step, where you can step back with the Previous button. If everything looks good, choose Next. This will prepare and start the job.
Monitoring Data Factory jobs
On the bottom of the Deployment page, there is a button, that takes you to the Monitor tool page. You can also switch to this page by selecting it on the welcome page.
As the name suggests, with this tool you can monitor your running jobs. A couple of interesting properties you can examine, like job status, duration, error info, etc.
For a more detailed description of the monitor tool, please refer to the quickstart tutorial.
Check what have we done
Couple hours later the job should end successfully. Yaay!
You can check the blob is downloaded by navigating the blob storage blade, then select Containers.
Note, that multiple levels of subcontainers generated corresponding to the path part of the HTTP URL.
We have the data dump stored in Azure, so if we want this data utilized in various Azure services, we can fetch it from blob storage instead of downloading it every time from an external website.
Our next step is to prepare the data for analytic querying in Azure Data Lake Analytics. To do this, we copy the data in Azure Data Lake Store and make a trivial processing step - decompressing, along the way.
Thanks to moczard for validating the process.