Collecting and Combining Messy Public Data
Do you need to collect and process some messy public data? Email me: email@example.com
This type of project has three main considerations:
1 - Data Procurement: How do we get the data?
2 - ETL: How do we combine and transform the data?
3 - Export & Analysis: How do we store the data so it can be used best?
1- Data Procurement
We need to figure out whether we can download files, and whether we need to scrape any information. How often will we need to get this data? Will the frequency of our need require us to alter our pattern of collection (i.e. avoid getting blocked)?
Here the data is published quarterly and the links could be extracted with a small scraper which collects and downloads the links to the files and runs once every three months.
2- ETL: Combining & Transforming
Data may be partitioned between different files and formats. In addition, those files may not have regular schemas. In this step, we need to figure out how to combine the data, how to clean and assign types to the columns, and then merge it into something usable. If the data is relatively small (<100GB) you can write scripts in python pandas. If the data is big (>100GB) you want to consider other options like loading the data into HDFS and using PySpark (or something like apache beam) to process the data.
The customer expressed a preference for notebook formats so I wrote a set of notebooks we could schedule on Google Cloud that would download, process and merge the data. These notebooks ran a set of functions to clean and assign types as well as merge dataframes that we read from disk.
The choice for final storage depends on how the data will be used (whether the data needs updates or is write/read only, for example), performance, and cost considerations, as well as convenience for the customer. Usually, a datawarehouse or database is ideal but sometimes more elaborate solutions are required.
Here we decided to store the final merged datasets in Google Bigquery as the data didn't really need many updates but rather just needed to be stored in queryable form for long periods between collections.