top of page

Collecting and Combining Messy Public Data


Do you need to collect and process some messy public data? Email me:

Technologies: Python, Vertex AI + Jupyter, Pandas, Google BigQuery, Google Cloud Storage

This type of project has three main considerations:

1 - Data Procurement: How do we get the data?
2 - ETL: How do we combine and transform the data?
3 - Export & Analysis: How do we store the data so it can be used best?

1- Data Procurement

We need to figure out whether we can download files, and whether we need to scrape any information. How often will we need to get this data? Will the frequency of our need require us to alter our pattern of collection (i.e. avoid getting blocked)?

Here the data is published quarterly and the links could be extracted with a small scraper which collects and downloads the links to the files and runs once every three months. 


2- ETL: Combining & Transforming

Data may be partitioned between different files and formats. In addition, those files may not have regular schemas. In this step, we need to figure out how to combine the data, how to clean and assign types to the columns, and then merge it into something usable. If the data is relatively small (<100GB) you can write scripts in python pandas. If the data is big (>100GB) you want to consider other options like loading the data into HDFS and using PySpark (or something like apache beam) to process the data.

The customer expressed a preference for notebook formats so I wrote a set of notebooks we could schedule on Google Cloud that would download, process and merge the data. These notebooks ran a set of functions to clean and assign types as well as merge dataframes that we read from disk.


3- Exporting

The choice for final storage depends on how the data will be used (whether the data needs updates or is write/read only, for example), performance, and cost considerations, as well as convenience for the customer. Usually, a datawarehouse  or database is ideal but sometimes more elaborate solutions are required.

Here we decided to store the final merged datasets in Google Bigquery as the data didn't really need many updates but rather just needed to be stored in queryable form for long periods between collections. 

bottom of page