Redshift to Mongo Data Pipeline Rewrite
Do you need help with a data pipeline rewrite? Email me: email@example.com
To help a customer fix their data pipelines it is important to ask three questions:
1 - What does the customer want?
2 - How to diagnose the barrier preventing this?
3 - How to design a solution to overcome this barrier?
1 - Find what the customer wants
Often a customer might not have the ability or time to understand why their application isn't working. That's ok-- it's my job to help them figure that out.
Here the issue proposed by the client was a series of crashing python scripts and cryptic error reports caused by excessive use of RAM. Of course, if your code pulls batched data from a database but still uses too much memory it's often an indication that some part of the code is not working right. The customer wanted this problem resolved so as not to receive email crash reports from failed data scripts.
Ok, clear target goals: get the scripts running without crashes, and no more error reports.
I used profilers (e.g. scalene, pympl, etc) to understand where and when the script exceeded its RAM limits. Sometimes our database cursor would allocate memory that was getting fragmented. Our cursor was pulling in too much data because it was improperly configured. Occasionally these batches would explode in size. This first problem had a straightforward solution.
I found a more serious set of issues when I investigated a script intended to move and keep up-to-date a quarter billion records. Problems here involved lengthy runtimes from a complicated SQL query getting killed by the remote database, excessive RAM usage on a remote database getting OOM-reaped, and large amounts of filtering happening on our RAM-bound server, as well as a slow query speed due to the complexity of a generated query.
To fix the first problem I wrote code to estimate the batch sizes based on sampling then pulled batch sizes as a low fraction of total RAM. I also refactored the code to avoid data copies. Most processing done with pandas, for example, could be done natively, or have its RAM explicitly released. This worked for most scripts and solved most problems.
To fix the second set of problems I simplified complex SQL to solve several problems at once. I made SQL run faster (which lowered chances of an error), I made the SQL filter (which lowered the amount of data we needed to export) and I implemented a custom partitioning and error-handling mechanism to break up the large query in many sections, keep track of our progress on each section and gracefully handle kill signals from the external database and restart the queries later, with no progress lost.
In the end, the customer was quite satisfied with the reliability of this solution.