Petabyte Scale Archive System
The first project goal was to save 100 Terabytes of data from a failing NoSQL data store. The second goal was to design a new system that could meet the needs of the customer, a large academic research lab. I saved the data collection and designed and migrated the customer to an archive system.
First, I bought us time by buffering streaming data on the servers where it was collected. During this period thought about the constraints the customer was facing. They needed to be able to handle an insert load and growth of 1TB a month or more. Keeping costs low was a major constraint, they could not afford a large managed database deployment. Few researchers queried the datastore but they did need to be able to run lots of filters on the data. This usage pattern was well suited to the Archive model. So I designed an archive system using rsync, autossh, bzip, and a large disk allocation on the university's high-performance computer cluster (HPC). You can see my system on the left of the pictured diagram, the old system is on the right. We repurposed several database machines to act as transfer nodes to make moving data to and from the archive performant. I had to negotiate for some special privileges with internal stakeholders at the HPC department but we found a way to accommodate everyone.
To empower researchers who did want query capacity we got them HPC allocations so they could build local databases. I also made a set of tools in python that made it easier to build databases, deal with corrupt data, and filter unwanted data before other steps.
After a trial period, the system worked very well, handled the insert volume and nightly batches of streamed data, and the customer soon wanted to migrate all the data there.
Do you have an ambitious project? Email me and let us work together: firstname.lastname@example.org