Don't load if you can stream

Computer RAM grows quite fast, but unlike hard disk space no matter how much you have, your applications or services will always require (or benefit) from more.

When you handle big datasets (between few hundred megabytes and few gigabytes) you have to be very careful with how you handle the data, because it is easy to create a point of failure due to out of memory errors. Especifically, you have to check that your code does not fully load into memory datasets if those can be big.

A real world scenario: CartoDB's Import API and a growing list of customers who upload datasets near or above the 1GB threshold. Monit was killing the worker jobs of those big uploads so we had to fix it.

First, diagnostics: At what points the code was fully loading the file?

- It wasn't upon importing the data into PostgreSQL, because that's done via command line and from ogr2ogr.

- It could be Rails, because its documentation only includes a basic "dump uploaded contents" example without even mentioning that you actually have saved the uploaded file in the folder specified by Pathname.new(params[:file]) (or :filename).

- It could be Typhoeus Ruby gem, because we have a Downloader class that fetches contents from urls and writes them into a file. We were doing a full single response_body dump while Typhoeus allows for streaming chunks of the response.

- It could be also AWS S3 Ruby SDK, because we upload there the import files so that workers can fetch them no matter in which server they are spawned. In this case, the documentation is great and it is a one-liner to write into an S3 object streaming a file.

All the 3 "could" were actual failure points, so I applied the fixes and job done. Now I have to spend some time (and bandwith) to upload some multi-GB datasets to benchmark and find where are our new limits in the platform :)

Bonus point: Upon uploading the file using Rails, anybody who hasn't set AS3 credentials on their CartoDB installation would get a different code execution path in which indeed the file is loaded and written once. That's acceptable, but I noticed that deactivating my credentials and testing that path, even after the HTTP request was fully processed, my Linux got around 1GB of RAM in use by Ruby process, suspiciously like the size of the file I uploaded.

After some debugging I found I had to force MRI 1.9.3's garbage collector to recognize the variable holding the file data as destroyed in order to regain my GB of RAM upon ending the request (filedata = nil). It's fun and sad at the same time that you get away from unmanaged languages to end up needing to do the same resource management techniques.

If you want, you can check all the changes I did to the Ruby code in this pull request.

Comments?

Posted by Kartones on 2014-05-14