On my current project we deal with quite a few CSV imports. While writing a basic CSV import is reasonably straightforward, we ran into performance issues when we started scaling the solution. Basically we could import 3,000 users in around 30 minutes (the domain model being rather complicated), but we needed to scale this solution to import 180,000 users every night. It was obvious that we needed to put some effort into the performance of our solution.
We did some quick testing and it became apparent that we could significantly improve the performance by pre-loading some data and doing some caching, but this would only yield a 4x or 5x performance boost – but we really wanted something like a 30x performance boost.
One of the best ways of loading CSV data into your database is with MySQL’s LOAD DATA INFILE function – it will basically take an entire CSV file and dump it into a table in your database. You usually load the file into a temporary table and then use SQL queries to update your database. Unfortunately we were using PostgreSQL, so this option was not available to us.
Manual SQL Statements
So taking some inspiration from this post we decided to bypass ActiveRecord and manually generate the SQL statements. This solution worked quite well in our situation, because we focused on making updates very fast – we could use ActiveRecord for creating new records (which in our case would execute a number of ugly callbacks), but generate the SQL for update statements manually.
This process is unfortunately very error-prone and even though we had good tests backing our implementation we still had to do quite a bit of trail and error before getting everything running smoothly.
Use Hashing for Eliminating Unnecessary Updates
While our manual SQL was getting us very impressive performance, we were still generating update statements for every single row, even though the data was unchanged most of the time. We wanted to ignore rows where the data in our system and the external system were unchanged since the previous import.
To do this we used hashing – we would add a csv_hash column to the database table and then generate the hash of the csv row itself. We can then quickly ignore rows by simply comparing the hash of the csv row to the hash stored in the database – if they match we simply ignore the row.
To speed things up even further we load the external reference to the user as well as the csv hash into memory – we can then skip matching rows without needing to hit the database again.
As I mentioned we focused on making updates very performant, since this made sense for our application.
Overall we managed to move from importing 3,000 users in 30 minutes to 180,000 users in 3 minutes. There are certain validations we had to manually do – for example, we require that users have a unique email (something which is not enforced in the external system), so instead of using the Rails unique validation we added a unique index to the database table.