Sometimes we want to insert multiple rows into a table.
1 2 3 4 5 6
For convenience, we often also want to use an ORM that tools like SQLAlchemy provide.
1 2 3 4 5 6 7 8 9
Now, suppose there are 400,000 martial artists to insert into the table. With the ORM portion of SQLAlchemy, this may take amount of time that is nowhere competitive with directly writing SQL due to its Unit of Work pattern. This is not surprising since the point of using (the ORM) SQLAlchemy is more on the engineer productivity side than the performance. The good news is that, starting SQLAlchemy v.1.0, it introduces Bulk Operations at the expense of some of its other features:
1 2 3 4 5 6 7
Now, according to the benchmark test, this will give a significant (e.g. factor of 3) reduction in the insert operation latency. But my case was a bit different in that I had to insert into five related tables. Recently I had a chance to play with Yahoo! Front Page Today User Click Log Dataset, of ~45M rows. For repeating query tasks, I happened to had to build multiple related tables where some tables needed primary keys in other tables to populate. The data model looked like:
1 2 3 4 5 6 7 8 9
For instance, to insert a row into UserVisitEvents, I needed three primary keys (User, Shortlist, Article) populated ahead of the time. By default,
bulk_save_objects does not allow you to evaluate the primary keys of inserted objects. It does have a flag that you can set as
bulk_save_objects(objects, return_defaults=True), but source code suggests this in essence reverts to sequential inserts.
1 2 3 4 5 6 7 8 9 10 11 12
Generally It seemed there is no quick easy way around with bulk inserts to multiple related tables. Of course, this is not an issue intrinsic to SQLAlchemy and it has never been intended for a task like bulk inserts. That said, I was curious how
bulk_save_objects(..., return_defaults=True) would perform against my data. Interestingly, I observed that the insert performance worsening over time likely due to overhead, initially at roughly 3s for 10,000 inserts (for the five tables) and after around 30th batch (300K inserts), the latency spiked up to 30s! This did not make sense as each batch operation is independent. This was in contrast to the sequential way giving 10s consistently. 5-minute cProfilng did not reveal too much insight. I mulled over breaking the relatedness of the tables initially by self-generating primary keys, but I decided to cop out for now. After all, this is just a one-off task and the dataset is not too large anyway. I left the job running overnight and woke up to find everything as needed.