Importing data and handling conflicts in Ruby on Rails applications

By Zach Dennis on 19 Aug 2016

The other week we took a look at importing data quickly into Ruby on Rails applications and saw how activerecord-import can speed up importing large sets of data, by 13x to nearly 40x, with just a few lines of code. One thing that post didn't cover was how to handle conflicts with our data. Should we ignore the duplicates, update the duplicates, or let it fail noisily?

In this post we'll take a look at updating the duplicates.

A Simple Example

Let's say that we have an authors database with the following schema:

We want to pull in an updated authors feed that will update the author names in our system. We may have misspellings, use initials where the author prefers their name expanded or vise versa, or perhaps the author has changed their name. Whatever it is we don't own the authors data so we want an upstream dataset to be used to make sure we've got up-to-date authors information.

In the above schema we are using the key field as the globally unique identifier for each author.

For the simplicity of this example our authors data already has a key that matches up with a corresponding key in the upstream data feed that we're going to import.

Running the import

For the time being let's assume that the only piece of author information we're interested in is the name. We can utilize the :on_duplicate_key_update option of activerecord-import's import method to specify that we want to update name when a duplicate is found:

The above code will efficiently INSERT new records into the database or update the name column when a duplicate is found.

How the above import works (MySQL)

If the data being inserted would cause a duplicate value in a UNIQUE index or PRIMARY KEY then MySQL will perform an UPDATE on the existing row. It will update the columns based on the list of columns you provide to the :on_duplicate_key_update option.

This relies on the underlying INSERT ... ON DUPLICATE KEY UPDATE functionality provided by MySQL.

How the above import works (PostgreSQL)

If the data being inserted would cause a duplicate value on the PRIMARY KEY field then PostgreSQL will perform an UPDATE on the existing row. It will update the columns based on the list of columns you provide to the :on_duplicate_key_update option.

The above example would fail in PostgreSQL if key weren't the primary key on the authors table, even if key had a UNIQUE index or a UNIQUE constraint.

This relies on the underlying INSERT ... ON CONFLICT functionality provided by PostgreSQL (only available in 9.5 and higher).

Specifying how to detect duplicates with PostgreSQL

With PostgreSQL activerecord-import supports passing a hash to :on_duplicate_key_update. The available options are:

  • columns — the array of columns to be updated when there is a duplicate
  • conflict_target — the column(s) or index expression that PostgreSQL can use to infer an index from for detecting duplicates. Don't pass the name of the index here, just the name of the columns in the index. Use this or constraint_name, but not both.
  • constraint_name — the name of the CONSTRAINT to use to detect duplicates. Unlike conflict_target, do not pass in the column name(s) that make(s) up the constraint; instead pass in the name of the constraint. Use this or conflict_target, but not both.

In the earlier example we added a UNIQUE index on authors.key. If the primary key on the table was the id field we would use the following import call to ensure our author names got updated:

To see how constraint_name is used let's remove the unique index on the authors.key and add an actual CONSTRAINT:

Here's the updated call to import:

In case you're wondering, ActiveRecord doesn't provide any methods for creating actual PostgreSQL database constraints, so that is why the above schema change executed raw SQL.

How about SQLite3?

SQLite3 doesn't provide an equivalent upsert implementation. The closest thing it currently supports is INSERT OR REPlACE. Rather than updating existing columns it can be used to replace an entire row with a new row.

activerecord-import currently doesn't provide any support for this SQLite3 feature.

Related bits

Why validate: false in the above examples?

Let's say that the author model looked like this:

The above uniqueness validation will run a query every time valid? is called on an author instance. Unfortunately, activerecord-import is not able to batch validate. Instead, it will try to validate each author instance individually. That will cause 10,000 SELECT ... FROM authors WHERE key = ... queries to hit the database before the import.

In the context of the above example we didn't need to run this validation since we're relying on a database-level UNIQUE index/constraint in order to trigger an update. Because of this we were able to the import without validations.

If we would have kept validations turned on nothing bad would have happened except the import would have gone a wee bit slower.

Use database level indexes/constraints for uniqueness

The validation helpers provided by Rails are useful, but they're not enough for fighting the war against duplicate data.

On its own a validates :column, uniqueness: true line in an ActiveRecord model won't actually ensure that you have unique data, nor will it be enough to utilize the :on_duplicate_key_update option in activerecord-import.

Summary

We saw previously how activerecord-import's on_duplicate_key_update option can be used to tell the database what columns to update when it finds a duplicate.

Since MySQL and PostgreSQL support different call semantics this resulted in a few variations on how on_duplicate_key_update can be used. For MySQL, it's a simple collection of columns to update, whereas with PostgreSQL it could be that or it could be a Hash of :columns and either :conflict_target or :constraint_name.

Now that we know how to leverage the speed and efficiency of activerecord-import with new data as well as updating existing data, we'll take a look next at ignoring unique key and constraint violations.

If there are any particular topics you'd like to see covered go ahead and post to Github or send us a .

Happy coding!

Image credit: Thomas Quine

About Mutually Human

Mutually Human is a custom software design and development consultancy specializing in mobile and web-based products and services. We help our clients design, develop and bring to market innovative products and services based on insightful research and strategy aligned with business objectives. We’ve helped Fortune 500 companies, state governments, and startups.