Darwinweb

Free Form Manual Eager Loading

August 11, 2007     

A recent discussion on rails-core raised performance concerns when eager-loading multiple associations at once. The core issue here is that joining gives you the cartesian product against whatever rows are already selected. So if you are eager-loading N, O, and P associations on model M you get M*N*O*P rows instead of the optimal M(N+O+P) rows.

To achieve this optimal number you need to unroll the single query into 3 separate queries. Of course this is only worth doing if the numbers of rows are big enough to offset the overhead of the additional queries.

Under some circumstances this could be a make-or-break performance bottleneck. The trouble is ActiveRecord eager-loading doesn’t support this type of unrolling. There would be numerous ways to work around this:

  1. Cache the problematic query. Not always possible.
  2. Do the queries separately and use the results separately. Adds various types of overhead.
  3. Hack into Rails internals. Could get messy and brittle.
  4. Eager-load only one association. Might work or might not depending on the specifics.

None of these solutions are particularly elegant, but the problem actually runs deeper. The core issue is the inflexibility of eager loading.

Instantiating Multiple Types of Models Manually

Eager loading is a killer feature of ActiveRecord. Solving the N+1 queries problem is pretty easy, it’s one of the most basic uses of the LEFT JOIN in SQL. It would be easy to write an efficient query using find_by_sql.
The problem is that you just end up with dumb data fields tacked onto your base object.

The vast benefit of ActiveRecord is the type mapping, and all the richness of the model objects. There is no straightforward way to instantiate an object from rows you manually selected using find_by_sql. Insantiating an ActiveRecord object is always done internally by Rails. This is why eager loading is all the more critical, because it provides one and only one way to instantiate multiple models with one query.

Eager loading as it stands is a pretty slick piece of software. It does a whole lot in a relatively small amount of code. So what are the weaknesses?

No Respect for :select

Eager loading writes an explicit select statement with unambiguous names for all rows that are easily mapped back to the various included models. This means any :select clause in your find will be ignored. This creates a class of performance problems when you are fetching far more data than you need. In practice this isn’t often a problem, but for large TEXT or BLOB columns you may achieve a sizable performance improvement by selecting only what you need.

Opaque References

Sometimes it’s useful to refer to a JOINed table in :conditions. With eager-loading the columns are named in a predictable but volatile fashion. Although you can sometimes get away with referencing the original table and column name, it only works if the table exists only once in the query. And just because you aren’t referencing it in :joins doesn’t mean it’s not there twice. Certain association scoping will add a second reference to the same table on an INNER JOIN.

Guessing the column name works, but is extremely brittle if anything in the query changes. A good test suite will save you here (isn’t that always true?), but it just smells bad.

I’m not saying that we necessarily need access to these column aliases though. Ultimately SQL is a messy beast, and I’m not sure how far we can take the eager-loading API through the :includes option. Tacking on additional methods or complex hashes to deal with this runs the risk of complicating the simple cases.

Manual Eager Loading Options

What I think we need is a supported way to do eager loading with explicit :select and :joins options. I’m interested enough in this to actually write some code, the problem is figuring out what the interface would be like. I’m thinking it would most likely be a method (or an find option) that allowed the user to specify which :selected fields belonged to which association. Should it always be an explicit list of fields? Or could we utilize some naming convention to sweeten the syntax? Could we hook into some parts of the eager-loading naming scheme to further lighten the verbosity?

Additionally it would be nice to be able to have an option unroll queries in a single find, or provide a way to merge the results into a single array. I think the latter goal can be accomplished with a minimally invasive plugin by explicitly setting the association proxy target, but it might need some deeper integration to handle edge cases.

This is definitely a tough nut to crack, so I’m soliciting any and all ideas the community has, especially concerning how the interface should work and any use cases people have.

Steven A Bristol says…
August 11, 2007 at 9:30PM

My first thought is that this should be really easy, like just make a

:fast_eager=>true

option which will then run

:include=>[:a,:b]

as two queries instead of one. Maybe it will even look for and array in the select & conditions like:

:select=>[‘select for join with a’, ‘select for join with b’]

These are my first thoughts.

Mark says…
August 13, 2007 at 4:09PM

Regarding :select + :include, a patch has already been submitted. I’ve been using it for months, seems to work well. You do have to check that’s it’s being engaged when you use it. It seems to ignore the :select invocation when certain id fields are not requested in the select block. Here is the patch