Optimizing and Simplifying Limited Eager Loading in ActiveRecord

September 13, 2007     

Imagine an ActiveRecord class like this:

class Article < ActiveRecord::Base
  has_many :comments, :images, :tags, :links

Now consider the following find statement:

Article.find(:all, :include => [:comments, :tags, :links], :limit => 10)

Ignore for a moment the performance issues with eager loading multiple associations, that’s a topic for another post.

The key bit here is that we are using both :limit and :include. I refer to this as limited eager-loading. This kind statement uses a very specific bit of ActiveRecord that constructs two separate queries. The first query (heretofore known as the pre-query) uses the form:

SELECT DISTINCT articles.id ... LIMIT 10

To fetch exactly 10 article ids. Then the actual query is performed by dropping the limit and adding a condition something like this:

SELECT ... WHERE articles.id IN (1,2,3,4,5,6,7,8,9,10)

This is necessary because LIMIT applies to rows, and if there is more than one comment, tag or link, then there will be more than one row per article, which means we would get less than 10 articles back.

Performance Problem

In the current ActiveRecord implementation there is a potential performance issue that can be fixed. We already know that loading multiple associations has a performance issue because it is loading the cartesian product of all the tables which results in polynomial rather than linear row generation. What I mean by that is say we have 100 articles, each with 5 comments, 5 tags, and 5 links. Eager-loading SQL grabs them all in one query which results in 12500 rows (100 * 5 * 5 * 5) being fetched when in fact there are only 1600 actual objects (100 + 100 * (5 + 5 + 5)). I’m working on a fundamental solution to that problem, but for now I have a more pressing issues.

Large Base Table Numbers

In cases where associations are small in numbers, this polynomial problem isn’t a big issue and we can safely eager-load several associations (assuming small numbers of comments in a blog is bad, but I digress). However the number you need to watch even more than associations is the quantity of base table rows. Associations by their nature are limited, but if you don’t put conditions on the base table you’re going to be starting with a very large number. Say you have a frequently updated tumblog with 15,000 articles (a pretty modest number by common database standards). All of a sudden the number of rows jumps to 187,500.

But you’d rarely want to output 15,000 rows anyway, typically you’ll be putting a LIMIT on the query, and in that case, only the pre-query will be slow. The main query will LIMIT the base number before JOINing and so you only end up fetching 1250 rows (if the limit is 10).

Slashing the Numbers

The astute among you may have already asked, “Why is the pre-query doing all those joins just to fetch 10 ids?”

By dropping the eager JOINs we go back to considering only 15,000 rows, plus no overhead of JOINing at all. The reason for the JOINs is so that you can add :conditions against the association tables. There are significant use-cases for that, but I think there are better ways of meeting those needs.

Limitations of Conditions on Eager-Loaded Tables

I believe that :conditions referring to eager-loaded tables are confusing and limiting. ActiveRecord (as all ORMs) is dangerous without knowledge of SQL, but this is one case in particular where you have to pay special attention to what the SQL is doing.

Conflated Effects

The main source of confusion is that referring to an LEFT JOINed column restricts both the base table and the association rows. For instance:

@articles = Article.find(:all, 
    :include => :tags, 
    :conditions => "tags.name = 'foo'")

The author of such a statement most likely is looking for any articles tagged with foo. That works as expected, but what they probably weren’t expecting is that articles[0].tags doesn’t contain all the tags! It only contains the tag foo. In fact this very bug occurred in Mephisto.

What you actually want is either of the following statements, take your pick (maybe with a benchmark):

@articles = Article.find(:all, 
  :include => :tags, 
  :joins => "LEFT JOIN tags AS custom_tags 
               ON custom_tags.article_id = articles.id", 
  :conditions => "custom_tags.name = 'foo'")
@articles = Article.find(:all, 
  :include => :tags, 
  :joins => "INNER JOIN tags AS custom_tags 
               ON custom_tags.article_id = articles.id 
               AND custom_tags.name = 'foo'")

Some people find this distasteful. Sven Fuchs in the above thread thought that the initial form should behave like these second forms. It’s a nice thought, but it directly conflicts with the generated SQL. It would require whole new modules to be added to ActiveRecord to parse the conditions which would be bug-prone, slow, and opaque. I guess it just goes to show the parity mismatch between OOP and SQL. ORM is more art than science—when in doubt keep it simple.

Now if for some reason you really wanted just tags named foo then you would be better served by creating a custom association like so:

class Article < ActiveRecord::Base
  has_many :foo_tags, :class_name => 'Tag', :conditions => "name = 'foo'"

So that’s how you separate the two desirable effects. It’s not perfect, but I think it’s the best we can hope for without some radical re-conceptualization of ActiveRecord. At the least I think a whole new find option with similar complexity to :include would be required, but I think designing an appropriate interface would be a fool’s errand anyway.

How much can we reasonably abstract away the complexities of SQL? Another problematic query is:

@articles = Article.find(:all, 
    :include => :tags, 
    :conditions => "tags.name = 'foo' AND tags.name = 'bar'")

At first glance you might think this fetches articles tagged with both foo and bar, but of course it will always return an empty set because on any given row tags.name can only have a single value. What you’d be looking for instead is:

@articles = Article.find(:all, 
    :include => :tags, 
    :joins => "INNER JOIN tags AS foo_tags
                 ON foo_tags.article_id = articles.id
                 AND foo_tags.name = 'foo'
               INNER JOIN tags AS bar_tags
                 ON bar_tags.article_id = articles.id
                 AND bar_tags.name = 'bar'")

The Solution

Everything I just discussed applies to all :includes + :conditions, but I’m afraid our only defense against the general case is education. What I’m concerned with is the aforementioned case where :limit is used, because it creates an especially egregious performance problem—the case where you :limit is where you have a lot of rows, and the pre-query is the only part that has a particularly high number of base rows since the actual query is restricted by ids (restricting by ids is particularly speedy in MySQL).

I’ve conceived of two solutions to this problem. The first removes eager-joining (not explicit joins) from limited eager-loading pre-queries. This is a simple one-line omission from ActiveRecord. Unfortunately this is a destructive change which breaks a small handful of tests, but I think this is a big win overall in terms of speeding things up while simplifying the code base. We need somewhat of a consensus to make a change like this, so please speak up on the mailing list if what I’ve outlined here resonates.

I’ve also written a non-destructive patch which solves my immediate performance concerns by collecting the names of tables that are referenced in :conditions and then only eagerly joins the necessary tables (in the pre-query). I dislike this solution because it adds a small amount of bloat and complexity, but I have tested it and believe it will introduce no new bugs—I know there is an issue with quoted table names (mysql eg. `table`), but that problem already existed with the table matching regexp which was already there.

Now, just to be perfectly clear, both of these patches affect only the pre-query. You can still use the :conditions to affect the final association rows that are loaded, if that is indeed what you want to do. If there is no :limit then nothing is affected at all.

Please refer to:

Sven Fuchs says…
September 13, 2007 at 3:30PM

I see that your patch is headed in a different direction and performance is a high-ranking criterion of course.

But performance aside, in this particular usecase (listing tagged blog articles) it would perfectly make sense to simply omit the condition from the second SQL query that is generated – no error-prone, bloated parsing engine needed. Wouldn’t it?

For this:

@articles = Article.find(:all, 
  :include => :tags, 
  :conditions => "tags.name = 'foo'")

ActiveRecord would generate these queries:

SELECT DISTINCT contents.id FROM contents
LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND taggings.taggable_type = 'Content')
LEFT OUTER JOIN tags ON tags.id = taggings.tag_id
WHERE tags.name IN ('tag')
ORDER BY contents.published_at DESC 

SELECT * FROM contents
LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND taggings.taggable_type = 'Content')
LEFT OUTER JOIN tags ON tags.id = taggings.tag_id
WHERE tags.name IN ('tag') AND contents.id IN ('1', '2', '3')
ORDER BY contents.published_at DESC

What we would want in this particular usecase would be a second query like this:

SELECT * FROM contents
LEFT OUTER JOIN taggings ON (taggings.taggable_id = contents.id AND taggings.taggable_type = 'Content')
LEFT OUTER JOIN tags ON tags.id = taggings.tag_id
WHERE contents.id IN ('1', '2', '3')
ORDER BY contents.published_at DESC


Gabe da Silveira says…
September 13, 2007 at 3:53PM

The problem is that sometimes you do want to add conditions that apply to joined tables. Sure you could hack around that limitation by adding :conditions to the association as I demonstrated above, but that does nothing for manual :joins.

Now that I think about it, I don’t think any kind of fancy parsing would solve the problem either, because how can you guess the intention of the developer? There would have to be some supplemental specification of which conditions to drop… messy…

Sven Fuchs says…
September 13, 2007 at 5:06PM

Sure, there are two ways to interpret the syntax of that ActiveRecord call and there are two usecases (the condition being applied or not being applied to the last query). AR enables one usecase through a slick interface and requires to spell out SQL for the other one.

My initial question was: why has it initially been designed this and not the other way around? I.e. why doesn’t AR by default just omit the condition from the second query? That’s a question of design or usability as opposed to technical reasons IMO. AR decides for one of these sides. Given this is a good thing – why does it decide this way?

Since I’ve been watching these discussions and asking people I’ve never seen somebody expecting the current behaviour offhand – but I’ve seen quite some reactions like “of course it should be this way!” (like just recently in the core mailinglist).

So to me this question still remains a mystery :)

jeremy says…
September 14, 2007 at 4:48AM

disregarding performance, what if the orm could use subselects to get this all in one statement:

select * from articles
left join tags on tags.article_id = articles.id
where articles.id in (select id from articles 
left join tags on tags.article_id = articles.id 
where tags.name = 'foo')
Gabe da Silveira says…
September 14, 2007 at 8:29AM

@Sven – Well I wrote extensively about that in the original thread on the mailing list. But there are two reasons:

  1. There would be no way to add conditions affecting explicitly joined tables.
  2. It would be inconsistent because the second query is the “main” query. It’s conceptually equivalent to the single query that is generated in all other cases (ie. any find without both a :limit and :include option). So the second query must behave the same way as a regular query, and you can not just drop :conditions from a regular query…

Even if they decided to do this, it still doesn’t solve the bug in a general sense, because not every query uses both :include and :limit, so the confusion would still be there. Again, this is just a parity mismatch between ORM and the relational model. There’s no quick fix, people just have to understand what is going on under the hood.

@jeremy – I have tended to avoid subselects in my career, so I don’t have any notion of whether that would be more efficient or what the tradeoffs might be. Sounds like it could be promising as a separate idea though.

Sven Fuchs says…
September 14, 2007 at 3:04PM

Gabe, I still think that we’re talking at cross-purposes … but I guess it would take some more extensive demonstration – and that’s just not the topic of your discussion here. So, sorry for all the noise :)

Gabe da Silveira says…
September 15, 2007 at 1:30AM

The main discussion is occurring on the mailing list, so I’m not too worried about noise here.

But basically Sven, it boils down to this:

I see how your proposed solution would solve this one edge case, but my direct question to you why bother when every other query still suffers from the same bug?