Optimizing and Simplifying Limited Eager Loading in ActiveRecord
Imagine an ActiveRecord class like this:
class Article < ActiveRecord::Base has_many :comments, :images, :tags, :links end
Now consider the following find statement:
Article.find(:all, :include => [:comments, :tags, :links], :limit => 10)
Ignore for a moment the performance issues with eager loading multiple associations, that’s a topic for another post.
The key bit here is that we are using both
:include. I refer to this as limited eager-loading. This kind statement uses a very specific bit of ActiveRecord that constructs two separate queries. The first query (heretofore known as the pre-query) uses the form:
SELECT DISTINCT articles.id ... LIMIT 10
To fetch exactly 10 article ids. Then the actual query is performed by dropping the limit and adding a condition something like this:
SELECT ... WHERE articles.id IN (1,2,3,4,5,6,7,8,9,10)
This is necessary because
LIMIT applies to rows, and if there is more than one comment, tag or link, then there will be more than one row per article, which means we would get less than 10 articles back.
In the current ActiveRecord implementation there is a potential performance issue that can be fixed. We already know that loading multiple associations has a performance issue because it is loading the cartesian product of all the tables which results in polynomial rather than linear row generation. What I mean by that is say we have 100 articles, each with 5 comments, 5 tags, and 5 links. Eager-loading SQL grabs them all in one query which results in
12500 rows (100 * 5 * 5 * 5) being fetched when in fact there are only
1600 actual objects (100 + 100 * (5 + 5 + 5)). I’m working on a fundamental solution to that problem, but for now I have a more pressing issues.
Large Base Table Numbers
In cases where associations are small in numbers, this polynomial problem isn’t a big issue and we can safely eager-load several associations (assuming small numbers of comments in a blog is bad, but I digress). However the number you need to watch even more than associations is the quantity of base table rows. Associations by their nature are limited, but if you don’t put conditions on the base table you’re going to be starting with a very large number. Say you have a frequently updated tumblog with 15,000 articles (a pretty modest number by common database standards). All of a sudden the number of rows jumps to 187,500.
But you’d rarely want to output 15,000 rows anyway, typically you’ll be putting a LIMIT on the query, and in that case, only the pre-query will be slow. The main query will LIMIT the base number before JOINing and so you only end up fetching 1250 rows (if the limit is 10).
Slashing the Numbers
The astute among you may have already asked, “Why is the pre-query doing all those joins just to fetch 10 ids?”
By dropping the eager JOINs we go back to considering only 15,000 rows, plus no overhead of JOINing at all. The reason for the JOINs is so that you can add
:conditions against the association tables. There are significant use-cases for that, but I think there are better ways of meeting those needs.
Limitations of Conditions on Eager-Loaded Tables
I believe that
:conditions referring to eager-loaded tables are confusing and limiting. ActiveRecord (as all ORMs) is dangerous without knowledge of SQL, but this is one case in particular where you have to pay special attention to what the SQL is doing.
The main source of confusion is that referring to an LEFT JOINed column restricts both the base table and the association rows. For instance:
@articles = Article.find(:all, :include => :tags, :conditions => "tags.name = 'foo'")
The author of such a statement most likely is looking for any articles tagged with
foo. That works as expected, but what they probably weren’t expecting is that
articles.tags doesn’t contain all the tags! It only contains the tag
foo. In fact this very bug occurred in Mephisto.
What you actually want is either of the following statements, take your pick (maybe with a benchmark):
@articles = Article.find(:all, :include => :tags, :joins => "LEFT JOIN tags AS custom_tags ON custom_tags.article_id = articles.id", :conditions => "custom_tags.name = 'foo'") @articles = Article.find(:all, :include => :tags, :joins => "INNER JOIN tags AS custom_tags ON custom_tags.article_id = articles.id AND custom_tags.name = 'foo'")
Some people find this distasteful. Sven Fuchs in the above thread thought that the initial form should behave like these second forms. It’s a nice thought, but it directly conflicts with the generated SQL. It would require whole new modules to be added to ActiveRecord to parse the conditions which would be bug-prone, slow, and opaque. I guess it just goes to show the parity mismatch between OOP and SQL. ORM is more art than science—when in doubt keep it simple.
Now if for some reason you really wanted just tags named
foo then you would be better served by creating a custom association like so:
class Article < ActiveRecord::Base has_many :foo_tags, :class_name => 'Tag', :conditions => "name = 'foo'" end
So that’s how you separate the two desirable effects. It’s not perfect, but I think it’s the best we can hope for without some radical re-conceptualization of ActiveRecord. At the least I think a whole new find option with similar complexity to :include would be required, but I think designing an appropriate interface would be a fool’s errand anyway.
How much can we reasonably abstract away the complexities of SQL? Another problematic query is:
@articles = Article.find(:all, :include => :tags, :conditions => "tags.name = 'foo' AND tags.name = 'bar'")
At first glance you might think this fetches articles tagged with both
bar, but of course it will always return an empty set because on any given row
tags.name can only have a single value. What you’d be looking for instead is:
@articles = Article.find(:all, :include => :tags, :joins => "INNER JOIN tags AS foo_tags ON foo_tags.article_id = articles.id AND foo_tags.name = 'foo' INNER JOIN tags AS bar_tags ON bar_tags.article_id = articles.id AND bar_tags.name = 'bar'")
Everything I just discussed applies to all
:conditions, but I’m afraid our only defense against the general case is education. What I’m concerned with is the aforementioned case where
:limit is used, because it creates an especially egregious performance problem—the case where you :limit is where you have a lot of rows, and the pre-query is the only part that has a particularly high number of base rows since the actual query is restricted by ids (restricting by ids is particularly speedy in MySQL).
I’ve conceived of two solutions to this problem. The first removes eager-joining (not explicit joins) from limited eager-loading pre-queries. This is a simple one-line omission from ActiveRecord. Unfortunately this is a destructive change which breaks a small handful of tests, but I think this is a big win overall in terms of speeding things up while simplifying the code base. We need somewhat of a consensus to make a change like this, so please speak up on the mailing list if what I’ve outlined here resonates.
I’ve also written a non-destructive patch which solves my immediate performance concerns by collecting the names of tables that are referenced in
:conditions and then only eagerly joins the necessary tables (in the pre-query). I dislike this solution because it adds a small amount of bloat and complexity, but I have tested it and believe it will introduce no new bugs—I know there is an issue with quoted table names (mysql eg. `table`), but that problem already existed with the table matching regexp which was already there.
Now, just to be perfectly clear, both of these patches affect only the pre-query. You can still use the
:conditions to affect the final association rows that are loaded, if that is indeed what you want to do. If there is no
:limit then nothing is affected at all.
Please refer to: