Darwinweb

In Defense of ORMs

June 16, 2011     

Laurie Voss just posted an article that ORM is an anti-pattern in which he calls out ActiveRecord (although nothing about ActiveRecord is specifically cited) as doing more harm than good. Some of the flaws he mentions are real issues, but he hand-wavingly dismisses the benefits without any real analysis. Let’s examine the claims:

Some ORM layers will tell you that they “eliminate the need for SQL”. This is a promise I have yet to see delivered.

Which ORM says this? This is just a strawman set up by painting ORM projects as making ridiculous claims. There are quotes around the words, but if you plug them into Google, nothing on the first page is a claim from an ORM, they’re all about eliminating [MS] SQL Server or SQL experts.

Others will more realistically claim that they reduce the need to write SQL but allow you to use it when you need it. For simple models, and early in a project, this is definitely a benefit: you will get up and running faster with ORM, no doubt about it. However, you will be running in the wrong direction.

Now he brings it back to reality and admits you can get up and running faster with ORM, but then claims you’re running in the wrong direction, full stop. There’s no reasoning why you are running just in the wrong direction, just a blind assertion.

Code Generation” is also mentioned as a benefit, but without going into much detail. “Efficiency is ‘good enough’”, is mentioned here as well but seems out of place; it’s a known tradeoff, not something to be touted as a benefit.

Now getting to the meat of the article.

Inadequate abstraction

The most obvious problem with ORM as an abstraction is that it does not adequately abstract away the implementation details.

This is because there is a fundamental parity mismatch between SQL and imperative code. It’s certainly valid that ORMs are one of the leakiest abstractions imaginable, and therefore they are poor abstractions on the continuum of abstraction quality. That’s a fundamental challenge of using SQL regardless of whether you are using an ORM or not. Writing your SQL by hand does nothing to address that, you still have to convert back and forth between imperative code and data structures to SQL and flat rows of data. The challenge of doing this well and creating a nice interface is the whole reason there are so many different ORMs and why they vary so widely in form and function. If you reject ORMs you’re really just saying that you have a better way of doing it, at least for the project at hand, which may very well be true, or you may wind up digging your very own anti-pattern.

The whole point of an abstraction is that it is supposed to simplify. An abstraction of SQL that requires you to understand SQL anyway is doubling the amount you need to learn: first you need to learn what the SQL you’re trying to run is, then you have to learn the API to get your ORM to write it for you.

The premise of ORM is not that it precludes the need to understand SQL, anyone telling you otherwise is selling snake oil. You’re not learning “twice as much”, the ORM is there to provide a standard way to get your data in and out of an SQL database. It certainly is quicker to learn the primitive SQL driver functions, but all the time and more you saved by not learning the ORM will be spent figuring out how to construct SQL queries and map row data to objects. Again, you might create a more optimized solution for your project, but just as likely you will run into dead-ends and require refactorings because you don’t have the experience of thousands of developer hours with widely varying use-cases behind your homegrown solution.

A defender of ORM will say that this is not true of every project, that not everyone needs to do complicated joins, that ORM is an “80/20” solution, where 80% of users need only 20% of the features of SQL, and that ORM can handle those. All I can say is that in my fifteen years of developing database-backed web applications that has not been true for me. Only at the very beginning of a project can you get away with no joins or naive joins. After that, you need to tune and consolidate queries. Even if 80% of users need only 30% of the features of SQL, then 100% of users have to break your abstraction to get the job done.

Well, in my 27 years of programming (that’s right, I first sat down in front of an AppleSoft BASIC prompt when I was 6), I’ve found that on 34% of projects, we need 58% of the features of 66% of the RDBMS engines available.

But seriously, if you work with ActiveRecord for any amount of time you realize that assertion about the percentage of SQL feature coverage is meaningless. There is no such percentage. I’m hard pressed to think of a feature of SQL that ActiveRecord does not support. It’s true that there are some higher level methods (associations for instance) that do more with less code, but even those can take additional options that tweak the SQL in a variety of ways. Even when you drop down to the more basic building block methods (#select, #where, #group, etc) you still gain the benefits of the ORM in that you can chain them with each other and higher level methods and not have to worry about munging your own SQL strings.

Incorrect abstraction

If your project really does not need any relational data features, then ORM will work perfectly for you, but then you have a different problem: you’re using the wrong datastore.

This is continuing with the straw man that ORMs don’t do joins, but they do. The subsequent line of thought follows directly from this fallacy. Then he comes to:

On the the other hand, if your data is relational, then your object mapping will eventually break down.

This is true in one sense. No ORM can make all possible relations be first class citizens in the library. At some point you have to drop to a lower level. Depending what it is you may need to drop down to the raw data, or maybe an intermediate level. In any case, this is not invalidating the other thousands of queries that for which your app happily makes use of the high-level ORM features. The two can very happily co-exist.

Death by a thousand queries

This leads naturally to another problem of ORM: inefficiency. When you fetch an object, which of its properties (columns in the table) do you need? ORM can’t know, so it gets all of them (or it requires you to say, breaking the abstraction).

So the solution is to force yourself to specify the columns every time without exception? Again, baby right out with the bathwater; it’s no harder to specify in an ORM than it is in raw SQL. Also you may end up shooting yourself in the foot because sometimes RDBMS software has optimizations around SELECT *, and therefore you hurt yourself by removing columns unless you really are shaving a lot of data.

Initially this is not a problem, but when you are fetching a thousand records at a time, fetching 30 columns when you only need 3 becomes a pernicious source of inefficiency.

It’s not pernicious. You just add .select('col1,col2,col3') after you profiled the action and realized this was the bottleneck. Or you could trace the usage of every query, optimize the SELECT clause for each and every case, and endlessly refactor as your application and interface grows and changes. You could also write the whole thing in binary and I bet it would be pretty easy to squeeze out twice the performance in 100x the development time.

Many ORM layers are also notably bad at deducing joins, and will fall back to dozens of individual queries for related objects.

On the other hand if you are joining several relations to a base table, this is exactly what you want to avoid cartesian product explosion. Contrary to the implication, ORM developers do profile and optimize their code. You can often do a better for a specific case, but it’s far from being necessary every time.

The problem, I have discovered with experience, is that there is seldom a single “magic bullet” query that needs to be optimized: the death of database-backed applications is not the efficiency of any one query, but the number of queries. ORM’s lack of context-sensitivity means that it cannot consolidate queries, and must fall back on caching and other mechanisms to attempt to compensate.

The use of “magic bullet” here frames the idea of selective optimization in a negative light, because all programmers know there are no magic bullets. Unfortunately it’s a poor use of metaphor because a bullet refers to a solution not a problem. A more apt platitude would be that premature optimization is always bad. There is more likely more than one problem query in a large application, but there are also likely hundreds of inconsequential boilerplate queries needing no more fine-tuning than an index or two.

In any case, it would be disengenuous to say that ORMs don’t come with performance overhead, but it’s a measured tradeoff. There’s no cliff-dropping performance hit as indicated by the last vaguely damning sentence about using caching to “compensate”, whatever that means.

Solutions: Use Objects

If your data is objects, stop using a relational database. The programming world is currently awash with key-value stores that will allow you to hold elegant, self-contained data structures in huge quantities and access them at lightning speed.

Okay this is a bit offtopic for a rant like this, but I can’t let it slide. Your data “is” not either objects or relational. The choice of data structures and data stores is an abstract representation of concepts that can be interpretted many different ways. It really bothers me whenever someone offhandedly says that there are many types of data that aren’t relational. Quite frankly, I think that’s horseshit. Let me go on record that I think that there is no useful data that can not be made more useful through solid relational modeling as opposed to some object DB dump.

With a relational model you have a far more powerful querying paradigm than you have on any object data store. By its nature, a relational model is meant to slice and dice data in unforeseen ways. Of course, this flexibility comes with a price. The reason for all these new high performance data stores is to commit to a less-flexible data model that allows you to break out of the performance limitations of a strict ACID RDBMS.

The reason people start with a relational database is because it’s the most flexible way for your data to sit. You can pivot your startup to completely different purposes using the same data and all you need to do is change a few indexes around to support your new query structure. When you use some kind of document or object-oriented store, you are making a much firmer assertion about exactly how this data will be used, and what kind of queries and reports you intend to run from it.

The fact is that most applications will never outgrow the capability of a single moderately powered relational DB server. If and when you do, you will have scaling issues that don’t magically go away just because you chose the high-performance datastore dujour. You’ll have a lot of architecting and optimization to do regardless of the path you take eliminating bottlenecks, so I think it’s entirely justified to slap in an SQL database on day one with the realization that you may have to change it later.

Solutions: Use SQL in the Model

However, remember that the job of your model layer is not to represent objects but to answer questions. Provide an API that answers the questions your application has, as simply and efficiently as possible.

I would add changing state to this, but fundamentally I agree with the philosophy. Given the fact that interacting with a relational database can be messy, I think it’s very smart if you can keep the logic entirely contained in the model, and therefore I could accept that keeping ActiveRecord primitives out of your controllers and views might be a good idea. In practice I only work this way for more complex types of “questions” that I’m asking the model, and Rails sort of blurs the lines by defining a bunch of methods, which could theoretically be replaced if you ripped out ActiveRecord and replaced your backend with something completely different.

OO is itself an abstraction, a beautiful and hugely flexible one, but relational data is one of its boundaries, and pretending objects can do something they can’t is the fundamental, root problem in all ORM.

This closing statement implies a false belief about the design criteria and intention of ORM systems. It seems like a case of misattributing the beliefs of newbies who wish SQL didn’t exist to ORM developers who clearly have a deep understanding of SQL. In reality ORMs are mostly designed for experts. I haven’t used a lot of other ORMs extensively, so I can’t comment on the design criteria or optimal usage of something like SQLAlchemy, but for ActiveRecord and probably a lot of ORMs there are some clear wins:

  • Automatically map database types to the equivalent programming language types and vice-versa. If you’re managing your own SQL then you’re going to waste a lot of time figuring this out. It’s one place where sensible defaults go a long way. Why debug the edge cases yourself?
  • Greatly reduce error-prone SQL string munging. Assuming you have your pants on straight and don’t open up any injection holes, you still have to have complex string manipulation logic to compile conditions and joins, or else a bunch of very similar but different queries that are difficult to read and hard to refactor.
  • Composability of queries. With a proper relational algebra engine, you can programmatically build up a query by chaining clause methods together. Doing so allows you to compose many complex queries and keep the code in the logical place without the design being driven by how you are organizing the SQL itself. An example would be if you have the concept of Authors who might have a “new” flag, and Posts that might have a “rant” flag. It’s clear that the definition of those flags belongs in the respective models, but what if you want to query both at once? In the case where you manually craft your SQL you have to put it somewhere, maybe in one model or the other, or somewhere else. You also have to decide if you are going to munge a bunch of conditions together in one query or split them out. Throw something like an advanced search form in the app and this becomes a minefield of niggling details.
  • Simplify the simple stuff. There are tons of boilerplate queries in any sizable app. An ORM gives you a concise way to call them and get a standard default mapping of fields to attributes.
  • Facilitate higher level patterns. The concepts of belongs_to and has_many in ActiveRecord are a simple encapsulation of very common patterns that come up again and again in relational databases. Through a series of simple options it allows you specify and execute a wide variety of queries for counting, instantiating, or loading multiple collections. They can be chained with the lower level functions, and always output correct and straightforward SQL. Sure they can’t do everything, but they certainly are capable of handling the majority of queries in a typical app quite elegantly.

The bottom line is that the concept of an ORM is a catchall for a range of projects that implement common patterns for using SQL in an object-oriented system. An ORM is nothing more than a codified set of patterns. Because of the nature of the relational model, there are many many different approaches one can take. To dismiss ORMs entirely is to dismiss SQL as well. You might claim that you don’t use an ORM in your code, but in effect you are just implementing your own idea of an ORM. If you are just peppering queries around your model then it might be a very lightweight ORM, but the fact is you are still Mapping Objects to Relations and vice-versa.

Christoph says…
June 17, 2011 at 1:36AM

There is no ‘like’ button here? :)

Just wanted to say I fully agree, seems as if someone completely missed the point…
Know your tools and when to use the right one. Besides that, the discussion (of the initial post) could be endless as beyond every layer of abstraction there is another one, and another one, and another one ….

Mike says…
July 29, 2011 at 1:56AM

Really enjoyed this article Gabe. I think you’ve nailed a lot of the hot issues around ORMs. Actually, the more I think about it the more I feel that the ORM is one of the biggest components of Ruby on Rails that makes people not sure it’s worth it. Every time the topic, albeit over discussed, comes up around scaling a rails app the first thing people attack is the ORM. Unfortunate.

Great read.

Adrien Lamothe says…
November 23, 2011 at 9:35AM

Very well spoken, pragmatic refutation.

It would be interesting to try figuring out how something like Postgres heuristic query optimization impacts Activerecord performance.

Speaking of fat models vs. fat controllers (i.e. “I think it’s very smart if you can keep the logic entirely contained in the model, and therefore I could accept that keeping ActiveRecord primitives out of your controllers and views might be a good idea.”), I’ve heard it is better from an overall performance standpoint to put the business logic in the controllers. One person told me about a huge performance improvement when he moved business logic out of his models and into his controllers (although perhaps the original code was poorly written and got optimized when moved.) Personally, I haven’t experienced any problems working with Activerecord objects inside my controllers, they are just another set of tinker toys to play with. I prefer coding complex transactions between models inside a controller. If you place a bunch of code inside a model, doesn’t that get instantiated even when you are simply loading some rows with a simple query to pass off to a view?

Thanks for sharing your insight. Great blog.

Gabe da Silveira says…
November 24, 2011 at 1:13AM

Thanks for the comment, I cherry-picked it out of my daily spam run :)

I don’t know the effect of fat models on object instantiation time, but in general I would assume the effect is negligible. Instance methods for both controllers and models are loaded only once. Ruby is relatively slow, but it generally it does take some real work to make a noticeable performance difference (instantiating many AR objects is a possible culprit). In any case though, I would definitely profile the code in question before and after states before drawing any conclusions. Simply assuming that controller code runs faster than model code based on one occurrence without further investigation strikes me as willful naiveté.

Adrien Lamothe says…
December 6, 2011 at 8:46AM

I did a test and sure enough the fat controller version was significantly faster than the fat model. Created a new branch of one of my apps and moved a fairly complex function from the controller into the model, then ran both branches on my development machine several times, observing how long it took to process the controller. After each test the database was restored to the previous state. The application, using Rails 3.1 with Ruby 1.9.3, is a crime game called Quick Heist and the “story” I tested was a player quitting a crew while the crew is performing a heist. When a player quits a crew, data in several tables changes. The method quit_crew is in the crew controller; in the fat controller version all the code needed to change the tables, etc. is in crew/quit_crew. In the fat model version, the quit_crew method in the controller instead calls a quit_crew method in the User model, which contains the identical code, the only difference being user.columnname is changed to self.columnname.

Here are the times in milliseconds:

Fat Controller version: 330, 329, 334, 333, 325

Fat Model version: 425, 386, 463, 388, 391

While the difference is not huge, it is still significant. In a web app, all those extra cycles add up and have to be paid for. It is interesting to see that the fat controller times have much tighter variance.

I did a similar comparison last year on one of my other apps, under Rails 2.3, with similar results. Didn’t record the results of that test.

Gabriel da Silveira says…
December 6, 2011 at 9:21AM

Are you running it in production mode or development? Because reloading a model is going to be much slower just due to ActiveRecord overhead compared to reloading a controller, but it only loads once in production. And for your fat controller are you just writing raw SQL against the adapter or doing something else?

Adrien Lamothe says…
December 7, 2011 at 9:10AM

Those tests were done in development mode. I just ran the tests in production mode. The fat controller version still is significantly faster, though it outperforms the fat model version by a narrower margin.

The quit_crew method code in both the fat controller and fat model versions is identical and doesn’t use any raw sql, it only uses ActiveRecord.

The test results in production mode (in ms):

Fat Controller version: 120, 83, 90, 97, 82

Fat Model version: 178, 150, 138, 139, 126

I wouldn’t expect the run mode to affect the comparison in terms of which technique is faster, given the code is identical and only uses ActiveRecord.

In these tests, the crew size is only 5 members and overall there are few rows in the tables. It will be interesting to try the tests again if there are a large quantity of rows in the database (i.e. lots of people are playing the game.) I may try adding some dummy crew members and re-testing to see if more rows changes the performance margins.

Quite a few people have asserted to me that fat models are faster. When questioned, one of those people admitted he uses fat models because they are easier to test and the tests run faster, so I suspect people are confusing testing speed with application performance.

One person on a mailing list reported a speed increase of 3000 times faster when moving his application logic from the models to the controllers, but I suspect some of that increase was due to code refactoring/optimization during the move. Then again, perhaps large amounts of data favor fat controllers by wider margins.

Wonder why the code runs faster in the controller than within an ActiveRecord model?

Gabe da Silveira says…
December 7, 2011 at 5:48PM

I think you need to deeper profiling to understand where the time is being spent before drawing any conclusions such as fat controllers being faster. By and large code is code, and a line of code does not “run faster” in one context in another (barring side effects like GC). If it’s slower by many milliseconds most likely a lot more code is running somewhere and you need to find out why.

Adrien Lamothe says…
December 8, 2011 at 11:54AM

It actually is logically transitive. A block of code gets moved from a controller to a new method in a model, the same controller calls the new method on that model instead of executing the code itself, and when the controller is called it runs significantly slower. Whatever else may be happening was happening before and if for some reason moving the code into the model causes something new to happen that is irrelevant to the fact performance is hampered.

Gabriel da Silveira says…
December 9, 2011 at 12:04AM

But unless you know why it may just be a peculiarity of your code. If you don’t have a deeper profile to understand where this time is being spent, concluding that fat controller is universally faster is cargo-cult logic.