# Background

You have a background worker responsible for generating multiple analyses as CSV-formattable data and compiling these as an Excel spreadsheet with cross-linking between sheets.

These analyses are performed by legacy modules that are unfinished and only lightly tested but also “just work", so these are best treated as black boxes and left unchanged as much as possible.

In total this spreadsheet file currently comprises 52 sheets with the following approximate dimensions:

• 1 sheet: 50 rows x 10 columns, each cross-linked to one of:
• 50 sheets: each 0-50 rows x 0-100 columns
• 1 sheet: 1000 rows x 10 columns

A new analysis is requested that aggregates these data more granularly, producing a significantly larger additional sheet: 300,000 rows x 10 columns.

This job takes prohibitively long to complete locally, about 25 minutes, and doesn’t complete at all when deployed.

The task now is to productionize this work to make our stakeholders happy.

# Optimizations

## Refine requirements

At a high level, what this new analyis surfaces is performance metrics for standardized-test-takers across a certain subset of assessments administered by a certain subset of programs. We aggregate across programs and per-assessment, while also grouping by question category and subcategory. There are approximately 50 programs administering 1000 assessments, each comprising questions across 200-500 categories and subcategories. The ensuing combinatorial explosion is what produces the 300K rows.

The first bit of easily won economy comes from probing into the customer’s needs: What’s the purpose of this analysis? We want to get a sense of the contours of student performance across programs. As it happens, there are no assessments that are administered by all programs, but there’s also a longish tail of assessments that are administered by only one or a small subset of programs. Data from these assessments is less valuable for our needs, so we can start with dropping those by introducing a threshold number of programs-in-common that assessments have to have in order to be included in the analysis.

This filtering can be performed with a GROUP BY-HAVING subquery on the join table between programs and assessments when querying for the latter (in this context, aliased as “admin exam”):

# app/models/program.rb

# Return all admin exams associated with the receiving colletion of Programs.
# If the intersection size threshold used_by_at_least is provided, filter
# the result set to those Admin Exams used by at least the provided number of
# programs.
break collection if used_by_at_least.blank?

used_by_at_least.to_i.abs,
)
end

end


Trial and error shows the first large drop in included assessments (from around ~700 to ~300) comes when this value is 15:

INTERSECTION_THRESHOLD = ENV.fetch("INTERSECTION_THRESHOLD", 15).to_i


## Profile for and eliminate N+1 queries

Inspection of the database query logger shows more low-hanging fruit for optimization: a host of “N+1” queries from the aforementioned legacy modules.

These are easily profiled for using Bullet:

Bullet.profile do
end


## Minimize ActiveRecord objects, except when you shouldn’t

ActiveRecord objects are memory hogs and as a rule of thumb it’s beneficial to minimize their initialization wherever it can be costlessly avoided. For example if all you need is an integer or even just some JSON, it’s wasteful to initialize AR objects as an intermediate step, just to invoke, say, record.id or record.to_json.

With best-practice-decay1, this heuristic produces overuse of #pluck, which not only doesn’t avoid instantiating ActiveRecord objects (the docs are misleading on this, as pointed out by one commenter at the link), its use as a filtering step in queries (i.e., when replacing a subquery) can also make queries underperform for a variety of reasons:

There are several considerations when writing a query using the IN operator that can have an effect on performance.

First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So col IN ('a','b','c') is rewritten to: (COL = 'a') OR (COL = 'b') OR (COL = 'c'). The execution plan for both queries will likely be equivalent assuming that you have an index on col.

Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.

Third, many databases have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.

Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.

A note on the above: even though Rails uses bind variables, it’s worth emphasizing that any variability in the number of elements in the generated SQL array will preclude caching of the execution plan:

SELECT "categories"."id", "categories"."name"
FROM "categories"
WHERE "categories"."id" IN ($1,$2, $3,$4, \$5, ...)  [["id", 5808], ["id", 141], ["id", 143], ["id", 168], ...]


## Profile memory consumption, eliminate OpenStructs

This is the part of the optimization sequence where you’ve exhausted the low-hanging fruit and made performance gains but still aren’t seeing enough improvement, so you actually have to engage your brain and measure things.

At this points jobs are completing when deployed as well as locally, but they still take too long because of excess memory consumption:

A dyno requires memory in excess of its quota. If this error occurs, the dyno will page to swap space to continue running, which may cause degraded process performance. [source]

So you profile memory and notice an unexpected recurring character across the profiler report:

retained memory by gem
-----------------------------------
212691288  ostruct
13123073  msgpack-1.4.2

retained memory by file
-----------------------------------
212691288  /usr/local/lib/ruby/3.0.0/ostruct.rb
13123073  /bundle/ruby/3.0.0/gems/msgpack-1.4.2/lib/msgpack/factory.rb
3135621  /bundle/ruby/3.0.0/gems/i18n-1.8.11/lib/i18n/core_ext/hash.rb

retained memory by location
-----------------------------------
160806744  /usr/local/lib/ruby/3.0.0/ostruct.rb:214
38433080  /usr/local/lib/ruby/3.0.0/ostruct.rb:215
13451424  /usr/local/lib/ruby/3.0.0/ostruct.rb:146
13123073  /bundle/ruby/3.0.0/gems/msgpack-1.4.2/lib/msgpack/factory.rb:70
2614656  /bundle/ruby/3.0.0/gems/i18n-1.8.11/lib/i18n/core_ext/hash.rb:52

retained memory by class
-----------------------------------
122640368  Class
76946720  Proc
16700336  String
14529824  Hash
3572928  Array
3202720  OpenStruct
227600  ActiveModel::Attribute::WithCastValue
112568  RubyVM::InstructionSequence

retained objects by class
-----------------------------------
961834  Proc
400050  String
81902  Hash
80173  Class
80068  OpenStruct
4205  Array
2845  ActiveModel::Attribute::WithCastValue
1579  Symbol


OpenStruct is consuming memory hugely out of proportion to the utility it’s providing. Why? It turns out OpenStruct can leak memory because of its use of define_method to define accessors dynamically. This in addition to the various caveats detailed in its documentation.

The inimitable tenderlove, Ruby genius and world-class punster who I once made cackle with a well-timed pun at a work dinner (tbh probably my crowning achievement in a decade-long career, if it can be called tht, in tech), has an old deep-dive where he cautions about the underlying problem:

Caution

Defining methods with define_method is faster, consumes less memory, and depending on your application isn’t significantly slower than using a class_eval defined method. So what is the down side?

Closures

The main down side is that define_method creates a closure. The closure could hold references to large objects, and those large objects will never be garbage collected. For example:

    class Foo
x = "X" * 1024000 # Not GC'd
define_method("foo") { }
end

class Bar
x = "X" * 1024000 # Is GC'd
class_eval("def foo; end")
end


The closure could access the local variable x in the Foo class, so that variable cannot be garbage collected.

This is a familiar problem I’ve dealt with in JavaScript and Objective-C, but this is the first I’ve seen it in Ruby, in particular with OpenStruct.

OpenStruct essentially just provides some syntatic sugar for readability here, so getting rid of it is a big win at negligible cost. Even the class’s documentation is itself not a fan:

For all these reasons, consider not using OpenStruct at all.

# Results

The optimizations detailed above reduce the running time for the background job in question by a factor of 10, from 25 minutes to around 2.

1. As a stylized fact: over time, consumer goods follow a trajectory down-market as efficiencies are found in their production (read: quality declines) and new markets for them are created (read: new buyers with lower purchasing power are found). This holds for information goods as well, and takes hold of any “best practice” once it becomes a Best Practice™ and begins to be bought and sold in the content economy. ↩︎