Memorymaxxing on Heroku
Background
You have a background worker responsible for generating multiple analyses as CSV-formattable data and compiling these as an Excel spreadsheet with cross-linking between sheets.
These analyses are performed by legacy modules that are unfinished and only lightly tested but also “just work", so these are best treated as black boxes and left unchanged as much as possible.
In total this spreadsheet file currently comprises 52 sheets with the following approximate dimensions:
- 1 sheet: 50 rows x 10 columns, each cross-linked to one of:
- 50 sheets: each 0-50 rows x 0-100 columns
- 1 sheet: 1000 rows x 10 columns
It currently takes about 3 minutes to generate the complete spreadsheet.
A new analysis is requested that aggregates these data more granularly, producing a significantly larger additional sheet: 300,000 rows x 10 columns.
This job takes prohibitively long to complete locally, about 25 minutes, and doesn’t complete at all when deployed.
The task now is to productionize this work to make our stakeholders happy.
Optimizations
Refine requirements
At a high level, what this new analyis surfaces is performance metrics for standardized-test-takers across a certain subset of assessments administered by a certain subset of programs. We aggregate across programs and per-assessment, while also grouping by question category and subcategory. There are approximately 50 programs administering 1000 assessments, each comprising questions across 200-500 categories and subcategories. The ensuing combinatorial explosion is what produces the 300K rows.
The first bit of easily won economy comes from probing into the customer’s needs: What’s the purpose of this analysis? We want to get a sense of the contours of student performance across programs. As it happens, there are no assessments that are administered by all programs, but there’s also a longish tail of assessments that are administered by only one or a small subset of programs. Data from these assessments is less valuable for our needs, so we can start with dropping those by introducing a threshold number of programs-in-common that assessments have to have in order to be included in the analysis.
This filtering can be performed with a GROUP BY
-HAVING
subquery on the join
table between programs and assessments when querying for the latter (in this
context, aliased as “admin exam”):
# app/models/program.rb
# Return all admin exams associated with the receiving colletion of Programs.
# If the intersection size threshold `used_by_at_least` is provided, filter
# the result set to those Admin Exams used by at least the provided number of
# programs.
def self.admin_exams(used_by_at_least: nil)
join_records = joins(:admin_exams_programs).then do |collection|
break collection if used_by_at_least.blank?
collection.group(:admin_exam_id).having(
'COUNT(admin_exams_programs.program_id) >= ?',
used_by_at_least.to_i.abs,
)
end
AdminExam.where(id: join_records.select(:admin_exam_id).distinct)
end
Trial and error shows the first large drop in included assessments (from around ~700 to ~300) comes when this value is 15:
INTERSECTION_THRESHOLD = ENV.fetch("INTERSECTION_THRESHOLD", 15).to_i
Profile for and eliminate N+1 queries
Inspection of the database query logger shows more low-hanging fruit for optimization: a host of “N+1” queries from the aforementioned legacy modules.
These are easily profiled for using Bullet:
Bullet.profile do
SpreadsheetGenerator.new(data).sheet
end
and remedied by eager-loading associations as recommended by the profiler.
Minimize ActiveRecord objects, except when you shouldn’t
ActiveRecord
objects are memory hogs and as a rule of thumb it’s beneficial to
minimize their initialization wherever it can be costlessly avoided. For example
if all you need is an integer or even just some JSON, it’s wasteful to
initialize AR objects as an intermediate step, just to invoke, say, record.id
or
record.to_json
.
With best-practice-decay1, this heuristic produces overuse of #pluck
,
which not only doesn’t avoid instantiating ActiveRecord objects (the docs are
misleading on this, as pointed out by one commenter at the link), its use as a
filtering step in queries (i.e., when replacing a subquery) can also make
queries underperform for a variety of reasons:
There are several considerations when writing a query using the IN operator that can have an effect on performance.
First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So
col IN ('a','b','c')
is rewritten to:(COL = 'a') OR (COL = 'b') OR (COL = 'c')
. The execution plan for both queries will likely be equivalent assuming that you have an index on col.Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.
Third, many databases have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.
Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.
A note on the above: even though Rails uses bind variables, it’s worth emphasizing that any variability in the number of elements in the generated SQL array will preclude caching of the execution plan:
SELECT "categories"."id", "categories"."name"
FROM "categories"
WHERE "categories"."id" IN ($1, $2, $3, $4, $5, ...) [["id", 5808], ["id", 141], ["id", 143], ["id", 168], ...]
Profile memory consumption, eliminate OpenStructs
This is the part of the optimization sequence where you’ve exhausted the low-hanging fruit and made performance gains but still aren’t seeing enough improvement, so you actually have to engage your brain and measure things.
At this points jobs are completing when deployed as well as locally, but they still take too long because of excess memory consumption:
A dyno requires memory in excess of its quota. If this error occurs, the dyno will page to swap space to continue running, which may cause degraded process performance. [source]
So you profile memory and notice an unexpected recurring character across the profiler report:
retained memory by gem
-----------------------------------
212691288 ostruct
13123073 msgpack-1.4.2
retained memory by file
-----------------------------------
212691288 /usr/local/lib/ruby/3.0.0/ostruct.rb
13123073 /bundle/ruby/3.0.0/gems/msgpack-1.4.2/lib/msgpack/factory.rb
3135621 /bundle/ruby/3.0.0/gems/i18n-1.8.11/lib/i18n/core_ext/hash.rb
retained memory by location
-----------------------------------
160806744 /usr/local/lib/ruby/3.0.0/ostruct.rb:214
38433080 /usr/local/lib/ruby/3.0.0/ostruct.rb:215
13451424 /usr/local/lib/ruby/3.0.0/ostruct.rb:146
13123073 /bundle/ruby/3.0.0/gems/msgpack-1.4.2/lib/msgpack/factory.rb:70
2614656 /bundle/ruby/3.0.0/gems/i18n-1.8.11/lib/i18n/core_ext/hash.rb:52
retained memory by class
-----------------------------------
122640368 Class
76946720 Proc
16700336 String
14529824 Hash
3572928 Array
3202720 OpenStruct
227600 ActiveModel::Attribute::WithCastValue
112568 RubyVM::InstructionSequence
retained objects by class
-----------------------------------
961834 Proc
400050 String
81902 Hash
80173 Class
80068 OpenStruct
4205 Array
2845 ActiveModel::Attribute::WithCastValue
1579 Symbol
OpenStruct
is consuming memory hugely out of proportion to the utility it’s
providing. Why? It turns out OpenStruct
can leak memory because of its use of
define_method
to define accessors dynamically. This in addition to the various
caveats detailed in its documentation.
The inimitable tenderlove, Ruby genius and world-class punster who I once made cackle with a well-timed pun at a work dinner (tbh probably my crowning achievement in a decade-long career, if it can be called that, in tech), has an old deep-dive where he cautions about the underlying problem:
Caution
Defining methods with define_method is faster, consumes less memory, and depending on your application isn’t significantly slower than using a class_eval defined method. So what is the down side?
Closures
The main down side is that
define_method
creates a closure. The closure could hold references to large objects, and those large objects will never be garbage collected. For example:
class Foo
x = "X" * 1024000 # Not GC'd
define_method("foo") { }
end
class Bar
x = "X" * 1024000 # Is GC'd
class_eval("def foo; end")
end
The closure could access the local variable x in the Foo class, so that variable cannot be garbage collected.
This is a familiar problem I’ve dealt with in JavaScript and Objective-C, but
this is the first I’ve seen it in Ruby, in particular with OpenStruct
.
OpenStruct
essentially just provides some syntatic sugar for readability here,
so getting rid of it is a big win at negligible cost. Even the class’s documentation
is itself not a fan:
For all these reasons, consider not using OpenStruct at all.
Results
The optimizations detailed above reduce the running time for the background job in question by a factor of 10, from 25 minutes to around 2.
A special case of alpha decay. ↩︎