Memorymaxxing on Heroku

28 Jan 2023 - Houston, Texas

Background

You have a background worker responsible for generating multiple analyses as CSV-formattable data and compiling these as an Excel spreadsheet with cross-linking between sheets.

These analyses are performed by legacy modules that are unfinished and only lightly tested but also “just work", so these are best treated as black boxes and left unchanged as much as possible.

In total this spreadsheet file currently comprises 52 sheets with the following approximate dimensions:

1 sheet: 50 rows x 10 columns, each cross-linked to one of:
50 sheets: each 0-50 rows x 0-100 columns
1 sheet: 1000 rows x 10 columns

It currently takes about 3 minutes to generate the complete spreadsheet.

A new analysis is requested that aggregates these data more granularly, producing a significantly larger additional sheet: 300,000 rows x 10 columns.

This job takes prohibitively long to complete locally, about 25 minutes, and doesn’t complete at all when deployed.

The task now is to productionize this work to make our stakeholders happy.

Optimizations

Refine requirements

At a high level, what this new analyis surfaces is performance metrics for standardized-test-takers across a certain subset of assessments administered by a certain subset of programs. We aggregate across programs and per-assessment, while also grouping by question category and subcategory. There are approximately 50 programs administering 1000 assessments, each comprising questions across 200-500 categories and subcategories. The ensuing combinatorial explosion is what produces the 300K rows.

The first bit of easily won economy comes from probing into the customer’s needs: What’s the purpose of this analysis? We want to get a sense of the contours of student performance across programs. As it happens, there are no assessments that are administered by all programs, but there’s also a longish tail of assessments that are administered by only one or a small subset of programs. Data from these assessments is less valuable for our needs, so we can start with dropping those by introducing a threshold number of programs-in-common that assessments have to have in order to be included in the analysis.

This filtering can be performed with a GROUP BY-HAVING subquery on the join table between programs and assessments when querying for the latter (in this context, aliased as “admin exam”):

# app/models/program.rb

# Return all admin exams associated with the receiving colletion of Programs.
# If the intersection size threshold `used_by_at_least` is provided, filter
# the result set to those Admin Exams used by at least the provided number of
# programs.
def self.admin_exams(used_by_at_least: nil)
  join_records = joins(:admin_exams_programs).then do |collection|
    break collection if used_by_at_least.blank?

    collection.group(:admin_exam_id).having(
      'COUNT(admin_exams_programs.program_id) >= ?',
      used_by_at_least.to_i.abs,
    )
  end

  AdminExam.where(id: join_records.select(:admin_exam_id).distinct)
end

Trial and error shows the first large drop in included assessments (from around ~700 to ~300) comes when this value is 15:

INTERSECTION_THRESHOLD = ENV.fetch("INTERSECTION_THRESHOLD", 15).to_i

Profile for and eliminate N+1 queries

Inspection of the database query logger shows more low-hanging fruit for optimization: a host of “N+1” queries from the aforementioned legacy modules.

These are easily profiled for using Bullet:

Bullet.profile do
  SpreadsheetGenerator.new(data).sheet
end

and remedied by eager-loading associations as recommended by the profiler.

Minimize ActiveRecord objects, except when you shouldn’t

ActiveRecord objects are memory hogs and as a rule of thumb it’s beneficial to minimize their initialization wherever it can be costlessly avoided. For example if all you need is an integer or even just some JSON, it’s wasteful to initialize AR objects as an intermediate step, just to invoke, say, record.id or record.to_json.

With best-practice-decay¹, this heuristic produces overuse of #pluck, which not only doesn’t avoid instantiating ActiveRecord objects (the docs are misleading on this, as pointed out by one commenter at the link), its use as a filtering step in queries (i.e., when replacing a subquery) can also make queries underperform for a variety of reasons:

There are several considerations when writing a query using the IN operator that can have an effect on performance.
First, IN clauses are generally internally rewritten by most databases to use the OR logical connective. So col IN ('a','b','c') is rewritten to: (COL = 'a') OR (COL = 'b') OR (COL = 'c'). The execution plan for both queries will likely be equivalent assuming that you have an index on col.
Second, when using either IN or OR with a variable number of arguments, you are causing the database to have to re-parse the query and rebuild an execution plan each time the arguments change. Building the execution plan for a query can be an expensive step. Most databases cache the execution plans for the queries they run using the EXACT query text as a key. If you execute a similar query but with different argument values in the predicate - you will most likely cause the database to spend a significant amount of time parsing and building execution plans. This is why bind variables are strongly recommended as a way to ensure optimal query performance.
Third, many databases have a limit on the complexity of queries they can execute - one of those limits is the number of logical connectives that can be included in the predicate. In your case, a few dozen values are unlikely to reach the built-in limit of the database, but if you expect to pass hundreds or thousands of value to an IN clause - it can definitely happen. In which case the database will simply cancel the query request.
Fourth, queries that include IN and OR in the predicate cannot always be optimally rewritten in a parallel environment. There are various cases where parallel server optimization do not get applied - MSDN has a decent introduction to optimizing queries for parallelism. Generally though, queries that use the UNION ALL operator are trivially parrallelizable in most databases - and are preferred to logical connectives (like OR and IN) when possible.

A note on the above: even though Rails uses bind variables, it’s worth emphasizing that any variability in the number of elements in the generated SQL array will preclude caching of the execution plan:

SELECT "categories"."id", "categories"."name"
FROM "categories"
WHERE "categories"."id" IN ($1, $2, $3, $4, $5, ...)  [["id", 5808], ["id", 141], ["id", 143], ["id", 168], ...]

Profile memory consumption, eliminate `OpenStructs`

This is the part of the optimization sequence where you’ve exhausted the low-hanging fruit and made performance gains but still aren’t seeing enough improvement, so you actually have to engage your brain and measure things.

At this points jobs are completing when deployed as well as locally, but they still take too long because of excess memory consumption:

A dyno requires memory in excess of its quota. If this error occurs, the dyno will page to swap space to continue running, which may cause degraded process performance. [source]

So you profile memory and notice an unexpected recurring character across the profiler report:

retained memory by gem
-----------------------------------
 212691288  ostruct
  13123073  msgpack-1.4.2

retained memory by file
-----------------------------------
 212691288  /usr/local/lib/ruby/3.0.0/ostruct.rb
  13123073  /bundle/ruby/3.0.0/gems/msgpack-1.4.2/lib/msgpack/factory.rb
   3135621  /bundle/ruby/3.0.0/gems/i18n-1.8.11/lib/i18n/core_ext/hash.rb

retained memory by location
-----------------------------------
 160806744  /usr/local/lib/ruby/3.0.0/ostruct.rb:214
  38433080  /usr/local/lib/ruby/3.0.0/ostruct.rb:215
  13451424  /usr/local/lib/ruby/3.0.0/ostruct.rb:146
  13123073  /bundle/ruby/3.0.0/gems/msgpack-1.4.2/lib/msgpack/factory.rb:70
   2614656  /bundle/ruby/3.0.0/gems/i18n-1.8.11/lib/i18n/core_ext/hash.rb:52

retained memory by class
-----------------------------------
 122640368  Class
  76946720  Proc
  16700336  String
  14529824  Hash
   3572928  Array
   3202720  OpenStruct
    227600  ActiveModel::Attribute::WithCastValue
    112568  RubyVM::InstructionSequence

retained objects by class
-----------------------------------
    961834  Proc
    400050  String
     81902  Hash
     80173  Class
     80068  OpenStruct
      4205  Array
      2845  ActiveModel::Attribute::WithCastValue
      1579  Symbol

OpenStruct is consuming memory hugely out of proportion to the utility it’s providing. Why? It turns out OpenStruct can leak memory because of its use of define_method to define accessors dynamically. This in addition to the various caveats detailed in its documentation.

The inimitable tenderlove, Ruby genius and world-class punster who I once made cackle with a well-timed pun at a work dinner (tbh probably my crowning achievement in a decade-long career, if it can be called that, in tech), has an old deep-dive where he cautions about the underlying problem:

Caution
Defining methods with define_method is faster, consumes less memory, and depending on your application isn’t significantly slower than using a class_eval defined method. So what is the down side?
Closures
The main down side is that define_method creates a closure. The closure could hold references to large objects, and those large objects will never be garbage collected. For example:

    class Foo
      x = "X" * 1024000 # Not GC'd
      define_method("foo") { }
    end

    class Bar
      x = "X" * 1024000 # Is GC'd
      class_eval("def foo; end")
    end

The closure could access the local variable x in the Foo class, so that variable cannot be garbage collected.

This is a familiar problem I’ve dealt with in JavaScript and Objective-C, but this is the first I’ve seen it in Ruby, in particular with OpenStruct.

OpenStruct essentially just provides some syntatic sugar for readability here, so getting rid of it is a big win at negligible cost. Even the class’s documentation is itself not a fan:

For all these reasons, consider not using OpenStruct at all.

Results

The optimizations detailed above reduce the running time for the background job in question by a factor of 10, from 25 minutes to around 2.

A special case of alpha decay. ↩︎