Troubleshooting Ruby Enumerable Memory Usage
Previously, I discussed
an implementation within an ETL system where I removed a lot of complexity from
my code by relying on Ruby’s
Enumerable module. While
using the default implementation of Enumerable#each_slice
is simplest from a
readability and maintenance standpoint, I found that the #each_slice
method
can use a significant amount of memory under certain conditions. In this post,
I will walk through my debugging process and show a solution to the memory
issues that I discovered. By overriding the #each_slice
method in my class,
I was able to optimize for memory conservation.
The Enumerable Approach
To provide context for this investigation, I have two classes.
EventsCSV
- represents a large CSV of records.FileSplitter
- splits the large CSV into smaller, more manageable files.
In practice, the smaller files are uploaded to a remote file repository and are each consumed in parallel via background jobs; those details are not discussed in greater detail here.
class EventsCSV
include Enumerable
attr_reader :headers
def initialize(file_path)
@file_path = file_path
@file = File.new(file_path, 'r')
@headers = @file.gets
end
def close
@file.close
end
def each
while line = @file.gets
yield line
end
end
end
class FileSplitter
BATCH_LIMIT = 100_000
def split_large_csv_file(file_path)
events_csv = EventsCSV.new(file_path)
events_csv
.each_slice(BATCH_LIMIT)
.each_with_index
.map do |batch, i|
"#{file_path}_#{i}".tap do |split_file_path|
write_batch_to_file(split_file_path, events_csv.headers, batch)
end
end
end
def write_batch_to_file(file_path, headers, batch)
File.open(file_path, 'w') do |csv|
csv << headers
batch.each { |row| csv << row }
end
end
end
So, in production, the CSV files that are processed can contain multiple
gigabytes of data. Shortly after releasing this code, I observed that the
worker processes splitting large CSV files were running out of memory. The
result was failed Resque jobs due to Resque::DirtyExit
errors.
Debugging
Initially, I assumed that I was not releasing the memory from each separate
slice of data as I iterated through the CSV file in batches. In order to try to
reveal the leak, I included the very helpful
get_process_mem
gem.
After pulling the EventsCSV
and FileSplitter
into a easily executable,
standalone program, I began introspecting the code’s memory usage. Using
a CSV with around 4 million records, I decided to execute the following script.
# profile.rb
require 'get_process_mem'
require 'time'
require_relative 'events_csv'
require_relative 'file_splitter'
start = Time.now
file_path = 'big_csv.csv'
file_splitter = FileSplitter.new
split_files = file_splitter
.split_large_csv_file(file_path)
puts "Finished in #{(Time.now - start).round(3)} seconds"
Even though my main goal was to decrease the memory usage of my algorithm, I also wanted to be aware of any significant performance penalties as the code consumed less memory. Therefore, I decided to print the execution time of the script.
So, using the get_process_mem
gem,
I was able to log the memory usage of the Ruby script like so:
class FileSplitter
BATCH_LIMIT = 100_000
def split_large_csv_file(file_path)
events_csv = EventsCSV.new(file_path)
print_memory_usage("BEFORE each_slice")
events_csv
.each_slice(BATCH_LIMIT)
.each_with_index
.map do |batch, i|
print_memory_usage("BEFORE batch #{i}")
"#{file_path}_#{i}".tap do |split_file_path|
write_batch_to_file(split_file_path, events_csv.headers, batch)
print_memory_usage("AFTER batch #{i}")
end
end
end
def write_batch_to_file(file_path, headers, batch)
File.open(file_path, 'w') do |csv|
csv << headers
batch.each { |row| csv << row }
end
end
def print_memory_usage(prefix = '')
mb = GetProcessMem.new.mb
puts "#{prefix} - MEMORY USAGE(MB): #{mb.round}"
end
end
Alright, so running ruby profile.rb
produced the following output in the
console:
BEFORE each_slice - MEMORY USAGE(MB): 17
BEFORE batch 0 - MEMORY USAGE(MB): 38
AFTER batch 0 - MEMORY USAGE(MB): 41
BEFORE batch 1 - MEMORY USAGE(MB): 60
AFTER batch 1 - MEMORY USAGE(MB): 64
BEFORE batch 2 - MEMORY USAGE(MB): 82
AFTER batch 2 - MEMORY USAGE(MB): 86
BEFORE batch 3 - MEMORY USAGE(MB): 106
AFTER batch 3 - MEMORY USAGE(MB): 106
BEFORE batch 4 - MEMORY USAGE(MB): 109
AFTER batch 4 - MEMORY USAGE(MB): 109
BEFORE batch 5 - MEMORY USAGE(MB): 110
AFTER batch 5 - MEMORY USAGE(MB): 110
BEFORE batch 6 - MEMORY USAGE(MB): 111
AFTER batch 6 - MEMORY USAGE(MB): 111
BEFORE batch 7 - MEMORY USAGE(MB): 127
AFTER batch 7 - MEMORY USAGE(MB): 130
BEFORE batch 8 - MEMORY USAGE(MB): 151
AFTER batch 8 - MEMORY USAGE(MB): 151
BEFORE batch 9 - MEMORY USAGE(MB): 161
AFTER batch 9 - MEMORY USAGE(MB): 152
...
BEFORE batch 40 - MEMORY USAGE(MB): 158
AFTER batch 40 - MEMORY USAGE(MB): 158
BEFORE batch 41 - MEMORY USAGE(MB): 166
AFTER batch 41 - MEMORY USAGE(MB): 166
BEFORE batch 42 - MEMORY USAGE(MB): 168
AFTER batch 42 - MEMORY USAGE(MB): 168
Finished in 6.361 seconds
In production, my containers have 512 MB of available memory, so given that
this barebones script consumes 168 MB of memory, it’s reasonable that a
container that also runs a full Rails app could exhaust its available resources.
Also, notice that the memory usage jumps significantly after we start calling
#each_slice
. Furthermore, the memory usage eventually plateaus so I felt it
safe to assume, at this point, that there wasn’t a memory leak in the code; the
implementation did not appear to be retaining references to each separate slice
of data. If that was the problem, I would have expected memory usage to keep
climbing indefinitely at a more-or-less linear manner. To verify that
#each_slice
was using a lot of memory, I tried reducing the BATCH_LIMIT
from 100,000 to 10,000.
BEFORE each_slice - MEMORY USAGE(MB): 17
BEFORE batch 0 - MEMORY USAGE(MB): 19
AFTER batch 0 - MEMORY USAGE(MB): 19
BEFORE batch 1 - MEMORY USAGE(MB): 21
AFTER batch 1 - MEMORY USAGE(MB): 21
BEFORE batch 2 - MEMORY USAGE(MB): 23
AFTER batch 2 - MEMORY USAGE(MB): 23
BEFORE batch 3 - MEMORY USAGE(MB): 25
AFTER batch 3 - MEMORY USAGE(MB): 25
BEFORE batch 4 - MEMORY USAGE(MB): 26
AFTER batch 4 - MEMORY USAGE(MB): 26
...
BEFORE batch 426 - MEMORY USAGE(MB): 40
AFTER batch 426 - MEMORY USAGE(MB): 40
BEFORE batch 427 - MEMORY USAGE(MB): 40
AFTER batch 427 - MEMORY USAGE(MB): 40
BEFORE batch 428 - MEMORY USAGE(MB): 40
AFTER batch 428 - MEMORY USAGE(MB): 40
Finished in 10.934 seconds
So when using smaller batches, the programs consumes less memory. To me, that
indicated that the implementation of Enumerable#each_slice
consumes memory
according to the size of the slices. That’s not entirely surprising, but
then I started to wonder if there was an alternative #each_slice
implementation that would consume less memory with large batches.
The Fix
So I wanted to try overriding Enumerable#each_slice
with my own
implementation. But what exactly do I want to return as a slice? Well, I
knew how I was going to use my #each_slice
implementation in practice; I knew
that, in my FileSplitter
class, I was going to call:
events_csv
.each_slice(BATCH_LIMIT)
.each_with_index
.map do |batch, i|
...
end
So, it looks like whatever I return from #each_slice
should also include
Enumerable
so that #each_with_index
and #map
will work as expected.
class EventsCSV
def each_slice(n)
# Something Enumerable
end
end
So I went ahead and just created a sort of placeholder class just to wrap my
head around how this was going to work. I figured that this class would at
least require a reference to the EventsCSV
and to know the slice size in
order to work.
class EventsCSV::SliceEnumerator
include Enumerable
def initialize(events_csv, slice_size)
@events_csv = events_csv
@slice_size = slice_size
end
def each
# yields a slice, I guess?
end
end
Okay, so using this model, what is a slice exactly? I knew that I did not want
to build an array of the given slice size; that is why Enumerable#each_slice
consumes so much memory. Rather, I wanted to yield one line of the file at
a time and to stop after reading a given number of lines. So something like:
class EventsCSV::SliceEnumerator
include Enumerable
def initialize(events_csv, slice_size)
@events_csv = events_csv
@slice_size = slice_size
end
def each
yield Slice.new(@events_csv, @slice_size) while @events_csv.any?
end
end
class EventsCSV::Slice
include Enumerable
def initialize(events_csv, size)
@events_csv = events_csv
@size = size
end
def each
@events_csv.each_with_index do |row, i|
yield row
break if i + 1 == @size
end
end
end
A small aside, I also had to implement EventsCSV#any?
. Because the
EventsCSV#each
calls @file.gets
, calling Enumerable#any?
will actually
read a line from the file; Enumerable#any?
proved to be destructive. So
we also have:
class EventsCSV
# EventsCSV stuff...
def any?
!@file.eof?
end
end
So after setting the BATCH_LIMIT
back to 100,000, the results of the
profiling script were:
BEFORE each_slice - MEMORY USAGE(MB): 17
BEFORE batch 0 - MEMORY USAGE(MB): 17
AFTER batch 0 - MEMORY USAGE(MB): 19
BEFORE batch 1 - MEMORY USAGE(MB): 19
AFTER batch 1 - MEMORY USAGE(MB): 20
BEFORE batch 2 - MEMORY USAGE(MB): 20
AFTER batch 2 - MEMORY USAGE(MB): 20
...
BEFORE batch 40 - MEMORY USAGE(MB): 20
AFTER batch 40 - MEMORY USAGE(MB): 20
BEFORE batch 41 - MEMORY USAGE(MB): 20
AFTER batch 41 - MEMORY USAGE(MB): 20
BEFORE batch 42 - MEMORY USAGE(MB): 20
AFTER batch 42 - MEMORY USAGE(MB): 20
Finished in 4.652 seconds
Perfect! That’s exactly what I was looking to accomplish; now I could process large batches without consuming very much memory.