An Object-oriented Moment
I had a rare epiphany not too long ago of how to reimplement a procedural algorithm with what I believe to be a cleaner, object-oriented approach. In the past, I have often struggled to see these opportunities. The first programs that I ever wrote were not written in an object-oriented language, and I wonder how much that has hindered my ability to design software in a steadfast OO manner. Nevertheless, given this latest discovery, I may have made a small breakthrough.
The Problem
At Stitch Fix, my team often finds ourselves in situations that call for processing data in batches of a certain maximum size. Sometimes our batch sizes are dictated by third-party API limits and sometimes we need batching to avoid long-running background jobs. We, as an engineering organization, consider it a best practice to avoid long-running background jobs. Among other benefits, this helps prevent lost work or large amounts of work that need to be retried when background worker processes shut down unexpectedly.
The System
Our email service provider (ESP) provides CSV files describing events for all of our emails throughout the day. These files can often be several gigabytes of data, especially on days when we send out large promotional campaigns. Every day we have an internal service that downloads those files from our ESP and saves the events to a database so that our customer experience team can see a history of emails that a client has received.
The downloading and recording of events is done in Resque background jobs. In order to keep the background jobs speedy (ideally under 5 minutes of runtime), we experimented and determined that we could save batches of 100,000 email events at a time.
Alright, so, how do we create batches of 100,000 email events? When we first download the large events file from our ESP, we iterate through the file and for every 100,000 rows in the CSV, we create a separate CSV file in an AWS S3 bucket. Each background job saves the events in one of these smaller S3 files.
The Algorithm
The Procedural Approach
So originally, I implemented a method to split large files and this first attempt was extremely procedural.
class FileSplitter
BATCH_LIMIT = 100_000
def split_large_csv_file(file_path)
i = 0
file_index = 1
File.open(file_path, 'r') do |f|
headers = f.gets
csv = File.new("#{file_path}_#{file_index}", 'w')
csv << headers
while line = f.gets
csv << line
i += 1
if i % BATCH_LIMIT == 0
csv.close
file_index += 1
csv = File.new("#{file_path}_#{file_index}", 'w')
csv << headers
end
end
if i % BATCH_LIMIT != 0
csv.close
else
file_index -= 1
end
end
# Return an array of file paths representing the smaller CSVs
(1..file_index).map { |i| "#{file_path}_#{i}" }
end
end
Yikes. I don’t think there is anything wrong with procedural programming, but in this case I could actually see the edge cases in the code; the implementation looks like an off-by-one-error just waiting to happen. To me, it felt like the complexity of the algorithm was far too prevalent in the method.
My first thought was to break this method into smaller methods, but I realized that would probably mean either creating more blocks to encapsulate opening and closing files or sharing the responsibility between opening and closing files between multiple methods. I didn’t like the sound of either of those solutions. In a rare moment of, I guess, inspiration (although that sounds like I am giving myself too much credit), I paused to rethink my entire approach.
It occurred to me to ask, how does Ruby, as a language, or perhaps more
accurately with its standard library, handle batching data? Well, it turns out
that Ruby’s Enumerable module provides the each_slice
method. Okay, so, all
I needed was to implement a class to represent one of those large CSV files and
this class needs to include the Enumerable
module? That sounded way too
simple, but it was easy enough to try out.
class EventsCSV
include Enumerable
attr_reader :headers
def initialize(file_path)
@file_path = file_path
@file = File.new(file_path, 'r')
@headers = @file.gets
end
def close
@file.close
end
def each
while line = @file.gets
yield line
end
end
end
Immediately, I started to see some benefits. This EventsCSV class could also encapsulate the concept of the headers in the CSV file. We can hide that complexity right in the constructor of the class; that works for me.
So using the EventsCSV
class let’s look at our split_large_csv_file
method.
class FileSplitter
BATCH_LIMIT = 100_000
def split_large_csv_file(file_path)
events_csv = EventsCSV.new(file_path)
file_index = 1
events_csv.each_slice(BATCH_LIMIT) do |batch|
File.open("#{file_path}_#{file_index}", 'w') do |csv|
csv << events_csv.headers
batch.each { |row| csv << row }
end
file_index += 1
end
# Return an array of file paths representing the smaller CSVs
(1..file_index).map { |i| "#{file_path}_#{i}" }
end
end
Okay, we are getting there. There is some low-hanging fruit such that we can rely on the Enumerable module even more.
class FileSplitter
BATCH_LIMIT = 100_000
def split_large_csv_file(file_path)
events_csv = EventsCSV.new(file_path)
events_csv.each_slice(BATCH_LIMIT)
.each_with_index
.map do |batch, file_index|
split_file_path = "#{file_path}_#{file_index}"
File.open(split_file_path, 'w') do |csv|
csv << events_csv.headers
batch.each { |row| csv << row }
end
split_file_path
end
end
end
One more, small change to extract the File IO into a separate method.
class FileSplitter
BATCH_LIMIT = 100_000
def split_large_csv_file(file_path)
events_csv = EventsCSV.new(file_path)
events_csv.each_slice(BATCH_LIMIT)
.each_with_index
.map do |batch, i|
"#{file_path}_#{i}".tap do |split_file_path|
write_batch_to_file(split_file_path, events_csv.headers, batch)
end
end
end
def write_batch_to_file(file_path, headers, batch)
File.open(file_path, 'w') do |csv|
csv << headers
batch.each { |row| csv << row }
end
end
end
Conclusions
As I said above, I was blown away by how a shift in approach could lead to a
drastically cleaner solution. The OO approach in this case relies on the Ruby
Enumerable module to deal with
the edge case complexities of batching data. Therefore, we don’t really have to
worry about testing those edge cases, either. As long as we have implemented
#each
correctly in our EventsCSV
class, we can have high confidence that
#each_slice
will also work correctly.
Next Up
As nice as this solution worked out from a code standpoint, there were actually some unforeseen memory issues when we released the code in production. I share how I debugged and ultimately resolved those issues here.