The other day, I needed to quickly analyse a data set that came in form of a large CSV file. I wanted to collect a particular column of that table and collect all entries categorised by a key in another column.
A simplified version of the table could look like this:
Key | Value1 | Interesting_Value | Other_Value |
foo | 17 | 23.5 | X |
bar | 21 | 1.75 | Q |
foo | 42 | 12.6 | B |
baz | 27 | 17.8 | F |
bar | 49 | 47.2 | K |
I strived for something like this:
result = {
foo: [23.5, 12.6],
bar: [1.75, 47.2],
baz: [17.8],
}
Iterating over the rows is easy, and getting to the columns is no problem either: The CSV gem is well documented and supports this easily.
A nice way to accumulate data is Enumerable#each_with_object. Since I wanted the result to be grouped by a key value, I’d pass a Hash
as the initial argument.
Step 1: each_with_object({})
However, since I’ve planned to append values for changing keys, the default value needed to be an Array
, not the default of nil
.
Step 2: each_with_object(Hash.new([])
This, however, returns the same empty Array
, when a key isn’t found, but I wanted a new empty Array:
Step 3: each_with_object(Hash.new { [] })
This executs the block every time a default values is needed (i.e. the given key isn’t yet in the Hash
).
The next step is to append the value found in a row to the (potentially new and empty) Array
for the given key.
I thought it would work this way:
data_table.each_with_object( Hash.new { [] }) do |row, acc|
acc[row['Key']] << row['Interesting_Value']
end
But, no, the result of this code is an empty Hash
! It needs to be the <<=
operator to work, as shown in the snippet of a pry
session:
[2] pry(main)> data_table = CSV.read 'table.csv', headers: true
=> #<CSV::Table mode:col_or_row row_count:6>
[3] pry(main)> data_table.each_with_object( Hash.new { [] }) do |row, acc|
[3] pry(main)* acc[row['Key']] <<= row['Interesting_Value']
[3] pry(main)* end
=> {"foo"=>["23.5", "12.6"], "bar"=>["1.75", "47.2"], "baz"=>["17.8"]}
It seems to me, that the Hash
lookup with the given default value []
returns an Array
, and the append operator <<
does in fact append the passed object to that Array
, but then the result of that does not end up as a (new) value fo the given Hash
key. In contrast, the <<=
operator does assign the result of the append operation.