HOWTO: Remove Byte-order Mark with Ruby and Iconv19 October 2009

I'm working on a small project that involves loading a UTF-16LE (16-bit Unicode, Little Endian) CSV file, converting it to UTF-8 (normal Unicode, as it may be) with iconv, then parsing the values with FasterCSV. Everything was working fine except for loading the first column of data by the column header value. For example, given data:

First Name Last Name Email
Jimbo Jones jimbo.jones@example.com

I could access column 2 (Last Name) as either row.field("Last Name") or row.field(1). However, if I tried to access the first column using row.field("First Name"), it would return nil. row.field(0), on the other hand, would return the proper value.

Hmmmm.

After some sleuthing, I examined the raw content of the string:

(rdb:1) p row.headers.first.unpack('C*')
[239, 187, 191, 70, 105, 114, 115, 116, 32, 78, 97, 109, 101]

Ah, ha! The first three characters are the byte-order mark, or BOM. Ruby, for whatever reason, does not strip it when reading a file as input, so it's passed along in the input stream. When loading a file with FasterCSV, it'll keep those characters in the key name, causing lookups by the first column key name to return nil.

I modified my file conversion code as follows:

def convert_to_utf8
    # Data files are exported as Little Endian UTF-16. We need to parse as UTF-8
    contents = File.open(@file_name).read      
    begin
      converted = Iconv.iconv('UTF-8', 'UTF-16LE', contents)
      converted.first.gsub!("\xEF\xBB\xBF", '') # strip the BOM (byte order mark) from the first line of input
      output = File.open(@file_name, 'w')
      output.write(converted)
    rescue Iconv::Failure
      puts $!.inspect
    end
end

And all is well in the world.

Want to talk about this a bit more? Send a tweet to @cgansen or email me at cgansen@gmail.com.