Archive for October, 2009
HOWTO: Remove Byte-order Mark with Ruby and Iconv
Monday, October 19th, 2009I’m working on a small project that involves loading a UTF-16LE (16-bit Unicode, Little Endian) CSV file, converting it to UTF-8 (normal Unicode, as it may be) with iconv, then parsing the values with FasterCSV. Everything was working fine except for loading the first column of data by the column header value. For example, given data:
| First Name | Last Name | |
|---|---|---|
| Jimbo | Jones | jimbo.jones@example.com |
I could access column 2 (Last Name) as either row.field("Last Name") or row.field(1). However, if I tried to access the first column using row.field("First Name"), it would return nil. row.field(0), on the other hand, would return the proper value.
Hmmmm.
After some sleuthing, I examined the raw content of the string:
(rdb:1) p row.headers.first.unpack('C*') [239, 187, 191, 70, 105, 114, 115, 116, 32, 78, 97, 109, 101]
Ah, ha! The first three characters are the byte-order mark, or BOM. Ruby, for whatever reason, does not strip it when reading a file as input, so it’s passed along in the input stream. When loading a file with FasterCSV, it’ll keep those characters in the key name, causing lookups by the first column key name to return nil.
I modified my file conversion code as follows:
def convert_to_utf8 # Data files are exported as Little Endian UTF-16. We need to parse as UTF-8 contents = File.open(@file_name).read begin converted = Iconv.iconv('UTF-8', 'UTF-16LE', contents) converted.first.gsub!("\xEF\xBB\xBF", '') # strip the BOM (byte order mark) from the first line of input output = File.open(@file_name, 'w') output.write(converted) rescue Iconv::Failure puts $!.inspect end end
And all is well in the world.
