Posts Tagged ‘ruby’

Handling Nested CDATA With Builder

Tuesday, September 21st, 2010

As noted by our associates at Atomic Object, XML doesn’t allow for nested<![CDATA[…]]> elements. In the course of rewriting some pieces of code, I developed the following Builder workaround to allow our application to export valid XML by breaking the nested CDATA elements into distinct chunks. When read back in via our Nokogiri-based parser, it concatenates the values automagically, and the end result is clean, valid XML.

Fix code:

module Builder
  class XmlMarkup < XmlBase
 
    def cdata_with_escaping!(text)
      if text =~ /(\]\]>)/
        text.gsub!(/(\]\]>)/, "]]]]><![CDATA[>")
      end
      cdata_without_escaping!(text)
    end
    alias_method_chain 'cdata!', 'escaping'
 
  end
end

Sample output:

>> xml = Builder::XmlMarkup.new(str)
>> xml.cdata!("<![CDATA[Foo bar sna]]>")
>> xml.target!
=> "<![CDATA[<![CDATA[Foo bar sna]]]]><![CDATA[>]]>"  # valid XML!
>> xml.cdata_without_escaping!("<![CDATA[Foo bar sna]]>")
>> xml.target!
=> "<![CDATA[<![CDATA[Foo bar sna]]>]]>" # invalid XML!

Sample parsing with Nokogiri:

>> doc = Nokogiri::XML("<baz><![CDATA[<![CDATA[Foo bar sna]]]]><![CDATA[>]]></baz>")
=> #<Nokogiri::XML::Document:0x825aff3c name="document" children=[#<Nokogiri::XML::Element:0x825afc1c name="baz" children=[#<Nokogiri::XML::CDATA:0x825af99c "<![CDATA[Foo bar sna]]>">]>]>
>> doc.css('baz').first.content
=> "<![CDATA[Foo bar sna]]>"

Where to find old versions of Ruby

Monday, May 3rd, 2010

This post is as much for my reference as it is frustrated folks trying to find non-1.9 versions for old Rails apps.

Ruby 1.8.6 p399: ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.6-p399.tar.gz (and in zip and bz2)

Ruby 1.8.7 p249: ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.7-p249.tar.gz (and in zip and bz2)

And of course, browse the entire FTP archive for everything going back to 1.8.0.

HOWTO: Remove Byte-order Mark with Ruby and Iconv

Monday, October 19th, 2009

I’m working on a small project that involves loading a UTF-16LE (16-bit Unicode, Little Endian) CSV file, converting it to UTF-8 (normal Unicode, as it may be) with iconv, then parsing the values with FasterCSV. Everything was working fine except for loading the first column of data by the column header value. For example, given data:

First Name Last Name Email
Jimbo Jones jimbo.jones@example.com

I could access column 2 (Last Name) as either row.field("Last Name") or row.field(1). However, if I tried to access the first column using row.field("First Name"), it would return nil. row.field(0), on the other hand, would return the proper value.

Hmmmm.

After some sleuthing, I examined the raw content of the string:

(rdb:1) p row.headers.first.unpack('C*')
[239, 187, 191, 70, 105, 114, 115, 116, 32, 78, 97, 109, 101]

Ah, ha! The first three characters are the byte-order mark, or BOM. Ruby, for whatever reason, does not strip it when reading a file as input, so it’s passed along in the input stream. When loading a file with FasterCSV, it’ll keep those characters in the key name, causing lookups by the first column key name to return nil.

I modified my file conversion code as follows:

  def convert_to_utf8
    # Data files are exported as Little Endian UTF-16. We need to parse as UTF-8
    contents = File.open(@file_name).read      
    begin
      converted = Iconv.iconv('UTF-8', 'UTF-16LE', contents)
      converted.first.gsub!("\xEF\xBB\xBF", '') # strip the BOM (byte order mark) from the first line of input
      output = File.open(@file_name, 'w')
      output.write(converted)
    rescue Iconv::Failure
      puts $!.inspect
    end
  end

And all is well in the world.