Elegant Xml Parsing

Elegant XML Parsing in Ruby

I blogged in more detail about this before. But I saw a Stack Overflow question about this, so I feel I need to keep up the public service announcements :)

If you are parsing big XML files you have to stream them using something like SAX. Does your code look like this?

    def start_element tag, attributes
      case tag
      when 'name'
        @attribute = 'name'
      when 'titles'
        @attribute = 'name'
      end
    end

    def end_element tag
      @model.save! if tag == 'name'
    end

    def characters text
      @model[@attribute] = text
    end

Please stop doing this!

Sax Machine was created to have an elegant API for parsing XML.

    class AtomEntry
      include SAXMachine
      element :title
      # the :as argument makes this available through atom_entry.author instead of .name
      element :name, :as => :author
      element "feedburner:origLink", :as => :url
      element :summary
      element :content
      element :published
    end

This is declarative coding, which is what we should be striving for no matter what programming tools we are using. The problem is that the official sax-machine gem loads everything into memory. The reason why is that it does not use a callback (yield) or fibers to process elements when they are immediately loaded.

At yap.TV, we have been using my fork of sax-machine for over a year now on gigantic XML files. This fork uses Ruby’s fibers to transfer control from the xml parsing back to your code. This is similar in concept to how yield works, but you don’t have to pass in a block right away. Instead you get back an Enumerator that will end up parsing on demand.

    class Atom
      include SAXMachine
      element :title
      elements :entry, :lazy => true, :as => :entries, :class => AtomEntry
    end

    feed = Atom.parse(xml_file_handle, :lazy => true)
    feed.entries # => #<Enumerator: #<Enumerator::Generator:0x00000004c41ea0>:each> 
    feed.entries.each do |entry|
      # every time the block is called the next entry is parsed- no memory blow up! 
      # This is probably where you save the entry to a database
    end

The only thing extra you need to do is add a :lazy attribute to your top level elements. Also checkout ezkl’s fork. He is taking over maintenance of sax-machine in the official repo now though and planning on pulling in my changes.

This solves the memory part of performance. However, if this is not fast enough for you, I suggest switching the library to use ox. From the ox documentation:

Unlike Nokogiri and LibXML, Ox can be tuned to use only the SAX callbacks that are of interest to the caller

The current Sax parser will pull everything out of the XML and bring it into Ruby land. Avoiding this could make the parsing an order of magnitude faster depending on how sparsely you are using the XML.

For our usage at yap (importing to a database) we of course found that database access became a big bottleneck. We added some indexes and caching, and made some queries use the raw driver rather than go through an ORM.

sax-machine is a great example of how Ruby’s advanced features of fibers and meta-programming can be used. Please use it to write declarative code!

Published: June 08 2012

blog comments powered by Disqus