Elegant Xml Parsing
Elegant XML Parsing in Ruby
I blogged in more detail about this before. But I saw a Stack Overflow question about this, so I feel I need to keep up the public service announcements :)
If you are parsing big XML files you have to stream them using something like SAX. Does your code look like this?
def start_element tag, attributes
case tag
when 'name'
@attribute = 'name'
when 'titles'
@attribute = 'name'
end
end
def end_element tag
@model.save! if tag == 'name'
end
def characters text
@model[@attribute] = text
end
Please stop doing this!
Sax Machine was created to have an elegant API for parsing XML.
class AtomEntry
include SAXMachine
element :title
# the :as argument makes this available through atom_entry.author instead of .name
element :name, :as => :author
element "feedburner:origLink", :as => :url
element :summary
element :content
element :published
end
This is declarative coding, which is what we should be striving for no matter what programming tools we are using. The problem is that the official sax-machine gem loads everything into memory. The reason why is that it does not use a callback (yield) or fibers to process elements when they are immediately loaded.
At yap.TV, we have been using my fork of sax-machine for over a year now on gigantic XML files. This fork uses Ruby’s fibers to transfer control from the xml parsing back to your code. This is similar in concept to how yield works, but you don’t have to pass in a block right away. Instead you get back an Enumerator that will end up parsing on demand.
class Atom
include SAXMachine
element :title
elements :entry, :lazy => true, :as => :entries, :class => AtomEntry
end
feed = Atom.parse(xml_file_handle, :lazy => true)
feed.entries # => #<Enumerator: #<Enumerator::Generator:0x00000004c41ea0>:each>
feed.entries.each do |entry|
# every time the block is called the next entry is parsed- no memory blow up!
# This is probably where you save the entry to a database
end
The only thing extra you need to do is add a :lazy attribute to your top level elements. Also checkout ezkl’s fork. He is taking over maintenance of sax-machine in the official repo now though and planning on pulling in my changes.
This solves the memory part of performance. However, if this is not fast enough for you, I suggest switching the library to use ox. From the ox documentation:
Unlike Nokogiri and LibXML, Ox can be tuned to use only the SAX callbacks that are of interest to the caller
The current Sax parser will pull everything out of the XML and bring it into Ruby land. Avoiding this could make the parsing an order of magnitude faster depending on how sparsely you are using the XML.
For our usage at yap (importing to a database) we of course found that database access became a big bottleneck. We added some indexes and caching, and made some queries use the raw driver rather than go through an ORM.
sax-machine is a great example of how Ruby’s advanced features of fibers and meta-programming can be used. Please use it to write declarative code!