Skip to content Skip to sidebar Skip to footer

What Is A Good Xml Stream Parser For Python?

Are there any XML parsers for Python that can parse file streams? My XML files are too big to fit in memory, so I need to parse the stream. Ideally I wouldn't have to have root acc

Solution 1:

Here's good answer about xml.etree.ElementTree.iterparse practice on huge XML files. lxml has the method as well. The key to stream parsing with iterparse is manual clearing and removing already processed nodes, because otherwise you will end up running out of memory.

Another option is using xml.sax. The official manual is too formal to me, and lacks examples so it needs clarification along with the question. Default parser module, xml.sax.expatreader, implement incremental parsing interface xml.sax.xmlreader.IncrementalParser. That is to say xml.sax.make_parser() provides suitable stream parser.

For instance, given a XML stream like:

<?xml version="1.0" encoding="utf-8"?><root><entry><a>value 0</a><bfoo='bar' /></entry><entry><a>value 1</a><bfoo='baz' /></entry><entry><a>value 2</a><bfoo='quz' /></entry>
  ...
</root>

Can be handled in the following way.

#!/usr/bin/env python# -*- coding: utf-8 -*-import xml.sax


classStreamHandler(xml.sax.handler.ContentHandler):

  lastEntry = None
  lastName  = NonedefstartElement(self, name, attrs):
    self.lastName = name
    if name == 'entry':
      self.lastEntry = {}
    elif name != 'root':
      self.lastEntry[name] = {'attrs': attrs, 'content': ''}

  defendElement(self, name):
    if name == 'entry':
      print({
        'a' : self.lastEntry['a']['content'],
        'b' : self.lastEntry['b']['attrs'].getValue('foo')
      })
      self.lastEntry = Noneelif name == 'root':
      raise StopIteration

  defcharacters(self, content):
    if self.lastEntry:
      self.lastEntry[self.lastName]['content'] += content


if __name__ == '__main__':
  # use default ``xml.sax.expatreader``
  parser = xml.sax.make_parser()
  parser.setContentHandler(StreamHandler())
  # feed the parser with small chunks to simulatewithopen('data.xml') as f:
    whileTrue:
      buffer = f.read(16)
      if buffer:
        try:
          parser.feed(buffer)
        except StopIteration:
          break# if you can provide a file-like object it's as simple aswithopen('data.xml') as f:
    parser.parse(f)

Solution 2:

Are you looking for xml.sax? It's right in the standard library.

Solution 3:

Use xml.etree.cElementTree. It's much faster than xml.etree.ElementTree. Neither of them are broken. Your files are broken (see my answer to your other question).

Post a Comment for "What Is A Good Xml Stream Parser For Python?"