Parser¶
The xml4h parser is a simple wrapper around the parser provided by an underlying XML library implementation.
Parse function¶
To parse XML documents with xml4h you feed the xml4h.parse()
function
an XML text document in one of three forms:
A file-like object:
>>> import xml4h >>> xml_file = open('tests/data/monty_python_films.xml', 'rb') >>> doc = xml4h.parse(xml_file) >>> doc.MontyPythonFilms <xml4h.nodes.Element: "MontyPythonFilms">
A file path string:
>>> doc = xml4h.parse('tests/data/monty_python_films.xml') >>> doc.root['source'] 'http://en.wikipedia.org/wiki/Monty_Python'
A string containing literal XML content:
>>> xml_file = open('tests/data/monty_python_films.xml', 'rb') >>> xml_text = xml_file.read() >>> doc = xml4h.parse(xml_text) >>> len(doc.find('Film')) 7
Note
The parse()
method distinguishes between a file path
string and an XML text string by looking for a <
character
in the value.
Stripping of Whitespace Nodes¶
By default the parse method ignores whitespace nodes in the XML document – or more accurately, it does extra work to remove these nodes after the document has been parsed by the underlying XML library.
Whitespace nodes are rarely interesting, since they are usually the result of XML content that has been serialized with extra whitespace to make it more readable to humans.
However if you need to keep these nodes, or if you want to avoid the extra
processing overhead when parsing large documents, you can disable this
feature by passing in the ignore_whitespace_text_nodes=False
flag:
>>> # Strip whitespace nodes from document
>>> doc = xml4h.parse('tests/data/monty_python_films.xml')
>>> # No excess text nodes (XML doc lists 7 films)
>>> len(doc.MontyPythonFilms.children)
7
>>> doc.MontyPythonFilms.children[0]
<xml4h.nodes.Element: "Film">
>>> # Don't strip whitespace nodes
>>> doc = xml4h.parse('tests/data/monty_python_films.xml',
... ignore_whitespace_text_nodes=False)
>>> # An extra text node is present
>>> len(doc.MontyPythonFilms.children)
8
>>> doc.MontyPythonFilms.children[0]
<xml4h.nodes.Text: "#text">