##Working with XML files in Python Based on original materials by Greg Wilson and Andrew Walker
XML is becoming the standard way to store everything from web pages to astronomical data. There is a bewildering variety of tools for dealing with it. In this tutorial we will look at how to process and modify XML. We will cover the basics but if you are interested here is some recommended reading:
A bit of history (to help you understand the context).In 1969-1986 Standard Generalized Markup Language (SGML) emerged. It was developed by Charles Goldfarb and others at IBM. It was a way of adding information to medical and legal documents so that computers could process them. It had a very complex specification (over 500 pages). In 1989: Tim Berners-Lee creates HyperText Markup Language (HTML) for the World Wide Web. It was much (much) simpler than SGML. Anyone could write it, so everyone did.
The problem was HTML had a small, fixed set of tags. Everyone wanted to add new ones. So the solution was to create a standard way to define a set of tags, and the relationships between them. The first version of XML was standardized in 1998. It is a set of rules for defining markup languages, much more complex than HTML, but still simpler than SGMLNew version of HTML called XHTML was also defined, it was like HTML, but obeys all XML rules. There still is a lot of non-XML compliant HTML out there.
A basic XML document contains elements and text. Full spec allows for external entity references, processing instructions, and other fun.
Elements are shown using tags:
- Must be enclosed in angle brackets <>
####Document Structure
<X>...<Y>...</Y></X>
is legal…<X>...<Y>...</X></Y>
is not####Text
<
and >
&name;
Attributes
<h1 align="center">A Centered Heading</h1>
<p class="disclaimer">This planet provided as-is.</p>
<p align="left" align="right">...</p>
is illegal<p align=center>...<p>
, but modern
parsers will reject itAttributes Vs. Elements
###Element Tree library in Python
Python's standard library includes Element Tree
library (xml.etree.ElementTree typically shortened to ET). Element Tree
provides easy ways to manipulate XML documents. ET is also a widely used library so learning it will help you in working with the code written by other developers.
#####XML Tree Example XML documents have hierarchical structure so a natural way to represent them is to use a tree structure.
<root>
<first>element</first>
<second attr="value">element</second>
<third-element/>
</root>
Figure 21.5: An XML Tree
#####Creating a Tree using ElementTree and extracting element object 'root'
Example: file "planets.xml"
<?xml version="1.0" encoding="utf-8"?>
<galaxy name="Solar System">
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Mars"><period units="days">56.97</period>
<moon name="Phobos"> </moon>
<moon name="Deimos"> </moon>
</planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</galaxy>
import xml.etree.ElementTree as etree
tree = etree.parse('planets.xml')
root = tree.getroot()
print root.attrib
Listing all root's children
for child in root:
print "tag=",child.tag, " attrib=",child.attrib
Elements are stored as a list so we can access the children using list indexes:
print root[0].tag, root[0].attrib
In ElementTree the attributes are stored as dictionaries:
print child.attrib['name']
####Finding particular elements:
We know now how to get hold of all elements in the XML document by recursively extracting all the children using the for
loop. But the ET library comes with methods which allow for iterating over the elements immediately below a given element. The iter
method does exactly that (using "depth-first iteration (DFS)"). iter
method is available for both ElementTree and Element objects.
Using iter for ElementTree:
for element in tree.iter():
print element.tag, element.attrib
Using iter for Element:
for element in root[0].iter():
print element.tag, element.attrib
We can use iter to find particular element:
for element in root.iter(tag= 'period'):
print element.attrib
print element.text
Or
for element in root.iter(tag='moon'): print element.attrib
print element.text