##Working with XML files in Python Based on original materials by Greg Wilson and Andrew Walker
XML is becoming the standard way to store everything from web pages to astronomical data. There is a bewildering variety of tools for dealing with it. In this tutorial we will look at how to process and modify XML. We will cover the basics but if you are interested here is some recommended reading:
A bit of history (to help you understand the context).In 1969-1986 Standard Generalized Markup Language (SGML) emerged. It was developed by Charles Goldfarb and others at IBM. It was a way of adding information to medical and legal documents so that computers could process them. It had a very complex specification (over 500 pages). In 1989: Tim Berners-Lee creates HyperText Markup Language (HTML) for the World Wide Web. It was much (much) simpler than SGML. Anyone could write it, so everyone did.
The problem was HTML had a small, fixed set of tags. Everyone wanted to add new ones. So the solution was to create a standard way to define a set of tags, and the relationships between them. The first version of XML was standardized in 1998. It is a set of rules for defining markup languages, much more complex than HTML, but still simpler than SGMLNew version of HTML called XHTML was also defined, it was like HTML, but obeys all XML rules. There still is a lot of non-XML compliant HTML out there.
Elements are shown using tags:
- Must be enclosed in angle brackets
<h1 align="center">A Centered Heading</h1>
<p class="disclaimer">This planet provided as-is.</p>
<p align="left" align="right">...</p>is illegal
<p align=center>...<p>, but modern parsers will reject it
Attributes Vs. Elements
###Element Tree library in Python
Python's standard library includes
Element Tree library (xml.etree.ElementTree typically shortened to ET).
Element Tree provides easy ways to manipulate XML documents. ET is also a widely used library so learning it will help you in working with the code written by other developers.
#####XML Tree Example XML documents have hierarchical structure so a natural way to represent them is to use a tree structure.
<root> <first>element</first> <second attr="value">element</second> <third-element/> </root>
Figure 21.5: An XML Tree
#####Creating a Tree using ElementTree and extracting element object 'root'
Example: file "planets.xml"
<?xml version="1.0" encoding="utf-8"?> <galaxy name="Solar System"> <planet name="Mercury"><period units="days">87.97</period></planet> <planet name="Mars"><period units="days">56.97</period> <moon name="Phobos"> </moon> <moon name="Deimos"> </moon> </planet> <planet name="Venus"><period units="days">224.7</period></planet> <planet name="Earth"><period units="days">365.26</period></planet> </galaxy> import xml.etree.ElementTree as etree tree = etree.parse('planets.xml') root = tree.getroot() print root.attrib
Listing all root's children
for child in root: print "tag=",child.tag, " attrib=",child.attrib
Elements are stored as a list so we can access the children using list indexes:
print root.tag, root.attrib
In ElementTree the attributes are stored as dictionaries:
####Finding particular elements:
We know now how to get hold of all elements in the XML document by recursively extracting all the children using the
for loop. But the ET library comes with methods which allow for iterating over the elements immediately below a given element. The
iter method does exactly that (using "depth-first iteration (DFS)").
iter method is available for both ElementTree and Element objects.
Using iter for ElementTree:
for element in tree.iter(): print element.tag, element.attrib
Using iter for Element:
for element in root.iter(): print element.tag, element.attrib
We can use iter to find particular element:
for element in root.iter(tag= 'period'): print element.attrib print element.text
Orfor element in root.iter(tag='moon'):
print element.attrib print element.text