1 """A simple way to read lists of fields from flat XML records.
2
3 Many XML formats are very simple: all the fields are needed, there is
4 no tree hierarchy, all the text inside of the tags is used, and the
5 text is short (it can easily fit inside of memory). SAX is pretty
6 good for this but it's still somewhat complicated to use. DOM is
7 designed to handle tree structures so is a bit too much for a simple
8 flat data structure.
9
10 This module implements a new, simpler API, which I'll call LAX. It
11 only works well when the elements are small and non-hierarchical. LAX
12 has three callbacks.
13
14 start() -- the first method called
15
16 element(tag, attrs, text) -- called once for each element, after the
17
18 element has been fully read. (Ie, called when the endElement
19 would be called.) The 'tag' is the element name, the attrs is the
20 attribute object that would be used in a startElement, and the
21 text is all the text between the two tags. The text is the
22 concatenation of all the characters() calls.
23
24 end() -- the last method called (unless there was an error)
25
26 LAX.LAX is an content handler which converts the SAX events to
27 LAX events. Here is an example use:
28
29 >>> from Martel import Word, Whitespace, Group, Integer, Rep1, AnyEol
30 >>> format = Rep1(Group("line", Word("name") + Whitespace() +
31 ... Integer("age")) + AnyEol())
32 >>> parser = format.make_parser()
33 >>>
34 >>> from Martel import LAX
35 >>> class PrintFields(LAX.LAX):
36 ... def element(self, tag, attrs, text):
37 ... print tag, "has", repr(text)
38 ...
39 >>> parser.setContentHandler(PrintFields())
40 >>> text = "Maggie 3\nPorter 1\n"
41 >>> parser.parseString(text)
42 name has 'Maggie'
43 age has '3'
44 line has 'Maggie 3'
45 name has 'Porter'
46 age has '1'
47 line has 'Porter 1'
48 >>>
49
50 Callbacks take some getting used to. Many people prefer an iterative
51 solution which returns all of the fields of a given record at one
52 time. The default implementation of LAX.LAX helps this case.
53 The 'start' method initializes a local variable named 'groups', which
54 is dictionary. When the 'element' method is called, the information
55 is added to groups; the key is the element name and the value is the
56 list of text strings. It's a list because the same field name may
57 occur multiple times.
58
59 If you need the element attributes as well as the name, use the
60 LAX.LAXAttrs class, which stores a list of 2-ples (text, attrs)
61 instead of just the text.
62
63 For examples:
64
65 >>> iterator = format.make_iterator("line")
66 >>> for record in iterator.iterateString(text, LAX.LAX()):
67 ... print record.groups["name"][0], "is", record.groups["age"][0]
68 ...
69 Maggie is 3
70 Porter is 1
71 >>>
72
73 If you only want a few fields, you can pass the list to constructor,
74 as in:
75
76 >>> lax = LAX.LAX(["name", "sequence"])
77 >>>
78
79 """
80
81 import string
82 from xml.sax import handler
83
84
85
89
90 -class LAX(handler.ContentHandler, dict):
97
99 if name == "document":
100 return self
101 raise AttributeError(name)
102
107
108
110 self.__capture = []
111 self.__expect = None
112 self.__pos = 0
113 self.start()
114
117
119 if tag in self.__fields:
120 self.__capture.append( (tag, attrs, [], self.__pos) )
121 self.__expect = tag
122
124 self.__pos += len(s)
125 for term in self.__capture:
126 term[2].append(s)
127
129 if tag == self.__expect:
130 cap, attrs, text_items, start = self.__capture.pop()
131 self.element(tag, attrs, string.join(text_items, ""),
132 start, self.__pos)
133 if self.__capture:
134 self.__expect = self.__capture[-1][0]
135 else:
136 self.__expect = None
137
138 - def element(self, tag, attrs, text, startpos, endpos):
140
142 if self.__capture:
143 missing = []
144 for term in self.__capture:
145 missing.append(term[0])
146 raise TypeError("Looking for endElements for %s" % \
147 string.join(missing, ","))
148 self.end()
149
152
153
154
156 - def element(self, tag, attrs, text, startpos, endpos):
158
159
161 - def __init__(self, text, attrs, startpos, endpos):
162 self.text = text
163 self.attrs = attrs
164 self.startpos = startpos
165 self.endpos = endpos
166
168 - def element(self, tag, attrs, text, startpos, endpos):
171