Trees | Indices | Help |
---|
|
1 # Copyright 2008 by Michiel de Hoon. All rights reserved. 2 # This code is part of the Biopython distribution and governed by its 3 # license. Please see the LICENSE file that should have been included 4 # as part of this package. 5 6 """Parser for XML results returned by NCBI's Entrez Utilities. This 7 parser is used by the read() function in Bio.Entrez, and is not intended 8 be used directly. 9 """ 10 11 # The question is how to represent an XML file as Python objects. Some 12 # XML files returned by NCBI look like lists, others look like dictionaries, 13 # and others look like a mix of lists and dictionaries. 14 # 15 # My approach is to classify each possible element in the XML as a plain 16 # string, an integer, a list, a dictionary, or a structure. The latter is a 17 # dictionary where the same key can occur multiple times; in Python, it is 18 # represented as a dictionary where that key occurs once, pointing to a list 19 # of values found in the XML file. 20 # 21 # The parser then goes through the XML and creates the appropriate Python 22 # object for each element. The different levels encountered in the XML are 23 # preserved on the Python side. So a subelement of a subelement of an element 24 # is a value in a dictionary that is stored in a list which is a value in 25 # some other dictionary (or a value in a list which itself belongs to a list 26 # which is a value in a dictionary, and so on). Attributes encountered in 27 # the XML are stored as a dictionary in a member .attributes of each element, 28 # and the tag name is saved in a member .tag. 29 # 30 # To decide which kind of Python object corresponds to each element in the 31 # XML, the parser analyzes the DTD referred at the top of (almost) every 32 # XML file returned by the Entrez Utilities. This is preferred over a hand- 33 # written solution, since the number of DTDs is rather large and their 34 # contents may change over time. About half the code in this parser deals 35 # wih parsing the DTD, and the other half with the XML itself. 36 37 38 import os.path 39 from xml.parsers import expat 40 41 # The following four classes are used to add a member .attributes to integers, 42 # strings, lists, and dictionaries, respectively. 43 45 47 49 51 53 54 # A StructureElement is like a dictionary, but some of its keys can have 55 # multiple values associated with it. These values are stored in a list 56 # under each key.6859 dict.__init__(self) 60 for key in keys: 61 dict.__setitem__(self, key, []) 62 self.listkeys = keys7025672 self.stack = [] 73 self.errors = [] 74 self.integers = [] 75 self.strings = [] 76 self.lists = [] 77 self.dictionaries = [] 78 self.structures = {} 79 self.items = [] 80 self.dtd_dir = dtd_dir8183 """Set up the parser and let it parse the XML results""" 84 self.parser = expat.ParserCreate() 85 self.parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS) 86 self.parser.StartElementHandler = self.startElement 87 self.parser.EndElementHandler = self.endElement 88 self.parser.CharacterDataHandler = self.characters 89 self.parser.ExternalEntityRefHandler = self.external_entity_ref_handler 90 self.parser.ParseFile(handle) 91 self.parser = None 92 return self.object9395 self.content = "" 96 if name in self.lists: 97 object = ListElement() 98 elif name in self.dictionaries: 99 object = DictionaryElement() 100 elif name in self.structures: 101 object = StructureElement(self.structures[name]) 102 elif name in self.items: # Only appears in ESummary 103 name = str(attrs["Name"]) # convert from Unicode 104 del attrs["Name"] 105 itemtype = str(attrs["Type"]) # convert from Unicode 106 del attrs["Type"] 107 if itemtype=="Structure": 108 object = DictionaryElement() 109 elif name in ("ArticleIds", "History"): 110 object = StructureElement(["pubmed", "medline"]) 111 elif itemtype=="List": 112 object = ListElement() 113 else: 114 object = StringElement() 115 object.itemname = name 116 object.itemtype = itemtype 117 elif name in self.strings + self.errors + self.integers: 118 self.attributes = attrs 119 return 120 else: 121 # Element not found in DTD; this will not be stored in the record 122 object = "" 123 if object!="": 124 object.tag = name 125 if attrs: 126 object.attributes = dict(attrs) 127 if len(self.stack)!=0: 128 current = self.stack[-1] 129 try: 130 current.append(object) 131 except AttributeError: 132 current[name] = object 133 self.stack.append(object)134136 value = self.content 137 if name in self.errors: 138 if value=="": 139 return 140 else: 141 raise RuntimeError(value) 142 elif name in self.integers: 143 value = IntegerElement(value) 144 elif name in self.strings: 145 # Convert Unicode strings to plain strings if possible 146 try: 147 value = StringElement(value) 148 except UnicodeEncodeError: 149 value = UnicodeElement(value) 150 elif name in self.items: 151 self.object = self.stack.pop() 152 if self.object.itemtype in ("List", "Structure"): 153 return 154 elif self.object.itemtype=="Integer": 155 value = IntegerElement(value) 156 else: 157 # Convert Unicode strings to plain strings if possible 158 try: 159 value = StringElement(value) 160 except UnicodeEncodeError: 161 value = UnicodeElement(value) 162 name = self.object.itemname 163 else: 164 self.object = self.stack.pop() 165 return 166 value.tag = name 167 if self.attributes: 168 value.attributes = dict(self.attributes) 169 del self.attributes 170 current = self.stack[-1] 171 try: 172 current.append(value) 173 except AttributeError: 174 current[name] = value175 178180 """This callback function is called for each element declaration: 181 <!ELEMENT name (...)> 182 encountered in a DTD. The purpose of this function is to determine 183 whether this element should be regarded as a string, integer, list 184 dictionary, structure, or error.""" 185 if name.upper()=="ERROR": 186 self.errors.append(name) 187 return 188 if name=='Item' and model==(expat.model.XML_CTYPE_MIXED, 189 expat.model.XML_CQUANT_REP, 190 None, ((expat.model.XML_CTYPE_NAME, 191 expat.model.XML_CQUANT_NONE, 192 'Item', 193 () 194 ), 195 ) 196 ): 197 # Special case. As far as I can tell, this only occurs in the 198 # eSummary DTD. 199 self.items.append(name) 200 return 201 # First, remove ignorable parentheses around declarations 202 while (model[0] in (expat.model.XML_CTYPE_SEQ, 203 expat.model.XML_CTYPE_CHOICE) 204 and model[1] in (expat.model.XML_CQUANT_NONE, 205 expat.model.XML_CQUANT_OPT) 206 and len(model[3])==1): 207 model = model[3][0] 208 # PCDATA declarations correspond to strings 209 if model[0] in (expat.model.XML_CTYPE_MIXED, 210 expat.model.XML_CTYPE_EMPTY): 211 self.strings.append(name) 212 return 213 # List-type elements 214 if (model[0] in (expat.model.XML_CTYPE_CHOICE, 215 expat.model.XML_CTYPE_SEQ) and 216 model[1] in (expat.model.XML_CQUANT_PLUS, 217 expat.model.XML_CQUANT_REP)): 218 self.lists.append(name) 219 return 220 # This is the tricky case. Check which keys can occur multiple 221 # times. If only one key is possible, and it can occur multiple 222 # times, then this is a list. If more than one key is possible, 223 # but none of them can occur multiple times, then this is a 224 # dictionary. Otherwise, this is a structure. 225 # In 'single' and 'multiple', we keep track which keys can occur 226 # only once, and which can occur multiple times. 227 single = [] 228 multiple = [] 229 # The 'count' function is called recursively to make sure all the 230 # children in this model are counted. Error keys are ignored; 231 # they raise an exception in Python. 232 def count(model): 233 quantifier, name, children = model[1:] 234 if name==None: 235 if quantifier in (expat.model.XML_CQUANT_PLUS, 236 expat.model.XML_CQUANT_REP): 237 for child in children: 238 multiple.append(child[2]) 239 else: 240 for child in children: 241 count(child) 242 elif name.upper()!="ERROR": 243 if quantifier in (expat.model.XML_CQUANT_NONE, 244 expat.model.XML_CQUANT_OPT): 245 single.append(name) 246 elif quantifier in (expat.model.XML_CQUANT_PLUS, 247 expat.model.XML_CQUANT_REP): 248 multiple.append(name)249 count(model) 250 if len(single)==0 and len(multiple)==1: 251 self.lists.append(name) 252 elif len(multiple)==0: 253 self.dictionaries.append(name) 254 else: 255 self.structures.update({name: multiple})258 """The purpose of this function is to load the DTD locally, instead 259 of downloading it from the URL specified in the XML. Using the local 260 DTD results in much faster parsing. If the DTD is not found locally, 261 we try to download it. In practice, this may fail though, if the XML 262 relies on many interrelated DTDs. If new DTDs appear, putting them in 263 Bio/Entrez/DTDs will allow the parser to see them.""" 264 location, filename = os.path.split(systemId) 265 path = os.path.join(self.dtd_dir, filename) 266 try: 267 handle = open(path) 268 except IOError: 269 message = """\ 270 Unable to load DTD file %s. 271 272 Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI Entrez. 273 Though most of NCBI's DTD files are included in the Biopython distribution, 274 sometimes you may find that a particular DTD file is missing. In such a 275 case, you can download the DTD file from NCBI and install it manually. 276 277 Usually, you can find missing DTD files at either 278 http://www.ncbi.nlm.nih.gov/dtd/ 279 or 280 http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ 281 If you cannot find %s there, you may also try to search 282 for it with a search engine such as Google. 283 284 Please save %s in the directory 285 %s 286 in order for Bio.Entrez to find it. 287 Alternatively, you can save %s in the directory 288 Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython. 289 290 Please also inform the Biopython developers by sending an email to 291 biopython-dev@biopython.org to inform us about this missing DTD, so that we 292 can include it with the next release of Biopython. 293 """ % (filename, filename, filename, self.dtd_dir, filename) 294 raise RuntimeError(message) 295 296 parser = self.parser.ExternalEntityParserCreate(context) 297 parser.ElementDeclHandler = self.elementDecl 298 parser.ParseFile(handle) 299 return 1300
Trees | Indices | Help |
---|
Generated by Epydoc 3.0.1 on Sun May 3 15:51:44 2009 | http://epydoc.sourceforge.net |