Package Bio :: Package Prosite
[hide private]
[frames] | no frames]

Source Code for Package Bio.Prosite

  1  # Copyright 1999 by Jeffrey Chang.  All rights reserved. 
  2  # Copyright 2000 by Jeffrey Chang.  All rights reserved. 
  3  # Revisions Copyright 2007 by Peter Cock.  All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7  """Module for working with Prosite files from ExPASy (OBSOLETE). 
  8   
  9  Most of the functionality in this module has moved to Bio.ExPASy.Prosite; 
 10  please see 
 11   
 12  Bio.ExPASy.Prosite.read          To read a Prosite file containing one entry. 
 13  Bio.ExPASy.Prosite.parse         Iterates over entries in a Prosite file. 
 14  Bio.ExPASy.Prosite.Record        Holds Prosite data. 
 15   
 16  For 
 17  scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns. 
 18  _extract_pattern_hits Extract Prosite patterns from a web page. 
 19  PatternHit            Holds data from a hit against a Prosite pattern. 
 20  please see the new module Bio.ExPASy.ScanProsite. 
 21   
 22  The other functions and classes in Bio.Prosite (including 
 23  Bio.Prosite.index_file and Bio.Prosite.Dictionary) are considered deprecated, 
 24  and were not moved to Bio.ExPASy.Prosite. If you use this functionality, 
 25  please contact the Biopython developers at biopython-dev@biopython.org to 
 26  avoid permanent removal of this module from Biopython. 
 27   
 28   
 29  This module provides code to work with the prosite dat file from 
 30  Prosite. 
 31  http://www.expasy.ch/prosite/ 
 32   
 33  Tested with: 
 34  Release 15.0, July 1998 
 35  Release 16.0, July 1999 
 36  Release 17.0, Dec 2001 
 37  Release 19.0, Mar 2006 
 38   
 39   
 40  Functions: 
 41  parse                 Iterates over entries in a Prosite file. 
 42  scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns. 
 43  index_file            Index a Prosite file for a Dictionary. 
 44  _extract_record       Extract Prosite data from a web page. 
 45  _extract_pattern_hits Extract Prosite patterns from a web page. 
 46   
 47   
 48  Classes: 
 49  Record                Holds Prosite data. 
 50  PatternHit            Holds data from a hit against a Prosite pattern. 
 51  Dictionary            Accesses a Prosite file using a dictionary interface. 
 52  RecordParser          Parses a Prosite record into a Record object. 
 53   
 54  _Scanner              Scans Prosite-formatted data. 
 55  _RecordConsumer       Consumes Prosite data to a Record object. 
 56   
 57  """ 
 58  from types import * 
 59  import re 
 60  import sgmllib 
 61  from Bio import File 
 62  from Bio import Index 
 63  from Bio.ParserSupport import * 
 64   
 65   
 66  # There is probably a cleaner way to write the read/parse functions 
 67  # if we don't use the "parser = RecordParser(); parser.parse(handle)" 
 68  # approach. Leaving that for the next revision of Bio.Prosite. 
69 -def parse(handle):
70 import cStringIO 71 parser = RecordParser() 72 text = "" 73 for line in handle: 74 text += line 75 if line[:2]=='//': 76 handle = cStringIO.StringIO(text) 77 record = parser.parse(handle) 78 text = "" 79 if not record: # Then this was the copyright notice 80 continue 81 yield record
82
83 -def read(handle):
84 parser = RecordParser() 85 try: 86 record = parser.parse(handle) 87 except ValueError, error: 88 if error.message=="There doesn't appear to be a record": 89 raise ValueError("No Prosite record found") 90 else: 91 raise error 92 # We should have reached the end of the record by now 93 remainder = handle.read() 94 if remainder: 95 raise ValueError("More than one Prosite record found") 96 return record
97
98 -class Record:
99 """Holds information from a Prosite record. 100 101 Members: 102 name ID of the record. e.g. ADH_ZINC 103 type Type of entry. e.g. PATTERN, MATRIX, or RULE 104 accession e.g. PS00387 105 created Date the entry was created. (MMM-YYYY) 106 data_update Date the 'primary' data was last updated. 107 info_update Date data other than 'primary' data was last updated. 108 pdoc ID of the PROSITE DOCumentation. 109 110 description Free-format description. 111 pattern The PROSITE pattern. See docs. 112 matrix List of strings that describes a matrix entry. 113 rules List of rule definitions (from RU lines). (strings) 114 prorules List of prorules (from PR lines). (strings) 115 116 NUMERICAL RESULTS 117 nr_sp_release SwissProt release. 118 nr_sp_seqs Number of seqs in that release of Swiss-Prot. (int) 119 nr_total Number of hits in Swiss-Prot. tuple of (hits, seqs) 120 nr_positive True positives. tuple of (hits, seqs) 121 nr_unknown Could be positives. tuple of (hits, seqs) 122 nr_false_pos False positives. tuple of (hits, seqs) 123 nr_false_neg False negatives. (int) 124 nr_partial False negatives, because they are fragments. (int) 125 126 COMMENTS 127 cc_taxo_range Taxonomic range. See docs for format 128 cc_max_repeat Maximum number of repetitions in a protein 129 cc_site Interesting site. list of tuples (pattern pos, desc.) 130 cc_skip_flag Can this entry be ignored? 131 cc_matrix_type 132 cc_scaling_db 133 cc_author 134 cc_ft_key 135 cc_ft_desc 136 cc_version version number (introduced in release 19.0) 137 138 DATA BANK REFERENCES - The following are all 139 lists of tuples (swiss-prot accession, 140 swiss-prot name) 141 dr_positive 142 dr_false_neg 143 dr_false_pos 144 dr_potential Potential hits, but fingerprint region not yet available. 145 dr_unknown Could possibly belong 146 147 pdb_structs List of PDB entries. 148 149 """
150 - def __init__(self):
151 self.name = '' 152 self.type = '' 153 self.accession = '' 154 self.created = '' 155 self.data_update = '' 156 self.info_update = '' 157 self.pdoc = '' 158 159 self.description = '' 160 self.pattern = '' 161 self.matrix = [] 162 self.rules = [] 163 self.prorules = [] 164 self.postprocessing = [] 165 166 self.nr_sp_release = '' 167 self.nr_sp_seqs = '' 168 self.nr_total = (None, None) 169 self.nr_positive = (None, None) 170 self.nr_unknown = (None, None) 171 self.nr_false_pos = (None, None) 172 self.nr_false_neg = None 173 self.nr_partial = None 174 175 self.cc_taxo_range = '' 176 self.cc_max_repeat = '' 177 self.cc_site = [] 178 self.cc_skip_flag = '' 179 180 self.dr_positive = [] 181 self.dr_false_neg = [] 182 self.dr_false_pos = [] 183 self.dr_potential = [] 184 self.dr_unknown = [] 185 186 self.pdb_structs = []
187
188 -class PatternHit:
189 """Holds information from a hit against a Prosite pattern. 190 191 Members: 192 name ID of the record. e.g. ADH_ZINC 193 accession e.g. PS00387 194 pdoc ID of the PROSITE DOCumentation. 195 description Free-format description. 196 matches List of tuples (start, end, sequence) where 197 start and end are indexes of the match, and sequence is 198 the sequence matched. 199 200 """
201 - def __init__(self):
202 self.name = None 203 self.accession = None 204 self.pdoc = None 205 self.description = None 206 self.matches = []
207 - def __str__(self):
208 lines = [] 209 lines.append("%s %s %s" % (self.accession, self.pdoc, self.name)) 210 lines.append(self.description) 211 lines.append('') 212 if len(self.matches) > 1: 213 lines.append("Number of matches: %s" % len(self.matches)) 214 for i in range(len(self.matches)): 215 start, end, seq = self.matches[i] 216 range_str = "%d-%d" % (start, end) 217 if len(self.matches) > 1: 218 lines.append("%7d %10s %s" % (i+1, range_str, seq)) 219 else: 220 lines.append("%7s %10s %s" % (' ', range_str, seq)) 221 return "\n".join(lines)
222 223
224 -class Dictionary:
225 """Accesses a Prosite file using a dictionary interface. 226 227 """ 228 __filename_key = '__filename' 229
230 - def __init__(self, indexname, parser=None):
231 """__init__(self, indexname, parser=None) 232 233 Open a Prosite Dictionary. indexname is the name of the 234 index for the dictionary. The index should have been created 235 using the index_file function. parser is an optional Parser 236 object to change the results into another form. If set to None, 237 then the raw contents of the file will be returned. 238 239 """ 240 self._index = Index.Index(indexname) 241 self._handle = open(self._index[Dictionary.__filename_key]) 242 self._parser = parser
243
244 - def __len__(self):
245 return len(self._index)
246
247 - def __getitem__(self, key):
248 start, len = self._index[key] 249 self._handle.seek(start) 250 data = self._handle.read(len) 251 if self._parser is not None: 252 return self._parser.parse(File.StringHandle(data)) 253 return data
254
255 - def __getattr__(self, name):
256 return getattr(self._index, name)
257
258 -class RecordParser(AbstractParser):
259 """Parses Prosite data into a Record object. 260 261 """
262 - def __init__(self):
263 self._scanner = _Scanner() 264 self._consumer = _RecordConsumer()
265
266 - def parse(self, handle):
267 self._scanner.feed(handle, self._consumer) 268 return self._consumer.data
269
270 -class _Scanner:
271 """Scans Prosite-formatted data. 272 273 Tested with: 274 Release 15.0, July 1998 275 276 """
277 - def feed(self, handle, consumer):
278 """feed(self, handle, consumer) 279 280 Feed in Prosite data for scanning. handle is a file-like 281 object that contains prosite data. consumer is a 282 Consumer object that will receive events as the report is scanned. 283 284 """ 285 if isinstance(handle, File.UndoHandle): 286 uhandle = handle 287 else: 288 uhandle = File.UndoHandle(handle) 289 290 consumer.finished = False 291 while not consumer.finished: 292 line = uhandle.peekline() 293 if not line: 294 break 295 elif is_blank_line(line): 296 # Skip blank lines between records 297 uhandle.readline() 298 continue 299 elif line[:2] == 'ID': 300 self._scan_record(uhandle, consumer) 301 elif line[:2] == 'CC': 302 self._scan_copyrights(uhandle, consumer) 303 else: 304 raise ValueError("There doesn't appear to be a record")
305
306 - def _scan_copyrights(self, uhandle, consumer):
307 consumer.start_copyrights() 308 self._scan_line('CC', uhandle, consumer.copyright, any_number=1) 309 self._scan_terminator(uhandle, consumer) 310 consumer.end_copyrights()
311
312 - def _scan_record(self, uhandle, consumer):
313 consumer.start_record() 314 for fn in self._scan_fns: 315 fn(self, uhandle, consumer) 316 317 # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before 318 # the 3D lines, instead of the other way around. 319 # Thus, I'll give the 3D lines another chance after the DO lines 320 # are finished. 321 if fn is self._scan_do.im_func: 322 self._scan_3d(uhandle, consumer) 323 consumer.end_record()
324
325 - def _scan_line(self, line_type, uhandle, event_fn, 326 exactly_one=None, one_or_more=None, any_number=None, 327 up_to_one=None):
328 # Callers must set exactly one of exactly_one, one_or_more, or 329 # any_number to a true value. I do not explicitly check to 330 # make sure this function is called correctly. 331 332 # This does not guarantee any parameter safety, but I 333 # like the readability. The other strategy I tried was have 334 # parameters min_lines, max_lines. 335 336 if exactly_one or one_or_more: 337 read_and_call(uhandle, event_fn, start=line_type) 338 if one_or_more or any_number: 339 while 1: 340 if not attempt_read_and_call(uhandle, event_fn, 341 start=line_type): 342 break 343 if up_to_one: 344 attempt_read_and_call(uhandle, event_fn, start=line_type)
345
346 - def _scan_id(self, uhandle, consumer):
347 self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)
348
349 - def _scan_ac(self, uhandle, consumer):
350 self._scan_line('AC', uhandle, consumer.accession, exactly_one=1)
351
352 - def _scan_dt(self, uhandle, consumer):
353 self._scan_line('DT', uhandle, consumer.date, exactly_one=1)
354
355 - def _scan_de(self, uhandle, consumer):
356 self._scan_line('DE', uhandle, consumer.description, exactly_one=1)
357
358 - def _scan_pa(self, uhandle, consumer):
359 self._scan_line('PA', uhandle, consumer.pattern, any_number=1)
360
361 - def _scan_ma(self, uhandle, consumer):
362 self._scan_line('MA', uhandle, consumer.matrix, any_number=1)
363 ## # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15 364 ## # contain a CC line buried within an 'MA' line. Need to check 365 ## # for that. 366 ## while 1: 367 ## if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'): 368 ## line1 = uhandle.readline() 369 ## line2 = uhandle.readline() 370 ## uhandle.saveline(line2) 371 ## uhandle.saveline(line1) 372 ## if line1[:2] == 'CC' and line2[:2] == 'MA': 373 ## read_and_call(uhandle, consumer.comment, start='CC') 374 ## else: 375 ## break 376
377 - def _scan_pp(self, uhandle, consumer):
378 #New PP line, PostProcessing, just after the MA line 379 self._scan_line('PP', uhandle, consumer.postprocessing, any_number=1)
380
381 - def _scan_ru(self, uhandle, consumer):
382 self._scan_line('RU', uhandle, consumer.rule, any_number=1)
383
384 - def _scan_nr(self, uhandle, consumer):
385 self._scan_line('NR', uhandle, consumer.numerical_results, 386 any_number=1)
387
388 - def _scan_cc(self, uhandle, consumer):
389 self._scan_line('CC', uhandle, consumer.comment, any_number=1)
390
391 - def _scan_dr(self, uhandle, consumer):
392 self._scan_line('DR', uhandle, consumer.database_reference, 393 any_number=1)
394
395 - def _scan_3d(self, uhandle, consumer):
396 self._scan_line('3D', uhandle, consumer.pdb_reference, 397 any_number=1)
398
399 - def _scan_pr(self, uhandle, consumer):
400 #New PR line, ProRule, between 3D and DO lines 401 self._scan_line('PR', uhandle, consumer.prorule, any_number=1)
402
403 - def _scan_do(self, uhandle, consumer):
404 self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1)
405
406 - def _scan_terminator(self, uhandle, consumer):
407 self._scan_line('//', uhandle, consumer.terminator, exactly_one=1)
408 409 #This is a list of scan functions in the order expected in the file file. 410 #The function definitions define how many times each line type is exected 411 #(or if optional): 412 _scan_fns = [ 413 _scan_id, 414 _scan_ac, 415 _scan_dt, 416 _scan_de, 417 _scan_pa, 418 _scan_ma, 419 _scan_pp, 420 _scan_ru, 421 _scan_nr, 422 _scan_cc, 423 424 # This is a really dirty hack, and should be fixed properly at 425 # some point. ZN2_CY6_FUNGAL_2, DNAJ_2 in Rel 15 and PS50309 426 # in Rel 17 have lines out of order. Thus, I have to rescan 427 # these, which decreases performance. 428 _scan_ma, 429 _scan_nr, 430 _scan_cc, 431 432 _scan_dr, 433 _scan_3d, 434 _scan_pr, 435 _scan_do, 436 _scan_terminator 437 ]
438
439 -class _RecordConsumer(AbstractConsumer):
440 """Consumer that converts a Prosite record to a Record object. 441 442 Members: 443 data Record with Prosite data. 444 445 """
446 - def __init__(self):
447 self.data = None
448
449 - def start_record(self):
450 self.data = Record()
451
452 - def end_record(self):
453 self._clean_record(self.data)
454
455 - def identification(self, line):
456 cols = line.split() 457 if len(cols) != 3: 458 raise ValueError("I don't understand identification line\n%s" \ 459 % line) 460 self.data.name = self._chomp(cols[1]) # don't want ';' 461 self.data.type = self._chomp(cols[2]) # don't want '.'
462
463 - def accession(self, line):
464 cols = line.split() 465 if len(cols) != 2: 466 raise ValueError("I don't understand accession line\n%s" % line) 467 self.data.accession = self._chomp(cols[1])
468
469 - def date(self, line):
470 uprline = line.upper() 471 cols = uprline.split() 472 473 # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE' 474 if cols[2] != '(CREATED);' or \ 475 cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \ 476 cols[7][:4] != '(INF' or cols[8] != 'UPDATE).': 477 raise ValueError("I don't understand date line\n%s" % line) 478 479 self.data.created = cols[1] 480 self.data.data_update = cols[3] 481 self.data.info_update = cols[6]
482
483 - def description(self, line):
484 self.data.description = self._clean(line)
485
486 - def pattern(self, line):
487 self.data.pattern = self.data.pattern + self._clean(line)
488
489 - def matrix(self, line):
490 self.data.matrix.append(self._clean(line))
491
492 - def postprocessing(self, line):
495
496 - def rule(self, line):
497 self.data.rules.append(self._clean(line))
498
499 - def numerical_results(self, line):
500 cols = self._clean(line).split(";") 501 for col in cols: 502 if not col: 503 continue 504 qual, data = [word.lstrip() for word in col.split("=")] 505 if qual == '/RELEASE': 506 release, seqs = data.split(",") 507 self.data.nr_sp_release = release 508 self.data.nr_sp_seqs = int(seqs) 509 elif qual == '/FALSE_NEG': 510 self.data.nr_false_neg = int(data) 511 elif qual == '/PARTIAL': 512 self.data.nr_partial = int(data) 513 elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']: 514 m = re.match(r'(\d+)\((\d+)\)', data) 515 if not m: 516 raise Exception("Broken data %s in comment line\n%s" \ 517 % (repr(data), line)) 518 hits = tuple(map(int, m.groups())) 519 if(qual == "/TOTAL"): 520 self.data.nr_total = hits 521 elif(qual == "/POSITIVE"): 522 self.data.nr_positive = hits 523 elif(qual == "/UNKNOWN"): 524 self.data.nr_unknown = hits 525 elif(qual == "/FALSE_POS"): 526 self.data.nr_false_pos = hits 527 else: 528 raise ValueError("Unknown qual %s in comment line\n%s" \ 529 % (repr(qual), line))
530
531 - def comment(self, line):
532 #Expect CC lines like this: 533 #CC /TAXO-RANGE=??EPV; /MAX-REPEAT=2; 534 #Can (normally) split on ";" and then on "=" 535 cols = self._clean(line).split(";") 536 for col in cols: 537 if not col or col[:17] == 'Automatic scaling': 538 # DNAJ_2 in Release 15 has a non-standard comment line: 539 # CC Automatic scaling using reversed database 540 # Throw it away. (Should I keep it?) 541 continue 542 if col.count("=") == 0 : 543 #Missing qualifier! Can we recover gracefully? 544 #For example, from Bug 2403, in PS50293 have: 545 #CC /AUTHOR=K_Hofmann; N_Hulo 546 continue 547 qual, data = [word.lstrip() for word in col.split("=")] 548 if qual == '/TAXO-RANGE': 549 self.data.cc_taxo_range = data 550 elif qual == '/MAX-REPEAT': 551 self.data.cc_max_repeat = data 552 elif qual == '/SITE': 553 pos, desc = data.split(",") 554 self.data.cc_site.append((int(pos), desc)) 555 elif qual == '/SKIP-FLAG': 556 self.data.cc_skip_flag = data 557 elif qual == '/MATRIX_TYPE': 558 self.data.cc_matrix_type = data 559 elif qual == '/SCALING_DB': 560 self.data.cc_scaling_db = data 561 elif qual == '/AUTHOR': 562 self.data.cc_author = data 563 elif qual == '/FT_KEY': 564 self.data.cc_ft_key = data 565 elif qual == '/FT_DESC': 566 self.data.cc_ft_desc = data 567 elif qual == '/VERSION': 568 self.data.cc_version = data 569 else: 570 raise ValueError("Unknown qual %s in comment line\n%s" \ 571 % (repr(qual), line))
572
573 - def database_reference(self, line):
574 refs = self._clean(line).split(";") 575 for ref in refs: 576 if not ref: 577 continue 578 acc, name, type = [word.strip() for word in ref.split(",")] 579 if type == 'T': 580 self.data.dr_positive.append((acc, name)) 581 elif type == 'F': 582 self.data.dr_false_pos.append((acc, name)) 583 elif type == 'N': 584 self.data.dr_false_neg.append((acc, name)) 585 elif type == 'P': 586 self.data.dr_potential.append((acc, name)) 587 elif type == '?': 588 self.data.dr_unknown.append((acc, name)) 589 else: 590 raise ValueError("I don't understand type flag %s" % type)
591
592 - def pdb_reference(self, line):
593 cols = line.split() 594 for id in cols[1:]: # get all but the '3D' col 595 self.data.pdb_structs.append(self._chomp(id))
596
597 - def prorule(self, line):
598 #Assume that each PR line can contain multiple ";" separated rules 599 rules = self._clean(line).split(";") 600 self.data.prorules.extend(rules)
601
602 - def documentation(self, line):
603 self.data.pdoc = self._chomp(self._clean(line))
604
605 - def terminator(self, line):
606 self.finished = True
607
608 - def _chomp(self, word, to_chomp='.,;'):
609 # Remove the punctuation at the end of a word. 610 if word[-1] in to_chomp: 611 return word[:-1] 612 return word
613
614 - def _clean(self, line, rstrip=1):
615 # Clean up a line. 616 if rstrip: 617 return line[5:].rstrip() 618 return line[5:]
619
620 -def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None):
621 """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) -> 622 list of PatternHit's 623 624 Search a sequence for occurrences of Prosite patterns. You can 625 specify either a sequence in seq or a SwissProt/trEMBL ID or accession 626 in id. Only one of those should be given. If exclude_frequent 627 is true, then the patterns with the high probability of occurring 628 will be excluded. 629 630 """ 631 from Bio import ExPASy 632 if (seq and id) or not (seq or id): 633 raise ValueError("Please specify either a sequence or an id") 634 handle = ExPASy.scanprosite1(seq, id, exclude_frequent) 635 return _extract_pattern_hits(handle)
636
637 -def _extract_pattern_hits(handle):
638 """_extract_pattern_hits(handle) -> list of PatternHit's 639 640 Extract hits from a web page. Raises a ValueError if there 641 was an error in the query. 642 643 """ 644 class parser(sgmllib.SGMLParser): 645 def __init__(self): 646 sgmllib.SGMLParser.__init__(self) 647 self.hits = [] 648 self.broken_message = 'Some error occurred' 649 self._in_pre = 0 650 self._current_hit = None 651 self._last_found = None # Save state of parsing
652 def handle_data(self, data): 653 if data.find('try again') >= 0: 654 self.broken_message = data 655 return 656 elif data == 'illegal': 657 self.broken_message = 'Sequence contains illegal characters' 658 return 659 if not self._in_pre: 660 return 661 elif not data.strip(): 662 return 663 if self._last_found is None and data[:4] == 'PDOC': 664 self._current_hit.pdoc = data 665 self._last_found = 'pdoc' 666 elif self._last_found == 'pdoc': 667 if data[:2] != 'PS': 668 raise ValueError("Expected accession but got:\n%s" % data) 669 self._current_hit.accession = data 670 self._last_found = 'accession' 671 elif self._last_found == 'accession': 672 self._current_hit.name = data 673 self._last_found = 'name' 674 elif self._last_found == 'name': 675 self._current_hit.description = data 676 self._last_found = 'description' 677 elif self._last_found == 'description': 678 m = re.findall(r'(\d+)-(\d+) (\w+)', data) 679 for start, end, seq in m: 680 self._current_hit.matches.append( 681 (int(start), int(end), seq)) 682 683 def do_hr(self, attrs): 684 # <HR> inside a <PRE> section means a new hit. 685 if self._in_pre: 686 self._current_hit = PatternHit() 687 self.hits.append(self._current_hit) 688 self._last_found = None 689 def start_pre(self, attrs): 690 self._in_pre = 1 691 self.broken_message = None # Probably not broken 692 def end_pre(self): 693 self._in_pre = 0 694 p = parser() 695 p.feed(handle.read()) 696 if p.broken_message: 697 raise ValueError(p.broken_message) 698 return p.hits 699 700 701 702
703 -def index_file(filename, indexname, rec2key=None):
704 """index_file(filename, indexname, rec2key=None) 705 706 Index a Prosite file. filename is the name of the file. 707 indexname is the name of the dictionary. rec2key is an 708 optional callback that takes a Record and generates a unique key 709 (e.g. the accession number) for the record. If not specified, 710 the id name will be used. 711 712 """ 713 import os 714 if not os.path.exists(filename): 715 raise ValueError("%s does not exist" % filename) 716 717 index = Index.Index(indexname, truncate=1) 718 index[Dictionary._Dictionary__filename_key] = filename 719 720 handle = open(filename) 721 records = parse(handle) 722 end = 0L 723 for record in records: 724 start = end 725 end = long(handle.tell()) 726 length = end - start 727 728 if rec2key is not None: 729 key = rec2key(record) 730 else: 731 key = record.name 732 733 if not key: 734 raise KeyError("empty key was produced") 735 elif key in index: 736 raise KeyError("duplicate key %s found" % key) 737 738 index[key] = start, length
739