Package Bio :: Package Prosite
[hide private]
[frames] | no frames]

Source Code for Package Bio.Prosite

  1  # Copyright 1999 by Jeffrey Chang.  All rights reserved. 
  2  # Copyright 2000 by Jeffrey Chang.  All rights reserved. 
  3  # Revisions Copyright 2007 by Peter Cock.  All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7  """Module for working with Prosite files from ExPASy (OBSOLETE). 
  8   
  9  Most of the functionality in this module has moved to Bio.ExPASy.Prosite; 
 10  please see 
 11   
 12  Bio.ExPASy.Prosite.read          To read a Prosite file containing one entry. 
 13  Bio.ExPASy.Prosite.parse         Iterates over entries in a Prosite file. 
 14  Bio.ExPASy.Prosite.Record        Holds Prosite data. 
 15   
 16  For 
 17  scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns. 
 18  _extract_pattern_hits Extract Prosite patterns from a web page. 
 19  PatternHit            Holds data from a hit against a Prosite pattern. 
 20  please see the new module Bio.ExPASy.ScanProsite. 
 21   
 22  The other functions and classes in Bio.Prosite (including 
 23  Bio.Prosite.index_file and Bio.Prosite.Dictionary) are considered deprecated, 
 24  and were not moved to Bio.ExPASy.Prosite. If you use this functionality, 
 25  please contact the Biopython developers at biopython-dev@biopython.org to 
 26  avoid permanent removal of this module from Biopython. 
 27   
 28   
 29  This module provides code to work with the prosite dat file from 
 30  Prosite. 
 31  http://www.expasy.ch/prosite/ 
 32   
 33  Tested with: 
 34  Release 15.0, July 1998 
 35  Release 16.0, July 1999 
 36  Release 17.0, Dec 2001 
 37  Release 19.0, Mar 2006 
 38   
 39   
 40  Functions: 
 41  parse                 Iterates over entries in a Prosite file. 
 42  scan_sequence_expasy  Scan a sequence for occurrences of Prosite patterns. 
 43  index_file            Index a Prosite file for a Dictionary. 
 44  _extract_record       Extract Prosite data from a web page. 
 45  _extract_pattern_hits Extract Prosite patterns from a web page. 
 46   
 47   
 48  Classes: 
 49  Record                Holds Prosite data. 
 50  PatternHit            Holds data from a hit against a Prosite pattern. 
 51  Dictionary            Accesses a Prosite file using a dictionary interface. 
 52  RecordParser          Parses a Prosite record into a Record object. 
 53  Iterator              Iterates over entries in a Prosite file; DEPRECATED. 
 54   
 55  _Scanner              Scans Prosite-formatted data. 
 56  _RecordConsumer       Consumes Prosite data to a Record object. 
 57   
 58  """ 
 59  from types import * 
 60  import re 
 61  import sgmllib 
 62  from Bio import File 
 63  from Bio import Index 
 64  from Bio.ParserSupport import * 
 65   
 66   
 67  # There is probably a cleaner way to write the read/parse functions 
 68  # if we don't use the "parser = RecordParser(); parser.parse(handle)" 
 69  # approach. Leaving that for the next revision of Bio.Prosite. 
70 -def parse(handle):
71 import cStringIO 72 parser = RecordParser() 73 text = "" 74 for line in handle: 75 text += line 76 if line[:2]=='//': 77 handle = cStringIO.StringIO(text) 78 record = parser.parse(handle) 79 text = "" 80 if not record: # Then this was the copyright notice 81 continue 82 yield record
83
84 -def read(handle):
85 parser = RecordParser() 86 try: 87 record = parser.parse(handle) 88 except ValueError, error: 89 if error.message=="There doesn't appear to be a record": 90 raise ValueError("No Prosite record found") 91 else: 92 raise error 93 # We should have reached the end of the record by now 94 remainder = handle.read() 95 if remainder: 96 raise ValueError("More than one Prosite record found") 97 return record
98
99 -class Record:
100 """Holds information from a Prosite record. 101 102 Members: 103 name ID of the record. e.g. ADH_ZINC 104 type Type of entry. e.g. PATTERN, MATRIX, or RULE 105 accession e.g. PS00387 106 created Date the entry was created. (MMM-YYYY) 107 data_update Date the 'primary' data was last updated. 108 info_update Date data other than 'primary' data was last updated. 109 pdoc ID of the PROSITE DOCumentation. 110 111 description Free-format description. 112 pattern The PROSITE pattern. See docs. 113 matrix List of strings that describes a matrix entry. 114 rules List of rule definitions (from RU lines). (strings) 115 prorules List of prorules (from PR lines). (strings) 116 117 NUMERICAL RESULTS 118 nr_sp_release SwissProt release. 119 nr_sp_seqs Number of seqs in that release of Swiss-Prot. (int) 120 nr_total Number of hits in Swiss-Prot. tuple of (hits, seqs) 121 nr_positive True positives. tuple of (hits, seqs) 122 nr_unknown Could be positives. tuple of (hits, seqs) 123 nr_false_pos False positives. tuple of (hits, seqs) 124 nr_false_neg False negatives. (int) 125 nr_partial False negatives, because they are fragments. (int) 126 127 COMMENTS 128 cc_taxo_range Taxonomic range. See docs for format 129 cc_max_repeat Maximum number of repetitions in a protein 130 cc_site Interesting site. list of tuples (pattern pos, desc.) 131 cc_skip_flag Can this entry be ignored? 132 cc_matrix_type 133 cc_scaling_db 134 cc_author 135 cc_ft_key 136 cc_ft_desc 137 cc_version version number (introduced in release 19.0) 138 139 DATA BANK REFERENCES - The following are all 140 lists of tuples (swiss-prot accession, 141 swiss-prot name) 142 dr_positive 143 dr_false_neg 144 dr_false_pos 145 dr_potential Potential hits, but fingerprint region not yet available. 146 dr_unknown Could possibly belong 147 148 pdb_structs List of PDB entries. 149 150 """
151 - def __init__(self):
152 self.name = '' 153 self.type = '' 154 self.accession = '' 155 self.created = '' 156 self.data_update = '' 157 self.info_update = '' 158 self.pdoc = '' 159 160 self.description = '' 161 self.pattern = '' 162 self.matrix = [] 163 self.rules = [] 164 self.prorules = [] 165 self.postprocessing = [] 166 167 self.nr_sp_release = '' 168 self.nr_sp_seqs = '' 169 self.nr_total = (None, None) 170 self.nr_positive = (None, None) 171 self.nr_unknown = (None, None) 172 self.nr_false_pos = (None, None) 173 self.nr_false_neg = None 174 self.nr_partial = None 175 176 self.cc_taxo_range = '' 177 self.cc_max_repeat = '' 178 self.cc_site = [] 179 self.cc_skip_flag = '' 180 181 self.dr_positive = [] 182 self.dr_false_neg = [] 183 self.dr_false_pos = [] 184 self.dr_potential = [] 185 self.dr_unknown = [] 186 187 self.pdb_structs = []
188
189 -class PatternHit:
190 """Holds information from a hit against a Prosite pattern. 191 192 Members: 193 name ID of the record. e.g. ADH_ZINC 194 accession e.g. PS00387 195 pdoc ID of the PROSITE DOCumentation. 196 description Free-format description. 197 matches List of tuples (start, end, sequence) where 198 start and end are indexes of the match, and sequence is 199 the sequence matched. 200 201 """
202 - def __init__(self):
203 self.name = None 204 self.accession = None 205 self.pdoc = None 206 self.description = None 207 self.matches = []
208 - def __str__(self):
209 lines = [] 210 lines.append("%s %s %s" % (self.accession, self.pdoc, self.name)) 211 lines.append(self.description) 212 lines.append('') 213 if len(self.matches) > 1: 214 lines.append("Number of matches: %s" % len(self.matches)) 215 for i in range(len(self.matches)): 216 start, end, seq = self.matches[i] 217 range_str = "%d-%d" % (start, end) 218 if len(self.matches) > 1: 219 lines.append("%7d %10s %s" % (i+1, range_str, seq)) 220 else: 221 lines.append("%7s %10s %s" % (' ', range_str, seq)) 222 return "\n".join(lines)
223
224 -class Iterator:
225 """Returns one record at a time from a Prosite file. 226 227 Methods: 228 next Return the next record from the stream, or None. 229 230 """
231 - def __init__(self, handle, parser=None):
232 """__init__(self, handle, parser=None) 233 234 Create a new iterator. handle is a file-like object. parser 235 is an optional Parser object to change the results into another form. 236 If set to None, then the raw contents of the file will be returned. 237 238 """ 239 import warnings 240 warnings.warn("Bio.Prosite.Iterator is deprecated; we recommend using the function Bio.Prosite.parse instead. Please contact the Biopython developers at biopython-dev@biopython.org you cannot use Bio.Prosite.parse instead of Bio.Prosite.Iterator.", 241 DeprecationWarning) 242 if type(handle) is not FileType and type(handle) is not InstanceType: 243 raise ValueError("I expected a file handle or file-like object") 244 self._uhandle = File.UndoHandle(handle) 245 self._parser = parser
246
247 - def next(self):
248 """next(self) -> object 249 250 Return the next Prosite record from the file. If no more records, 251 return None. 252 253 """ 254 # Skip the copyright info, if it's the first record. 255 line = self._uhandle.peekline() 256 if line[:2] == 'CC': 257 while 1: 258 line = self._uhandle.readline() 259 if not line: 260 break 261 if line[:2] == '//': 262 break 263 if line[:2] != 'CC': 264 raise ValueError("Oops, where's the copyright?") 265 266 lines = [] 267 while 1: 268 line = self._uhandle.readline() 269 if not line: 270 break 271 lines.append(line) 272 if line[:2] == '//': 273 break 274 275 if not lines: 276 return None 277 278 data = "".join(lines) 279 if self._parser is not None: 280 return self._parser.parse(File.StringHandle(data)) 281 return data
282
283 - def __iter__(self):
284 return iter(self.next, None)
285
286 -class Dictionary:
287 """Accesses a Prosite file using a dictionary interface. 288 289 """ 290 __filename_key = '__filename' 291
292 - def __init__(self, indexname, parser=None):
293 """__init__(self, indexname, parser=None) 294 295 Open a Prosite Dictionary. indexname is the name of the 296 index for the dictionary. The index should have been created 297 using the index_file function. parser is an optional Parser 298 object to change the results into another form. If set to None, 299 then the raw contents of the file will be returned. 300 301 """ 302 self._index = Index.Index(indexname) 303 self._handle = open(self._index[Dictionary.__filename_key]) 304 self._parser = parser
305
306 - def __len__(self):
307 return len(self._index)
308
309 - def __getitem__(self, key):
310 start, len = self._index[key] 311 self._handle.seek(start) 312 data = self._handle.read(len) 313 if self._parser is not None: 314 return self._parser.parse(File.StringHandle(data)) 315 return data
316
317 - def __getattr__(self, name):
318 return getattr(self._index, name)
319
320 -class RecordParser(AbstractParser):
321 """Parses Prosite data into a Record object. 322 323 """
324 - def __init__(self):
325 self._scanner = _Scanner() 326 self._consumer = _RecordConsumer()
327
328 - def parse(self, handle):
329 self._scanner.feed(handle, self._consumer) 330 return self._consumer.data
331
332 -class _Scanner:
333 """Scans Prosite-formatted data. 334 335 Tested with: 336 Release 15.0, July 1998 337 338 """
339 - def feed(self, handle, consumer):
340 """feed(self, handle, consumer) 341 342 Feed in Prosite data for scanning. handle is a file-like 343 object that contains prosite data. consumer is a 344 Consumer object that will receive events as the report is scanned. 345 346 """ 347 if isinstance(handle, File.UndoHandle): 348 uhandle = handle 349 else: 350 uhandle = File.UndoHandle(handle) 351 352 consumer.finished = False 353 while not consumer.finished: 354 line = uhandle.peekline() 355 if not line: 356 break 357 elif is_blank_line(line): 358 # Skip blank lines between records 359 uhandle.readline() 360 continue 361 elif line[:2] == 'ID': 362 self._scan_record(uhandle, consumer) 363 elif line[:2] == 'CC': 364 self._scan_copyrights(uhandle, consumer) 365 else: 366 raise ValueError("There doesn't appear to be a record")
367
368 - def _scan_copyrights(self, uhandle, consumer):
369 consumer.start_copyrights() 370 self._scan_line('CC', uhandle, consumer.copyright, any_number=1) 371 self._scan_terminator(uhandle, consumer) 372 consumer.end_copyrights()
373
374 - def _scan_record(self, uhandle, consumer):
375 consumer.start_record() 376 for fn in self._scan_fns: 377 fn(self, uhandle, consumer) 378 379 # In Release 15.0, C_TYPE_LECTIN_1 has the DO line before 380 # the 3D lines, instead of the other way around. 381 # Thus, I'll give the 3D lines another chance after the DO lines 382 # are finished. 383 if fn is self._scan_do.im_func: 384 self._scan_3d(uhandle, consumer) 385 consumer.end_record()
386
387 - def _scan_line(self, line_type, uhandle, event_fn, 388 exactly_one=None, one_or_more=None, any_number=None, 389 up_to_one=None):
390 # Callers must set exactly one of exactly_one, one_or_more, or 391 # any_number to a true value. I do not explicitly check to 392 # make sure this function is called correctly. 393 394 # This does not guarantee any parameter safety, but I 395 # like the readability. The other strategy I tried was have 396 # parameters min_lines, max_lines. 397 398 if exactly_one or one_or_more: 399 read_and_call(uhandle, event_fn, start=line_type) 400 if one_or_more or any_number: 401 while 1: 402 if not attempt_read_and_call(uhandle, event_fn, 403 start=line_type): 404 break 405 if up_to_one: 406 attempt_read_and_call(uhandle, event_fn, start=line_type)
407
408 - def _scan_id(self, uhandle, consumer):
409 self._scan_line('ID', uhandle, consumer.identification, exactly_one=1)
410
411 - def _scan_ac(self, uhandle, consumer):
412 self._scan_line('AC', uhandle, consumer.accession, exactly_one=1)
413
414 - def _scan_dt(self, uhandle, consumer):
415 self._scan_line('DT', uhandle, consumer.date, exactly_one=1)
416
417 - def _scan_de(self, uhandle, consumer):
418 self._scan_line('DE', uhandle, consumer.description, exactly_one=1)
419
420 - def _scan_pa(self, uhandle, consumer):
421 self._scan_line('PA', uhandle, consumer.pattern, any_number=1)
422
423 - def _scan_ma(self, uhandle, consumer):
424 self._scan_line('MA', uhandle, consumer.matrix, any_number=1)
425 ## # ZN2_CY6_FUNGAL_2, DNAJ_2 in Release 15 426 ## # contain a CC line buried within an 'MA' line. Need to check 427 ## # for that. 428 ## while 1: 429 ## if not attempt_read_and_call(uhandle, consumer.matrix, start='MA'): 430 ## line1 = uhandle.readline() 431 ## line2 = uhandle.readline() 432 ## uhandle.saveline(line2) 433 ## uhandle.saveline(line1) 434 ## if line1[:2] == 'CC' and line2[:2] == 'MA': 435 ## read_and_call(uhandle, consumer.comment, start='CC') 436 ## else: 437 ## break 438
439 - def _scan_pp(self, uhandle, consumer):
440 #New PP line, PostProcessing, just after the MA line 441 self._scan_line('PP', uhandle, consumer.postprocessing, any_number=1)
442
443 - def _scan_ru(self, uhandle, consumer):
444 self._scan_line('RU', uhandle, consumer.rule, any_number=1)
445
446 - def _scan_nr(self, uhandle, consumer):
447 self._scan_line('NR', uhandle, consumer.numerical_results, 448 any_number=1)
449
450 - def _scan_cc(self, uhandle, consumer):
451 self._scan_line('CC', uhandle, consumer.comment, any_number=1)
452
453 - def _scan_dr(self, uhandle, consumer):
454 self._scan_line('DR', uhandle, consumer.database_reference, 455 any_number=1)
456
457 - def _scan_3d(self, uhandle, consumer):
458 self._scan_line('3D', uhandle, consumer.pdb_reference, 459 any_number=1)
460
461 - def _scan_pr(self, uhandle, consumer):
462 #New PR line, ProRule, between 3D and DO lines 463 self._scan_line('PR', uhandle, consumer.prorule, any_number=1)
464
465 - def _scan_do(self, uhandle, consumer):
466 self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1)
467
468 - def _scan_terminator(self, uhandle, consumer):
469 self._scan_line('//', uhandle, consumer.terminator, exactly_one=1)
470 471 #This is a list of scan functions in the order expected in the file file. 472 #The function definitions define how many times each line type is exected 473 #(or if optional): 474 _scan_fns = [ 475 _scan_id, 476 _scan_ac, 477 _scan_dt, 478 _scan_de, 479 _scan_pa, 480 _scan_ma, 481 _scan_pp, 482 _scan_ru, 483 _scan_nr, 484 _scan_cc, 485 486 # This is a really dirty hack, and should be fixed properly at 487 # some point. ZN2_CY6_FUNGAL_2, DNAJ_2 in Rel 15 and PS50309 488 # in Rel 17 have lines out of order. Thus, I have to rescan 489 # these, which decreases performance. 490 _scan_ma, 491 _scan_nr, 492 _scan_cc, 493 494 _scan_dr, 495 _scan_3d, 496 _scan_pr, 497 _scan_do, 498 _scan_terminator 499 ]
500
501 -class _RecordConsumer(AbstractConsumer):
502 """Consumer that converts a Prosite record to a Record object. 503 504 Members: 505 data Record with Prosite data. 506 507 """
508 - def __init__(self):
509 self.data = None
510
511 - def start_record(self):
512 self.data = Record()
513
514 - def end_record(self):
515 self._clean_record(self.data)
516
517 - def identification(self, line):
518 cols = line.split() 519 if len(cols) != 3: 520 raise ValueError("I don't understand identification line\n%s" \ 521 % line) 522 self.data.name = self._chomp(cols[1]) # don't want ';' 523 self.data.type = self._chomp(cols[2]) # don't want '.'
524
525 - def accession(self, line):
526 cols = line.split() 527 if len(cols) != 2: 528 raise ValueError("I don't understand accession line\n%s" % line) 529 self.data.accession = self._chomp(cols[1])
530
531 - def date(self, line):
532 uprline = line.upper() 533 cols = uprline.split() 534 535 # Release 15.0 contains both 'INFO UPDATE' and 'INF UPDATE' 536 if cols[2] != '(CREATED);' or \ 537 cols[4] != '(DATA' or cols[5] != 'UPDATE);' or \ 538 cols[7][:4] != '(INF' or cols[8] != 'UPDATE).': 539 raise ValueError("I don't understand date line\n%s" % line) 540 541 self.data.created = cols[1] 542 self.data.data_update = cols[3] 543 self.data.info_update = cols[6]
544
545 - def description(self, line):
546 self.data.description = self._clean(line)
547
548 - def pattern(self, line):
549 self.data.pattern = self.data.pattern + self._clean(line)
550
551 - def matrix(self, line):
552 self.data.matrix.append(self._clean(line))
553
554 - def postprocessing(self, line):
557
558 - def rule(self, line):
559 self.data.rules.append(self._clean(line))
560
561 - def numerical_results(self, line):
562 cols = self._clean(line).split(";") 563 for col in cols: 564 if not col: 565 continue 566 qual, data = [word.lstrip() for word in col.split("=")] 567 if qual == '/RELEASE': 568 release, seqs = data.split(",") 569 self.data.nr_sp_release = release 570 self.data.nr_sp_seqs = int(seqs) 571 elif qual == '/FALSE_NEG': 572 self.data.nr_false_neg = int(data) 573 elif qual == '/PARTIAL': 574 self.data.nr_partial = int(data) 575 elif qual in ['/TOTAL', '/POSITIVE', '/UNKNOWN', '/FALSE_POS']: 576 m = re.match(r'(\d+)\((\d+)\)', data) 577 if not m: 578 raise Exception("Broken data %s in comment line\n%s" \ 579 % (repr(data), line)) 580 hits = tuple(map(int, m.groups())) 581 if(qual == "/TOTAL"): 582 self.data.nr_total = hits 583 elif(qual == "/POSITIVE"): 584 self.data.nr_positive = hits 585 elif(qual == "/UNKNOWN"): 586 self.data.nr_unknown = hits 587 elif(qual == "/FALSE_POS"): 588 self.data.nr_false_pos = hits 589 else: 590 raise ValueError("Unknown qual %s in comment line\n%s" \ 591 % (repr(qual), line))
592
593 - def comment(self, line):
594 #Expect CC lines like this: 595 #CC /TAXO-RANGE=??EPV; /MAX-REPEAT=2; 596 #Can (normally) split on ";" and then on "=" 597 cols = self._clean(line).split(";") 598 for col in cols: 599 if not col or col[:17] == 'Automatic scaling': 600 # DNAJ_2 in Release 15 has a non-standard comment line: 601 # CC Automatic scaling using reversed database 602 # Throw it away. (Should I keep it?) 603 continue 604 if col.count("=") == 0 : 605 #Missing qualifier! Can we recover gracefully? 606 #For example, from Bug 2403, in PS50293 have: 607 #CC /AUTHOR=K_Hofmann; N_Hulo 608 continue 609 qual, data = [word.lstrip() for word in col.split("=")] 610 if qual == '/TAXO-RANGE': 611 self.data.cc_taxo_range = data 612 elif qual == '/MAX-REPEAT': 613 self.data.cc_max_repeat = data 614 elif qual == '/SITE': 615 pos, desc = data.split(",") 616 self.data.cc_site.append((int(pos), desc)) 617 elif qual == '/SKIP-FLAG': 618 self.data.cc_skip_flag = data 619 elif qual == '/MATRIX_TYPE': 620 self.data.cc_matrix_type = data 621 elif qual == '/SCALING_DB': 622 self.data.cc_scaling_db = data 623 elif qual == '/AUTHOR': 624 self.data.cc_author = data 625 elif qual == '/FT_KEY': 626 self.data.cc_ft_key = data 627 elif qual == '/FT_DESC': 628 self.data.cc_ft_desc = data 629 elif qual == '/VERSION': 630 self.data.cc_version = data 631 else: 632 raise ValueError("Unknown qual %s in comment line\n%s" \ 633 % (repr(qual), line))
634
635 - def database_reference(self, line):
636 refs = self._clean(line).split(";") 637 for ref in refs: 638 if not ref: 639 continue 640 acc, name, type = [word.strip() for word in ref.split(",")] 641 if type == 'T': 642 self.data.dr_positive.append((acc, name)) 643 elif type == 'F': 644 self.data.dr_false_pos.append((acc, name)) 645 elif type == 'N': 646 self.data.dr_false_neg.append((acc, name)) 647 elif type == 'P': 648 self.data.dr_potential.append((acc, name)) 649 elif type == '?': 650 self.data.dr_unknown.append((acc, name)) 651 else: 652 raise ValueError("I don't understand type flag %s" % type)
653
654 - def pdb_reference(self, line):
655 cols = line.split() 656 for id in cols[1:]: # get all but the '3D' col 657 self.data.pdb_structs.append(self._chomp(id))
658
659 - def prorule(self, line):
660 #Assume that each PR line can contain multiple ";" separated rules 661 rules = self._clean(line).split(";") 662 self.data.prorules.extend(rules)
663
664 - def documentation(self, line):
665 self.data.pdoc = self._chomp(self._clean(line))
666
667 - def terminator(self, line):
668 self.finished = True
669
670 - def _chomp(self, word, to_chomp='.,;'):
671 # Remove the punctuation at the end of a word. 672 if word[-1] in to_chomp: 673 return word[:-1] 674 return word
675
676 - def _clean(self, line, rstrip=1):
677 # Clean up a line. 678 if rstrip: 679 return line[5:].rstrip() 680 return line[5:]
681
682 -def scan_sequence_expasy(seq=None, id=None, exclude_frequent=None):
683 """scan_sequence_expasy(seq=None, id=None, exclude_frequent=None) -> 684 list of PatternHit's 685 686 Search a sequence for occurrences of Prosite patterns. You can 687 specify either a sequence in seq or a SwissProt/trEMBL ID or accession 688 in id. Only one of those should be given. If exclude_frequent 689 is true, then the patterns with the high probability of occurring 690 will be excluded. 691 692 """ 693 from Bio import ExPASy 694 if (seq and id) or not (seq or id): 695 raise ValueError("Please specify either a sequence or an id") 696 handle = ExPASy.scanprosite1(seq, id, exclude_frequent) 697 return _extract_pattern_hits(handle)
698
699 -def _extract_pattern_hits(handle):
700 """_extract_pattern_hits(handle) -> list of PatternHit's 701 702 Extract hits from a web page. Raises a ValueError if there 703 was an error in the query. 704 705 """ 706 class parser(sgmllib.SGMLParser): 707 def __init__(self): 708 sgmllib.SGMLParser.__init__(self) 709 self.hits = [] 710 self.broken_message = 'Some error occurred' 711 self._in_pre = 0 712 self._current_hit = None 713 self._last_found = None # Save state of parsing
714 def handle_data(self, data): 715 if data.find('try again') >= 0: 716 self.broken_message = data 717 return 718 elif data == 'illegal': 719 self.broken_message = 'Sequence contains illegal characters' 720 return 721 if not self._in_pre: 722 return 723 elif not data.strip(): 724 return 725 if self._last_found is None and data[:4] == 'PDOC': 726 self._current_hit.pdoc = data 727 self._last_found = 'pdoc' 728 elif self._last_found == 'pdoc': 729 if data[:2] != 'PS': 730 raise ValueError("Expected accession but got:\n%s" % data) 731 self._current_hit.accession = data 732 self._last_found = 'accession' 733 elif self._last_found == 'accession': 734 self._current_hit.name = data 735 self._last_found = 'name' 736 elif self._last_found == 'name': 737 self._current_hit.description = data 738 self._last_found = 'description' 739 elif self._last_found == 'description': 740 m = re.findall(r'(\d+)-(\d+) (\w+)', data) 741 for start, end, seq in m: 742 self._current_hit.matches.append( 743 (int(start), int(end), seq)) 744 745 def do_hr(self, attrs): 746 # <HR> inside a <PRE> section means a new hit. 747 if self._in_pre: 748 self._current_hit = PatternHit() 749 self.hits.append(self._current_hit) 750 self._last_found = None 751 def start_pre(self, attrs): 752 self._in_pre = 1 753 self.broken_message = None # Probably not broken 754 def end_pre(self): 755 self._in_pre = 0 756 p = parser() 757 p.feed(handle.read()) 758 if p.broken_message: 759 raise ValueError(p.broken_message) 760 return p.hits 761 762 763 764
765 -def index_file(filename, indexname, rec2key=None):
766 """index_file(filename, indexname, rec2key=None) 767 768 Index a Prosite file. filename is the name of the file. 769 indexname is the name of the dictionary. rec2key is an 770 optional callback that takes a Record and generates a unique key 771 (e.g. the accession number) for the record. If not specified, 772 the id name will be used. 773 774 """ 775 import os 776 if not os.path.exists(filename): 777 raise ValueError("%s does not exist" % filename) 778 779 index = Index.Index(indexname, truncate=1) 780 index[Dictionary._Dictionary__filename_key] = filename 781 782 handle = open(filename) 783 records = parse(handle) 784 end = 0L 785 for record in records: 786 start = end 787 end = long(handle.tell()) 788 length = end - start 789 790 if rec2key is not None: 791 key = rec2key(record) 792 else: 793 key = record.name 794 795 if not key: 796 raise KeyError("empty key was produced") 797 elif key in index: 798 raise KeyError("duplicate key %s found" % key) 799 800 index[key] = start, length
801