Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

  1  # Copyright 2006-2009 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5  # 
  6  #Nice link: 
  7  # http://www.ebi.ac.uk/help/formats_frame.html 
  8   
  9  """Sequence input/output as SeqRecord objects. 
 10   
 11  Bio.SeqIO is also documented at U{http://biopython.org/wiki/SeqIO} and by 
 12  a whole chapter in our tutorial: 
 13   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html} 
 14   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}   
 15   
 16  Input 
 17  ===== 
 18  The main function is Bio.SeqIO.parse(...) which takes an input file handle, 
 19  and format string.  This returns an iterator giving SeqRecord objects: 
 20   
 21      >>> from Bio import SeqIO 
 22      >>> handle = open("Fasta/f002", "rU") 
 23      >>> for record in SeqIO.parse(handle, "fasta") : 
 24      ...     print record.id, len(record) 
 25      gi|1348912|gb|G26680|G26680 633 
 26      gi|1348917|gb|G26685|G26685 413 
 27      gi|1592936|gb|G29385|G29385 471 
 28      >>> handle.close() 
 29   
 30  Note that the parse() function will all invoke the relevant parser for the 
 31  format with its default settings.  You may want more control, in which case 
 32  you need to create a format specific sequence iterator directly. 
 33   
 34  Input - Single Records 
 35  ====================== 
 36  If you expect your file to contain one-and-only-one record, then we provide 
 37  the following 'helper' function which will return a single SeqRecord, or 
 38  raise an exception if there are no records or more than one record: 
 39   
 40      >>> from Bio import SeqIO 
 41      >>> handle = open("Fasta/f001", "rU") 
 42      >>> record = SeqIO.read(handle, "fasta") 
 43      >>> handle.close() 
 44      >>> print record.id, len(record) 
 45      gi|3318709|pdb|1A91| 79 
 46   
 47  This style is useful when you expect a single record only (and would 
 48  consider multiple records an error).  For example, when dealing with GenBank 
 49  files for bacterial genomes or chromosomes, there is normally only a single 
 50  record.  Alternatively, use this with a handle when download a single record 
 51  from the internet. 
 52   
 53  However, if you just want the first record from a file containing multiple 
 54  record, use the iterator's next() method: 
 55   
 56      >>> from Bio import SeqIO 
 57      >>> handle = open("Fasta/f002", "rU") 
 58      >>> record = SeqIO.parse(handle, "fasta").next() 
 59      >>> handle.close() 
 60      >>> print record.id, len(record) 
 61      gi|1348912|gb|G26680|G26680 633 
 62   
 63  The above code will work as long as the file contains at least one record. 
 64  Note that if there is more than one record, the remaining records will be 
 65  silently ignored. 
 66   
 67   
 68  Input - Multiple Records 
 69  ======================== 
 70  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
 71  using a sequence iterator can save you a lot of memory (RAM).  There is 
 72  less benefit for interlaced file formats (e.g. most multiple alignment file 
 73  formats).  However, an iterator only lets you access the records one by one. 
 74   
 75  If you want random access to the records by number, turn this into a list: 
 76   
 77      >>> from Bio import SeqIO 
 78      >>> handle = open("Fasta/f002", "rU") 
 79      >>> records = list(SeqIO.parse(handle, "fasta")) 
 80      >>> handle.close() 
 81      >>> print records[1].id 
 82      gi|1348917|gb|G26685|G26685 
 83   
 84  If you want random access to the records by a key such as the record id, 
 85  turn the iterator into a dictionary: 
 86   
 87      >>> from Bio import SeqIO 
 88      >>> handle = open("Fasta/f002", "rU") 
 89      >>> record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta")) 
 90      >>> handle.close() 
 91      >>> len(record_dict) 
 92      3 
 93      >>> print len(record_dict["gi|1348917|gb|G26685|G26685"]) 
 94      413 
 95   
 96  However, using list() or the to_dict() function will load all the records 
 97  into memory at once, and therefore is not possible on very large files. 
 98  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
 99  providing dictionary like access to any record. For example, 
100   
101      >>> from Bio import SeqIO 
102      >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
103      >>> len(record_dict) 
104      3 
105      >>> print len(record_dict["gi|1348917|gb|G26685|G26685"]) 
106      413 
107   
108  Many but not all of the supported input file formats can be indexed like 
109  this. For example "ace", "embl", "fasta", "fastq", "genbank", "ig", "phd", 
110  "pir", "tab" and "qual" work, but alignment formats like "phylip", "clustalw" 
111  and "nexus" will not. 
112   
113  Input - Alignments 
114  ================== 
115  You can read in alignment files as Alignment objects using Bio.AlignIO. 
116  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
117  you a SeqRecord for each row of each alignment: 
118   
119      >>> from Bio import SeqIO 
120      >>> handle = open("Clustalw/hedgehog.aln", "rU") 
121      >>> for record in SeqIO.parse(handle, "clustal") : 
122      ...     print record.id, len(record) 
123      gi|167877390|gb|EDS40773.1| 447 
124      gi|167234445|ref|NP_001107837. 447 
125      gi|74100009|gb|AAZ99217.1| 447 
126      gi|13990994|dbj|BAA33523.2| 447 
127      gi|56122354|gb|AAV74328.1| 447 
128      >>> handle.close() 
129   
130  Output 
131  ====== 
132  Use the function Bio.SeqIO.write(...), which takes a complete set of 
133  SeqRecord objects (either as a list, or an iterator), an output file handle 
134  and of course the file format:: 
135   
136      from Bio import SeqIO 
137      records = ... 
138      handle = open("example.faa", "w") 
139      SeqIO.write(records, handle, "fasta") 
140      handle.close() 
141   
142  In general, you are expected to call this function once (with all your 
143  records) and then close the file handle. 
144   
145  Output - Advanced 
146  ================= 
147  The effect of calling write() multiple times on a single file will vary 
148  depending on the file format, and is best avoided unless you have a strong 
149  reason to do so. 
150   
151  Trying this for certain alignment formats (e.g. phylip, clustal, stockholm) 
152  would have the effect of concatenating several multiple sequence alignments 
153  together.  Such files are created by the PHYLIP suite of programs for 
154  bootstrap analysis. 
155   
156  For sequential files formats (e.g. fasta, genbank) each "record block" holds 
157  a single sequence.  For these files it would probably be safe to call 
158  write() multiple times. 
159   
160  Conversion 
161  ========== 
162  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
163  file format conversions. Additionally, it may use file format specific 
164  optimisations so this should be the fastest way too. 
165   
166  In general however, you can combine the Bio.SeqIO.parse(...) function with the 
167  Bio.SeqIO.write(...) function for sequence file conversion. Using generator 
168  expressions provides a memory efficient way to perform filtering or other 
169  extra operations as part of the process. 
170   
171  File Formats 
172  ============ 
173  When specifying the file format, use lowercase strings.  The same format 
174  names are also used in Bio.AlignIO and include the following: 
175   
176   - ace     - Reads the contig sequences from an ACE assembly file. 
177   - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
178   - fasta   - The generic sequence file format where each record starts with 
179               an identifer line starting with a ">" character, followed by 
180               lines of sequence. 
181   - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
182               sequence quality values (with an ASCII offset of 33). 
183   - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
184   - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
185               encodes Solexa quality scores (not PHRED quality scores) with an 
186               ASCII offset of 64. 
187   - fastq-illumina - Solexa/Illumnia 1.3+ variant of the FASTQ format which 
188               encodes PHRED quality scores with an ASCII offset of 64 (not 33). 
189   - genbank - The GenBank or GenPept flat file format. 
190   - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
191   - ig      - The IntelliGenetics file format, apparently the same as the 
192               MASE alignment format. 
193   - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
194   - pir     - A "FASTA like" format introduced by the National Biomedical 
195               Research Foundation (NBRF) for the Protein Information Resource 
196               (PIR) database, now part of UniProt. 
197   - swiss   - Plain text Swiss-Prot aka UniProt format. 
198   - tab     - Simple two column tab separated sequence files, where each 
199               line holds a record's identifier and sequence. For example, 
200               this is used as by Aligent's eArray software when saving 
201               microarray probes in a minimal tab delimited text file. 
202   - qual    - A "FASTA like" format holding PHRED quality values from 
203               sequencing DNA, but no actual sequences (usually provided 
204               in separate FASTA files). 
205   
206  Note that while Bio.SeqIO can read all the above file formats, it cannot 
207  write to all of them. 
208   
209  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
210  "phlip" and "stockholm", which gives you access to the individual sequences 
211  making up each alignment as SeqRecords. 
212  """ 
213  __docformat__ = "epytext en" #not just plaintext 
214   
215  #TODO 
216  # - define policy on reading aligned sequences with gaps in 
217  #   (e.g. - and . characters) including how the alphabet interacts 
218  # 
219  # - Can we build the to_alignment(...) functionality 
220  #   into the generic Alignment class instead? 
221  # 
222  # - How best to handle unique/non unique record.id when writing. 
223  #   For most file formats reading such files is fine; The stockholm 
224  #   parser would fail. 
225  # 
226  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
227  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format  
228   
229  """ 
230  FAO BioPython Developers 
231  ======================== 
232  The way I envision this SeqIO system working as that for any sequence file 
233  format we have an iterator that returns SeqRecord objects. 
234   
235  This also applies to interlaced fileformats (like clustal - although that 
236  is now handled via Bio.AlignIO instead) where the file cannot be read record 
237  by record.  You should still return an iterator, even if the implementation 
238  could just as easily return a list. 
239   
240  These file format specific sequence iterators may be implemented as: 
241  * Classes which take a handle for __init__ and provide the __iter__ method 
242  * Functions that take a handle, and return an iterator object 
243  * Generator functions that take a handle, and yield SeqRecord objects 
244   
245  It is then trivial to turn this iterator into a list of SeqRecord objects, 
246  an in memory dictionary, or a multiple sequence alignment object. 
247   
248  For building the dictionary by default the id propery of each SeqRecord is 
249  used as the key.  You should always populate the id property, and it should 
250  be unique in most cases. For some file formats the accession number is a good 
251  choice.  If the file itself contains ambiguous identifiers, don't try and 
252  dis-ambiguate them - return them as is. 
253   
254  When adding a new file format, please use the same lower case format name 
255  as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
256   
257  See also http://biopython.org/wiki/SeqIO_dev 
258   
259  --Peter 
260  """ 
261   
262  import os 
263  from Bio.Seq import Seq 
264  from Bio.SeqRecord import SeqRecord 
265  from Bio.Align.Generic import Alignment 
266  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
267   
268  import AceIO 
269  import FastaIO 
270  import IgIO #IntelliGenetics or MASE format 
271  import InsdcIO #EMBL and GenBank 
272  import PhdIO 
273  import PirIO 
274  import SwissIO 
275  import TabIO 
276  import QualityIO #FastQ and qual files 
277   
278   
279  #Convention for format names is "mainname-subtype" in lower case. 
280  #Please use the same names as BioPerl or EMBOSS where possible. 
281  # 
282  #Note that this simple system copes with defining 
283  #multiple possible iterators for a given format/extension 
284  #with the -subtype suffix 
285  # 
286  #Most alignment file formats will be handled via Bio.AlignIO 
287   
288  _FormatToIterator ={"fasta" : FastaIO.FastaIterator, 
289                      "gb" : InsdcIO.GenBankIterator, 
290                      "genbank" : InsdcIO.GenBankIterator, 
291                      "genbank-cds" : InsdcIO.GenBankCdsFeatureIterator, 
292                      "embl" : InsdcIO.EmblIterator, 
293                      "embl-cds" : InsdcIO.EmblCdsFeatureIterator, 
294                      "ig" : IgIO.IgIterator, 
295                      "swiss" : SwissIO.SwissIterator, 
296                      "phd" : PhdIO.PhdIterator, 
297                      "ace" : AceIO.AceIterator, 
298                      "tab" : TabIO.TabIterator, 
299                      "pir" : PirIO.PirIterator, 
300                      "fastq" : QualityIO.FastqPhredIterator, 
301                      "fastq-sanger" : QualityIO.FastqPhredIterator, 
302                      "fastq-solexa" : QualityIO.FastqSolexaIterator, 
303                      "fastq-illumina" : QualityIO.FastqIlluminaIterator, 
304                      "qual" : QualityIO.QualPhredIterator, 
305                      } 
306   
307  _FormatToWriter ={"fasta" : FastaIO.FastaWriter, 
308                    "gb" : InsdcIO.GenBankWriter, 
309                    "genbank" : InsdcIO.GenBankWriter, 
310                    "tab" : TabIO.TabWriter, 
311                    "fastq" : QualityIO.FastqPhredWriter, 
312                    "fastq-sanger" : QualityIO.FastqPhredWriter, 
313                    "fastq-solexa" : QualityIO.FastqSolexaWriter, 
314                    "fastq-illumina" : QualityIO.FastqIlluminaWriter, 
315                    "phd" : PhdIO.PhdWriter, 
316                    "qual" : QualityIO.QualPhredWriter, 
317                    } 
318   
319 -def write(sequences, handle, format) :
320 """Write complete set of sequences to a file. 321 322 - sequences - A list (or iterator) of SeqRecord objects. 323 - handle - File handle object to write to. 324 - format - lower case string describing the file format to write. 325 326 You should close the handle after calling this function. 327 328 Returns the number of records written (as an integer). 329 """ 330 from Bio import AlignIO 331 332 #Try and give helpful error messages: 333 if isinstance(handle, basestring) : 334 raise TypeError("Need a file handle, not a string (i.e. not a filename)") 335 if not isinstance(format, basestring) : 336 raise TypeError("Need a string for the file format (lower case)") 337 if not format : 338 raise ValueError("Format required (lower case string)") 339 if format != format.lower() : 340 raise ValueError("Format string '%s' should be lower case" % format) 341 if isinstance(sequences,SeqRecord): 342 raise ValueError("Use a SeqRecord list/iterator, not just a single SeqRecord") 343 344 #Map the file format to a writer class 345 if format in _FormatToWriter : 346 writer_class = _FormatToWriter[format] 347 count = writer_class(handle).write_file(sequences) 348 elif format in AlignIO._FormatToWriter : 349 #Try and turn all the records into a single alignment, 350 #and write that using Bio.AlignIO 351 alignment = to_alignment(sequences) 352 alignment_count = AlignIO.write([alignment], handle, format) 353 assert alignment_count == 1, "Internal error - the underlying writer " \ 354 + " should have returned 1, not %s" % repr(alignment_count) 355 count = len(alignment.get_all_seqs()) 356 del alignment_count, alignment 357 elif format in _FormatToIterator or format in AlignIO._FormatToIterator : 358 raise ValueError("Reading format '%s' is supported, but not writing" \ 359 % format) 360 else : 361 raise ValueError("Unknown format '%s'" % format) 362 363 assert isinstance(count, int), "Internal error - the underlying %s " \ 364 "writer should have returned the record count, not %s" \ 365 % (format, repr(count)) 366 return count
367
368 -def parse(handle, format, alphabet=None) :
369 r"""Turns a sequence file into an iterator returning SeqRecords. 370 371 - handle - handle to the file. 372 - format - lower case string describing the file format. 373 - alphabet - optional Alphabet object, useful when the sequence type 374 cannot be automatically inferred from the file itself 375 (e.g. format="fasta" or "tab") 376 377 Typical usage, opening a file to read in, and looping over the record(s): 378 379 >>> from Bio import SeqIO 380 >>> filename = "Nucleic/sweetpea.nu" 381 >>> for record in SeqIO.parse(open(filename,"rU"), "fasta") : 382 ... print "ID", record.id 383 ... print "Sequence length", len(record) 384 ... print "Sequence alphabet", record.seq.alphabet 385 ID gi|3176602|gb|U78617.1|LOU78617 386 Sequence length 309 387 Sequence alphabet SingleLetterAlphabet() 388 389 For file formats like FASTA where the alphabet cannot be determined, it 390 may be useful to specify the alphabet explicitly: 391 392 >>> from Bio import SeqIO 393 >>> from Bio.Alphabet import generic_dna 394 >>> filename = "Nucleic/sweetpea.nu" 395 >>> for record in SeqIO.parse(open(filename,"rU"), "fasta", generic_dna) : 396 ... print "ID", record.id 397 ... print "Sequence length", len(record) 398 ... print "Sequence alphabet", record.seq.alphabet 399 ID gi|3176602|gb|U78617.1|LOU78617 400 Sequence length 309 401 Sequence alphabet DNAAlphabet() 402 403 If you have a string 'data' containing the file contents, you must 404 first turn this into a handle in order to parse it: 405 406 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 407 >>> from Bio import SeqIO 408 >>> from StringIO import StringIO 409 >>> for record in SeqIO.parse(StringIO(data), "fasta") : 410 ... print record.id, record.seq 411 Alpha ACCGGATGTA 412 Beta AGGCTCGGTTA 413 414 Use the Bio.SeqIO.read(handle, format) function when you expect a single 415 record only. 416 """ 417 #NOTE - The above docstring has some raw \n characters needed 418 #for the StringIO example, hense the whole docstring is in raw 419 #string mode (see the leading r before the opening quote). 420 from Bio import AlignIO 421 422 #Try and give helpful error messages: 423 if isinstance(handle, basestring) : 424 raise TypeError("Need a file handle, not a string (i.e. not a filename)") 425 if not isinstance(format, basestring) : 426 raise TypeError("Need a string for the file format (lower case)") 427 if not format : 428 raise ValueError("Format required (lower case string)") 429 if format != format.lower() : 430 raise ValueError("Format string '%s' should be lower case" % format) 431 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \ 432 isinstance(alphabet, AlphabetEncoder)) : 433 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 434 435 #Map the file format to a sequence iterator: 436 if format in _FormatToIterator : 437 iterator_generator = _FormatToIterator[format] 438 if alphabet is None : 439 return iterator_generator(handle) 440 try : 441 return iterator_generator(handle, alphabet=alphabet) 442 except : 443 return _force_alphabet(iterator_generator(handle), alphabet) 444 elif format in AlignIO._FormatToIterator : 445 #Use Bio.AlignIO to read in the alignments 446 #TODO - Can this helper function can be replaced with a generator expression, 447 #or something from itertools? 448 return _iterate_via_AlignIO(handle, format, alphabet) 449 else : 450 raise ValueError("Unknown format '%s'" % format)
451 452 #This is a generator function
453 -def _iterate_via_AlignIO(handle, format, alphabet) :
454 """Iterate over all records in several alignments (PRIVATE).""" 455 from Bio import AlignIO 456 for align in AlignIO.parse(handle, format, alphabet=alphabet) : 457 for record in align : 458 yield record
459
460 -def _force_alphabet(record_iterator, alphabet) :
461 """Iterate over records, over-riding the alphabet (PRIVATE).""" 462 #Assume the alphabet argument has been pre-validated 463 given_base_class = _get_base_alphabet(alphabet).__class__ 464 for record in record_iterator : 465 if isinstance(_get_base_alphabet(record.seq.alphabet), 466 given_base_class) : 467 record.seq.alphabet = alphabet 468 yield record 469 else : 470 raise ValueError("Specified alphabet %s clashes with "\ 471 "that determined from the file, %s" \ 472 % (repr(alphabet), repr(record.seq.alphabet)))
473
474 -def read(handle, format, alphabet=None) :
475 """Turns a sequence file into a single SeqRecord. 476 477 - handle - handle to the file. 478 - format - string describing the file format. 479 - alphabet - optional Alphabet object, useful when the sequence type 480 cannot be automatically inferred from the file itself 481 (e.g. format="fasta" or "tab") 482 483 This function is for use parsing sequence files containing 484 exactly one record. For example, reading a GenBank file: 485 486 >>> from Bio import SeqIO 487 >>> record = SeqIO.read(open("GenBank/arab1.gb", "rU"), "genbank") 488 >>> print "ID", record.id 489 ID AC007323.5 490 >>> print "Sequence length", len(record) 491 Sequence length 86436 492 >>> print "Sequence alphabet", record.seq.alphabet 493 Sequence alphabet IUPACAmbiguousDNA() 494 495 If the handle contains no records, or more than one record, 496 an exception is raised. For example: 497 498 >>> from Bio import SeqIO 499 >>> record = SeqIO.read(open("GenBank/cor6_6.gb", "rU"), "genbank") 500 Traceback (most recent call last): 501 ... 502 ValueError: More than one record found in handle 503 504 If however you want the first record from a file containing 505 multiple records this function would raise an exception (as 506 shown in the example above). Instead use: 507 508 >>> from Bio import SeqIO 509 >>> record = SeqIO.parse(open("GenBank/cor6_6.gb", "rU"), "genbank").next() 510 >>> print "First record's ID", record.id 511 First record's ID X55053.1 512 513 Use the Bio.SeqIO.parse(handle, format) function if you want 514 to read multiple records from the handle. 515 """ 516 iterator = parse(handle, format, alphabet) 517 try : 518 first = iterator.next() 519 except StopIteration : 520 first = None 521 if first is None : 522 raise ValueError("No records found in handle") 523 try : 524 second = iterator.next() 525 except StopIteration : 526 second = None 527 if second is not None : 528 raise ValueError("More than one record found in handle") 529 return first
530
531 -def to_dict(sequences, key_function=None) :
532 """Turns a sequence iterator or list into a dictionary. 533 534 - sequences - An iterator that returns SeqRecord objects, 535 or simply a list of SeqRecord objects. 536 - key_function - Optional callback function which when given a 537 SeqRecord should return a unique key for the dictionary. 538 539 e.g. key_function = lambda rec : rec.name 540 or, key_function = lambda rec : rec.description.split()[0] 541 542 If key_function is ommitted then record.id is used, on the assumption 543 that the records objects returned are SeqRecords with a unique id. 544 545 If there are duplicate keys, an error is raised. 546 547 Example usage, defaulting to using the record.id as key: 548 549 >>> from Bio import SeqIO 550 >>> handle = open("GenBank/cor6_6.gb", "rU") 551 >>> format = "genbank" 552 >>> id_dict = SeqIO.to_dict(SeqIO.parse(handle, format)) 553 >>> print sorted(id_dict.keys()) 554 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 555 >>> print id_dict["L31939.1"].description 556 Brassica rapa (clone bif72) kin mRNA, complete cds. 557 558 A more complex example, using the key_function argument in order to 559 use a sequence checksum as the dictionary key: 560 561 >>> from Bio import SeqIO 562 >>> from Bio.SeqUtils.CheckSum import seguid 563 >>> handle = open("GenBank/cor6_6.gb", "rU") 564 >>> format = "genbank" 565 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(handle, format), 566 ... key_function = lambda rec : seguid(rec.seq)) 567 >>> for key, record in sorted(seguid_dict.iteritems()) : 568 ... print key, record.id 569 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 570 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 571 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 572 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 573 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 574 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 575 576 This approach is not suitable for very large sets of sequences, as all 577 the SeqRecord objects are held in memory. Instead, consider using the 578 Bio.SeqIO.index() function (if it supports your particular file format). 579 """ 580 if key_function is None : 581 key_function = lambda rec : rec.id 582 583 d = dict() 584 for record in sequences : 585 key = key_function(record) 586 if key in d : 587 raise ValueError("Duplicate key '%s'" % key) 588 d[key] = record 589 return d
590
591 -def index(filename, format, alphabet=None, key_function=None) :
592 """Indexes a sequence file and returns a dictionary like object. 593 594 - filename - string giving name of file to be indexed 595 - format - lower case string describing the file format 596 - alphabet - optional Alphabet object, useful when the sequence type 597 cannot be automatically inferred from the file itself 598 (e.g. format="fasta" or "tab") 599 - key_function - Optional callback function which when given a 600 SeqRecord identifier string should return a unique 601 key for the dictionary. 602 603 This indexing function will return a dictionary like object, giving the 604 SeqRecord objects as values: 605 606 >>> from Bio import SeqIO 607 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 608 >>> len(records) 609 3 610 >>> sorted(records.keys()) 611 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 612 >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta") 613 >EAS54_6_R1_2_1_540_792 614 TTGGCAGGCCAAGGCCGATGGATCA 615 <BLANKLINE> 616 >>> "EAS54_6_R1_2_1_540_792" in records 617 True 618 >>> print records.get("Missing", None) 619 None 620 621 Note that this psuedo dictionary will not support all the methods of a 622 true Python dictionary, for example values() is not defined since this 623 would require loading all of the records into memory at once. 624 625 When you call the index function, it will scan through the file, noting 626 the location of each record. When you access a particular record via the 627 dictionary methods, the code will jump to the appropriate part of the 628 file and then parse that section into a SeqRecord. 629 630 Note that not all the input formats supported by Bio.SeqIO can be used 631 with this index function. It is designed to work only with sequential 632 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 633 interlaced file format (e.g. alignment formats such as "clustal"). 634 635 For small files, it may be more efficient to use an in memory Python 636 dictionary, e.g. 637 638 >>> from Bio import SeqIO 639 >>> records = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fastq"), "fastq")) 640 >>> len(records) 641 3 642 >>> sorted(records.keys()) 643 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 644 >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta") 645 >EAS54_6_R1_2_1_540_792 646 TTGGCAGGCCAAGGCCGATGGATCA 647 <BLANKLINE> 648 649 As with the to_dict() function, by default the id string of each record 650 is used as the key. You can specify a callback function to transform 651 this (the record identifier string) into your prefered key. For example: 652 653 >>> from Bio import SeqIO 654 >>> def make_tuple(identifier) : 655 ... parts = identifier.split("_") 656 ... return int(parts[-2]), int(parts[-1]) 657 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 658 ... key_function=make_tuple) 659 >>> len(records) 660 3 661 >>> sorted(records.keys()) 662 [(413, 324), (443, 348), (540, 792)] 663 >>> print records[(540, 792)].format("fasta") 664 >EAS54_6_R1_2_1_540_792 665 TTGGCAGGCCAAGGCCGATGGATCA 666 <BLANKLINE> 667 >>> (540, 792) in records 668 True 669 >>> "EAS54_6_R1_2_1_540_792" in records 670 False 671 >>> print records.get("Missing", None) 672 None 673 674 Another common use case would be indexing an NCBI style FASTA file, 675 where you might want to extract the GI number from the FASTA identifer 676 to use as the dictionary key. 677 678 Notice that unlike the to_dict() function, here the key_function does 679 not get given the full SeqRecord to use to generate the key. Doing so 680 would impose a severe performance penalty as it would require the file 681 to be completely parsed while building the index. Right now this is 682 usually avoided. 683 """ 684 #Try and give helpful error messages: 685 if not isinstance(filename, basestring) : 686 raise TypeError("Need a filename (not a handle)") 687 if not isinstance(format, basestring) : 688 raise TypeError("Need a string for the file format (lower case)") 689 if not format : 690 raise ValueError("Format required (lower case string)") 691 if format != format.lower() : 692 raise ValueError("Format string '%s' should be lower case" % format) 693 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \ 694 isinstance(alphabet, AlphabetEncoder)) : 695 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 696 697 #Map the file format to a sequence iterator: 698 import _index #Lazy import 699 try : 700 indexer = _index._FormatToIndexedDict[format] 701 except KeyError : 702 raise ValueError("Unsupported format '%s'" % format) 703 return indexer(filename, alphabet, key_function)
704
705 -def to_alignment(sequences, alphabet=None, strict=True) :
706 """Returns a multiple sequence alignment (OBSOLETE). 707 708 - sequences -An iterator that returns SeqRecord objects, 709 or simply a list of SeqRecord objects. All 710 the record sequences must be the same length. 711 - alphabet - Optional alphabet. Stongly recommended. 712 - strict - Optional, defaults to True. Should error checking 713 be done? 714 715 Using this function is now discouraged. Rather doing this: 716 717 >>> from Bio import SeqIO 718 >>> handle = open("Clustalw/protein.aln") 719 >>> alignment = SeqIO.to_alignment(SeqIO.parse(handle, "clustal")) 720 >>> handle.close() 721 722 You are now encouraged to use Bio.AlignIO instead, e.g. 723 724 >>> from Bio import AlignIO 725 >>> handle = open("Clustalw/protein.aln") 726 >>> alignment = AlignIO.read(handle, "clustal") 727 >>> handle.close() 728 """ 729 #TODO - Move this functionality into the Alignment class instead? 730 from Bio.Alphabet import generic_alphabet 731 from Bio.Alphabet import _consensus_alphabet 732 if alphabet is None : 733 sequences = list(sequences) 734 alphabet = _consensus_alphabet([rec.seq.alphabet for rec in sequences \ 735 if rec.seq is not None]) 736 737 if not (isinstance(alphabet, Alphabet) or isinstance(alphabet, AlphabetEncoder)) : 738 raise ValueError("Invalid alphabet") 739 740 alignment_length = None 741 alignment = Alignment(alphabet) 742 for record in sequences : 743 if strict : 744 if alignment_length is None : 745 alignment_length = len(record.seq) 746 elif alignment_length != len(record.seq) : 747 raise ValueError("Sequences must all be the same length") 748 749 assert isinstance(record.seq.alphabet, Alphabet) \ 750 or isinstance(record.seq.alphabet, AlphabetEncoder), \ 751 "Sequence does not have a valid alphabet" 752 753 #TODO - Move this alphabet comparison code into the Alphabet module/class? 754 #TODO - Is a normal alphabet "ungapped" by default, or does it just mean 755 #undecided? 756 if isinstance(record.seq.alphabet, Alphabet) \ 757 and isinstance(alphabet, Alphabet) : 758 #Comparing two non-gapped alphabets 759 if not isinstance(record.seq.alphabet, alphabet.__class__) : 760 raise ValueError("Incompatible sequence alphabet " \ 761 + "%s for %s alignment" \ 762 % (record.seq.alphabet, alphabet)) 763 elif isinstance(record.seq.alphabet, AlphabetEncoder) \ 764 and isinstance(alphabet, Alphabet) : 765 raise ValueError("Sequence has a gapped alphabet, alignment does not") 766 elif isinstance(record.seq.alphabet, Alphabet) \ 767 and isinstance(alphabet, Gapped) : 768 #Sequence isn't gapped, alignment is. 769 if not isinstance(record.seq.alphabet, alphabet.alphabet.__class__) : 770 raise ValueError("Incompatible sequence alphabet " \ 771 + "%s for %s alignment" \ 772 % (record.seq.alphabet, alphabet)) 773 else : 774 #Comparing two gapped alphabets 775 if not isinstance(record.seq.alphabet, alphabet.__class__) : 776 raise ValueError("Incompatible sequence alphabet " \ 777 + "%s for %s alignment" \ 778 % (record.seq.alphabet, alphabet)) 779 if record.seq.alphabet.gap_char != alphabet.gap_char : 780 raise ValueError("Sequence gap characters != alignment gap char") 781 #ToDo, additional checks on the specified alignment... 782 #Should we look at the alphabet.contains() method? 783 if record.seq is None : 784 raise TypeError("SeqRecord (id=%s) has None for its sequence." % record.id) 785 786 #This is abusing the "private" records list, 787 #we should really have a method like add_sequence 788 #but which takes SeqRecord objects. See also Bug 1944 789 alignment._records.append(record) 790 return alignment
791
792 -def convert(in_file, in_format, out_file, out_format, alphabet=None) :
793 """Convert between two sequence file formats, return number of records. 794 795 - in_file - an input handle or filename 796 - in_format - input file format, lower case string 797 - out_file - an output handle or filename 798 - out_format - output file format, lower case string 799 - alphabet - optional alphabet to assume 800 801 NOTE - If you provide an output filename, it will be opened which will 802 overwrite any existing file without warning. This may happen if even 803 the conversion is aborted (e.g. an invalid out_format name is given). 804 805 For example, going from a filename to a handle: 806 807 >>> from Bio import SeqIO 808 >>> from StringIO import StringIO 809 >>> handle = StringIO("") 810 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 811 3 812 >>> print handle.getvalue() 813 >EAS54_6_R1_2_1_413_324 814 CCCTTCTTGTCTTCAGCGTTTCTCC 815 >EAS54_6_R1_2_1_540_792 816 TTGGCAGGCCAAGGCCGATGGATCA 817 >EAS54_6_R1_2_1_443_348 818 GTTGCTTCTGGCGTGGGTGGGGGGG 819 <BLANKLINE> 820 """ 821 #TODO - Add optimised versions of important conversions 822 #For now just off load the work to SeqIO parse/write 823 if isinstance(in_file, basestring) : 824 in_handle = open(in_file, "rU") 825 in_close = True 826 else : 827 in_handle = in_file 828 in_close = False 829 #Don't open the output file until we've checked the input is OK? 830 if isinstance(out_file, basestring) : 831 out_handle = open(out_file, "w") 832 out_close = True 833 else : 834 out_handle = out_file 835 out_close = False 836 #This will check the arguments and issue error messages, 837 #after we have opened the file which is a shame. 838 from _convert import _handle_convert #Lazy import 839 count = _handle_convert(in_handle, in_format, 840 out_handle, out_format, 841 alphabet) 842 #Must now close any handles we opened 843 if in_close : in_handle.close() 844 if out_close : out_handle.close() 845 return count
846
847 -def _test():
848 """Run the Bio.SeqIO module's doctests. 849 850 This will try and locate the unit tests directory, and run the doctests 851 from there in order that the relative paths used in the examples work. 852 """ 853 import doctest 854 import os 855 if os.path.isdir(os.path.join("..","..","Tests")) : 856 print "Runing doctests..." 857 cur_dir = os.path.abspath(os.curdir) 858 os.chdir(os.path.join("..","..","Tests")) 859 doctest.testmod() 860 os.chdir(cur_dir) 861 del cur_dir 862 print "Done" 863 elif os.path.isdir(os.path.join("Tests", "Fasta")) : 864 print "Runing doctests..." 865 cur_dir = os.path.abspath(os.curdir) 866 os.chdir(os.path.join("Tests")) 867 doctest.testmod() 868 os.chdir(cur_dir) 869 del cur_dir 870 print "Done"
871 872 if __name__ == "__main__": 873 #Run the doctests 874 _test() 875