Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2008 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  #Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  """Sequence input/output as SeqRecord objects. 
  10   
  11  The Bio.SeqIO module is also documented by a whole chapter in the Biopython 
  12  tutorial, and by the wiki http://biopython.org/wiki/SeqIO on the website. 
  13  The approach is designed to be similar to the bioperl SeqIO design. 
  14   
  15  Input 
  16  ===== 
  17  The main function is Bio.SeqIO.parse(...) which takes an input file handle, 
  18  and format string.  This returns an iterator giving SeqRecord objects. 
  19   
  20      from Bio import SeqIO 
  21      handle = open("example.fasta", "rU") 
  22      for record in SeqIO.parse(handle, "fasta") : 
  23          print record 
  24      handle.close() 
  25   
  26  Note that the parse() function will all invoke the relevant parser for the 
  27  format with its default settings.  You may want more control, in which case 
  28  you need to create a format specific sequence iterator directly. 
  29   
  30  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  31  using a sequence iterator can save you a lot of memory (RAM).  There is 
  32  less benefit for interlaced file formats (e.g. most multiple alignment file 
  33  formats).  However, an iterator only lets you access the records one by one. 
  34   
  35  If you want random access to the records by number, turn this into a list: 
  36   
  37      from Bio import SeqIO 
  38      handle = open("example.fasta", "rU") 
  39      records = list(SeqIO.parse(handle, "fasta")) 
  40      handle.close() 
  41      print records[0] 
  42   
  43  If you want random access to the records by a key such as the record id, 
  44  turn the iterator into a dictionary: 
  45   
  46      from Bio import SeqIO 
  47      handle = open("example.fasta", "rU") 
  48      record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta")) 
  49      handle.close() 
  50      print record["gi:12345678"] 
  51   
  52  If you expect your file to contain one-and-only-one record, then we provide 
  53  the following 'helper' function which will return a single SeqRecord, or 
  54  raise an exception if there are no records or more than one record: 
  55   
  56      from Bio import SeqIO 
  57      handle = open("example.fasta", "rU") 
  58      record = SeqIO.read(handle, "fasta") 
  59      handle.close() 
  60      print record 
  61   
  62  This style is useful when you expect a single record only (and would 
  63  consider multiple records an error).  For example, when dealing with GenBank 
  64  files for bacterial genomes or chromosomes, there is normally only a single 
  65  record.  Alternatively, use this with a handle when download a single record 
  66  from the internet. 
  67   
  68  However, if you just want the first record from a file containing multiple 
  69  record, use the iterator's next() method: 
  70   
  71      from Bio import SeqIO 
  72      handle = open("example.fasta", "rU") 
  73      record = SeqIO.parse(handle, "fasta").next() 
  74      handle.close() 
  75      print record 
  76   
  77  The above code will work as long as the file contains at least one record. 
  78  Note that if there is more than one record, the remaining records will be 
  79  silently ignored. 
  80   
  81  Input - Alignments 
  82  ================== 
  83  You can read in alignment files as Alignment objects using Bio.AlignIO. 
  84  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
  85  you a SeqRecord for each row of each alignment. 
  86   
  87  Output 
  88  ====== 
  89  Use the function Bio.SeqIO.write(...), which takes a complete set of 
  90  SeqRecord objects (either as a list, or an iterator), an output file handle 
  91  and of course the file format. 
  92   
  93      from Bio import SeqIO 
  94      records = ... 
  95      handle = open("example.faa", "w") 
  96      SeqIO.write(records, handle, "fasta") 
  97      handle.close() 
  98   
  99  In general, you are expected to call this function once (with all your 
 100  records) and then close the file handle. 
 101   
 102  Output - Advanced 
 103  ================= 
 104  The effect of calling write() multiple times on a single file will vary 
 105  depending on the file format, and is best avoided unless you have a strong 
 106  reason to do so. 
 107   
 108  Trying this for certain alignment formats (e.g. phylip, clustal, stockholm) 
 109  would have the effect of concatenating several multiple sequence alignments 
 110  together.  Such files are created by the PHYLIP suite of programs for 
 111  bootstrap analysis. 
 112   
 113  For sequential files formats (e.g. fasta, genbank) each "record block" holds 
 114  a single sequence.  For these files it would probably be safe to call 
 115  write() multiple times. 
 116   
 117  File Formats 
 118  ============ 
 119  When specifying formats, use lowercase strings. 
 120  """ 
 121   
 122  #TODO 
 123  # - define policy on reading aligned sequences with gaps in 
 124  #   (e.g. - and . characters) including how the alphabet interacts 
 125  # 
 126  # - Can we build the to_alignment(...) functionality 
 127  #   into the generic Alignment class instead? 
 128  # 
 129  # - How best to handle unique/non unique record.id when writing. 
 130  #   For most file formats reading such files is fine; The stockholm 
 131  #   parser would fail. 
 132  # 
 133  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 134  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format  
 135   
 136  """ 
 137  FAO BioPython Developers 
 138  ======================== 
 139  The way I envision this SeqIO system working as that for any sequence file 
 140  format we have an iterator that returns SeqRecord objects. 
 141   
 142  This also applies to interlaced fileformats (like clustal) where the file 
 143  cannot be read record by record.  You should still return an iterator! 
 144   
 145  These file format specific sequence iterators may be implemented as: 
 146  * Classes which take a handle for __init__ and provide the __iter__ method 
 147  * Functions that take a handle, and return an iterator object 
 148  * Generator functions that take a handle, and yeild SeqRecord objects 
 149   
 150  It is then trivial to turn this iterator into a list of SeqRecord objects, 
 151  an in memory dictionary, or a multiple sequence alignment object. 
 152   
 153  For building the dictionary by default the id propery of each SeqRecord is 
 154  used as the key.  You should always populate the id property, and it should 
 155  be unique. For some file formats the accession number is a good choice. 
 156   
 157  When adding a new file format, please use the same lower case format name 
 158  as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 159   
 160  See also http://biopython.org/wiki/SeqIO_dev 
 161   
 162  --Peter 
 163  """ 
 164   
 165  import os 
 166  from StringIO import StringIO 
 167  from Bio.Seq import Seq 
 168  from Bio.SeqRecord import SeqRecord 
 169  from Bio.Align.Generic import Alignment 
 170   
 171  import AceIO 
 172  import FastaIO 
 173  import IgIO #IntelliGenetics or MASE format 
 174  import InsdcIO #EMBL and GenBank 
 175  import PhdIO 
 176  import SwissIO 
 177   
 178  #Convention for format names is "mainname-subtype" in lower case. 
 179  #Please use the same names as BioPerl where possible. 
 180  # 
 181  #Note that this simple system copes with defining 
 182  #multiple possible iterators for a given format/extension 
 183  #with the -subtype suffix 
 184  # 
 185  #Most alignment file formats will be handled via Bio.AlignIO 
 186   
 187  _FormatToIterator ={"fasta" : FastaIO.FastaIterator, 
 188                      "genbank" : InsdcIO.GenBankIterator, 
 189                      "genbank-cds" : InsdcIO.GenBankCdsFeatureIterator, 
 190                      "embl" : InsdcIO.EmblIterator, 
 191                      "embl-cds" : InsdcIO.EmblCdsFeatureIterator, 
 192                      "ig" : IgIO.IgIterator, 
 193                      "swiss" : SwissIO.SwissIterator, 
 194                      "phd" : PhdIO.PhdIterator, 
 195                      "ace" : AceIO.AceIterator, 
 196                      } 
 197   
 198  _FormatToWriter ={"fasta" : FastaIO.FastaWriter, 
 199                    } 
 200   
201 -def write(sequences, handle, format) :
202 """Write complete set of sequences to a file. 203 204 sequences - A list (or iterator) of SeqRecord objects 205 handle - File handle object to write to 206 format - What format to use. 207 208 You should close the handle after calling this function. 209 210 There is no return value. 211 """ 212 from Bio import AlignIO 213 214 215 #Try and give helpful error messages: 216 if isinstance(handle, basestring) : 217 raise TypeError("Need a file handle, not a string (i.e. not a filename)") 218 if not isinstance(format, basestring) : 219 raise TypeError("Need a string for the file format (lower case)") 220 if not format : 221 raise ValueError("Format required (lower case string)") 222 if format <> format.lower() : 223 raise ValueError("Format string '%s' should be lower case" % format) 224 if isinstance(sequences,SeqRecord): 225 raise ValueError("Use a SeqRecord list/iterator, not just a single SeqRecord") 226 227 #Map the file format to a writer class 228 if format in _FormatToWriter : 229 writer_class = _FormatToWriter[format] 230 writer_class(handle).write_file(sequences) 231 #Don't close the file, as that would prevent things like 232 #creating concatenated phylip files for bootstrapping. 233 elif format in AlignIO._FormatToIterator : 234 #Try and turn all the records into a single alignment, 235 #and write that using Bio.AlignIO 236 AlignIO.write([to_alignment(sequences)], handle, format) 237 else : 238 raise ValueError("Unknown format '%s'" % format) 239 240 return
241
242 -def parse(handle, format) :
243 """Turns a sequence file into an iterator returning SeqRecords. 244 245 handle - handle to the file. 246 format - string describing the file format. 247 248 If you have the file name in a string 'filename', use: 249 250 from Bio import SeqIO 251 my_iterator = SeqIO.parse(open(filename,"rU"), format) 252 253 If you have a string 'data' containing the file contents, use: 254 255 from Bio import SeqIO 256 from StringIO import StringIO 257 my_iterator = SeqIO.parse(StringIO(data), format) 258 259 Note that file will be parsed with default settings, 260 which may result in a generic alphabet or other non-ideal 261 settings. For more control, you must use the format specific 262 iterator directly... 263 264 Use the Bio.SeqIO.read(handle, format) function when you expect 265 a single record only. 266 """ 267 from Bio import AlignIO 268 269 #Try and give helpful error messages: 270 if isinstance(handle, basestring) : 271 raise TypeError("Need a file handle, not a string (i.e. not a filename)") 272 if not isinstance(format, basestring) : 273 raise TypeError("Need a string for the file format (lower case)") 274 if not format : 275 raise ValueError("Format required (lower case string)") 276 if format <> format.lower() : 277 raise ValueError("Format string '%s' should be lower case" % format) 278 279 #Map the file format to a sequence iterator: 280 if format in _FormatToIterator : 281 iterator_generator = _FormatToIterator[format] 282 return iterator_generator(handle) 283 elif format in AlignIO._FormatToIterator : 284 #Use Bio.AlignIO to read in the alignments 285 return _iterate_via_AlignIO(handle, format) 286 else : 287 raise ValueError("Unknown format '%s'" % format)
288 289 #This is a generator function
290 -def _iterate_via_AlignIO(handle, format) :
291 """Private function to iterate over all records in several alignments.""" 292 from Bio import AlignIO 293 for align in AlignIO.parse(handle, format) : 294 for record in align : 295 yield record
296
297 -def read(handle, format) :
298 """Turns a sequence file into a single SeqRecord. 299 300 handle - handle to the file. 301 format - string describing the file format. 302 303 If the handle contains no records, or more than one record, 304 an exception is raised. For example, using a GenBank file 305 containing one record: 306 307 from Bio import SeqIO 308 record = SeqIO.read(open("example.gbk"), "genbank") 309 310 If however you want the first record from a file containing, 311 multiple records this function would raise an exception. 312 Instead use: 313 314 from Bio import SeqIO 315 record = SeqIO.parse(open("example.gbk"), "genbank").next() 316 317 Use the Bio.SeqIO.parse(handle, format) function if you want 318 to read multiple records from the handle. 319 """ 320 iterator = parse(handle, format) 321 try : 322 first = iterator.next() 323 except StopIteration : 324 first = None 325 if first is None : 326 raise ValueError, "No records found in handle" 327 try : 328 second = iterator.next() 329 except StopIteration : 330 second = None 331 if second is not None : 332 raise ValueError, "More than one record found in handle" 333 return first
334
335 -def to_dict(sequences, key_function=None) :
336 """Turns a sequence iterator or list into a dictionary. 337 338 sequences - An iterator that returns SeqRecord objects, 339 or simply a list of SeqRecord objects. 340 key_function - Optional function which when given a SeqRecord 341 returns a unique string for the dictionary key. 342 343 e.g. key_function = lambda rec : rec.name 344 or, key_function = lambda rec : rec.description.split()[0] 345 346 If key_function is ommitted then record.id is used, on the 347 assumption that the records objects returned are SeqRecords 348 with a unique id field. 349 350 If there are duplicate keys, an error is raised. 351 352 Example usage: 353 354 from Bio import SeqIO 355 filename = "example.fasta" 356 d = SeqIO.to_dict(SeqIO.parse(open(faa_filename, "rU")), 357 key_function = lambda rec : rec.description.split()[0]) 358 print len(d) 359 print d.keys()[0:10] 360 key = d.keys()[0] 361 print d[key] 362 """ 363 if key_function is None : 364 key_function = lambda rec : rec.id 365 366 d = dict() 367 for record in sequences : 368 key = key_function(record) 369 if key in d : 370 raise ValueError("Duplicate key '%s'" % key) 371 d[key] = record 372 return d
373
374 -def to_alignment(sequences, alphabet=None, strict=True) :
375 """Returns a multiple sequence alignment (OBSOLETE). 376 377 sequences -An iterator that returns SeqRecord objects, 378 or simply a list of SeqRecord objects. 379 All the record sequences must be the same length. 380 alphabet - Optional alphabet. Stongly recommended. 381 strict - Optional, defaults to True. Should error checking 382 be done? 383 384 Using this function is now discouraged. Rather doing this: 385 386 from Bio import SeqIO 387 alignment = SeqIO.to_alignment(SeqIO.parse(handle, format)) 388 389 You are now encouraged to use Bio.AlignIO instead, e.g. 390 391 from Bio import AlignIO 392 alignment = AlignIO.read(handle, format) 393 """ 394 #TODO - Move this functionality into the Alignment class instead? 395 from Bio.Alphabet import Alphabet, Gapped, generic_alphabet 396 if alphabet is None : 397 alphabet = Gapped(generic_alphabet) 398 399 if not (isinstance(alphabet, Alphabet) or isinstance(alphabet, Gapped)) : 400 raise ValueError("Invalid alignment alphabet") 401 402 alignment_length = None 403 alignment = Alignment(alphabet) 404 for record in sequences : 405 if strict : 406 if alignment_length is None : 407 alignment_length = len(record.seq) 408 elif alignment_length <> len(record.seq) : 409 raise ValueError("Sequences must all be the same length") 410 411 assert isinstance(record.seq.alphabet, Alphabet) \ 412 or isinstance(record.seq.alphabet, Gapped), \ 413 "Sequence does not have a valid alphabet" 414 415 #TODO - Move this alphabet comparison code into the Alphabet module/class? 416 #TODO - Is a normal alphabet "ungapped" by default, or does it just mean 417 #undecided? 418 if isinstance(record.seq.alphabet, Alphabet) \ 419 and isinstance(alphabet, Alphabet) : 420 #Comparing two non-gapped alphabets 421 if not isinstance(record.seq.alphabet, alphabet.__class__) : 422 raise ValueError("Incompatible sequence alphabet " \ 423 + "%s for %s alignment" \ 424 % (record.seq.alphabet, alphabet)) 425 elif isinstance(record.seq.alphabet, Gapped) \ 426 and isinstance(alphabet, Alphabet) : 427 raise ValueError("Sequence has a gapped alphabet, alignment does not") 428 elif isinstance(record.seq.alphabet, Alphabet) \ 429 and isinstance(alphabet, Gapped) : 430 #Sequence isn't gapped, alignment is. 431 if not isinstance(record.seq.alphabet, alphabet.alphabet.__class__) : 432 raise ValueError("Incompatible sequence alphabet " \ 433 + "%s for %s alignment" \ 434 % (record.seq.alphabet, alphabet)) 435 else : 436 #Comparing two gapped alphabets 437 if not isinstance(record.seq.alphabet, alphabet.__class__) : 438 raise ValueError("Incompatible sequence alphabet " \ 439 + "%s for %s alignment" \ 440 % (record.seq.alphabet, alphabet)) 441 if record.seq.alphabet.gap_char <> alphabet.gap_char : 442 raise ValueError("Sequence gap characters <> alignment gap char") 443 #ToDo, additional checks on the specified alignment... 444 #Should we look at the alphabet.contains() method? 445 446 #This is abusing the "private" records list, 447 #we should really have a method like add_sequence 448 #but which takes SeqRecord objects. See also Bug 1944 449 alignment._records.append(record) 450 return alignment
451 452 if __name__ == "__main__" : 453 #Run some tests... 454 from Bio.Alphabet import generic_nucleotide 455 from sets import Set 456 457 # Fasta file with unusual layout, from here: 458 # http://virgil.ruc.dk/kurser/Sekvens/Treedraw.htm 459 faa_example = \ 460 """>V_Harveyi_PATH 461 mknwikvava aialsaatvq aatevkvgms gryfpftfvk qdklqgfevd mwdeigkrnd 462 ykieyvtanf sglfglletg ridtisnqit mtdarkakyl fadpyvvdga qitvrkgnds 463 iqgvedlagk tvavnlgsnf eqllrdydkd gkiniktydt giehdvalgr adafimdrls 464 alelikktgl plqlagepfe tiqnawpfvd nekgrklqae vnkalaemra dgtvekisvk 465 wfgaditk 466 >B_subtilis_YXEM 467 mkmkkwtvlv vaallavlsa cgngnssske ddnvlhvgat gqsypfayke ngkltgfdve 468 vmeavakkid mkldwkllef sglmgelqtg kldtisnqva vtderketyn ftkpyayagt 469 qivvkkdntd iksvddlkgk tvaavlgsnh aknleskdpd kkiniktyet qegtlkdvay 470 grvdayvnsr tvliaqikkt glplklagdp ivyeqvafpf akddahdklr kkvnkaldel 471 rkdgtlkkls ekyfneditv eqkh 472 >FLIY_ECOLI 473 mklahlgrqa lmgvmavalv agmsvksfad egllnkvker gtllvglegt yppfsfqgdd 474 gkltgfevef aqqlakhlgv easlkptkwd gmlasldskr idvvinqvti sderkkkydf 475 stpytisgiq alvkkgnegt iktaddlkgk kvgvglgtny eewlrqnvqg vdvrtydddp 476 tkyqdlrvgr idailvdrla aldlvkktnd tlavtgeafs rqesgvalrk gnedllkavn 477 daiaemqkdg tlqalsekwf gadvtk 478 >Deinococcus_radiodurans 479 mkksllslkl sgllvpsvla lslsacssps stlnqgtlki amegtyppft skneqgelvg 480 fdvdiakava qklnlkpefv ltewsgilag lqankydviv nqvgitperq nsigfsqpya 481 ysrpeiivak nntfnpqsla dlkgkrvgst lgsnyekqli dtgdikivty pgapeiladl 482 vagridaayn drlvvnyiin dqklpvrgag qigdaapvgi alkkgnsalk dqidkaltem 483 rsdgtfekis qkwfgqdvgq p 484 >B_subtilis_GlnH_homo_YCKK 485 mkkallalfm vvsiaalaac gagndnqskd nakdgdlwas ikkkgvltvg tegtyepfty 486 hdkdtdkltg ydveviteva krlglkvdfk etqwgsmfag lnskrfdvva nqvgktdred 487 kydfsdkytt sravvvtkkd nndikseadv kgktsaqslt snynklatna gakvegvegm 488 aqalqmiqqa rvdmtyndkl avlnylktsg nknvkiafet gepqstyftf rkgsgevvdq 489 vnkalkemke dgtlskiskk wfgedvsk 490 >YA80_HAEIN 491 mkkllfttal ltgaiafstf shageiadrv ektktllvgt egtyapftfh dksgkltgfd 492 vevirkvaek lglkvefket qwdamyagln akrfdvianq tnpsperlkk ysfttpynys 493 ggvivtkssd nsiksfedlk grksaqsats nwgkdakaag aqilvvdgla qslelikqgr 494 aeatindkla vldyfkqhpn sglkiaydrg dktptafafl qgedalitkf nqvlealrqd 495 gtlkqisiew fgyditq 496 >E_coli_GlnH 497 mksvlkvsla altlafavss haadkklvva tdtafvpfef kqgdkyvgfd vdlwaaiake 498 lkldyelkpm dfsgiipalq tknvdlalag ititderkka idfsdgyyks gllvmvkann 499 ndvksvkdld gkvvavksgt gsvdyakani ktkdlrqfpn idnaymelgt nradavlhdt 500 pnilyfikta gngqfkavgd sleaqqygia fpkgsdelrd kvngalktlr engtyneiyk 501 kwfgtepk 502 >HISJ_E_COLI 503 mkklvlslsl vlafssataa faaipqniri gtdptyapfe sknsqgelvg fdidlakelc 504 krintqctfv enpldalips lkakkidaim sslsitekrq qeiaftdkly aadsrlvvak 505 nsdiqptves lkgkrvgvlq gttqetfgne hwapkgieiv syqgqdniys dltagridaa 506 fqdevaaseg flkqpvgkdy kfggpsvkde klfgvgtgmg lrkednelre alnkafaemr 507 adgtyeklak kyfdfdvygg""" 508 509 # This alignment was created from the fasta example given above 510 aln_example = \ 511 """CLUSTAL X (1.83) multiple sequence alignment 512 513 514 V_Harveyi_PATH --MKNWIKVAVAAIA--LSAA------------------TVQAATEVKVG 515 B_subtilis_YXEM MKMKKWTVLVVAALLAVLSACG------------NGNSSSKEDDNVLHVG 516 B_subtilis_GlnH_homo_YCKK MKKALLALFMVVSIAALAACGAGNDNQSKDNAKDGDLWASIKKKGVLTVG 517 YA80_HAEIN MKKLLFTTALLTGAIAFSTF-----------SHAGEIADRVEKTKTLLVG 518 FLIY_ECOLI MKLAHLGRQALMGVMAVALVAG---MSVKSFADEG-LLNKVKERGTLLVG 519 E_coli_GlnH --MKSVLKVSLAALTLAFAVS------------------SHAADKKLVVA 520 Deinococcus_radiodurans -MKKSLLSLKLSGLLVPSVLALS--------LSACSSPSSTLNQGTLKIA 521 HISJ_E_COLI MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG 522 : . : :. 523 524 V_Harveyi_PATH MSGRYFPFTFVKQ--DKLQGFEVDMWDEIGKRNDYKIEYVTANFSGLFGL 525 B_subtilis_YXEM ATGQSYPFAYKEN--GKLTGFDVEVMEAVAKKIDMKLDWKLLEFSGLMGE 526 B_subtilis_GlnH_homo_YCKK TEGTYEPFTYHDKDTDKLTGYDVEVITEVAKRLGLKVDFKETQWGSMFAG 527 YA80_HAEIN TEGTYAPFTFHDK-SGKLTGFDVEVIRKVAEKLGLKVEFKETQWDAMYAG 528 FLIY_ECOLI LEGTYPPFSFQGD-DGKLTGFEVEFAQQLAKHLGVEASLKPTKWDGMLAS 529 E_coli_GlnH TDTAFVPFEFKQG--DKYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPA 530 Deinococcus_radiodurans MEGTYPPFTSKNE-QGELVGFDVDIAKAVAQKLNLKPEFVLTEWSGILAG 531 HISJ_E_COLI TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS 532 ** .: *::::. : :. . ..: 533 534 V_Harveyi_PATH LETGRIDTISNQITMTDARKAKYLFADPYVVDG-AQITVRKGNDSIQGVE 535 B_subtilis_YXEM LQTGKLDTISNQVAVTDERKETYNFTKPYAYAG-TQIVVKKDNTDIKSVD 536 B_subtilis_GlnH_homo_YCKK LNSKRFDVVANQVG-KTDREDKYDFSDKYTTSR-AVVVTKKDNNDIKSEA 537 YA80_HAEIN LNAKRFDVIANQTNPSPERLKKYSFTTPYNYSG-GVIVTKSSDNSIKSFE 538 FLIY_ECOLI LDSKRIDVVINQVTISDERKKKYDFSTPYTISGIQALVKKGNEGTIKTAD 539 E_coli_GlnH LQTKNVDLALAGITITDERKKAIDFSDGYYKSG-LLVMVKANNNDVKSVK 540 Deinococcus_radiodurans LQANKYDVIVNQVGITPERQNSIGFSQPYAYSRPEIIVAKNNTFNPQSLA 541 HISJ_E_COLI LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE 542 *.: . * . * *: : : . 543 544 V_Harveyi_PATH DLAGKTVAVNLGSNFEQLLRDYDKDGKINIKTYDT--GIEHDVALGRADA 545 B_subtilis_YXEM DLKGKTVAAVLGSNHAKNLESKDPDKKINIKTYETQEGTLKDVAYGRVDA 546 B_subtilis_GlnH_homo_YCKK DVKGKTSAQSLTSNYNKLATN----AGAKVEGVEGMAQALQMIQQARVDM 547 YA80_HAEIN DLKGRKSAQSATSNWGKDAKA----AGAQILVVDGLAQSLELIKQGRAEA 548 FLIY_ECOLI DLKGKKVGVGLGTNYEEWLRQNV--QGVDVRTYDDDPTKYQDLRVGRIDA 549 E_coli_GlnH DLDGKVVAVKSGTGSVDYAKAN--IKTKDLRQFPNIDNAYMELGTNRADA 550 Deinococcus_radiodurans DLKGKRVGSTLGSNYEKQLIDTG---DIKIVTYPGAPEILADLVAGRIDA 551 HISJ_E_COLI SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA 552 .: *: . : .: : * : 553 554 V_Harveyi_PATH FIMDRLSALE-LIKKT-GLPLQLAGEPFETI-----QNAWPFVDNEKGRK 555 B_subtilis_YXEM YVNSRTVLIA-QIKKT-GLPLKLAGDPIVYE-----QVAFPFAKDDAHDK 556 B_subtilis_GlnH_homo_YCKK TYNDKLAVLN-YLKTSGNKNVKIAFETGEPQ-----STYFTFRKGS--GE 557 YA80_HAEIN TINDKLAVLD-YFKQHPNSGLKIAYDRGDKT-----PTAFAFLQGE--DA 558 FLIY_ECOLI ILVDRLAALD-LVKKT-NDTLAVTGEAFSRQ-----ESGVALRKGN--ED 559 E_coli_GlnH VLHDTPNILY-FIKTAGNGQFKAVGDSLEAQ-----QYGIAFPKGS--DE 560 Deinococcus_radiodurans AYNDRLVVNY-IINDQ-KLPVRGAGQIGDAA-----PVGIALKKGN--SA 561 HISJ_E_COLI AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE 562 . .: : . . 563 564 V_Harveyi_PATH LQAEVNKALAEMRADGTVEKISVKWFGADITK---- 565 B_subtilis_YXEM LRKKVNKALDELRKDGTLKKLSEKYFNEDITVEQKH 566 B_subtilis_GlnH_homo_YCKK VVDQVNKALKEMKEDGTLSKISKKWFGEDVSK---- 567 YA80_HAEIN LITKFNQVLEALRQDGTLKQISIEWFGYDITQ---- 568 FLIY_ECOLI LLKAVNDAIAEMQKDGTLQALSEKWFGADVTK---- 569 E_coli_GlnH LRDKVNGALKTLRENGTYNEIYKKWFGTEPK----- 570 Deinococcus_radiodurans LKDQIDKALTEMRSDGTFEKISQKWFGQDVGQP--- 571 HISJ_E_COLI LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG--- 572 : .: .: :: :** . : ::*. : 573 """ 574 575 # This is the clustal example (above) but output in phylip format, 576 # with truncated names. Note there is an ambiguity here: two 577 # different sequences both called "B_subtilis", originally 578 # "B_subtilis_YXEM" and "B_subtilis_GlnH_homo_YCKK" 579 phy_example = \ 580 """ 8 286 581 V_Harveyi_ --MKNWIKVA VAAIA--LSA A--------- ---------T VQAATEVKVG 582 B_subtilis MKMKKWTVLV VAALLAVLSA CG-------- ----NGNSSS KEDDNVLHVG 583 B_subtilis MKKALLALFM VVSIAALAAC GAGNDNQSKD NAKDGDLWAS IKKKGVLTVG 584 YA80_HAEIN MKKLLFTTAL LTGAIAFSTF ---------- -SHAGEIADR VEKTKTLLVG 585 FLIY_ECOLI MKLAHLGRQA LMGVMAVALV AG---MSVKS FADEG-LLNK VKERGTLLVG 586 E_coli_Gln --MKSVLKVS LAALTLAFAV S--------- ---------S HAADKKLVVA 587 Deinococcu -MKKSLLSLK LSGLLVPSVL ALS------- -LSACSSPSS TLNQGTLKIA 588 HISJ_E_COL MKKLVLSLSL VLAFSSATAA F--------- ---------- AAIPQNIRIG 589 590 MSGRYFPFTF VKQ--DKLQG FEVDMWDEIG KRNDYKIEYV TANFSGLFGL 591 ATGQSYPFAY KEN--GKLTG FDVEVMEAVA KKIDMKLDWK LLEFSGLMGE 592 TEGTYEPFTY HDKDTDKLTG YDVEVITEVA KRLGLKVDFK ETQWGSMFAG 593 TEGTYAPFTF HDK-SGKLTG FDVEVIRKVA EKLGLKVEFK ETQWDAMYAG 594 LEGTYPPFSF QGD-DGKLTG FEVEFAQQLA KHLGVEASLK PTKWDGMLAS 595 TDTAFVPFEF KQG--DKYVG FDVDLWAAIA KELKLDYELK PMDFSGIIPA 596 MEGTYPPFTS KNE-QGELVG FDVDIAKAVA QKLNLKPEFV LTEWSGILAG 597 TDPTYAPFES KNS-QGELVG FDIDLAKELC KRINTQCTFV ENPLDALIPS 598 599 LETGRIDTIS NQITMTDARK AKYLFADPYV VDG-AQITVR KGNDSIQGVE 600 LQTGKLDTIS NQVAVTDERK ETYNFTKPYA YAG-TQIVVK KDNTDIKSVD 601 LNSKRFDVVA NQVG-KTDRE DKYDFSDKYT TSR-AVVVTK KDNNDIKSEA 602 LNAKRFDVIA NQTNPSPERL KKYSFTTPYN YSG-GVIVTK SSDNSIKSFE 603 LDSKRIDVVI NQVTISDERK KKYDFSTPYT ISGIQALVKK GNEGTIKTAD 604 LQTKNVDLAL AGITITDERK KAIDFSDGYY KSG-LLVMVK ANNNDVKSVK 605 LQANKYDVIV NQVGITPERQ NSIGFSQPYA YSRPEIIVAK NNTFNPQSLA 606 LKAKKIDAIM SSLSITEKRQ QEIAFTDKLY AADSRLVVAK NSDIQP-TVE 607 608 DLAGKTVAVN LGSNFEQLLR DYDKDGKINI KTYDT--GIE HDVALGRADA 609 DLKGKTVAAV LGSNHAKNLE SKDPDKKINI KTYETQEGTL KDVAYGRVDA 610 DVKGKTSAQS LTSNYNKLAT N----AGAKV EGVEGMAQAL QMIQQARVDM 611 DLKGRKSAQS ATSNWGKDAK A----AGAQI LVVDGLAQSL ELIKQGRAEA 612 DLKGKKVGVG LGTNYEEWLR QNV--QGVDV RTYDDDPTKY QDLRVGRIDA 613 DLDGKVVAVK SGTGSVDYAK AN--IKTKDL RQFPNIDNAY MELGTNRADA 614 DLKGKRVGST LGSNYEKQLI DTG---DIKI VTYPGAPEIL ADLVAGRIDA 615 SLKGKRVGVL QGTTQETFGN EHWAPKGIEI VSYQGQDNIY SDLTAGRIDA 616 617 FIMDRLSALE -LIKKT-GLP LQLAGEPFET I-----QNAW PFVDNEKGRK 618 YVNSRTVLIA -QIKKT-GLP LKLAGDPIVY E-----QVAF PFAKDDAHDK 619 TYNDKLAVLN -YLKTSGNKN VKIAFETGEP Q-----STYF TFRKGS--GE 620 TINDKLAVLD -YFKQHPNSG LKIAYDRGDK T-----PTAF AFLQGE--DA 621 ILVDRLAALD -LVKKT-NDT LAVTGEAFSR Q-----ESGV ALRKGN--ED 622 VLHDTPNILY -FIKTAGNGQ FKAVGDSLEA Q-----QYGI AFPKGS--DE 623 AYNDRLVVNY -IINDQ-KLP VRGAGQIGDA A-----PVGI ALKKGN--SA 624 AFQDEVAASE GFLKQPVGKD YKFGGPSVKD EKLFGVGTGM GLRKED--NE 625 626 LQAEVNKALA EMRADGTVEK ISVKWFGADI TK---- 627 LRKKVNKALD ELRKDGTLKK LSEKYFNEDI TVEQKH 628 VVDQVNKALK EMKEDGTLSK ISKKWFGEDV SK---- 629 LITKFNQVLE ALRQDGTLKQ ISIEWFGYDI TQ---- 630 LLKAVNDAIA EMQKDGTLQA LSEKWFGADV TK---- 631 LRDKVNGALK TLRENGTYNE IYKKWFGTEP K----- 632 LKDQIDKALT EMRSDGTFEK ISQKWFGQDV GQP--- 633 LREALNKAFA EMRADGTYEK LAKKYFDFDV YGG--- 634 """ 635 # This is the clustal example (above) but output in phylip format, 636 nxs_example = \ 637 """#NEXUS 638 BEGIN DATA; 639 dimensions ntax=8 nchar=286; 640 format missing=? 641 symbols="ABCDEFGHIKLMNPQRSTUVWXYZ" 642 interleave datatype=PROTEIN gap= -; 643 644 matrix 645 V_Harveyi_PATH --MKNWIKVAVAAIA--LSAA------------------TVQAATEVKVG 646 B_subtilis_YXEM MKMKKWTVLVVAALLAVLSACG------------NGNSSSKEDDNVLHVG 647 B_subtilis_GlnH_homo_YCKK MKKALLALFMVVSIAALAACGAGNDNQSKDNAKDGDLWASIKKKGVLTVG 648 YA80_HAEIN MKKLLFTTALLTGAIAFSTF-----------SHAGEIADRVEKTKTLLVG 649 FLIY_ECOLI MKLAHLGRQALMGVMAVALVAG---MSVKSFADEG-LLNKVKERGTLLVG 650 E_coli_GlnH --MKSVLKVSLAALTLAFAVS------------------SHAADKKLVVA 651 Deinococcus_radiodurans -MKKSLLSLKLSGLLVPSVLALS--------LSACSSPSSTLNQGTLKIA 652 HISJ_E_COLI MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG 653 654 V_Harveyi_PATH MSGRYFPFTFVKQ--DKLQGFEVDMWDEIGKRNDYKIEYVTANFSGLFGL 655 B_subtilis_YXEM ATGQSYPFAYKEN--GKLTGFDVEVMEAVAKKIDMKLDWKLLEFSGLMGE 656 B_subtilis_GlnH_homo_YCKK TEGTYEPFTYHDKDTDKLTGYDVEVITEVAKRLGLKVDFKETQWGSMFAG 657 YA80_HAEIN TEGTYAPFTFHDK-SGKLTGFDVEVIRKVAEKLGLKVEFKETQWDAMYAG 658 FLIY_ECOLI LEGTYPPFSFQGD-DGKLTGFEVEFAQQLAKHLGVEASLKPTKWDGMLAS 659 E_coli_GlnH TDTAFVPFEFKQG--DKYVGFDVDLWAAIAKELKLDYELKPMDFSGIIPA 660 Deinococcus_radiodurans MEGTYPPFTSKNE-QGELVGFDVDIAKAVAQKLNLKPEFVLTEWSGILAG 661 HISJ_E_COLI TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS 662 663 V_Harveyi_PATH LETGRIDTISNQITMTDARKAKYLFADPYVVDG-AQITVRKGNDSIQGVE 664 B_subtilis_YXEM LQTGKLDTISNQVAVTDERKETYNFTKPYAYAG-TQIVVKKDNTDIKSVD 665 B_subtilis_GlnH_homo_YCKK LNSKRFDVVANQVG-KTDREDKYDFSDKYTTSR-AVVVTKKDNNDIKSEA 666 YA80_HAEIN LNAKRFDVIANQTNPSPERLKKYSFTTPYNYSG-GVIVTKSSDNSIKSFE 667 FLIY_ECOLI LDSKRIDVVINQVTISDERKKKYDFSTPYTISGIQALVKKGNEGTIKTAD 668 E_coli_GlnH LQTKNVDLALAGITITDERKKAIDFSDGYYKSG-LLVMVKANNNDVKSVK 669 Deinococcus_radiodurans LQANKYDVIVNQVGITPERQNSIGFSQPYAYSRPEIIVAKNNTFNPQSLA 670 HISJ_E_COLI LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE 671 672 V_Harveyi_PATH DLAGKTVAVNLGSNFEQLLRDYDKDGKINIKTYDT--GIEHDVALGRADA 673 B_subtilis_YXEM DLKGKTVAAVLGSNHAKNLESKDPDKKINIKTYETQEGTLKDVAYGRVDA 674 B_subtilis_GlnH_homo_YCKK DVKGKTSAQSLTSNYNKLATN----AGAKVEGVEGMAQALQMIQQARVDM 675 YA80_HAEIN DLKGRKSAQSATSNWGKDAKA----AGAQILVVDGLAQSLELIKQGRAEA 676 FLIY_ECOLI DLKGKKVGVGLGTNYEEWLRQNV--QGVDVRTYDDDPTKYQDLRVGRIDA 677 E_coli_GlnH DLDGKVVAVKSGTGSVDYAKAN--IKTKDLRQFPNIDNAYMELGTNRADA 678 Deinococcus_radiodurans DLKGKRVGSTLGSNYEKQLIDTG---DIKIVTYPGAPEILADLVAGRIDA 679 HISJ_E_COLI SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA 680 681 V_Harveyi_PATH FIMDRLSALE-LIKKT-GLPLQLAGEPFETI-----QNAWPFVDNEKGRK 682 B_subtilis_YXEM YVNSRTVLIA-QIKKT-GLPLKLAGDPIVYE-----QVAFPFAKDDAHDK 683 B_subtilis_GlnH_homo_YCKK TYNDKLAVLN-YLKTSGNKNVKIAFETGEPQ-----STYFTFRKGS--GE 684 YA80_HAEIN TINDKLAVLD-YFKQHPNSGLKIAYDRGDKT-----PTAFAFLQGE--DA 685 FLIY_ECOLI ILVDRLAALD-LVKKT-NDTLAVTGEAFSRQ-----ESGVALRKGN--ED 686 E_coli_GlnH VLHDTPNILY-FIKTAGNGQFKAVGDSLEAQ-----QYGIAFPKGS--DE 687 Deinococcus_radiodurans AYNDRLVVNY-IINDQ-KLPVRGAGQIGDAA-----PVGIALKKGN--SA 688 HISJ_E_COLI AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE 689 690 V_Harveyi_PATH LQAEVNKALAEMRADGTVEKISVKWFGADITK---- 691 B_subtilis_YXEM LRKKVNKALDELRKDGTLKKLSEKYFNEDITVEQKH 692 B_subtilis_GlnH_homo_YCKK VVDQVNKALKEMKEDGTLSKISKKWFGEDVSK---- 693 YA80_HAEIN LITKFNQVLEALRQDGTLKQISIEWFGYDITQ---- 694 FLIY_ECOLI LLKAVNDAIAEMQKDGTLQALSEKWFGADVTK---- 695 E_coli_GlnH LRDKVNGALKTLRENGTYNEIYKKWFGTEPK----- 696 Deinococcus_radiodurans LKDQIDKALTEMRSDGTFEKISQKWFGQDVGQP--- 697 HISJ_E_COLI LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG--- 698 ; 699 end; 700 """ 701 702 # This example uses DNA, from here: 703 # http://www.molecularevolution.org/resources/fileformats/ 704 nxs_example2 = \ 705 """#NEXUS 706 707 Begin data; 708 Dimensions ntax=10 nchar=705; 709 Format datatype=dna interleave=yes gap=- missing=?; 710 Matrix 711 Cow ATGGCATATCCCATACAACTAGGATTCCAAGATGCAACATCACCAATCATAGAAGAACTA 712 Carp ATGGCACACCCAACGCAACTAGGTTTCAAGGACGCGGCCATACCCGTTATAGAGGAACTT 713 Chicken ATGGCCAACCACTCCCAACTAGGCTTTCAAGACGCCTCATCCCCCATCATAGAAGAGCTC 714 Human ATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTACTTCCCCTATCATAGAAGAGCTT 715 Loach ATGGCACATCCCACACAATTAGGATTCCAAGACGCGGCCTCACCCGTAATAGAAGAACTT 716 Mouse ATGGCCTACCCATTCCAACTTGGTCTACAAGACGCCACATCCCCTATTATAGAAGAGCTA 717 Rat ATGGCTTACCCATTTCAACTTGGCTTACAAGACGCTACATCACCTATCATAGAAGAACTT 718 Seal ATGGCATACCCCCTACAAATAGGCCTACAAGATGCAACCTCTCCCATTATAGAGGAGTTA 719 Whale ATGGCATATCCATTCCAACTAGGTTTCCAAGATGCAGCATCACCCATCATAGAAGAGCTC 720 Frog ATGGCACACCCATCACAATTAGGTTTTCAAGACGCAGCCTCTCCAATTATAGAAGAATTA 721 722 Cow CTTCACTTTCATGACCACACGCTAATAATTGTCTTCTTAATTAGCTCATTAGTACTTTAC 723 Carp CTTCACTTCCACGACCACGCATTAATAATTGTGCTCCTAATTAGCACTTTAGTTTTATAT 724 Chicken GTTGAATTCCACGACCACGCCCTGATAGTCGCACTAGCAATTTGCAGCTTAGTACTCTAC 725 Human ATCACCTTTCATGATCACGCCCTCATAATCATTTTCCTTATCTGCTTCCTAGTCCTGTAT 726 Loach CTTCACTTCCATGACCATGCCCTAATAATTGTATTTTTGATTAGCGCCCTAGTACTTTAT 727 Mouse ATAAATTTCCATGATCACACACTAATAATTGTTTTCCTAATTAGCTCCTTAGTCCTCTAT 728 Rat ACAAACTTTCATGACCACACCCTAATAATTGTATTCCTCATCAGCTCCCTAGTACTTTAT 729 Seal CTACACTTCCATGACCACACATTAATAATTGTGTTCCTAATTAGCTCATTAGTACTCTAC 730 Whale CTACACTTTCACGATCATACACTAATAATCGTTTTTCTAATTAGCTCTTTAGTTCTCTAC 731 Frog CTTCACTTCCACGACCATACCCTCATAGCCGTTTTTCTTATTAGTACGCTAGTTCTTTAC 732 733 Cow ATTATTTCACTAATACTAACGACAAAGCTGACCCATACAAGCACGATAGATGCACAAGAA 734 Carp ATTATTACTGCAATGGTATCAACTAAACTTACTAATAAATATATTCTAGACTCCCAAGAA 735 Chicken CTTCTAACTCTTATACTTATAGAAAAACTATCA---TCAAACACCGTAGATGCCCAAGAA 736 Human GCCCTTTTCCTAACACTCACAACAAAACTAACTAATACTAACATCTCAGACGCTCAGGAA 737 Loach GTTATTATTACAACCGTCTCAACAAAACTCACTAACATATATATTTTGGACTCACAAGAA 738 Mouse ATCATCTCGCTAATATTAACAACAAAACTAACACATACAAGCACAATAGATGCACAAGAA 739 Rat ATTATTTCACTAATACTAACAACAAAACTAACACACACAAGCACAATAGACGCCCAAGAA 740 Seal ATTATCTCACTTATACTAACCACGAAACTCACCCACACAAGTACAATAGACGCACAAGAA 741 Whale ATTATTACCCTAATGCTTACAACCAAATTAACACATACTAGTACAATAGACGCCCAAGAA 742 Frog ATTATTACTATTATAATAACTACTAAACTAACTAATACAAACCTAATGGACGCACAAGAG 743 744 Cow GTAGAGACAATCTGAACCATTCTGCCCGCCATCATCTTAATTCTAATTGCTCTTCCTTCT 745 Carp ATCGAAATCGTATGAACCATTCTACCAGCCGTCATTTTAGTACTAATCGCCCTGCCCTCC 746 Chicken GTTGAACTAATCTGAACCATCCTACCCGCTATTGTCCTAGTCCTGCTTGCCCTCCCCTCC 747 Human ATAGAAACCGTCTGAACTATCCTGCCCGCCATCATCCTAGTCCTCATCGCCCTCCCATCC 748 Loach ATTGAAATCGTATGAACTGTGCTCCCTGCCCTAATCCTCATTTTAATCGCCCTCCCCTCA 749 Mouse GTTGAAACCATTTGAACTATTCTACCAGCTGTAATCCTTATCATAATTGCTCTCCCCTCT 750 Rat GTAGAAACAATTTGAACAATTCTCCCAGCTGTCATTCTTATTCTAATTGCCCTTCCCTCC 751 Seal GTGGAAACGGTGTGAACGATCCTACCCGCTATCATTTTAATTCTCATTGCCCTACCATCA 752 Whale GTAGAAACTGTCTGAACTATCCTCCCAGCCATTATCTTAATTTTAATTGCCTTGCCTTCA 753 Frog ATCGAAATAGTGTGAACTATTATACCAGCTATTAGCCTCATCATAATTGCCCTTCCATCC 754 755 Cow TTACGAATTCTATACATAATAGATGAAATCAATAACCCATCTCTTACAGTAAAAACCATA 756 Carp CTACGCATCCTGTACCTTATAGACGAAATTAACGACCCTCACCTGACAATTAAAGCAATA 757 Chicken CTCCAAATCCTCTACATAATAGACGAAATCGACGAACCTGATCTCACCCTAAAAGCCATC 758 Human CTACGCATCCTTTACATAACAGACGAGGTCAACGATCCCTCCCTTACCATCAAATCAATT 759 Loach CTACGAATTCTATATCTTATAGACGAGATTAATGACCCCCACCTAACAATTAAGGCCATG 760 Mouse CTACGCATTCTATATATAATAGACGAAATCAACAACCCCGTATTAACCGTTAAAACCATA 761 Rat CTACGAATTCTATACATAATAGACGAGATTAATAACCCAGTTCTAACAGTAAAAACTATA 762 Seal TTACGAATCCTCTACATAATGGACGAGATCAATAACCCTTCCTTGACCGTAAAAACTATA 763 Whale TTACGGATCCTTTACATAATAGACGAAGTCAATAACCCCTCCCTCACTGTAAAAACAATA 764 Frog CTTCGTATCCTATATTTAATAGATGAAGTTAATGATCCACACTTAACAATTAAAGCAATC 765 766 Cow GGACATCAGTGATACTGAAGCTATGAGTATACAGATTATGAGGACTTAAGCTTCGACTCC 767 Carp GGACACCAATGATACTGAAGTTACGAGTATACAGACTATGAAAATCTAGGATTCGACTCC 768 Chicken GGACACCAATGATACTGAACCTATGAATACACAGACTTCAAGGACCTCTCATTTGACTCC 769 Human GGCCACCAATGGTACTGAACCTACGAGTACACCGACTACGGCGGACTAATCTTCAACTCC 770 Loach GGGCACCAATGATACTGAAGCTACGAGTATACTGATTATGAAAACTTAAGTTTTGACTCC 771 Mouse GGGCACCAATGATACTGAAGCTACGAATATACTGACTATGAAGACCTATGCTTTGATTCA 772 Rat GGACACCAATGATACTGAAGCTATGAATATACTGACTATGAAGACCTATGCTTTGACTCC 773 Seal GGACATCAGTGATACTGAAGCTATGAGTACACAGACTACGAAGACCTGAACTTTGACTCA 774 Whale GGTCACCAATGATATTGAAGCTATGAGTATACCGACTACGAAGACCTAAGCTTCGACTCC 775 Frog GGCCACCAATGATACTGAAGCTACGAATATACTAACTATGAGGATCTCTCATTTGACTCT 776 777 Cow TACATAATTCCAACATCAGAATTAAAGCCAGGGGAGCTACGACTATTAGAAGTCGATAAT 778 Carp TATATAGTACCAACCCAAGACCTTGCCCCCGGACAATTCCGACTTCTGGAAACAGACCAC 779 Chicken TACATAACCCCAACAACAGACCTCCCCCTAGGCCACTTCCGCCTACTAGAAGTCGACCAT 780 Human TACATACTTCCCCCATTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAAT 781 Loach TACATAATCCCCACCCAGGACCTAACCCCTGGACAATTCCGGCTACTAGAGACAGACCAC 782 Mouse TATATAATCCCAACAAACGACCTAAAACCTGGTGAACTACGACTGCTAGAAGTTGATAAC 783 Rat TACATAATCCCAACCAATGACCTAAAACCAGGTGAACTTCGTCTATTAGAAGTTGATAAT 784 Seal TATATGATCCCCACACAAGAACTAAAGCCCGGAGAACTACGACTGCTAGAAGTAGACAAT 785 Whale TATATAATCCCAACATCAGACCTAAAGCCAGGAGAACTACGATTATTAGAAGTAGATAAC 786 Frog TATATAATTCCAACTAATGACCTTACCCCTGGACAATTCCGGCTGCTAGAAGTTGATAAT 787 788 Cow CGAGTTGTACTACCAATAGAAATAACAATCCGAATGTTAGTCTCCTCTGAAGACGTATTA 789 Carp CGAATAGTTGTTCCAATAGAATCCCCAGTCCGTGTCCTAGTATCTGCTGAAGACGTGCTA 790 Chicken CGCATTGTAATCCCCATAGAATCCCCCATTCGAGTAATCATCACCGCTGATGACGTCCTC 791 Human CGAGTAGTACTCCCGATTGAAGCCCCCATTCGTATAATAATTACATCACAAGACGTCTTG 792 Loach CGAATGGTTGTTCCCATAGAATCCCCTATTCGCATTCTTGTTTCCGCCGAAGATGTACTA 793 Mouse CGAGTCGTTCTGCCAATAGAACTTCCAATCCGTATATTAATTTCATCTGAAGACGTCCTC 794 Rat CGGGTAGTCTTACCAATAGAACTTCCAATTCGTATACTAATCTCATCCGAAGACGTCCTG 795 Seal CGAGTAGTCCTCCCAATAGAAATAACAATCCGCATACTAATCTCATCAGAAGATGTACTC 796 Whale CGAGTTGTCTTACCTATAGAAATAACAATCCGAATATTAGTCTCATCAGAAGACGTACTC 797 Frog CGAATAGTAGTCCCAATAGAATCTCCAACCCGACTTTTAGTTACAGCCGAAGACGTCCTC 798 799 Cow CACTCATGAGCTGTGCCCTCTCTAGGACTAAAAACAGACGCAATCCCAGGCCGTCTAAAC 800 Carp CATTCTTGAGCTGTTCCATCCCTTGGCGTAAAAATGGACGCAGTCCCAGGACGACTAAAT 801 Chicken CACTCATGAGCCGTACCCGCCCTCGGGGTAAAAACAGACGCAATCCCTGGACGACTAAAT 802 Human CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC 803 Loach CACTCCTGGGCCCTTCCAGCCATGGGGGTAAAGATAGACGCGGTCCCAGGACGCCTTAAC 804 Mouse CACTCATGAGCAGTCCCCTCCCTAGGACTTAAAACTGATGCCATCCCAGGCCGACTAAAT 805 Rat CACTCATGAGCCATCCCTTCACTAGGGTTAAAAACCGACGCAATCCCCGGCCGCCTAAAC 806 Seal CACTCATGAGCCGTACCGTCCCTAGGACTAAAAACTGATGCTATCCCAGGACGACTAAAC 807 Whale CACTCATGGGCCGTACCCTCCTTGGGCCTAAAAACAGATGCAATCCCAGGACGCCTAAAC 808 Frog CACTCGTGAGCTGTACCCTCCTTGGGTGTCAAAACAGATGCAATCCCAGGACGACTTCAT 809 810 Cow CAAACAACCCTTATATCGTCCCGTCCAGGCTTATATTACGGTCAATGCTCAGAAATTTGC 811 Carp CAAGCCGCCTTTATTGCCTCACGCCCAGGGGTCTTTTACGGACAATGCTCTGAAATTTGT 812 Chicken CAAACCTCCTTCATCACCACTCGACCAGGAGTGTTTTACGGACAATGCTCAGAAATCTGC 813 Human CAAACCACTTTCACCGCTACACGACCGGGGGTATACTACGGTCAATGCTCTGAAATCTGT 814 Loach CAAACCGCCTTTATTGCCTCCCGCCCCGGGGTATTCTATGGGCAATGCTCAGAAATCTGT 815 Mouse CAAGCAACAGTAACATCAAACCGACCAGGGTTATTCTATGGCCAATGCTCTGAAATTTGT 816 Rat CAAGCTACAGTCACATCAAACCGACCAGGTCTATTCTATGGCCAATGCTCTGAAATTTGC 817 Seal CAAACAACCCTAATAACCATACGACCAGGACTGTACTACGGTCAATGCTCAGAAATCTGT 818 Whale CAAACAACCTTAATATCAACACGACCAGGCCTATTTTATGGACAATGCTCAGAGATCTGC 819 Frog CAAACATCATTTATTGCTACTCGTCCGGGAGTATTTTACGGACAATGTTCAGAAATTTGC 820 821 Cow GGGTCAAACCACAGTTTCATACCCATTGTCCTTGAGTTAGTCCCACTAAAGTACTTTGAA 822 Carp GGAGCTAATCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCTCTCGAACACTTCGAA 823 Chicken GGAGCTAACCACAGCTACATACCCATTGTAGTAGAGTCTACCCCCCTAAAACACTTTGAA 824 Human GGAGCAAACCACAGTTTCATGCCCATCGTCCTAGAATTAATTCCCCTAAAAATCTTTGAA 825 Loach GGAGCAAACCACAGCTTTATACCCATCGTAGTAGAAGCGGTCCCACTATCTCACTTCGAA 826 Mouse GGATCTAACCATAGCTTTATGCCCATTGTCCTAGAAATGGTTCCACTAAAATATTTCGAA 827 Rat GGCTCAAATCACAGCTTCATACCCATTGTACTAGAAATAGTGCCTCTAAAATATTTCGAA 828 Seal GGTTCAAACCACAGCTTCATACCTATTGTCCTCGAATTGGTCCCACTATCCCACTTCGAG 829 Whale GGCTCAAACCACAGTTTCATACCAATTGTCCTAGAACTAGTACCCCTAGAAGTCTTTGAA 830 Frog GGAGCAAACCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCGCTAACCGACTTTGAA 831 832 Cow AAATGATCTGCGTCAATATTA---------------------TAA 833 Carp AACTGATCCTCATTAATACTAGAAGACGCCTCGCTAGGAAGCTAA 834 Chicken GCCTGATCCTCACTA------------------CTGTCATCTTAA 835 Human ATA---------------------GGGCCCGTATTTACCCTATAG 836 Loach AACTGGTCCACCCTTATACTAAAAGACGCCTCACTAGGAAGCTAA 837 Mouse AACTGATCTGCTTCAATAATT---------------------TAA 838 Rat AACTGATCAGCTTCTATAATT---------------------TAA 839 Seal AAATGATCTACCTCAATGCTT---------------------TAA 840 Whale AAATGATCTGTATCAATACTA---------------------TAA 841 Frog AACTGATCTTCATCAATACTA---GAAGCATCACTA------AGA 842 ; 843 End; 844 """ 845 846 # This example uses amino acids, from here: 847 # http://www.molecularevolution.org/resources/fileformats/ 848 nxs_example3 = \ 849 """#NEXUS 850 851 Begin data; 852 Dimensions ntax=10 nchar=234; 853 Format datatype=protein gap=- interleave; 854 Matrix 855 Cow MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE 856 Carp MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQE 857 Chicken MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLS-SNTVDAQE 858 Human MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQE 859 Loach MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQE 860 Mouse MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE 861 Rat MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE 862 Seal MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQE 863 Whale MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQE 864 Frog MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQE 865 866 Cow VETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDS 867 Carp IEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDS 868 Chicken VELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDS 869 Human METVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNS 870 Loach IEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDS 871 Mouse VETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDS 872 Rat VETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDS 873 Seal VETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDS 874 Whale VETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDS 875 Frog IEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDS 876 877 Cow YMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN 878 Carp YMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLN 879 Chicken YMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLN 880 Human YMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLN 881 Loach YMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLN 882 Mouse YMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLN 883 Rat YMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLN 884 Seal YMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLN 885 Whale YMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLN 886 Frog YMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLH 887 888 Cow QTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML------- 889 Carp QAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS 890 Chicken QTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSL------LSS 891 Human QTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEM-------GPVFTL 892 Loach QTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS 893 Mouse QATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI------- 894 Rat QATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI------- 895 Seal QTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML------- 896 Whale QTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML------- 897 Frog QTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSML-EASL-- 898 ; 899 End; 900 """ 901 902 # This example with its slightly odd (partial) annotation is from here: 903 # http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html 904 sth_example = \ 905 """# STOCKHOLM 1.0 906 #=GF ID CBS 907 #=GF AC PF00571 908 #=GF DE CBS domain 909 #=GF AU Bateman A 910 #=GF CC CBS domains are small intracellular modules mostly found 911 #=GF CC in 2 or four copies within a protein. 912 #=GF SQ 67 913 #=GS O31698/18-71 AC O31698 914 #=GS O83071/192-246 AC O83071 915 #=GS O83071/259-312 AC O83071 916 #=GS O31698/88-139 AC O31698 917 #=GS O31698/88-139 OS Bacillus subtilis 918 O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS 919 #=GR O83071/192-246 SA 999887756453524252..55152525....36463774777 920 O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY 921 #=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE 922 O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS 923 #=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH 924 O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE 925 #=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH 926 #=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH 927 O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE 928 #=GR O31699/88-139 AS ________________*__________________________ 929 #=GR_O31699/88-139_IN ____________1______________2__________0____ 930 // 931 """ 932 933 # Interlaced example from BioPerl documentation. Also note the blank line. 934 # http://www.bioperl.org/wiki/Stockholm_multiple_alignment_format 935 sth_example2 = \ 936 """# STOCKHOLM 1.0 937 #=GC SS_cons .................<<<<<<<<...<<<<<<<........>>>>>>>.. 938 AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU 939 #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..-- 940 AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU 941 #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>---- 942 943 #=GC SS_cons ......<<<<<<<.......>>>>>>>..>>>>>>>>............... 944 AP001509.1 CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 945 #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>--------------- 946 AE007476.1 UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 947 #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>--------------- 948 //""" 949 950 # Sample GenBank record from here: 951 # http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html 952 gbk_example = \ 953 """LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 954 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p 955 (AXL2) and Rev7p (REV7) genes, complete cds. 956 ACCESSION U49845 957 VERSION U49845.1 GI:1293613 958 KEYWORDS . 959 SOURCE Saccharomyces cerevisiae (baker's yeast) 960 ORGANISM Saccharomyces cerevisiae 961 Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; 962 Saccharomycetales; Saccharomycetaceae; Saccharomyces. 963 REFERENCE 1 (bases 1 to 5028) 964 AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. 965 TITLE Cloning and sequence of REV7, a gene whose function is required for 966 DNA damage-induced mutagenesis in Saccharomyces cerevisiae 967 JOURNAL Yeast 10 (11), 1503-1509 (1994) 968 PUBMED 7871890 969 REFERENCE 2 (bases 1 to 5028) 970 AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. 971 TITLE Selection of axial growth sites in yeast requires Axl2p, a novel 972 plasma membrane glycoprotein 973 JOURNAL Genes Dev. 10 (7), 777-793 (1996) 974 PUBMED 8846915 975 REFERENCE 3 (bases 1 to 5028) 976 AUTHORS Roemer,T. 977 TITLE Direct Submission 978 JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New 979 Haven, CT, USA 980 FEATURES Location/Qualifiers 981 source 1..5028 982 /organism="Saccharomyces cerevisiae" 983 /db_xref="taxon:4932" 984 /chromosome="IX" 985 /map="9" 986 CDS <1..206 987 /codon_start=3 988 /product="TCP1-beta" 989 /protein_id="AAA98665.1" 990 /db_xref="GI:1293614" 991 /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA 992 AEVLLRVDNIIRARPRTANRQHM" 993 gene 687..3158 994 /gene="AXL2" 995 CDS 687..3158 996 /gene="AXL2" 997 /note="plasma membrane glycoprotein" 998 /codon_start=1 999 /function="required for axial budding pattern of S. 1000 cerevisiae" 1001 /product="Axl2p" 1002 /protein_id="AAA98666.1" 1003 /db_xref="GI:1293615" 1004 /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF 1005 TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN 1006 VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE 1007 VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE 1008 TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV 1009 YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG 1010 DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ 1011 DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA 1012 NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA 1013 CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN 1014 NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ 1015 SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS 1016 YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK 1017 HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL 1018 VDFSNKSNVNVGQVKDIHGRIPEML" 1019 gene complement(3300..4037) 1020 /gene="REV7" 1021 CDS complement(3300..4037) 1022 /gene="REV7" 1023 /codon_start=1 1024 /product="Rev7p" 1025 /protein_id="AAA98667.1" 1026 /db_xref="GI:1293616" 1027 /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ 1028 FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD 1029 KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR 1030 RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK 1031 LISGDDKILNGVYSQYEEGESIFGSLF" 1032 ORIGIN 1033 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 1034 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 1035 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 1036 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 1037 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa 1038 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa 1039 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat 1040 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga 1041 481 gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc 1042 541 tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga 1043 601 acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta 1044 661 cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag 1045 721 ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa 1046 781 aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata 1047 841 cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga 1048 901 gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac 1049 961 tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg 1050 1021 acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc 1051 1081 tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa 1052 1141 acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca 1053 1201 ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac 1054 1261 ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa 1055 1321 actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag 1056 1381 gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct 1057 1441 ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac 1058 1501 ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa 1059 1561 acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc 1060 1621 cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata 1061 1681 cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca 1062 1741 ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc 1063 1801 cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc 1064 1861 aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca 1065 1921 agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc 1066 1981 tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg 1067 2041 caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt 1068 2101 acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc 1069 2161 cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg 1070 2221 ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca 1071 2281 gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata 1072 2341 atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg 1073 2401 atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga 1074 2461 aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt 1075 2521 ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat 1076 2581 tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt 1077 2641 cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc 1078 2701 tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag 1079 2761 aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta 1080 2821 tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa 1081 2881 caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact 1082 2941 ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt 1083 3001 ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa 1084 3061 gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag 1085 3121 ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct 1086 3181 taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt 1087 3241 agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact 1088 3301 taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa 1089 3361 attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg 1090 3421 tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt 1091 3481 aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc 1092 3541 tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca 1093 3601 ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc 1094 3661 ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc 1095 3721 tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat 1096 3781 aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa 1097 3841 agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga 1098 3901 acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat 1099 3961 acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc 1100 4021 tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc 1101 4081 tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa 1102 4141 gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg 1103 4201 atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc 1104 4261 ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt 1105 4321 tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg 1106 4381 cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg 1107 4441 ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt 1108 4501 agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt 1109 4561 tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat 1110 4621 tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc 1111 4681 atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct 1112 4741 tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta 1113 4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac 1114 4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct 1115 4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct 1116 4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc 1117 //""" 1118 1119 # GenBank format protein (aka GenPept) file from: 1120 # http://www.molecularevolution.org/resources/fileformats/ 1121 gbk_example2 = \ 1122 """LOCUS AAD51968 143 aa linear BCT 21-AUG-2001 1123 DEFINITION transcriptional regulator RovA [Yersinia enterocolitica]. 1124 ACCESSION AAD51968 1125 VERSION AAD51968.1 GI:5805369 1126 DBSOURCE locus AF171097 accession AF171097.1 1127 KEYWORDS . 1128 SOURCE Yersinia enterocolitica 1129 ORGANISM Yersinia enterocolitica 1130 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; 1131 Enterobacteriaceae; Yersinia. 1132 REFERENCE 1 (residues 1 to 143) 1133 AUTHORS Revell,P.A. and Miller,V.L. 1134 TITLE A chromosomally encoded regulator is required for expression of the 1135 Yersinia enterocolitica inv gene and for virulence 1136 JOURNAL Mol. Microbiol. 35 (3), 677-685 (2000) 1137 MEDLINE 20138369 1138 PUBMED 10672189 1139 REFERENCE 2 (residues 1 to 143) 1140 AUTHORS Revell,P.A. and Miller,V.L. 1141 TITLE Direct Submission 1142 JOURNAL Submitted (22-JUL-1999) Molecular Microbiology, Washington 1143 University School of Medicine, Campus Box 8230, 660 South Euclid, 1144 St. Louis, MO 63110, USA 1145 COMMENT Method: conceptual translation. 1146 FEATURES Location/Qualifiers 1147 source 1..143 1148 /organism="Yersinia enterocolitica" 1149 /mol_type="unassigned DNA" 1150 /strain="JB580v" 1151 /serotype="O:8" 1152 /db_xref="taxon:630" 1153 Protein 1..143 1154 /product="transcriptional regulator RovA" 1155 /name="regulates inv expression" 1156 CDS 1..143 1157 /gene="rovA" 1158 /coded_by="AF171097.1:380..811" 1159 /note="regulator of virulence" 1160 /transl_table=11 1161 ORIGIN 1162 1 mestlgsdla rlvrvwrali dhrlkplelt qthwvtlhni nrlppeqsqi qlakaigieq 1163 61 pslvrtldql eekglitrht candrrakri klteqsspii eqvdgvicst rkeilggisp 1164 121 deiellsgli dklerniiql qsk 1165 //""" 1166 1167 1168 swiss_example = \ 1169 """ID 104K_THEAN Reviewed; 893 AA. 1170 AC Q4U9M9; 1171 DT 18-APR-2006, integrated into UniProtKB/Swiss-Prot. 1172 DT 05-JUL-2005, sequence version 1. 1173 DT 31-OCT-2006, entry version 8. 1174 DE 104 kDa microneme-rhoptry antigen precursor (p104). 1175 GN ORFNames=TA08425; 1176 OS Theileria annulata. 1177 OC Eukaryota; Alveolata; Apicomplexa; Piroplasmida; Theileriidae; 1178 OC Theileria. 1179 OX NCBI_TaxID=5874; 1180 RN [1] 1181 RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. 1182 RC STRAIN=Ankara; 1183 RX PubMed=15994557; DOI=10.1126/science.1110418; 1184 RA Pain A., Renauld H., Berriman M., Murphy L., Yeats C.A., Weir W., 1185 RA Kerhornou A., Aslett M., Bishop R., Bouchier C., Cochet M., 1186 RA Coulson R.M.R., Cronin A., de Villiers E.P., Fraser A., Fosker N., 1187 RA Gardner M., Goble A., Griffiths-Jones S., Harris D.E., Katzer F., 1188 RA Larke N., Lord A., Maser P., McKellar S., Mooney P., Morton F., 1189 RA Nene V., O'Neil S., Price C., Quail M.A., Rabbinowitsch E., 1190 RA Rawlings N.D., Rutter S., Saunders D., Seeger K., Shah T., Squares R., 1191 RA Squares S., Tivey A., Walker A.R., Woodward J., Dobbelaere D.A.E., 1192 RA Langsley G., Rajandream M.A., McKeever D., Shiels B., Tait A., 1193 RA Barrell B.G., Hall N.; 1194 RT "Genome of the host-cell transforming parasite Theileria annulata 1195 RT compared with T. parva."; 1196 RL Science 309:131-133(2005). 1197 CC -!- SUBCELLULAR LOCATION: Cell membrane; lipid-anchor; GPI-anchor 1198 CC (Potential). In microneme/rhoptry complexes (By similarity). 1199 DR EMBL; CR940353; CAI76474.1; -; Genomic_DNA. 1200 DR InterPro; IPR007480; DUF529. 1201 DR Pfam; PF04385; FAINT; 4. 1202 KW Complete proteome; GPI-anchor; Lipoprotein; Membrane; Repeat; Signal; 1203 KW Sporozoite. 1204 FT SIGNAL 1 19 Potential. 1205 FT CHAIN 20 873 104 kDa microneme-rhoptry antigen. 1206 FT /FTId=PRO_0000232680. 1207 FT PROPEP 874 893 Removed in mature form (Potential). 1208 FT /FTId=PRO_0000232681. 1209 FT COMPBIAS 215 220 Poly-Leu. 1210 FT COMPBIAS 486 683 Lys-rich. 1211 FT COMPBIAS 854 859 Poly-Arg. 1212 FT LIPID 873 873 GPI-anchor amidated aspartate 1213 FT (Potential). 1214 SQ SEQUENCE 893 AA; 101921 MW; 2F67CEB3B02E7AC1 CRC64; 1215 MKFLVLLFNI LCLFPILGAD ELVMSPIPTT DVQPKVTFDI NSEVSSGPLY LNPVEMAGVK 1216 YLQLQRQPGV QVHKVVEGDI VIWENEEMPL YTCAIVTQNE VPYMAYVELL EDPDLIFFLK 1217 EGDQWAPIPE DQYLARLQQL RQQIHTESFF SLNLSFQHEN YKYEMVSSFQ HSIKMVVFTP 1218 KNGHICKMVY DKNIRIFKAL YNEYVTSVIG FFRGLKLLLL NIFVIDDRGM IGNKYFQLLD 1219 DKYAPISVQG YVATIPKLKD FAEPYHPIIL DISDIDYVNF YLGDATYHDP GFKIVPKTPQ 1220 CITKVVDGNE VIYESSNPSV ECVYKVTYYD KKNESMLRLD LNHSPPSYTS YYAKREGVWV 1221 TSTYIDLEEK IEELQDHRST ELDVMFMSDK DLNVVPLTNG NLEYFMVTPK PHRDIIIVFD 1222 GSEVLWYYEG LENHLVCTWI YVTEGAPRLV HLRVKDRIPQ NTDIYMVKFG EYWVRISKTQ 1223 YTQEIKKLIK KSKKKLPSIE EEDSDKHGGP PKGPEPPTGP GHSSSESKEH EDSKESKEPK 1224 EHGSPKETKE GEVTKKPGPA KEHKPSKIPV YTKRPEFPKK SKSPKRPESP KSPKRPVSPQ 1225 RPVSPKSPKR PESLDIPKSP KRPESPKSPK RPVSPQRPVS PRRPESPKSP KSPKSPKSPK 1226 VPFDPKFKEK LYDSYLDKAA KTKETVTLPP VLPTDESFTH TPIGEPTAEQ PDDIEPIEES 1227 VFIKETGILT EEVKTEDIHS ETGEPEEPKR PDSPTKHSPK PTGTHPSMPK KRRRSDGLAL 1228 STTDLESEAG RILRDPTGKI VTMKRSKSFD DLTTVREKEH MGAEIRKIVV DDDGTEADDE 1229 DTHPSKEKHL STVRRRRPRP KKSSKSSKPR KPDSAFVPSI IFIFLVSLIV GIL 1230 // 1231 ID 104K_THEPA Reviewed; 924 AA. 1232 AC P15711; Q4N2B5; 1233 DT 01-APR-1990, integrated into UniProtKB/Swiss-Prot. 1234 DT 01-APR-1990, sequence version 1. 1235 DT 31-OCT-2006, entry version 31. 1236 DE 104 kDa microneme-rhoptry antigen precursor (p104). 1237 GN OrderedLocusNames=TP04_0437; 1238 OS Theileria parva. 1239 OC Eukaryota; Alveolata; Apicomplexa; Piroplasmida; Theileriidae; 1240 OC Theileria. 1241 OX NCBI_TaxID=5875; 1242 RN [1] 1243 RP NUCLEOTIDE SEQUENCE [GENOMIC DNA]. 1244 RC STRAIN=Muguga; 1245 RX MEDLINE=90158697; PubMed=1689460; DOI=10.1016/0166-6851(90)90007-9; 1246 RA Iams K.P., Young J.R., Nene V., Desai J., Webster P., Ole-Moiyoi O.K., 1247 RA Musoke A.J.; 1248 RT "Characterisation of the gene encoding a 104-kilodalton microneme- 1249 RT rhoptry protein of Theileria parva."; 1250 RL Mol. Biochem. Parasitol. 39:47-60(1990). 1251 RN [2] 1252 RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. 1253 RC STRAIN=Muguga; 1254 RX PubMed=15994558; DOI=10.1126/science.1110439; 1255 RA Gardner M.J., Bishop R., Shah T., de Villiers E.P., Carlton J.M., 1256 RA Hall N., Ren Q., Paulsen I.T., Pain A., Berriman M., Wilson R.J.M., 1257 RA Sato S., Ralph S.A., Mann D.J., Xiong Z., Shallom S.J., Weidman J., 1258 RA Jiang L., Lynn J., Weaver B., Shoaibi A., Domingo A.R., Wasawo D., 1259 RA Crabtree J., Wortman J.R., Haas B., Angiuoli S.V., Creasy T.H., Lu C., 1260 RA Suh B., Silva J.C., Utterback T.R., Feldblyum T.V., Pertea M., 1261 RA Allen J., Nierman W.C., Taracha E.L.N., Salzberg S.L., White O.R., 1262 RA Fitzhugh H.A., Morzaria S., Venter J.C., Fraser C.M., Nene V.; 1263 RT "Genome sequence of Theileria parva, a bovine pathogen that transforms 1264 RT lymphocytes."; 1265 RL Science 309:134-137(2005). 1266 CC -!- SUBCELLULAR LOCATION: Cell membrane; lipid-anchor; GPI-anchor 1267 CC (Potential). In microneme/rhoptry complexes. 1268 CC -!- DEVELOPMENTAL STAGE: Sporozoite antigen. 1269 DR EMBL; M29954; AAA18217.1; -; Unassigned_DNA. 1270 DR EMBL; AAGK01000004; EAN31789.1; -; Genomic_DNA. 1271 DR PIR; A44945; A44945. 1272 DR InterPro; IPR007480; DUF529. 1273 DR Pfam; PF04385; FAINT; 4. 1274 KW Complete proteome; GPI-anchor; Lipoprotein; Membrane; Repeat; Signal; 1275 KW Sporozoite. 1276 FT SIGNAL 1 19 Potential. 1277 FT CHAIN 20 904 104 kDa microneme-rhoptry antigen. 1278 FT /FTId=PRO_0000046081. 1279 FT PROPEP 905 924 Removed in mature form (Potential). 1280 FT /FTId=PRO_0000232679. 1281 FT COMPBIAS 508 753 Pro-rich. 1282 FT COMPBIAS 880 883 Poly-Arg. 1283 FT LIPID 904 904 GPI-anchor amidated aspartate 1284 FT (Potential). 1285 SQ SEQUENCE 924 AA; 103626 MW; 289B4B554A61870E CRC64; 1286 MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL 1287 QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG 1288 DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN 1289 GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK 1290 YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI 1291 TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT 1292 THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS 1293 EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT 1294 QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS 1295 SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR 1296 PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD 1297 DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK 1298 DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR 1299 SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL 1300 TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP 1301 KKPDSAYIPS ILAILVVSLI VGIL 1302 // 1303 ID 108_SOLLC Reviewed; 102 AA. 1304 AC Q43495; 1305 DT 15-JUL-1999, integrated into UniProtKB/Swiss-Prot. 1306 DT 01-NOV-1996, sequence version 1. 1307 DT 31-OCT-2006, entry version 37. 1308 DE Protein 108 precursor. 1309 OS Solanum lycopersicum (Tomato) (Lycopersicon esculentum). 1310 OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; 1311 OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; 1312 OC asterids; lamiids; Solanales; Solanaceae; Solanum; Lycopersicon. 1313 OX NCBI_TaxID=4081; 1314 RN [1] 1315 RP NUCLEOTIDE SEQUENCE [MRNA]. 1316 RC STRAIN=cv. VF36; TISSUE=Anther; 1317 RX MEDLINE=94143497; PubMed=8310077; DOI=10.1104/pp.101.4.1413; 1318 RA Chen R., Smith A.G.; 1319 RT "Nucleotide sequence of a stamen- and tapetum-specific gene from 1320 RT Lycopersicon esculentum."; 1321 RL Plant Physiol. 101:1413-1413(1993). 1322 CC -!- TISSUE SPECIFICITY: Stamen- and tapetum-specific. 1323 CC -!- SIMILARITY: Belongs to the A9/FIL1 family. 1324 DR EMBL; Z14088; CAA78466.1; -; mRNA. 1325 DR PIR; S26409; S26409. 1326 DR InterPro; IPR013770; LPT_helical. 1327 DR InterPro; IPR003612; LTP/seed_store/tryp_amyl_inhib. 1328 DR Pfam; PF00234; Tryp_alpha_amyl; 1. 1329 DR SMART; SM00499; AAI; 1. 1330 KW Signal. 1331 FT SIGNAL 1 30 Potential. 1332 FT CHAIN 31 102 Protein 108. 1333 FT /FTId=PRO_0000000238. 1334 FT DISULFID 41 77 By similarity. 1335 FT DISULFID 51 66 By similarity. 1336 FT DISULFID 67 92 By similarity. 1337 FT DISULFID 79 99 By similarity. 1338 SQ SEQUENCE 102 AA; 10576 MW; CFBAA1231C3A5E92 CRC64; 1339 MASVKSSSSS SSSSFISLLL LILLVIVLQS QVIECQPQQS CTASLTGLNV CAPFLVPGSP 1340 TASTECCNAV QSINHDCMCN TMRIAAQIPA QCNLPPLSCS AN 1341 // 1342 """ 1343 1344 print "#########################################################" 1345 print "# Sequence Input Tests #" 1346 print "#########################################################" 1347 1348 #ToDo - Check alphabet, or at least DNA/amino acid, for those 1349 # filetype that specify it (e.g. Nexus, GenBank) 1350 tests = [ 1351 (aln_example, "clustal", 8, "HISJ_E_COLI", 1352 "MKKLVLSLSLVLAFSSATAAF-------------------AAIPQNIRIG" + \ 1353 "TDPTYAPFESKNS-QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPS" + \ 1354 "LKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAKNSDIQP-TVE" + \ 1355 "SLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGQDNIYSDLTAGRIDA" + \ 1356 "AFQDEVAASEGFLKQPVGKDYKFGGPSVKDEKLFGVGTGMGLRKED--NE" + \ 1357 "LREALNKAFAEMRADGTYEKLAKKYFDFDVYGG---", True), 1358 (phy_example, "phylip", 8, "HISJ_E_COL", None, False), 1359 (nxs_example, "nexus", 8, "HISJ_E_COLI", None, True), 1360 (nxs_example2, "nexus", 10, "Frog", 1361 "ATGGCACACCCATCACAATTAGGTTTTCAAGACGCAGCCTCTCCAATTATAGAAGAATTA" + \ 1362 "CTTCACTTCCACGACCATACCCTCATAGCCGTTTTTCTTATTAGTACGCTAGTTCTTTAC" + \ 1363 "ATTATTACTATTATAATAACTACTAAACTAACTAATACAAACCTAATGGACGCACAAGAG" + \ 1364 "ATCGAAATAGTGTGAACTATTATACCAGCTATTAGCCTCATCATAATTGCCCTTCCATCC" + \ 1365 "CTTCGTATCCTATATTTAATAGATGAAGTTAATGATCCACACTTAACAATTAAAGCAATC" + \ 1366 "GGCCACCAATGATACTGAAGCTACGAATATACTAACTATGAGGATCTCTCATTTGACTCT" + \ 1367 "TATATAATTCCAACTAATGACCTTACCCCTGGACAATTCCGGCTGCTAGAAGTTGATAAT" + \ 1368 "CGAATAGTAGTCCCAATAGAATCTCCAACCCGACTTTTAGTTACAGCCGAAGACGTCCTC" + \ 1369 "CACTCGTGAGCTGTACCCTCCTTGGGTGTCAAAACAGATGCAATCCCAGGACGACTTCAT" + \ 1370 "CAAACATCATTTATTGCTACTCGTCCGGGAGTATTTTACGGACAATGTTCAGAAATTTGC" + \ 1371 "GGAGCAAACCACAGCTTTATACCAATTGTAGTTGAAGCAGTACCGCTAACCGACTTTGAA" + \ 1372 "AACTGATCTTCATCAATACTA---GAAGCATCACTA------AGA", True), 1373 (nxs_example3, "nexus", 10, "Frog", 1374 'MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQE' + \ 1375 'IEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDS' + \ 1376 'YMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLH' + \ 1377 'QTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSML-EASL--', True), 1378 (faa_example, "fasta", 8, "HISJ_E_COLI", 1379 'mkklvlslslvlafssataafaaipqnirigtdptyapfesknsqgelvgfdidlakelc' + \ 1380 'krintqctfvenpldalipslkakkidaimsslsitekrqqeiaftdklyaadsrlvvak' + \ 1381 'nsdiqptveslkgkrvgvlqgttqetfgnehwapkgieivsyqgqdniysdltagridaa' + \ 1382 'fqdevaasegflkqpvgkdykfggpsvkdeklfgvgtgmglrkednelrealnkafaemr' + \ 1383 'adgtyeklakkyfdfdvygg', True), 1384 (sth_example, "stockholm", 5, "O31699/88-139", 1385 'EVMLTDIPRLHINDPIMK--GFGMVINN------GFVCVENDE', True), 1386 (sth_example2, "stockholm", 2, "AE007476.1", 1387 'AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU' + \ 1388 'UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU', True), 1389 (gbk_example, "genbank", 1, "U49845.1", None, True), 1390 (gbk_example2,"genbank", 1, 'AAD51968.1', 1391 "MESTLGSDLARLVRVWRALIDHRLKPLELTQTHWVTLHNINRLPPEQSQIQLAKAIGIEQ" + \ 1392 "PSLVRTLDQLEEKGLITRHTCANDRRAKRIKLTEQSSPIIEQVDGVICSTRKEILGGISP" + \ 1393 "DEIELLSGLIDKLERNIIQLQSK", True), 1394 (gbk_example, "genbank-cds", 3, "AAA98667.1", 1395 'MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQFVPINRHPALIDYIEE' + \ 1396 'LILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVDKDDQIITETEVFDEFRSS' + \ 1397 'LNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNRRVDSLEEKAEIERDSNWVKC' + \ 1398 'QEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEKLISGDDKILNGVYSQYEEGESI' + \ 1399 'FGSLF', True), 1400 (swiss_example,"swiss", 3, "Q43495", 1401 "MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP" + \ 1402 "TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN", True), 1403 ] 1404 1405 for (data, format, rec_count, last_id, last_seq, dict_check) in tests: 1406 1407 print "%s file with %i records" % (format, rec_count) 1408 1409 print "Bio.SeqIO.parse(handle)" 1410 1411 #Basic check, turning the iterator into a list... 1412 #This uses "for x in iterator" interally. 1413 iterator = parse(StringIO(data), format=format) 1414 as_list = list(iterator) 1415 assert len(as_list) == rec_count, \ 1416 "Expected %i records, found %i" \ 1417 % (rec_count, len(as_list)) 1418 assert as_list[-1].id == last_id, \ 1419 "Expected '%s' as last record ID, found '%s'" \ 1420 % (last_id, as_list[-1].id) 1421 if last_seq : 1422 assert as_list[-1].seq.tostring() == last_seq 1423 1424 #Test iteration including use of the next() method and "for x in iterator" 1425 iterator = parse(StringIO(data), format=format) 1426 count = 1 1427 record = iterator.next() 1428 assert record is not None 1429 assert str(record.__class__) == "Bio.SeqRecord.SeqRecord" 1430 #print record 1431 for record in iterator : 1432 assert record.id == as_list[count].id 1433 assert record.seq.tostring() == as_list[count].seq.tostring() 1434 count = count + 1 1435 assert count == rec_count 1436 assert record is not None 1437 assert record.id == last_id 1438 1439 #Test iteration using just next() method 1440 iterator = parse(StringIO(data), format=format) 1441 count = 0 1442 while True : 1443 try : 1444 record = iterator.next() 1445 except StopIteration : 1446 break 1447 if record is None : break 1448 assert record.id == as_list[count].id 1449 assert record.seq.tostring() == as_list[count].seq.tostring() 1450 count=count+1 1451 assert count == rec_count 1452 1453 print "parse(...)" 1454 iterator = parse(StringIO(data), format=format) 1455 for (i, record) in enumerate(iterator) : 1456 assert record.id == as_list[i].id 1457 assert record.seq.tostring() == as_list[i].seq.tostring() 1458 assert i+1 == rec_count 1459 1460 print "parse(handle to empty file)" 1461 iterator = parse(StringIO(""), format=format) 1462 assert len(list(iterator))==0 1463 1464 if dict_check : 1465 print "to_dict(parse(...))" 1466 seq_dict = to_dict(parse(StringIO(data), format=format)) 1467 assert Set(seq_dict.keys()) == Set([r.id for r in as_list]) 1468 assert last_id in seq_dict 1469 assert seq_dict[last_id].seq.tostring() == as_list[-1].seq.tostring() 1470 1471 if len(Set([len(r.seq) for r in as_list]))==1 : 1472 #All the sequences in the example are the same length, 1473 #so it make sense to try turning this file into an alignment. 1474 print "to_alignment(parse(handle))" 1475 alignment = to_alignment(parse(handle = StringIO(data), format=format)) 1476 assert len(alignment._records)==rec_count 1477 assert alignment.get_alignment_length() == len(as_list[0].seq) 1478 for i in range(0, rec_count) : 1479 assert as_list[i].id == alignment._records[i].id 1480 assert as_list[i].id == alignment.get_all_seqs()[i].id 1481 assert as_list[i].seq.tostring() == alignment._records[i].seq.tostring() 1482 assert as_list[i].seq.tostring() == alignment.get_all_seqs()[i].seq.tostring() 1483 1484 print "read(...)" 1485 if rec_count == 1 : 1486 record = read(StringIO(data), format) 1487 assert isinstance(record, SeqRecord) 1488 else : 1489 try : 1490 record = read(StringIO(data), format) 1491 assert False, "Should have failed" 1492 except ValueError : 1493 #Expected to fail 1494 pass 1495 1496 print 1497 1498 print "Checking phy <-> aln examples agree using list(parse(...))" 1499 #Only compare the first 10 characters of the record.id as they 1500 #are truncated in the phylip file. Cannot use to_dict(parse(...)) 1501 #on the phylip file as there is a repeared id. 1502 aln_list = list(parse(StringIO(aln_example), format="clustal")) 1503 phy_list = list(parse(StringIO(phy_example), format="phylip")) 1504 assert len(aln_list) == len(phy_list) 1505 assert Set([r.id[0:10] for r in aln_list]) == Set([r.id for r in phy_list]) 1506 for i in range(0, len(aln_list)) : 1507 assert aln_list[i].id[0:10] == phy_list[i].id 1508 assert aln_list[i].seq.tostring() == phy_list[i].seq.tostring() 1509 1510 print "Checking nxs <-> aln examples agree using parse" 1511 #Only compare the first 10 characters of the record.id as they 1512 #are truncated in the phylip file. Cannot use to_dict(parse(...)) 1513 #on the phylip file as there is a repeared id. 1514 aln_iter = parse(StringIO(aln_example), format="clustal") 1515 nxs_iter = parse(StringIO(nxs_example), format="nexus") 1516 while True : 1517 try : 1518 aln_record = aln_iter.next() 1519 except StopIteration : 1520 aln_record = None 1521 try : 1522 nxs_record = nxs_iter.next() 1523 except StopIteration : 1524 nxs_record = None 1525 if aln_record is None or nxs_record is None : 1526 assert aln_record is None 1527 assert nxs_record is None 1528 break 1529 assert aln_record.id == nxs_record.id 1530 assert aln_record.seq.tostring() == nxs_record.seq.tostring() 1531 1532 print "Checking faa <-> aln examples agree using to_dict(parse(...)" 1533 #In my examples, aln_example is an alignment of faa_example 1534 aln_dict = to_dict(parse(StringIO(aln_example), format="clustal")) 1535 faa_dict = to_dict(parse(StringIO(faa_example), format="fasta")) 1536 1537 ids = Set(aln_dict.keys()) 1538 assert ids == Set(faa_dict.keys()) 1539 1540 for id in ids : 1541 #The aln file contains gaps as "-", and this fasta file does not 1542 assert aln_dict[id].seq.tostring().upper().replace("-","") == \ 1543 faa_dict[id].seq.tostring().upper() 1544 1545 print 1546 print "#########################################################" 1547 print "# Sequence Output Tests #" 1548 print "#########################################################" 1549 print 1550 1551 general_output_formats = _FormatToWriter.keys() 1552 alignment_formats = ["phylip","stockholm","clustal"] 1553 for (in_data, in_format, rec_count, last_id, last_seq, unique_ids) in tests: 1554 if unique_ids : 1555 in_list = list(parse(StringIO(in_data), format=in_format)) 1556 seq_lengths = [len(r.seq) for r in in_list] 1557 output_formats = general_output_formats[:] 1558 if min(seq_lengths)==max(seq_lengths) : 1559 output_formats.extend(alignment_formats) 1560 print "Checking conversion from %s (including to alignment formats)" % in_format 1561 else : 1562 print "Checking conversion from %s (excluding alignment formats)" % in_format 1563 for out_format in output_formats : 1564 print "Converting %s iterator -> %s" % (in_format, out_format) 1565 output = open("temp.txt","w") 1566 iterator = parse(StringIO(in_data), format=in_format) 1567 #I am using an iterator here deliberately, as some format 1568 #writers (e.g. phylip and stockholm) will have to cope with 1569 #this and get the record count. 1570 1571 try : 1572 write(iterator, output, out_format) 1573 except ValueError, e: 1574 print "FAILED: %s" % str(e) 1575 #Try next format instead... 1576 continue 1577 1578 output.close() 1579 1580 print "Checking %s <-> %s" % (in_format, out_format) 1581 out_list = list(parse(open("temp.txt","rU"), format=out_format)) 1582 1583 assert rec_count == len(out_list) 1584 if last_seq : 1585 assert last_seq == out_list[-1].seq.tostring() 1586 if out_format=="phylip" : 1587 assert last_id[0:10] == out_list[-1].id 1588 else : 1589 assert last_id == out_list[-1].id 1590 1591 for i in range(0, rec_count) : 1592 assert in_list[-1].seq.tostring() == out_list[-1].seq.tostring() 1593 if out_format=="phylip" : 1594 assert in_list[i].id[0:10] == out_list[i].id 1595 else : 1596 assert in_list[i].id == out_list[i].id 1597 print 1598 1599 print "#########################################################" 1600 print "# SeqIO Tests finished #" 1601 print "#########################################################" 1602