1
2
3
4
5
6
7
8
9 """Bio.SeqIO support for the "fastq" and "qual" file formats.
10
11 Note that you are expected to use this code via the Bio.SeqIO interface, as
12 shown below.
13
14 The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
15 to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
16 90). Rather than using a single FASTQ file, often paired FASTA and QUAL files
17 are used containing the sequence and the quality information separately.
18
19 The PHRED software reads DNA sequencing trace files, calls bases, and
20 assigns a quality value between 0 and 90 to each called base using a logged
21 transformation of the error probability, Q = -10 log10( Pe ), for example::
22
23 Pe = 0.0, Q = 0
24 Pe = 0.1, Q = 10
25 Pe = 0.01, Q = 20
26 ...
27 Pe = 0.00000001, Q = 80
28 Pe = 0.000000001, Q = 90
29
30 In the QUAL format these quality values are held as space separated text in
31 a FASTA like file format. In the FASTQ format, each quality values is encoded
32 with a single ASCI character using chr(Q+33), meaning zero maps to the
33 character "!" and for example 80 maps to "q". The sequences and quality are
34 then stored in pairs in a FASTA like format.
35
36 Unfortunately there is no official document describing the FASTQ file format,
37 and worse, several related but different variants exist. Reasonable
38 documentation exists at: http://maq.sourceforge.net/fastq.shtml
39
40 Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
41 be negative or easily exceed 90. PHRED scores and Solexa scores are NOT
42 interchangeable (but a reasonable mapping can be achieved between them).
43 Confusingly Solexa produces a FASTQ like file but using their own score
44 mapping instead.
45
46 Also note that Roche 454 sequencers can output files in the QUAL format, and
47 thankfully they use PHREP style scores like Sanger. To extract QUAL files from
48 a Roche 454 SFF binary file, use the Roche off instrument command line tool
49 "sffinfo" with the -q or -qual argument. You can extract a matching FASTA file
50 using the -s or -seq argument instead.
51
52 You are expected to use this module via the Bio.SeqIO functions, with the
53 following format names:
54 - "fastq" means Sanger style FASTQ files using PHRED scores.
55 - "fastq-solexa" means Solexa/Illumina style FASTQ files.
56 - "qual" means simple quality files using PHRED scores.
57
58 For example, consider the following short FASTQ file (extracted from a real
59 NCBI dataset)::
60
61 @EAS54_6_R1_2_1_413_324
62 CCCTTCTTGTCTTCAGCGTTTCTCC
63 +
64 ;;3;;;;;;;;;;;;7;;;;;;;88
65 @EAS54_6_R1_2_1_540_792
66 TTGGCAGGCCAAGGCCGATGGATCA
67 +
68 ;;;;;;;;;;;7;;;;;-;;;3;83
69 @EAS54_6_R1_2_1_443_348
70 GTTGCTTCTGGCGTGGGTGGGGGGG
71 +
72 ;;;;;;;;;;;9;7;;.7;393333
73
74 This contains three reads of length 25. From the read length these were
75 probably originally from an early Solexa/Illumina sequencer but NCBI have
76 followed the Sanger FASTQ convention and this actually uses PHRED style
77 qualities. This means we can parse this file using Bio.SeqIO using "fastq"
78 as the format name:
79
80 >>> from Bio import SeqIO
81 >>> for record in SeqIO.parse(open("Quality/example.fastq"), "fastq") :
82 ... print record.id, record.seq
83 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
84 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
85 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
86
87 The qualities are held as a list of integers in each record's annotation:
88
89 >>> print record
90 ID: EAS54_6_R1_2_1_443_348
91 Name: EAS54_6_R1_2_1_443_348
92 Description: EAS54_6_R1_2_1_443_348
93 Number of features: 0
94 Per letter annotation for: phred_quality
95 Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
96 >>> print record.letter_annotations["phred_quality"]
97 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
98
99 You can use the SeqRecord format method you can show this in the QUAL format:
100
101 >>> print record.format("qual")
102 >EAS54_6_R1_2_1_443_348
103 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
104 24 18 18 18 18
105 <BLANKLINE>
106
107 Or go back to the FASTQ format,
108
109 >>> print record.format("fastq")
110 @EAS54_6_R1_2_1_443_348
111 GTTGCTTCTGGCGTGGGTGGGGGGG
112 +
113 ;;;;;;;;;;;9;7;;.7;393333
114 <BLANKLINE>
115
116 You can also get Biopython to convert the scores and show a Solexa style
117 FASTQ file:
118
119 >>> print record.format("fastq-solexa")
120 @EAS54_6_R1_2_1_443_348
121 GTTGCTTCTGGCGTGGGTGGGGGGG
122 +
123 ZZZZZZZZZZZXZVZZMVZRXRRRR
124 <BLANKLINE>
125
126 If you wanted to trim your sequences (perhaps to remove low quality regions,
127 or to remove a primer sequence), try slicing the SeqRecord objects. e.g.
128
129 >>> sub_rec = record[5:15]
130 >>> print sub_rec
131 ID: EAS54_6_R1_2_1_443_348
132 Name: EAS54_6_R1_2_1_443_348
133 Description: EAS54_6_R1_2_1_443_348
134 Number of features: 0
135 Per letter annotation for: phred_quality
136 Seq('TTCTGGCGTG', SingleLetterAlphabet())
137 >>> print sub_rec.letter_annotations["phred_quality"]
138 [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
139 >>> print sub_rec.format("fastq")
140 @EAS54_6_R1_2_1_443_348
141 TTCTGGCGTG
142 +
143 ;;;;;;9;7;
144 <BLANKLINE>
145
146 If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
147
148 >>> from Bio import SeqIO
149 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
150 >>> out_handle = open("Quality/temp.qual", "w")
151 >>> SeqIO.write(record_iterator, out_handle, "qual")
152 3
153 >>> out_handle.close()
154
155 You can of course read in a QUAL file, such as the one we just created:
156
157 >>> from Bio import SeqIO
158 >>> for record in SeqIO.parse(open("Quality/temp.qual"), "qual") :
159 ... print record.id, record.seq
160 EAS54_6_R1_2_1_413_324 ?????????????????????????
161 EAS54_6_R1_2_1_540_792 ?????????????????????????
162 EAS54_6_R1_2_1_443_348 ?????????????????????????
163
164 Notice that QUAL files don't have a proper sequence present! But the quality
165 information is there:
166
167 >>> print record
168 ID: EAS54_6_R1_2_1_443_348
169 Name: EAS54_6_R1_2_1_443_348
170 Description: EAS54_6_R1_2_1_443_348
171 Number of features: 0
172 Per letter annotation for: phred_quality
173 UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
174 >>> print record.letter_annotations["phred_quality"]
175 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
176
177 Just to keep things tidy, if you are following this example yourself, you can
178 delete this temporary file now:
179
180 >>> import os
181 >>> os.remove("Quality/temp.qual")
182
183 Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
184 files. Because the Bio.SeqIO system is designed for reading single files, you
185 would have to read the two in separately and then combine the data. However,
186 since this is such a common thing to want to do, there is a helper iterator
187 defined in this module that does this for you - PairedFastaQualIterator.
188
189 Alternatively, if you have enough RAM to hold all the records in memory at once,
190 then a simple dictionary approach would work:
191
192 >>> from Bio import SeqIO
193 >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta"))
194 >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual") :
195 ... reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
196
197 You can then access any record by its key, and get both the sequence and the
198 quality scores.
199
200 >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq")
201 @EAS54_6_R1_2_1_540_792
202 TTGGCAGGCCAAGGCCGATGGATCA
203 +
204 ;;;;;;;;;;;7;;;;;-;;;3;83
205 <BLANKLINE>
206
207 It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
208 using ("fastq" for the Sanger standard using PHRED values, or "fastq-solexa"
209 for the Solexa/Illumina variant), as this cannot be detected reliably
210 automatically.
211 """
212 __docformat__ = "epytext en"
213
214
215
216 from Bio.Alphabet import single_letter_alphabet
217 from Bio.Seq import Seq, UnknownSeq
218 from Bio.SeqRecord import SeqRecord
219 from Interfaces import SequentialSequenceWriter
220 from math import log
221
222
223
224 SANGER_SCORE_OFFSET = 33
225 SOLEXA_SCORE_OFFSET = 64
226
228 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
229
230 This will return a floating point number, it is up to you to round this to
231 the nearest integer if appropriate. e.g.
232
233 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
234 80.00
235 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
236 50.00
237 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
238 19.96
239 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
240 9.54
241 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
242 -5.87
243 """
244 return 10*log(10**(phred_quality/10.0) - 1, 10)
245
247 """Convert a Solexa quality (which can be negative) to a PHRED quality.
248
249 This will return a floating point number, it is up to you to round this to
250 the nearest integer if appropriate. e.g.
251
252 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
253 80.00
254 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
255 20.04
256 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
257 10.41
258 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
259 3.01
260 >>> print "%0.2f" % round(phred_quality_from_solexa(-10),2)
261 0.41
262 """
263 return 10*log(10**(solexa_quality/10.0) + 1, 10)
264
266 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
267
268 If there are no PHRED qualities, but there are Solexa qualities, those are
269 used instead after conversion.
270 """
271 try :
272 return record.letter_annotations["phred_quality"]
273 except KeyError :
274 pass
275 try :
276 return [phred_quality_from_solexa(q) for \
277 q in record.letter_annotations["solexa_quality"]]
278 except KeyError :
279 raise ValueError("No suitable quality scores found in letter_annotations "
280 "of SeqRecord (id=%s)." % record.id)
281
283 """Extract Solexa qualities from a SeqRecord's letter_annotations (PRIVATE).
284
285 If there are no Solexa qualities, but there are PHRED qualities, those are
286 used instead after conversion.
287 """
288 try :
289 return record.letter_annotations["solexa_quality"]
290 except KeyError :
291 pass
292 try :
293 return [solexa_quality_from_phred(q) for \
294 q in record.letter_annotations["phred_quality"]]
295 except KeyError :
296 raise ValueError("No suitable quality scores found in letter_annotation "
297 "of SeqRecord (id=%s)." % record.id)
298
299
300
302 """Iterate over Fastq records as string tuples (not as SeqRecord objects).
303
304 This code does not try to interpret the quality string numerically. It
305 just returns tuples of the title, sequence and quality as strings. For
306 the sequence and quality, any whitespace (such as new lines) is removed.
307
308 Our SeqRecord based FASTQ iterators call this function internally, and then
309 turn the strings into a SeqRecord objects, mapping the quality string into
310 a list of numerical scores. If you want to do a custom quality mapping,
311 then you might consider calling this function directly.
312
313 For parsing FASTQ files, the title string from the "@" line at the start
314 of each record can optionally be omitted on the "+" lines. If it is
315 repeated, it must be identical.
316
317 The sequence string and the quality string can optionally be split over
318 multiple lines, although several sources discourage this. In comparison,
319 for the FASTA file format line breaks between 60 and 80 characters are
320 the norm.
321
322 WARNING - Because the "@" character can appear in the quality string,
323 this can cause problems as this is also the marker for the start of
324 a new sequence. In fact, the "+" sign can also appear as well. Some
325 sources recommended having no line breaks in the quality to avoid this,
326 but even that is not enough, consider this example::
327
328 @071113_EAS56_0053:1:1:998:236
329 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
330 +071113_EAS56_0053:1:1:998:236
331 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
332 @071113_EAS56_0053:1:1:182:712
333 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
334 +
335 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
336 @071113_EAS56_0053:1:1:153:10
337 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
338 +
339 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
340 @071113_EAS56_0053:1:3:990:501
341 TGGGAGGTTTTATGTGGA
342 AAGCAGCAATGTACAAGA
343 +
344 IIIIIII.IIIIII1@44
345 @-7.%<&+/$/%4(++(%
346
347 This is four PHRED encoded FASTQ entries originally from an NCBI source
348 (given the read length of 36, these are probably Solexa Illumna reads where
349 the quality has been mapped onto the PHRED values).
350
351 This example has been edited to illustrate some of the nasty things allowed
352 in the FASTQ format. Firstly, on the "+" lines most but not all of the
353 (redundant) identifiers are ommited. In real files it is likely that all or
354 none of these extra identifiers will be present.
355
356 Secondly, while the first three sequences have been shown without line
357 breaks, the last has been split over multiple lines. In real files any line
358 breaks are likely to be consistent.
359
360 Thirdly, some of the quality string lines start with an "@" character. For
361 the second record this is unavoidable. However for the fourth sequence this
362 only happens because its quality string is split over two lines. A naive
363 parser could wrongly treat any line starting with an "@" as the beginning of
364 a new sequence! This code copes with this possible ambiguity by keeping track
365 of the length of the sequence which gives the expected length of the quality
366 string.
367
368 Using this tricky example file as input, this short bit of code demonstrates
369 what this parsing function would return:
370
371 >>> handle = open("Quality/tricky.fastq", "rU")
372 >>> for (title, sequence, quality) in FastqGeneralIterator(handle) :
373 ... print title
374 ... print sequence, quality
375 071113_EAS56_0053:1:1:998:236
376 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
377 071113_EAS56_0053:1:1:182:712
378 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
379 071113_EAS56_0053:1:1:153:10
380 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
381 071113_EAS56_0053:1:3:990:501
382 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
383 >>> handle.close()
384
385 Finally we note that some sources state that the quality string should
386 start with "!" (which using the PHRED mapping means the first letter always
387 has a quality score of zero). This rather restrictive rule is not widely
388 observed, so is therefore ignored here. One plus point about this "!" rule
389 is that (provided there are no line breaks in the quality sequence) it
390 would prevent the above problem with the "@" character.
391 """
392
393 while True :
394 line = handle.readline()
395 if line == "" : return
396 if line[0] == "@" :
397 break
398
399 while True :
400 if line[0]!="@" :
401 raise ValueError("Records in Fastq files should start with '@' character")
402 title_line = line[1:].rstrip()
403
404 seq_lines = []
405 line = handle.readline()
406 while True:
407 if not line :
408 raise ValueError("End of file without quality information.")
409 if line[0] == "+":
410
411 if line[1:].rstrip() and line[1:].rstrip() != title_line :
412 raise ValueError("Sequence and quality captions differ.")
413 break
414 seq_lines.extend(line.split())
415 line = handle.readline()
416
417 seq_string = "".join(seq_lines)
418 del seq_lines
419
420 quality_lines = []
421 line = handle.readline()
422 while True:
423 if not line : break
424 if line[0] == "@":
425
426
427
428
429 if len("".join(quality_lines)) >= len(seq_string) :
430
431
432 break
433
434
435 quality_lines.extend(line.split())
436 line = handle.readline()
437
438 quality_string = "".join(quality_lines)
439 del quality_lines
440
441 if len(seq_string) != len(quality_string) :
442 raise ValueError("Lengths of sequence and quality values differs "
443 " for %s (%i and %i)." \
444 % (title_line, len(seq_string), len(quality_string)))
445
446
447 yield (title_line, seq_string, quality_string)
448 if not line : return
449 assert False, "Should not reach this line"
450
451
453 """Generator function to iterate over FASTQ records (as SeqRecord objects).
454
455 - handle - input file
456 - alphabet - optional alphabet
457 - title2ids - A function that, when given the title line from the FASTQ
458 file (without the beginning >), will return the id, name and
459 description (in that order) for the record as a tuple of
460 strings. If this is not given, then the entire title line
461 will be used as the description, and the first word as the
462 id and name.
463
464 Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
465
466 For each sequence in a (Sanger style) FASTQ file there is a matching string
467 encoding the PHRED qualities (integers between 0 and about 90) using ASCII
468 values with an offset of 33.
469
470 For example, consider a file containing three short reads::
471
472 @EAS54_6_R1_2_1_413_324
473 CCCTTCTTGTCTTCAGCGTTTCTCC
474 +
475 ;;3;;;;;;;;;;;;7;;;;;;;88
476 @EAS54_6_R1_2_1_540_792
477 TTGGCAGGCCAAGGCCGATGGATCA
478 +
479 ;;;;;;;;;;;7;;;;;-;;;3;83
480 @EAS54_6_R1_2_1_443_348
481 GTTGCTTCTGGCGTGGGTGGGGGGG
482 +
483 ;;;;;;;;;;;9;7;;.7;393333
484
485 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
486 string encoding the PHRED qualities using a ASCI values with an offset of
487 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
488
489 Using this module directly you might run:
490
491 >>> handle = open("Quality/example.fastq", "rU")
492 >>> for record in FastqPhredIterator(handle) :
493 ... print record.id, record.seq
494 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
495 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
496 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
497 >>> handle.close()
498
499 Typically however, you would call this via Bio.SeqIO instead with "fastq" as
500 the format:
501
502 >>> from Bio import SeqIO
503 >>> handle = open("Quality/example.fastq", "rU")
504 >>> for record in SeqIO.parse(handle, "fastq") :
505 ... print record.id, record.seq
506 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
507 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
508 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
509 >>> handle.close()
510
511 If you want to look at the qualities, they are record in each record's
512 per-letter-annotation dictionary as a simple list of integers:
513
514 >>> print record.letter_annotations["phred_quality"]
515 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
516 """
517 for title_line, seq_string, quality_string in FastqGeneralIterator(handle) :
518 if title2ids :
519 id, name, descr = title2ids(title_line)
520 else :
521 descr = title_line
522 id = descr.split()[0]
523 name = id
524 record = SeqRecord(Seq(seq_string, alphabet),
525 id=id, name=name, description=descr)
526
527 assert SANGER_SCORE_OFFSET == ord("!")
528
529
530
531
532
533
534
535
536
537 qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string]
538 if qualities :
539 if min(qualities) < 0 or max(qualities) > 90 :
540 raise ValueError("Quality score outside 0 to 90 found - these are perhaps "
541 "in a Solexa/Illumina format, not the Sanger FASTQ format "
542 "which uses PHRED scores.")
543 record.letter_annotations["phred_quality"] = qualities
544 yield record
545
546
548 """Parsing the Solexa/Illumina FASTQ like files (which differ in the quality mapping).
549
550 The optional arguments are the same as those for the FastqPhredIterator.
551
552 For each sequence in Solexa/Illumina FASTQ files there is a matching string
553 encoding the Solexa integer qualities using ASCII values with an offset
554 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython
555 will NOT perform any automatic conversion when loading.
556
557 For example, consider a file containing these five records::
558
559 @SLXA-B3_649_FC8437_R1_1_1_610_79
560 GATGTGCAATACCTTTGTAGAGGAA
561 +SLXA-B3_649_FC8437_R1_1_1_610_79
562 YYYYYYYYYYYYYYYYYYWYWYYSU
563 @SLXA-B3_649_FC8437_R1_1_1_397_389
564 GGTTTGAGAAAGAGAAATGAGATAA
565 +SLXA-B3_649_FC8437_R1_1_1_397_389
566 YYYYYYYYYWYYYYWWYYYWYWYWW
567 @SLXA-B3_649_FC8437_R1_1_1_850_123
568 GAGGGTGTTGATCATGATGATGGCG
569 +SLXA-B3_649_FC8437_R1_1_1_850_123
570 YYYYYYYYYYYYYWYYWYYSYYYSY
571 @SLXA-B3_649_FC8437_R1_1_1_362_549
572 GGAAACAAAGTTTTTCTCAACATAG
573 +SLXA-B3_649_FC8437_R1_1_1_362_549
574 YYYYYYYYYYYYYYYYYYWWWWYWY
575 @SLXA-B3_649_FC8437_R1_1_1_183_714
576 GTATTATTTAATGGCATACACTCAA
577 +SLXA-B3_649_FC8437_R1_1_1_183_714
578 YYYYYYYYYYWYYYYWYWWUWWWQQ
579
580 Using this module directly you might run:
581
582 >>> handle = open("Quality/solexa_example.fastq", "rU")
583 >>> for record in FastqSolexaIterator(handle) :
584 ... print record.id, record.seq
585 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
586 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
587 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
588 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
589 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
590 >>> handle.close()
591
592 Typically however, you would call this via Bio.SeqIO instead with "fastq" as
593 the format:
594
595 >>> from Bio import SeqIO
596 >>> handle = open("Quality/solexa_example.fastq", "rU")
597 >>> for record in SeqIO.parse(handle, "fastq-solexa") :
598 ... print record.id, record.seq
599 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
600 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
601 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
602 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
603 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
604 >>> handle.close()
605
606 If you want to look at the qualities, they are recorded in each record's
607 per-letter-annotation dictionary as a simple list of integers:
608
609 >>> print record.letter_annotations["solexa_quality"]
610 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
611
612 These scores aren't very good, but they are high enough that they map
613 almost exactly onto PHRED scores:
614
615 >>> print "%0.2f" % phred_quality_from_solexa(25)
616 25.01
617
618 Let's look at another example read which is even worse, where there are
619 more noticeable differences between the Solexa and PHRED scores::
620
621 @slxa_0013_1_0001_24
622 ACAAAAATCACAAGCATTCTTATACACC
623 +slxa_0013_1_0001_24
624 ??????????????????:??<?<-6%.
625
626 Again, you would typically use Bio.SeqIO to read this file in (rather than
627 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will
628 contain thousands of reads, so you would normally use Bio.SeqIO.parse()
629 as shown above. This example has only as one entry, so instead we can
630 use the Bio.SeqIO.read() function:
631
632 >>> from Bio import SeqIO
633 >>> handle = open("Quality/solexa.fastq", "rU")
634 >>> record = SeqIO.read(handle, "fastq-solexa")
635 >>> handle.close()
636 >>> print record.id, record.seq
637 slxa_0013_1_0001_24 ACAAAAATCACAAGCATTCTTATACACC
638 >>> print record.letter_annotations["solexa_quality"]
639 [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -6, -1, -1, -4, -1, -4, -19, -10, -27, -18]
640
641 These quality scores are so low that when converted from the Solexa scheme
642 into PHRED scores they look quite different:
643
644 >>> print "%0.2f" % phred_quality_from_solexa(-1)
645 2.54
646
647 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
648 method to output the record(s):
649
650 >>> print record.format("fastq-solexa")
651 @slxa_0013_1_0001_24
652 ACAAAAATCACAAGCATTCTTATACACC
653 +
654 ??????????????????:??<?<-6%.
655 <BLANKLINE>
656
657 Note this output is slightly different from the input file as Biopython
658 has left out the optional repetition of the sequence identifier on the "+"
659 line. If you want the to use PHRED scores, use "fastq" or "qual" as the
660 output format instead, and Biopython will do the conversion for you:
661
662 >>> print record.format("fastq")
663 @slxa_0013_1_0001_24
664 ACAAAAATCACAAGCATTCTTATACACC
665 +
666 $$$$$$$$$$$$$$$$$$"$$"$"!!!!
667 <BLANKLINE>
668
669 >>> print record.format("qual")
670 >slxa_0013_1_0001_24
671 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 1 0 0 0 0
672 <BLANKLINE>
673 """
674 for title_line, seq_string, quality_string in FastqGeneralIterator(handle) :
675 if title2ids :
676 id, name, descr = title_line
677 else :
678 descr = title_line
679 id = descr.split()[0]
680 name = id
681 record = SeqRecord(Seq(seq_string, alphabet),
682 id=id, name=name, description=descr)
683 qualities = [ord(letter)-SOLEXA_SCORE_OFFSET for letter in quality_string]
684
685 record.letter_annotations["solexa_quality"] = qualities
686 yield record
687
689 """For QUAL files which include PHRED quality scores, but no sequence.
690
691 For example, consider this short QUAL file::
692
693 >EAS54_6_R1_2_1_413_324
694 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
695 26 26 26 23 23
696 >EAS54_6_R1_2_1_540_792
697 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
698 26 18 26 23 18
699 >EAS54_6_R1_2_1_443_348
700 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
701 24 18 18 18 18
702
703 Using this module directly you might run:
704
705 >>> handle = open("Quality/example.qual", "rU")
706 >>> for record in QualPhredIterator(handle) :
707 ... print record.id, record.seq
708 EAS54_6_R1_2_1_413_324 ?????????????????????????
709 EAS54_6_R1_2_1_540_792 ?????????????????????????
710 EAS54_6_R1_2_1_443_348 ?????????????????????????
711 >>> handle.close()
712
713 Typically however, you would call this via Bio.SeqIO instead with "qual"
714 as the format:
715
716 >>> from Bio import SeqIO
717 >>> handle = open("Quality/example.qual", "rU")
718 >>> for record in SeqIO.parse(handle, "qual") :
719 ... print record.id, record.seq
720 EAS54_6_R1_2_1_413_324 ?????????????????????????
721 EAS54_6_R1_2_1_540_792 ?????????????????????????
722 EAS54_6_R1_2_1_443_348 ?????????????????????????
723 >>> handle.close()
724
725 Becase QUAL files don't contain the sequence string itself, the seq
726 property is set to an UnknownSeq object. As no alphabet was given, this
727 has defaulted to a generic single letter alphabet and the character "?"
728 used.
729
730 By specifying a nucleotide alphabet, "N" is used instead:
731
732 >>> from Bio import SeqIO
733 >>> from Bio.Alphabet import generic_dna
734 >>> handle = open("Quality/example.qual", "rU")
735 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna) :
736 ... print record.id, record.seq
737 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
738 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
739 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
740 >>> handle.close()
741
742 However, the quality scores themselves are available as a list of integers
743 in each record's per-letter-annotation:
744
745 >>> print record.letter_annotations["phred_quality"]
746 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
747
748 You can still slice one of these SeqRecord objects with an UnknownSeq:
749
750 >>> sub_record = record[5:10]
751 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
752 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
753 """
754
755 while True :
756 line = handle.readline()
757 if line == "" : return
758 if line[0] == ">" :
759 break
760
761 while True :
762 if line[0]!=">" :
763 raise ValueError("Records in Fasta files should start with '>' character")
764 if title2ids :
765 id, name, descr = title2ids(line[1:].rstrip())
766 else :
767 descr = line[1:].rstrip()
768 id = descr.split()[0]
769 name = id
770
771 qualities = []
772 line = handle.readline()
773 while True:
774 if not line : break
775 if line[0] == ">": break
776 qualities.extend([int(word) for word in line.split()])
777 line = handle.readline()
778
779 if qualities :
780 if min(qualities) < 0 or max(qualities) > 90 :
781 raise ValueError(("Quality score range for %s is %i to %i, outside the " \
782 +"expected 0 to 90. Perhaps these are Solexa/Illumina " \
783 +"scores, and not PHRED scores?") \
784 % (id, min(qualities), max(qualities)))
785
786
787 record = SeqRecord(UnknownSeq(len(qualities), alphabet),
788 id = id, name = name, description = descr)
789 record.letter_annotations["phred_quality"] = qualities
790 yield record
791
792 if not line : return
793 assert False, "Should not reach this line"
794
796 """Class to write FASTQ format files (using PHRED quality scores).
797
798 Although you can use this class directly, you are strongly encouraged
799 to use the Bio.SeqIO.write() function instead. For example, this code
800 reads in a FASTQ (PHRED) file and re-saves it as another FASTQ (PHRED)
801 file:
802
803 >>> from Bio import SeqIO
804 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
805 >>> out_handle = open("Quality/temp.fastq", "w")
806 >>> SeqIO.write(record_iterator, out_handle, "fastq")
807 3
808 >>> out_handle.close()
809
810 You might want to do this if the original file included extra line breaks,
811 which while valid may not be supported by all tools. The output file from
812 Biopython will have each sequence on a single line, and each quality
813 string on a single line (which is considered desirable for maximum
814 compatibility).
815
816 In this next example, a Solexa FASTQ file is converted into a standard
817 Sanger style FASTQ file using PHRED qualities:
818
819 >>> from Bio import SeqIO
820 >>> record_iterator = SeqIO.parse(open("Quality/solexa.fastq"), "fastq-solexa")
821 >>> out_handle = open("Quality/temp.fastq", "w")
822 >>> SeqIO.write(record_iterator, out_handle, "fastq")
823 1
824 >>> out_handle.close()
825
826 This code is also called if you use the .format("fastq") method of a
827 SeqRecord.
828
829 P.S. To avoid cluttering up your working directory, you can delete this
830 temporary file now:
831
832 >>> import os
833 >>> os.remove("Quality/temp.fastq")
834
835 """
855
857 """Class to write QUAL format files (using PHRED quality scores).
858
859 Although you can use this class directly, you are strongly encouraged
860 to use the Bio.SeqIO.write() function instead. For example, this code
861 reads in a FASTQ file and saves the quality scores into a QUAL file:
862
863 >>> from Bio import SeqIO
864 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
865 >>> out_handle = open("Quality/temp.qual", "w")
866 >>> SeqIO.write(record_iterator, out_handle, "qual")
867 3
868 >>> out_handle.close()
869
870 This code is also called if you use the .format("qual") method of a
871 SeqRecord.
872
873 P.S. Don't forget to clean up the temp file if you don't need it anymore:
874
875 >>> import os
876 >>> os.remove("Quality/temp.qual")
877 """
878 - def __init__(self, handle, wrap=60, record2title=None):
879 """Create a QUAL writer.
880
881 Arguments:
882 - handle - Handle to an output file, e.g. as returned
883 by open(filename, "w")
884 - wrap - Optional line length used to wrap sequence lines.
885 Defaults to wrapping the sequence at 60 characters
886 Use zero (or None) for no wrapping, giving a single
887 long line for the sequence.
888 - record2title - Optional function to return the text to be
889 used for the title line of each record. By default
890 a combination of the record.id and record.description
891 is used. If the record.description starts with the
892 record.id, then just the record.description is used.
893
894 The record2title argument is present for consistency with the
895 Bio.SeqIO.FastaIO writer class.
896 """
897 SequentialSequenceWriter.__init__(self, handle)
898
899 self.wrap = None
900 if wrap :
901 if wrap < 1 :
902 raise ValueError
903 self.wrap = wrap
904 self.record2title = record2title
905
943
945 """Class to write FASTQ format files (using Solexa quality scores).
946
947 Although you can use this class directly, you are strongly encouraged
948 to use the Bio.SeqIO.write() function instead. For example, this code
949 reads in a FASTQ file and re-saves it as another FASTQ file:
950
951 >>> from Bio import SeqIO
952 >>> record_iterator = SeqIO.parse(open("Quality/solexa.fastq"), "fastq-solexa")
953 >>> out_handle = open("Quality/temp.fastq", "w")
954 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa")
955 1
956 >>> out_handle.close()
957
958 You might want to do this if the original file included extra line
959 breaks, which (while valid) may not be supported by all tools. The
960 output file from Biopython will have each sequence on a single line, and
961 each quality string on a single line (which is considered desirable for
962 maximum compatibility).
963
964 This code is also called if you use the .format("fastq-solexa") method of
965 a SeqRecord.
966
967 P.S. Don't forget to delete the temp file if you don't need it anymore:
968
969 >>> import os
970 >>> os.remove("Quality/temp.fastq")
971 """
989
991 """Iterate over matched FASTA and QUAL files as SeqRecord objects.
992
993 For example, consider this short QUAL file::
994
995 >EAS54_6_R1_2_1_413_324
996 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
997 26 26 26 23 23
998 >EAS54_6_R1_2_1_540_792
999 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1000 26 18 26 23 18
1001 >EAS54_6_R1_2_1_443_348
1002 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1003 24 18 18 18 18
1004
1005 And a matching FASTA file::
1006
1007 >EAS54_6_R1_2_1_413_324
1008 CCCTTCTTGTCTTCAGCGTTTCTCC
1009 >EAS54_6_R1_2_1_540_792
1010 TTGGCAGGCCAAGGCCGATGGATCA
1011 >EAS54_6_R1_2_1_443_348
1012 GTTGCTTCTGGCGTGGGTGGGGGGG
1013
1014 You can parse these separately using Bio.SeqIO with the "qual" and
1015 "fasta" formats, but then you'll get a group of SeqRecord objects with
1016 no sequence, and a matching group with the sequence but not the
1017 qualities. Because it only deals with one input file handle, Bio.SeqIO
1018 can't be used to read the two files together - but this function can!
1019 For example,
1020
1021 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1022 ... open("Quality/example.qual", "rU"))
1023 >>> for record in rec_iter :
1024 ... print record.id, record.seq
1025 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1026 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1027 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1028
1029 As with the FASTQ or QUAL parsers, if you want to look at the qualities,
1030 they are in each record's per-letter-annotation dictionary as a simple
1031 list of integers:
1032
1033 >>> print record.letter_annotations["phred_quality"]
1034 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1035
1036 If you have access to data as a FASTQ format file, using that directly
1037 would be simpler and more straight forward. Note that you can easily use
1038 this function to convert paired FASTA and QUAL files into FASTQ files:
1039
1040 >>> from Bio import SeqIO
1041 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1042 ... open("Quality/example.qual", "rU"))
1043 >>> out_handle = open("Quality/temp.fastq", "w")
1044 >>> SeqIO.write(rec_iter, out_handle, "fastq")
1045 3
1046 >>> out_handle.close()
1047
1048 And don't forget to clean up the temp file if you don't need it anymore:
1049
1050 >>> import os
1051 >>> os.remove("Quality/temp.fastq")
1052 """
1053 from Bio.SeqIO.FastaIO import FastaIterator
1054 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \
1055 title2ids=title2ids)
1056 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \
1057 title2ids=title2ids)
1058
1059
1060
1061 while True :
1062 try :
1063 f_rec = fasta_iter.next()
1064 except StopIteration :
1065 f_rec = None
1066 try :
1067 q_rec = qual_iter.next()
1068 except StopIteration :
1069 q_rec = None
1070 if f_rec is None and q_rec is None :
1071
1072 break
1073 if f_rec is None :
1074 raise ValueError("FASTA file has more entries than the QUAL file.")
1075 if q_rec is None :
1076 raise ValueError("QUAL file has more entries than the FASTA file.")
1077 if f_rec.id != q_rec.id :
1078 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \
1079 % (f_rec.id, q_rec.id))
1080 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]) :
1081 raise ValueError("Sequence length and number of quality scores disagree for %s" \
1082 % f_rec.id)
1083
1084 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"]
1085 yield f_rec
1086
1087
1088
1090 """Run the Bio.SeqIO module's doctests.
1091
1092 This will try and locate the unit tests directory, and run the doctests
1093 from there in order that the relative paths used in the examples work.
1094 """
1095 import doctest
1096 import os
1097 if os.path.isdir(os.path.join("..","..","Tests")) :
1098 print "Runing doctests..."
1099 cur_dir = os.path.abspath(os.curdir)
1100 os.chdir(os.path.join("..","..","Tests"))
1101 assert os.path.isfile("Quality/example.fastq")
1102 assert os.path.isfile("Quality/example.fasta")
1103 assert os.path.isfile("Quality/example.qual")
1104 assert os.path.isfile("Quality/tricky.fastq")
1105 assert os.path.isfile("Quality/solexa.fastq")
1106 doctest.testmod()
1107 os.chdir(cur_dir)
1108 del cur_dir
1109 print "Done"
1110
1111 if __name__ == "__main__" :
1112 _test()
1113