skbio.io.format.fastq
)¶The FASTQ file format (fastq
) stores biological (e.g., nucleotide)
sequences and their quality scores in a simple plain text format that is both
human-readable and easy to parse. The file format was invented by Jim Mullikin
at the Wellcome Trust Sanger Institute but wasn’t given a formal definition,
though it has informally become a standard file format for storing
high-throughput sequence data. More information about the format and its
variants can be found in 1 and 2.
Conceptually, a FASTQ file is similar to paired FASTA and QUAL files in that it stores both biological sequences and their quality scores. FASTQ differs from FASTA/QUAL because the quality scores are stored in the same file as the biological sequence data.
An example FASTQ-formatted file containing two DNA sequences and their quality scores:
@seq1 description 1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
+
````Y^T]`]c^cabcacc`^Lb^ccYT\T\Y\WF
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
]KZ[PY]_[YY^```ac^\\`bT``c`\aT``bbb
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
generator of |
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
A FASTQ file contains one or more biological sequences and their corresponding quality scores stored sequentially as records. Each record consists of four sections:
Sequence header line consisting of a sequence identifier (ID) and description (both optional)
Biological sequence data (typically stored using the standard IUPAC lexicon), optionally split over multiple lines
Quality header line separating sequence data from quality scores (optionally repeating the ID and description from the sequence header line)
Quality scores as printable ASCII characters, optionally split over multiple lines. Decoding of quality scores will depend on the specified FASTQ variant (see below for more details)
For the complete FASTQ format specification, see 1. scikit-bio’s FASTQ implementation follows the format specification described in this excellent publication, including validating the implementation against the FASTQ example files provided in the publication’s supplementary data.
Note
IDs and descriptions will be parsed from sequence header lines in
exactly the same way as FASTA headers (skbio.io.format.fasta
). IDs,
descriptions, and quality scores are also stored on, and written from,
sequence objects in the same way as with FASTA.
Note
Blank or whitespace-only lines are only allowed at the beginning of the file, between FASTQ records, or at the end of the file. A blank or whitespace-only line after the header line, within the sequence, or within quality scores will raise an error.
scikit-bio will ignore leading and trailing whitespace characters on each line while reading.
Note
Validation may be performed depending on the type of object the data is being read into. This behavior matches that of FASTA files.
Note
scikit-bio will write FASTQ files in a normalized format, with each record section on a single line. Thus, each record will be composed of exactly four lines. The quality header line won’t have the sequence ID and description repeated.
Note
lowercase functionality is supported the same as with FASTA.
FASTQ associates quality scores with sequence data, with each quality score encoded as a single printable ASCII character. In scikit-bio, all quality scores are decoded as Phred quality scores. This is the most common quality score metric, though there are others (e.g., Solexa quality scores). Unfortunately, different sequencers have different ways of encoding quality scores as ASCII characters, notably Sanger and Illumina. Below is a table highlighting the different encoding variants supported by scikit-bio, as well as listing the equivalent variant names used in the Open Bioinformatics Foundation (OBF) 3 projects (e.g., Biopython, BioPerl, etc.).
Variant |
ASCII Range |
Offset |
Quality Range |
Notes |
---|---|---|---|---|
sanger |
33 to 126 |
33 |
0 to 93 |
Equivalent to OBF’s fastq-sanger. |
illumina1.3 |
64 to 126 |
64 |
0 to 62 |
Equivalent to OBF’s fastq-illumina. Use this if your data was generated using Illumina 1.3-1.7 software. |
illumina1.8 |
33 to 95 |
33 |
0 to 62 |
Equivalent to sanger but with 0 to 62 quality score range check. Use this if your data was generated using Illumina 1.8 software or later. |
solexa |
59 to 126 |
64 |
-5 to 62 |
Not currently implemented. |
Note
When writing, Phred quality scores will be truncated to the maximum value in the variant’s range and a warning will be issued. This is consistent with the OBF projects.
When reading, an error will be raised if a decoded quality score is outside the variant’s range.
The following parameters are available to all FASTQ format readers and writers:
variant
: A string indicating the quality score variant used to
decode/encode Phred quality scores. Must be one of sanger
,
illumina1.3
, illumina1.8
, or solexa
. This parameter is preferred
over phred_offset
because additional quality score range checks and
conversions can be performed. It is also more explicit.
phred_offset
: An integer indicating the ASCII code offset used to
decode/encode Phred quality scores. Must be in the range [33, 126]
. All
decoded scores will be assumed to be Phred scores (i.e., no additional
conversions are performed). Prefer using variant
over this parameter
whenever possible.
Note
You must provide variant
or phred_offset
when reading or
writing a FASTQ file. variant
and phred_offset
cannot both be
provided at the same time.
The following additional parameters are the same as in FASTA format
(skbio.io.format.fasta
):
constructor
: see constructor
parameter in FASTA format
seq_num
: see seq_num
parameter in FASTA format
id_whitespace_replacement
: see id_whitespace_replacement
parameter in
FASTA format
description_newline_replacement
: see description_newline_replacement
parameter in FASTA format
lowercase
: see lowercase
parameter in FASTA format
Examples
Suppose we have the following FASTQ file with two DNA sequences:
@seq1 description 1
AACACCAAACTTCTCCACC
ACGTGAGCTACAAAAG
+seq1 description 1
''''Y^T]']C^CABCACC
`^LB^CCYT\T\Y\WF
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
]KZ[PY]_[YY^'''AC^\\'BT''C'\AT''BBB
Note that the first sequence and its quality scores are split across multiple lines, while the second sequence and its quality scores are each on a single line. Also note that the first sequence has a duplicate ID and description on the quality header line, while the second sequence does not.
Let’s define this file in-memory as a StringIO
, though this could be a real
file path, file handle, or anything that’s supported by scikit-bio’s I/O
registry in practice:
>>> from io import StringIO
>>> fs = '\n'.join([
... r"@seq1 description 1",
... r"AACACCAAACTTCTCCACC",
... r"ACGTGAGCTACAAAAG",
... r"+seq1 description 1",
... r"''''Y^T]']C^CABCACC",
... r"'^LB^CCYT\T\Y\WF",
... r"@seq2 description 2",
... r"TATGTATATATAACATATACATATATACATACATA",
... r"+",
... r"]KZ[PY]_[YY^'''AC^\\'BT''C'\AT''BBB"])
>>> fh = StringIO(fs)
To load the sequences into a TabularMSA
, we run:
>>> from skbio import TabularMSA, DNA
>>> msa = TabularMSA.read(fh, constructor=DNA, variant='sanger')
>>> msa
TabularMSA[DNA]
-----------------------------------
Stats:
sequence count: 2
position count: 35
-----------------------------------
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
TATGTATATATAACATATACATATATACATACATA
Note that quality scores are decoded from Sanger. To load the second sequence
as DNA
:
>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
>>> seq = DNA.read(fh, variant='sanger', seq_num=2)
>>> seq
DNA
----------------------------------------
Metadata:
'description': 'description 2'
'id': 'seq2'
Positional metadata:
'quality': <dtype: uint8>
Stats:
length: 35
has gaps: False
has degenerates: False
has definites: True
GC-content: 14.29%
----------------------------------------
0 TATGTATATA TAACATATAC ATATATACAT ACATA
To write our TabularMSA
to a FASTQ file with quality scores encoded using
the illumina1.3
variant:
>>> new_fh = StringIO()
>>> print(msa.write(new_fh, format='fastq', variant='illumina1.3').getvalue())
@seq1 description 1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
+
FFFFx}s|F|b}b`ab`bbF}ka}bbxs{s{x{ve
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
|jyzox|~zxx}FFF`b}{{FasFFbF{`sFFaaa
>>> new_fh.close()
Note that the file has been written in normalized format: sequence and quality scores each only occur on a single line and the sequence header line is not repeated in the quality header line. Note also that the quality scores are different because they have been encoded using a different variant.
References
Peter J. A. Cock, Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucl. Acids Res. (2010) 38 (6): 1767-1771. first published online December 16, 2009. doi:10.1093/nar/gkp1137 http://nar.oxfordjournals.org/content/38/6/1767