skbio.io.format.gff3
)¶GFF3 (Generic Feature Format version 3) is a standard file format for describing features for biological sequences. It contains lines of text, each consisting of 9 tab-delimited columns 1.
Has Sniffer: Yes
Reader |
Writer |
Object Class |
---|---|---|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
generator of tuple (seq_id of str type,
|
State: Experimental as of 0.5.1.
The first line of the file is a comment that identifies the format and version. This is followed by a series of data lines. Each data line corresponds to an annotation and consists of 9 columns: SEQID, SOURCE, TYPE, START, END, SCORE, STRAND, PHASE, and ATTR.
Column 9 (ATTR) is list of feature attributes in the format “tag=value”. Multiple “tag=value” pairs are delimited by semicolons. Multiple values of the same tag are separated with the comma “,”. The following tags have predefined meanings: ID, Name, Alias, Parent, Target, Gap, Derives_from, Note, Dbxref, Ontology_term, and Is_circular.
The meaning and format of these columns and attributes are explained
detail in the format specification 1. And they are read in as the
vocabulary defined in GenBank parser (skbio.io.format.genbank
).
IntervalMetadata
GFF3 reader requires 1 parameter: seq_id
.
It reads the annotation with the specified
sequence ID from the GFF3 file into an IntervalMetadata
object.
DNA
and Sequence
GFF3 readers require seq_num
of int as
parameter. It specifies which GFF3 record to read from a GFF3 file
with annotations of multiple sequences in it.
skip_subregion
is a boolean parameter used by all the GFF3 writers. It
specifies whether you would like to write each non-contiguous
sub-region for a feature annotation. For example, if there is
interval feature for a gene with two exons in an IntervalMetadata
object, it will write one line into the GFF3 file when skip_subregion
is
True
and will write 3 lines (one for the gene and one for each
exon, respectively) when skip_subregion
is False
. Default is True
.
In addition, IntervalMetadata
GFF3 writer needs a parameter of
seq_id
. It specify the sequence ID (column 1 in GFF3 file) that
the annotation belong to.
Examples
Let’s create a file stream with following data in GFF3 format:
>>> from skbio import Sequence, DNA
>>> gff_str = """
... ##gff-version 3
... seq_1\t.\tgene\t10\t90\t.\t+\t0\tID=gen1
... seq_1\t.\texon\t10\t30\t.\t+\t.\tParent=gen1
... seq_1\t.\texon\t50\t90\t.\t+\t.\tParent=gen1
... seq_2\t.\tgene\t80\t96\t.\t-\t.\tID=gen2
... ##FASTA
... >seq_1
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... >seq_2
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... """
>>> import io
>>> from skbio.metadata import IntervalMetadata
>>> from skbio.io import read
>>> gff = io.StringIO(gff_str)
We can read it into IntervalMetadata
. Each line will be read into
an interval feature in IntervalMetadata
object:
>>> im = read(gff, format='gff3', into=IntervalMetadata,
... seq_id='seq_1')
>>> im
3 interval features
-------------------
Interval(interval_metadata=<4604421736>, bounds=[(9, 90)], fuzzy=[(False, False)], metadata={'type': 'gene', 'phase': 0, 'strand': '+', 'source': '.', 'score': '.', 'ID': 'gen1'})
Interval(interval_metadata=<4604421736>, bounds=[(9, 30)], fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', 'type': 'exon', 'Parent': 'gen1', 'score': '.'})
Interval(interval_metadata=<4604421736>, bounds=[(49, 90)], fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', 'type': 'exon', 'Parent': 'gen1', 'score': '.'})
We can write the IntervalMetadata
object back to GFF3 file:
>>> with io.StringIO() as fh:
... print(im.write(fh, format='gff3', seq_id='seq_1').getvalue())
##gff-version 3
seq_1 . gene 10 90 . + 0 ID=gen1
seq_1 . exon 10 30 . + . Parent=gen1
seq_1 . exon 50 90 . + . Parent=gen1
If the GFF3 file does not have the sequence ID, it will return an empty object:
>>> gff = io.StringIO(gff_str)
>>> im = read(gff, format='gff3', into=IntervalMetadata,
... seq_id='foo')
>>> im
0 interval features
-------------------
We can also read the GFF3 file into a generator:
>>> gff = io.StringIO(gff_str)
>>> gen = read(gff, format='gff3')
>>> for im in gen:
... print(im[0]) # the seq id
... print(im[1]) # the interval metadata on this seq
seq_1
3 interval features
-------------------
Interval(interval_metadata=<4603377592>, bounds=[(9, 90)], fuzzy=[(False, False)], metadata={'type': 'gene', 'ID': 'gen1', 'source': '.', 'score': '.', 'strand': '+', 'phase': 0})
Interval(interval_metadata=<4603377592>, bounds=[(9, 30)], fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', 'Parent': 'gen1', 'source': '.', 'score': '.'})
Interval(interval_metadata=<4603377592>, bounds=[(49, 90)], fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', 'Parent': 'gen1', 'source': '.', 'score': '.'})
seq_2
1 interval feature
------------------
Interval(interval_metadata=<4603378712>, bounds=[(79, 96)], fuzzy=[(False, False)], metadata={'strand': '-', 'type': 'gene', 'ID': 'gen2', 'source': '.', 'score': '.'})
For the GFF3 file with sequences, we can read it into Sequence
or DNA
:
>>> gff = io.StringIO(gff_str)
>>> seq = read(gff, format='gff3', into=Sequence, seq_num=1)
>>> seq
Sequence
--------------------------------------------------------------------
Metadata:
'description': ''
'id': 'seq_1'
Interval metadata:
3 interval features
Stats:
length: 100
--------------------------------------------------------------------
0 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
60 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
>>> gff = io.StringIO(gff_str)
>>> seq = read(gff, format='gff3', into=DNA, seq_num=2)
>>> seq
DNA
--------------------------------------------------------------------
Metadata:
'description': ''
'id': 'seq_2'
Interval metadata:
1 interval feature
Stats:
length: 120
has gaps: False
has degenerates: False
has definites: True
GC-content: 50.00%
--------------------------------------------------------------------
0 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
60 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
References