skbio.sequence.distance.kmer_distance

skbio.sequence.distance.kmer_distance(seq1, seq2, k, overlap=True)[source]

Compute the kmer distance between a pair of sequences

State: Experimental as of 0.5.0.

The kmer distance between two sequences is the fraction of kmers that are unique to either sequence.

Parameters
  • seq2 (seq1,) – Sequences to compute kmer distance between.

  • k (int) – The kmer length.

  • overlap (bool, optional) – Defines whether the kmers should be overlapping or not.

Returns

kmer distance between seq1 and seq2.

Return type

float

Raises
  • ValueError – If k is less than 1.

  • TypeError – If seq1 and seq2 are not Sequence instances.

  • TypeError – If seq1 and seq2 are not the same type.

Notes

kmer counts are not incorporated in this distance metric.

np.nan will be returned if there are no kmers defined for the sequences.

Examples

>>> from skbio import Sequence
>>> seq1 = Sequence('ATCGGCGAT')
>>> seq2 = Sequence('GCAGATGTG')
>>> kmer_distance(seq1, seq2, 3) 
0.9230769230...