TabularMSA.
conservation
(metric='inverse_shannon_uncertainty', degenerate_mode='error', gap_mode='nan')[source]¶Apply metric to compute conservation for all alignment positions
State: Experimental as of 0.4.1.
metric ({'inverse_shannon_uncertainty'}, optional) – Metric that should be applied for computing conservation. Resulting values should be larger when a position is more conserved.
degenerate_mode ({'nan', 'error'}, optional) – Mode for handling positions with degenerate characters. If
"nan"
, positions with degenerate characters will be assigned a
conservation score of np.nan
. If "error"
, an
error will be raised if one or more degenerate characters are
present.
gap_mode ({'nan', 'ignore', 'error', 'include'}, optional) – Mode for handling positions with gap characters. If "nan"
,
positions with gaps will be assigned a conservation score of
np.nan
. If "ignore"
, positions with gaps will be filtered
to remove gaps before metric
is applied. If "error"
, an
error will be raised if one or more gap characters are present. If
"include"
, conservation will be computed on alignment positions
with gaps included. In this case, it is up to the metric to ensure
that gaps are handled as they should be or to raise an error if
gaps are not supported by that metric.
Values resulting from the application of metric
to each
position in the alignment.
np.array of floats
ValueError – If an unknown metric
, degenerate_mode
or gap_mode
is
provided.
ValueError – If any degenerate characters are present in the alignment when
degenerate_mode
is "error"
.
ValueError – If any gaps are present in the alignment when gap_mode
is
"error"
.
Notes
Users should be careful interpreting results when
gap_mode = "include"
as the results may be misleading. For example,
as pointed out in 1, a protein alignment position composed of 90%
gaps and 10% tryptophans would score as more highly conserved than a
position composed of alanine and glycine in equal frequencies with the
"inverse_shannon_uncertainty"
metric.
gap_mode = "include"
will result in all gap characters being
recoded to TabularMSA.dtype.default_gap_char
. Because no
conservation metrics that we are aware of consider different gap
characters differently (e.g., none of the metrics described in 1),
they are all treated the same within this method.
The inverse_shannon_uncertainty
metric is simply one minus
Shannon’s uncertainty metric. This method uses the inverse of Shannon’s
uncertainty so that larger values imply higher conservation. Shannon’s
uncertainty is also referred to as Shannon’s entropy, but when making
computations from symbols, as is done here, “uncertainty” is the
preferred term (2).
References
Valdar WS. Scoring residue conservation. Proteins. (2002)
Schneider T. Pitfalls in information theory (website, ca. 2015). https://schneider.ncifcrf.gov/glossary.html#Shannon_entropy