skbio.stats.composition.
ancom
(table, grouping, alpha=0.05, tau=0.02, theta=0.1, multiple_comparisons_correction='holm-bonferroni', significance_test=None, percentiles=(0.0, 25.0, 50.0, 75.0, 100.0))[source]¶Performs a differential abundance test using ANCOM.
State: Experimental as of 0.4.1.
This is done by calculating pairwise log ratios between all features and performing a significance test to determine if there is a significant difference in feature ratios with respect to the variable of interest.
In an experiment with only two treatments, this tests the following hypothesis for feature \(i\)
where \(u_i^{(1)}\) is the mean abundance for feature \(i\) in the first group and \(u_i^{(2)}\) is the mean abundance for feature \(i\) in the second group.
table (pd.DataFrame) – A 2D matrix of strictly positive values (i.e. counts or proportions) where the rows correspond to samples and the columns correspond to features.
grouping (pd.Series) – Vector indicating the assignment of samples to groups. For example, these could be strings or integers denoting which group a sample belongs to. It must be the same length as the samples in table. The index must be the same on table and grouping but need not be in the same order.
alpha (float, optional) – Significance level for each of the statistical tests. This can can be anywhere between 0 and 1 exclusive.
tau (float, optional) – A constant used to determine an appropriate cutoff. A value close to zero indicates a conservative cutoff. This can can be anywhere between 0 and 1 exclusive.
theta (float, optional) – Lower bound for the proportion for the W-statistic. If all W-statistics are lower than theta, then no features will be detected to be differentially significant. This can can be anywhere between 0 and 1 exclusive.
multiple_comparisons_correction ({None, 'holm-bonferroni'}, optional) – The multiple comparison correction procedure to run. If None, then no multiple comparison correction procedure will be run. If ‘holm-boniferroni’ is specified, then the Holm-Boniferroni procedure 1 will be run.
significance_test (function, optional) – A statistical significance function to test for significance between
classes. This function must be able to accept at least two 1D
array_like arguments of floats and returns a test statistic and a
p-value. By default scipy.stats.f_oneway
is used.
percentiles (iterable of floats, optional) – Percentile abundances to return for each feature in each group. By default, will return the minimum, 25th percentile, median, 75th percentile, and maximum abundances for each feature in each group.
pd.DataFrame – A table of features, their W-statistics and whether the null hypothesis is rejected.
”W” is the W-statistic, or number of features that a single feature is tested to be significantly different against.
”Reject null hypothesis” indicates if feature is differentially abundant across groups (True) or not (False).
pd.DataFrame – A table of features and their percentile abundances in each group. If
percentiles
is empty, this will be an empty pd.DataFrame
. The
rows in this object will be features, and the columns will be a
multi-index where the first index is the percentile, and the second
index is the group.
See also
multiplicative_replacement()
, scipy.stats.ttest_ind()
, scipy.stats.f_oneway()
, scipy.stats.wilcoxon()
, scipy.stats.kruskal()
Notes
The developers of this method recommend the following significance tests
(2, Supplementary File 1, top of page 11): if there are 2 groups, use
the standard parametric t-test (scipy.stats.ttest_ind
) or
non-parametric Wilcoxon rank sum test (scipy.stats.wilcoxon
).
If there are more than 2 groups, use parametric one-way ANOVA
(scipy.stats.f_oneway
) or nonparametric Kruskal-Wallis
(scipy.stats.kruskal
). Because one-way ANOVA is equivalent
to the standard t-test when the number of groups is two, we default to
scipy.stats.f_oneway
here, which can be used when there are two or
more groups. Users should refer to the documentation of these tests in
SciPy to understand the assumptions made by each test.
This method cannot handle any zero counts as input, since the logarithm
of zero cannot be computed. While this is an unsolved problem, many
studies, including 2, have shown promising results by adding
pseudocounts to all values in the matrix. In 2, a pseudocount of 0.001
was used, though the authors note that a pseudocount of 1.0 may also be
useful. Zero counts can also be addressed using the
multiplicative_replacement
method.
References
Holm, S. “A simple sequentially rejective multiple test procedure”. Scandinavian Journal of Statistics (1979), 6.
Mandal et al. “Analysis of composition of microbiomes: a novel method for studying microbial composition”, Microbial Ecology in Health & Disease, (2015), 26.
Examples
First import all of the necessary modules:
>>> from skbio.stats.composition import ancom
>>> import pandas as pd
Now let’s load in a DataFrame with 6 samples and 7 features (e.g., these may be bacterial OTUs):
>>> table = pd.DataFrame([[12, 11, 10, 10, 10, 10, 10],
... [9, 11, 12, 10, 10, 10, 10],
... [1, 11, 10, 11, 10, 5, 9],
... [22, 21, 9, 10, 10, 10, 10],
... [20, 22, 10, 10, 13, 10, 10],
... [23, 21, 14, 10, 10, 10, 10]],
... index=['s1', 's2', 's3', 's4', 's5', 's6'],
... columns=['b1', 'b2', 'b3', 'b4', 'b5', 'b6',
... 'b7'])
Then create a grouping vector. In this example, there is a treatment group and a placebo group.
>>> grouping = pd.Series(['treatment', 'treatment', 'treatment',
... 'placebo', 'placebo', 'placebo'],
... index=['s1', 's2', 's3', 's4', 's5', 's6'])
Now run ancom
to determine if there are any features that are
significantly different in abundance between the treatment and the placebo
groups. The first DataFrame that is returned contains the ANCOM test
results, and the second contains the percentile abundance data for each
feature in each group.
>>> ancom_df, percentile_df = ancom(table, grouping)
>>> ancom_df['W']
b1 0
b2 4
b3 0
b4 1
b5 1
b6 0
b7 1
Name: W, dtype: int64
The W-statistic is the number of features that a single feature is tested to be significantly different against. In this scenario, b2 was detected to have significantly different abundances compared to four of the other features. To summarize the results from the W-statistic, let’s take a look at the results from the hypothesis test. The Reject null hypothesis column in the table indicates whether the null hypothesis was rejected, and that a feature was therefore observed to be differentially abundant across the groups.
>>> ancom_df['Reject null hypothesis']
b1 False
b2 True
b3 False
b4 False
b5 False
b6 False
b7 False
Name: Reject null hypothesis, dtype: bool
From this we can conclude that only b2 was significantly different in abundance between the treatment and the placebo. We still don’t know, for example, in which group b2 was more abundant. We therefore may next be interested in comparing the abundance of b2 across the two groups. We can do that using the second DataFrame that was returned. Here we compare the median (50th percentile) abundance of b2 in the treatment and placebo groups:
>>> percentile_df[50.0].loc['b2']
Group
placebo 21.0
treatment 11.0
Name: b2, dtype: float64
We can also look at a full five-number summary for b2
in the treatment
and placebo groups:
>>> percentile_df.loc['b2']
Percentile Group
0.0 placebo 21.0
25.0 placebo 21.0
50.0 placebo 21.0
75.0 placebo 21.5
100.0 placebo 22.0
0.0 treatment 11.0
25.0 treatment 11.0
50.0 treatment 11.0
75.0 treatment 11.0
100.0 treatment 11.0
Name: b2, dtype: float64
Taken together, these data tell us that b2 is present in significantly higher abundance in the placebo group samples than in the treatment group samples.