Computes p-values using statistical tests, optionally with correction for multiple hypotheses. Tests supported are: t-test, Wilcoxon rank sum and signed rank tests, Chi-squared, Fisher's exact test, F-test for variance, Shapiro-Wilk normality test, Kolmogorov-Smirnov and correlation tests (Pearson and Spearman). Tests can be paired/non-paired and two-sided/one-sided; not all combinations are relevant for some tests. If the data contain missing values, only non-missing values are used.
Several tests are computed in parallel. Tests can be computed for each row (when byRow=true) or for various column combinations (when byRow=false). If multiple hypothesis correction is enabled (using the correction parameter), corrected p-values are computed. Both are written to the pvalues output. Identifiers whose p-value is less than given threshold are written to the idlist output. If multiple hypothesis correction is enabled, threshold comparison is done using corrected p-values; otherwise, raw p-values are used.
Each tests needs one or two numeric vectors that are obtained from the input. The first vector is called target and the second reference. Some tests use only target (e.g. Shapiro). Some can use reference if present (e.g, t-test and Wilcoxon rank sum); otherwise, expected mean is supplied as parameter. If reference is not used, referenceColumns should be empty.
When byRow=true, targetColumns and referenceColumns give two lists of columns that are used for testing on each row; they can have different lengths, except when paired=true or with correlation tests. Column lists can also be obtained from the groups input if present. Identifiers for idlist are taken from the first column of the input matrix. The matrix2 input must not be present.
When byRow=false, targetColumns and referenceColumns must have equal length, denoted N. If reference is not used, referenceColumns is empty and targetColumn has N entries. Now, N tests are performed so that target values are taken from i'th column of targetColumns and reference values from the i'th column of referenceColumns on each iteration i. Identifiers for idlist have the format T_R, where T is a target column and R is a reference column. If two matrices are supplied, target values are taken from the first and reference values from the second. If the matrices have the same number of columns, targetColumns and referenceColumns can be set to * to test i'th column of matrix against i'th column of matrix2.
Correlation tests compute r and compare it to r=0; they return p-values and not correlation coefficients. Chi-squared and Fisher's exact test use contingency matrices for testing. These matrices are created from linear input vectors in column-major order so that [1 2 3 4] is considered as column vectors [1 2] and [3 4]. The number of rows in contingency matrices is set with contingencyRows.
Version | 1.1 |
---|---|
Bundle | tools |
Categories | Analysis |
Authors | Kristian Ovaska (kristian.ovaska@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | R ; MASS (R-package) |
Source files | component.xml StatisticalTest.r |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
matrix | Matrix | Mandatory | Input matrix, containing values to be tested. |
matrix2 | Matrix | Optional | Secondary input matrix, containing the values to be tested. Only used when byRow=false. |
groups | SampleGroupTable | Optional | If given, must contain the groups named by the parameters targetColumns and referenceColumns. Both groups must have at least two members. The members must be present in the input matrices. |
Name | Type | Description |
---|---|---|
pvalues | CSV | Contains all computed p-values, including those over the threshold. The first column contains IDs associated to tests (e.g., gene names). The second column (named by pvalueColumn) contains raw p-values. If multiple hypothesis correction is enabled, the third column (named by correctedColumn) contains corrected p-values. |
idlist | SetList | Contains the IDs of those tests whose p-value is below the threshold. The name of the sole ID set is given with the parameter outputSet. |
Name | Type | Default | Description |
---|---|---|---|
byRow | boolean | true | If true, there is one test for each row. If false, there are tests for various column combinations. |
contingencyRows | int | 2 | For categorical variable independence tests (chi-squared, Fisher), this gives the number of rows in contingency tables. The number of columns is deduced from data. For example, if contingencyRows=3 and target vector has length 12, the table has 4 columns. |
correctedColumn | string | "PValueCorrected" | Output column name in "pvalues" for corrected p-values. Only used if multiple hypothesis correction has been enabled. |
correction | string | "none" | Type of multiple hypothesis correction. One of "none", "fdr" (Benjamini-Hochberg, 1995), "robustfdr" (Pounds-Cheng 2006), "BY" (FDR, Benjamini-Yekutieli, 2001), "holm", "bonferroni". The value "none" disables correction. |
mean | float | 0 | Expected mean of the reference distribution (mu) when the reference group is not supplied. Used for t-test and wilcoxon. |
outputSet | string | "statistical" | The name of the output set in the idlist output. |
paired | boolean | false | Indicates whether the test is paired so that i'th element of target vector corresponds to i'th element of reference vector. Ignored for F-test. |
prefixColumn | string | "" | If non-empty, include a column in the pvalues output that contains the outputSet on every row. This parameter gives the name of that column. The column is the first column in pvalues. This prefix column is useful when several separate statistical tests are made and the results are combined into one file. If prefixColumn is empty, do not include the prefix column. |
pvalueColumn | string | "PValue" | Output column name in "pvalues" for raw p-values. |
referenceColumns | string | (no default) | Comma-separated list of reference columns, or, if the groups input is present, the name of a sample group. The special value * means all columns of matrix2 if matrix2 is supplied; otherwise, all columns of matrix are used. May be empty if references are not used. |
sided | string | "twosided" | For two-sided tests, one of "twosided", "greater" (target greater than reference) or "less" (target smaller than reference). |
targetColumns | string | (no default) | Comma-separated list of target columns, or, if the groups input is present, the name of a sample group. The special value * means all columns of matrix. |
test | string | "t-test" | Type of the statistical test. One of "t-test", "wilcoxon", "chi-squared", "fisher", "F-test", "shapiro" (Shapiro-Wilk), "ks" (Kolmogorov-Smirnov), "cor-pearson", "cor-spearman". Variants of the tests are set with other parameters (sided, paired). |
threshold | float | 0.05 | P-value threshold for inclusion in the idlist output. Must be between 0 and 1 inclusive. If multiple hypothesis correction has been enabled, the threshold is for corrected p-values. |
Test case | Parameters▼ | IN matrix |
IN matrix2 |
IN groups |
OUT pvalues |
OUT idlist |
---|---|---|---|---|---|---|
case1 | properties | matrix | (missing) | groups | pvalues | idlist |
targetColumns=Group1, |
||||||
case10_ks | properties | matrix | (missing) | (missing) | pvalues | idlist |
test=ks, |
||||||
case11_shapiro | properties | matrix | (missing) | (missing) | pvalues | idlist |
test=shapiro, |
||||||
case12_t_mu | properties | matrix | (missing) | groups | pvalues | idlist |
targetColumns=Group1, |
||||||
case13_col_wilcox_mu | properties | matrix | (missing) | (missing) | pvalues | idlist |
byRow=false, |
||||||
case14_chisq | properties | matrix | (missing) | (missing) | pvalues | idlist |
test=chi-squared, |
||||||
case15_col_fisher | properties | matrix | (missing) | (missing) | pvalues | idlist |
byRow=false, |
||||||
case2_paired_greater | properties | matrix | (missing) | (missing) | pvalues | idlist |
test=t-test, |
||||||
case3_fdr | properties | matrix | (missing) | groups | pvalues | idlist |
correction=fdr, |
||||||
case4_prefix | properties | matrix | (missing) | groups | pvalues | idlist |
correction=fdr, |
||||||
case5_col_t | properties | matrix | (missing) | (missing) | pvalues | idlist |
byRow=false, |
||||||
case6_wilcox | properties | matrix | (missing) | groups | pvalues | idlist |
test=wilcoxon, |
||||||
case7_col_wilcox | properties | matrix | (missing) | (missing) | pvalues | idlist |
byRow=false, |
||||||
case8_col_F | properties | matrix | matrix2 | (missing) | pvalues | idlist |
byRow=false, |
||||||
case9_cor | properties | matrix | (missing) | (missing) | pvalues | idlist |
test=cor-pearson, |