Up: Component summary Component

StatisticalTest

Computes p-values using statistical tests, optionally with correction for multiple hypotheses. Tests supported are: t-test, Wilcoxon rank sum and signed rank tests, Chi-squared, Fisher's exact test, F-test for variance, Shapiro-Wilk normality test, Kolmogorov-Smirnov and correlation tests (Pearson and Spearman). Tests can be paired/non-paired and two-sided/one-sided; not all combinations are relevant for some tests. If the data contain missing values, only non-missing values are used.

Several tests are computed in parallel. Tests can be computed for each row (when byRow=true) or for various column combinations (when byRow=false). If multiple hypothesis correction is enabled (using the correction parameter), corrected p-values are computed. Both are written to the pvalues output. Identifiers whose p-value is less than given threshold are written to the idlist output. If multiple hypothesis correction is enabled, threshold comparison is done using corrected p-values; otherwise, raw p-values are used.

Each tests needs one or two numeric vectors that are obtained from the input. The first vector is called target and the second reference. Some tests use only target (e.g. Shapiro). Some can use reference if present (e.g, t-test and Wilcoxon rank sum); otherwise, expected mean is supplied as parameter. If reference is not used, referenceColumns should be empty.

When byRow=true, targetColumns and referenceColumns give two lists of columns that are used for testing on each row; they can have different lengths, except when paired=true or with correlation tests. Column lists can also be obtained from the groups input if present. Identifiers for idlist are taken from the first column of the input matrix. The matrix2 input must not be present.

When byRow=false, targetColumns and referenceColumns must have equal length, denoted N. If reference is not used, referenceColumns is empty and targetColumn has N entries. Now, N tests are performed so that target values are taken from i'th column of targetColumns and reference values from the i'th column of referenceColumns on each iteration i. Identifiers for idlist have the format T_R, where T is a target column and R is a reference column. If two matrices are supplied, target values are taken from the first and reference values from the second. If the matrices have the same number of columns, targetColumns and referenceColumns can be set to * to test i'th column of matrix against i'th column of matrix2.

Correlation tests compute r and compare it to r=0; they return p-values and not correlation coefficients. Chi-squared and Fisher's exact test use contingency matrices for testing. These matrices are created from linear input vectors in column-major order so that [1 2 3 4] is considered as column vectors [1 2] and [3 4]. The number of rows in contingency matrices is set with contingencyRows.

Version 1.1
Bundle tools
Categories Analysis
Authors Kristian Ovaska (kristian.ovaska@helsinki.fi)
Issue tracker View/Report issues
Requires R ; MASS (R-package)
Source files component.xml StatisticalTest.r
Usage Example with default values

Inputs

Name Type Mandatory Description
matrix Matrix Mandatory Input matrix, containing values to be tested.
matrix2 Matrix Optional Secondary input matrix, containing the values to be tested. Only used when byRow=false.
groups SampleGroupTable Optional If given, must contain the groups named by the parameters targetColumns and referenceColumns. Both groups must have at least two members. The members must be present in the input matrices.

Outputs

Name Type Description
pvalues CSV Contains all computed p-values, including those over the threshold. The first column contains IDs associated to tests (e.g., gene names). The second column (named by pvalueColumn) contains raw p-values. If multiple hypothesis correction is enabled, the third column (named by correctedColumn) contains corrected p-values.
idlist SetList Contains the IDs of those tests whose p-value is below the threshold. The name of the sole ID set is given with the parameter outputSet.

Parameters

Name Type Default Description
byRow boolean true If true, there is one test for each row. If false, there are tests for various column combinations.
contingencyRows int 2 For categorical variable independence tests (chi-squared, Fisher), this gives the number of rows in contingency tables. The number of columns is deduced from data. For example, if contingencyRows=3 and target vector has length 12, the table has 4 columns.
correctedColumn string "PValueCorrected" Output column name in "pvalues" for corrected p-values. Only used if multiple hypothesis correction has been enabled.
correction string "none" Type of multiple hypothesis correction. One of "none", "fdr" (Benjamini-Hochberg, 1995), "robustfdr" (Pounds-Cheng 2006), "BY" (FDR, Benjamini-Yekutieli, 2001), "holm", "bonferroni". The value "none" disables correction.
mean float 0 Expected mean of the reference distribution (mu) when the reference group is not supplied. Used for t-test and wilcoxon.
outputSet string "statistical" The name of the output set in the idlist output.
paired boolean false Indicates whether the test is paired so that i'th element of target vector corresponds to i'th element of reference vector. Ignored for F-test.
prefixColumn string "" If non-empty, include a column in the pvalues output that contains the outputSet on every row. This parameter gives the name of that column. The column is the first column in pvalues. This prefix column is useful when several separate statistical tests are made and the results are combined into one file. If prefixColumn is empty, do not include the prefix column.
pvalueColumn string "PValue" Output column name in "pvalues" for raw p-values.
referenceColumns string (no default) Comma-separated list of reference columns, or, if the groups input is present, the name of a sample group. The special value * means all columns of matrix2 if matrix2 is supplied; otherwise, all columns of matrix are used. May be empty if references are not used.
sided string "twosided" For two-sided tests, one of "twosided", "greater" (target greater than reference) or "less" (target smaller than reference).
targetColumns string (no default) Comma-separated list of target columns, or, if the groups input is present, the name of a sample group. The special value * means all columns of matrix.
test string "t-test" Type of the statistical test. One of "t-test", "wilcoxon", "chi-squared", "fisher", "F-test", "shapiro" (Shapiro-Wilk), "ks" (Kolmogorov-Smirnov), "cor-pearson", "cor-spearman". Variants of the tests are set with other parameters (sided, paired).
threshold float 0.05 P-value threshold for inclusion in the idlist output. Must be between 0 and 1 inclusive. If multiple hypothesis correction has been enabled, the threshold is for corrected p-values.

Test cases

Test case Parameters IN
matrix
IN
matrix2
IN
groups
OUT
pvalues
OUT
idlist
case1 properties matrix (missing) groups pvalues idlist

targetColumns=Group1,
referenceColumns=Group2,
outputSet=mySet

case10_ks properties matrix (missing) (missing) pvalues idlist

test=ks,
targetColumns=S1,S3,S4,S5,
referenceColumns=S6,S7,S8,S9,S10,
sided=greater,
threshold=0.5,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case11_shapiro properties matrix (missing) (missing) pvalues idlist

test=shapiro,
targetColumns=S1,S3,S4,S5,
referenceColumns=,
threshold=0.5,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case12_t_mu properties matrix (missing) groups pvalues idlist

targetColumns=Group1,
referenceColumns=,
mean=2,
threshold=0.5,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case13_col_wilcox_mu properties matrix (missing) (missing) pvalues idlist

byRow=false,
test=wilcoxon,
targetColumns=S1,S3,S4,
referenceColumns=,
sided=less,
mean=2,
threshold=0.1,
pvalueColumn=myPvalueColumn,
outputSet=mySet

case14_chisq properties matrix (missing) (missing) pvalues idlist

test=chi-squared,
targetColumns=S1,S2,S3,S4,S5,S6,
referenceColumns=,
contingencyRows=3,
threshold=0.1,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case15_col_fisher properties matrix (missing) (missing) pvalues idlist

byRow=false,
test=fisher,
targetColumns=S1,S2,S4,S5,
referenceColumns=,
contingencyRows=3,
threshold=0.01,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case2_paired_greater properties matrix (missing) (missing) pvalues idlist

test=t-test,
targetColumns=S1,S2,S4,
referenceColumns=S7,S8,S10,
sided=greater,
paired=true,
pvalueColumn=myPvalueColumn,
outputSet=mySet,
threshold=0.1

case3_fdr properties matrix (missing) groups pvalues idlist

correction=fdr,
targetColumns=Group1,
referenceColumns=Group2,
pvalueColumn=myPvalueColumn,
correctedColumn=PValueCor,
threshold=0.9

case4_prefix properties matrix (missing) groups pvalues idlist

correction=fdr,
targetColumns=Group1,
referenceColumns=Group2,
outputSet=mySet,
pvalueColumn=myPvalueColumn,
threshold=0.9,
prefixColumn=PrefixSetID

case5_col_t properties matrix (missing) (missing) pvalues idlist

byRow=false,
targetColumns=S1,S1,S3,
referenceColumns=S2,S3,S4,
sided=less,
correction=bonferroni,
threshold=0.5,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case6_wilcox properties matrix (missing) groups pvalues idlist

test=wilcoxon,
targetColumns=Group1,
referenceColumns=Group2,
correction=fdr,
threshold=0.6,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case7_col_wilcox properties matrix (missing) (missing) pvalues idlist

byRow=false,
test=wilcoxon,
targetColumns=S1,S1,S3,
referenceColumns=S2,S3,S4,
sided=less,
correction=bonferroni,
threshold=0.5,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case8_col_F properties matrix matrix2 (missing) pvalues idlist

byRow=false,
test=F-test,
targetColumns=*,
referenceColumns=*,
sided=greater,
correction=bonferroni,
threshold=0.5,
outputSet=mySet,
pvalueColumn=myPvalueColumn

case9_cor properties matrix (missing) (missing) pvalues idlist

test=cor-pearson,
targetColumns=S1,S3,S4,S5,
referenceColumns=S6,S7,S8,S10,
threshold=0.5,
outputSet=mySet,
pvalueColumn=myPvalueColumn


Generated 2019-02-08 07:42:19 by Anduril 2.0.0