Calculates a score matrix for a dichotomially classified set of samples. Each sample is represented its class (0 or 1, e.g. cancer or not) and by a gene expression ranking of size G. The rows of the input are gene expression ranks of genes for the sample. The only requirement for ranks is that they must be comparable and lower ranks are interpreted as more expressed. The output is a weighted vote of ordered candidate gene pairs, e.g. a matrix A whose element A(i, j) will correspond to gene pair (i, j), with higher scores for gene pairs whose relative expression seems to differentiate well between the classes based on similar pairwise gene ranks in each sample for that class as opposed to the other class. If the ranks seem independent of sample class, the score will be closer to zero. If the ranks differ based on sample class the score approaches 1 or -1 depending on whether the ordered gene pair in the matrix indicates for class 0 or 1 respectively. Based on the article "Merging microarray data from separate breast cancer studies provides a robust prognostic test" by Lei Xu, Aik Choon Tan, Raimond L Winslow and Donald Geman.
Version | 1.0 |
---|---|
Bundle | microarray |
Categories | |
Authors | Lauri Lyly (lauri.lyly@helsinki.fi) |
Issue tracker | View/Report issues |
Source files | component.xml rank_score.py |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
inClasses | CSV | Mandatory | Sample classes. The method only supports two classes. Rows are the classes - either 0 or 1 |
ranks | CSV | Mandatory | Gene ranks by sample. First column is a list of gene IDs. Rest of columns are interpreted as samples. Row of these columns represent gene rankings. A lower rank is interpreted to be more expressed though this only affects the signum of the result. |
genes | CSV | Optional | Genes of interest or all genes if not specifies. The names must exactly match those in the first column of the ranks CSV file. |
Name | Type | Description |
---|---|---|
scores | CSV | Score matrix |
outClasses | CSV | Class names in order that corresponds to the scores matrix sign |
Name | Type | Default | Description |
---|---|---|---|
count_na | boolean | true | The score function has a normalization constant for the number of samples of given classification which is used to convert the number of specifically ordered gene pairs into a probability measure. There are two basic possibilities - to count all samples, or only those samples that actually have a value other than NA for one gene or the other. This affects whether genes for which there are only few measurements will ever get a high score because if the NA samples are counted for that gene pair the normalization constant will always be very high and thus the probability close to zero. On the other hand, if only one class has samples for specific gene, then that could also tilt the results in favor of that class if this parameter isn't set to true. By default it is false because it's hard to say that genes without measurements should be considered important if there's e.g. just one "positive" measurement for them. |
Test case | Parameters▼ | IN inClasses |
IN ranks |
IN genes |
OUT scores |
OUT outClasses |
---|---|---|---|---|---|---|
case1 | (missing) | inClasses | ranks | (missing) | (missing) | (missing) |
case2 | (missing) | inClasses | ranks | genes | (missing) | (missing) |
case3 | properties | inClasses | ranks | (missing) | (missing) | (missing) |
count_na=false |
||||||
case4 | properties | inClasses | ranks | genes | (missing) | (missing) |
count_na=false |