Imports data from microarray text files such as Agilent CSV files. The input is a directory of CSV files, each of which represents a two-channel or one-channel microarray. Three types of data can be extracted: numeric matrices containing channel or log-ratio values, probe annotations and sample-specific annotations. This component does not do normalization such as background correction. Binary files can not be processed by this component.
The outputs green, green2, red, and red2 are matrices, each containing values from one column in each CSV file. The columns for these matrices are named with the parameter channelColumns. In output matrices, each column represents one microarray (i.e., one input CSV file) and each row represents a probe. Interpretation of these matrices depends on the columns selected. The default settings, geared towards Agilent two-channel arrays, extract normalized values into "green" and "red" and raw values into "green2" and "red2". However, by modifying channelColumns, it is also possible to extract raw foreground values into "green" and "red" and background values into "green2" and "red2", for example. For one-channel arrays, "red" and "red2" are not used.
If combineProbes=true, probes that have multiple copies on the microarray are combined using median so that the output contains a unique value for each probe. Probes IDs are taken from the column named by idColumn. If combineProbes=false, probe IDs are renamed to internal unique IDs that can be mapped to original probe IDs using the first two columns of probeAnnotation.
Input rows can be filtered using the filter parameter. This enables to remove control and bad quality probes from the output. The default values are conservative in order to remove possible false positives; if the results are independently validated or the experiment setup includes dye-swap or the, less strict filtering may be applied.
Probe annotations are columns from input CSV files that do not depend on the sample being processed. Common annotation include gene name, textual description and nucleotide sequence. Sample annotations are columns from input CSV files that may depend on the sample and the channel. Sample annotations are defined using three parameters: sampleAnnotation gives output columns, sampleAnnotationChannel1 gives input column names for channel 1 and sampleAnnotationChannel2 for channel 2 (if present).
Version | 1.0.1 |
---|---|
Bundle | microarray |
Categories | Data Import Agilent |
Authors | Kristian Ovaska (kristian.ovaska@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | commons-math3-3.2.jar (jar) ; commons-primitives-1.0.jar (jar) ; jep-2.4.1.jar (jar) |
Source files | component.xml AgilentReader.java |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
agilent | AgilentDirectory | Mandatory | Agilent source file directory. |
sampleNames | CSV | Mandatory | Sample definitions. The table contains the columns GreenSampleID (sample ID for the sample on green channel), GreenDescription (human-readable description for the sample), RedSampleID, RedDescription, Filename (key; relative to the Agilent source directory). |
Name | Type | Description |
---|---|---|
green | LogMatrix | Green channel, primary values. Source column is the first element of channelColumns. |
green2 | LogMatrix | Green channel, secondary values. Source column is the second element of channelColumns. Depending on the column selected, these may be raw (unprocessed) values or background values. |
red | LogMatrix | Red channel, primary values. Source column is the third element of channelColumns. |
red2 | LogMatrix | Red channel, secondary values. Source column is the fourth element of channelColumns. Depending on the column selected, these may be raw (unprocessed) values or background values. |
probeAnnotation | AnnotationTable | Probe annotations that are not sample-dependent. If combineProbes=false, the first two columns are InternalProbe and original probe ID (column name given with idColumn); all probes having the same value in the second column represent duplicates of the same probe. If combineProbes=false, the first column is the unique probe ID column. The rest of the columns are specified using probeAnnotation. |
sampleAnnotation | CSV | Sample-dependent annotations. The first three columns ("SampleID", probe-id, "Index") uniquely identify the row. The rest of the column are specified by the parameter sampleAnnotation. |
groups | SampleGroupTable | Sample group table that is generated based on sampleNames. All groups have the type sample. |
Name | Type | Default | Description |
---|---|---|---|
channelColumns | string | "gProcessedSignal,gMedianSignal,rProcessedSignal,rMedianSignal" | Column names for matrix extraction, in the order green, green2, red, red2. Empty values may be omitted, so "col1" is the same as "col1,,,". The default values, for Agilent two-channel arrays, extract preprocessed values into "green" and "red" and raw values into "green2" and "red2". |
combineProbes | boolean | true | If true, duplicate probes (having the same sequence) are combined into one using median. If false, duplicate probes are present in the output, with unique internal names. |
filter | string | "ControlType!=0 || gIsSaturated==1 || rIsSaturated==1 || gIsWellAboveBG==0 || rIsWellAboveBG==0 || gIsFeatPopnOL==1 || rIsFeatPopnOL==1 || gIsBGPopnOL==1 || rIsBGPopnOL==1" | Rows in source files matching this Boolean expression are excluded from the result. The expression can refer to any cell value of the current row using column names. Boolean and arithmetic operators and parenthesis as defined in Java are available. For example, "ControlType!=0 || (gIsSaturated==1 && rIsSaturated==1)" removes probes that are either control probes or are saturated on both green and red channels; ControlType, gIsSaturated and rIsSaturated must be valid columns in input files. |
idColumn | string | "ProbeName" | Column name in input CSV files that gives the probe ID. Features having the same probe ID are assumed to be copies of the same probe. |
probeAnnotation | string | "GeneName,Description,Row,Col" | Comma-separated list of column names in input CSV files that contain probe annotation. These are extracted to the probeAnnotation output. |
sampleAnnotation | string | "" | Comma-separated list of sample annotation columns for output. These column names appear in the sampleAnnotation output. These columns are not queried in input CSV files; rather, sampleAnnotationChannel1 and sampleAnnotationChannel2 define the column names in input files and this parameter gives the corresponding output column names. |
sampleAnnotationChannel1 | string | "" | Comma-separated list of sample annotation columns for channel 1 in the input files. The value for channel 1 is extracted from these columns. The list must have equal length to the sampleAnnotation list. |
sampleAnnotationChannel2 | string | "" | Comma-separated list of sample annotation columns for channel 2 in the input files. The value for channel 2 is extracted from these columns. The list must have equal length to the sampleAnnotation list. If empty, it is assumed that the array has one channel and annotations for channel 2 are not processed. |
startPattern | string | "\"?FEATURES\"?\t" | Regular expression that identifies the start of content in input CSV files. This allows to skip some content from the beginning of files. The pattern is matched to the start of each line. The matching line must be a header that contains column names. |
useColumnNameMatch | boolean | false | Instead of startPattern matching to find the start of the actual expression data, useColumnNameMatch=true selects the line that has all the channel column names (from channelColumns) to be the start of the data. |
Test case | Parameters▼ | IN agilent |
IN sampleNames |
OUT green |
OUT green2 |
OUT red |
OUT red2 |
OUT probeAnnotation |
OUT sampleAnnotation |
OUT groups |
---|---|---|---|---|---|---|---|---|---|---|
case1 | properties | agilent | sampleNames | green | green2 | red | red2 | probeAnnotation | sampleAnnotation | groups |
probeAnnotation=GeneName,Description,Row,Col,ControlType, |
||||||||||
case2 | properties | agilent | sampleNames | green | green2 | red | red2 | probeAnnotation | sampleAnnotation | groups |
probeAnnotation=GeneName,Description,Row,Col,ControlType, |
||||||||||
case3_filter | properties | agilent | sampleNames | green | green2 | red | red2 | probeAnnotation | sampleAnnotation | groups |
probeAnnotation=GeneName,Description,Row,Col,ControlType, |
||||||||||
case4_onematrix | properties | agilent | sampleNames | green | green2 | red | red2 | probeAnnotation | sampleAnnotation | groups |
channelColumns=,,rProcessedSignal,, |
||||||||||
case5_nocombine | properties | agilent | sampleNames | green | green2 | red | red2 | probeAnnotation | sampleAnnotation | groups |
probeAnnotation=GeneName,Description,Row,Col,ControlType, |