Identifies gene/protein IDs using fuzzy keyword matching between gene aliases and descriptions. Keyword matching is based on Czekanowski-Dice distance between pairwise keyword sets, taking into account the information content of the keywords.
To illustrate, consider mapping gene names into database identifiers. The query consists of gene names, which may include non-official names, and textual descriptions. The reference annotation contains database IDs, official gene names, gene aliases and textual descriptions for the whole genome (or a subset of interest), against which query items are compared. Query columns are mapped against each annotation column in turn and a distance metric is computed. If the distance is below threshold, the match is reported. Also reported are keywords that uniquely identify query items in the annotation file.
Method: CSV items in annotation and query files are tokenized using regular expressions, producing keywords. Keywords are converted to lower case and their order does not matter. From the annotation file, normalized information content (NIC; range 0..1) of each keyword is computed as log(k/N)/log(1/N), where k is the frequency of the keyword and N is the total number of annotation items having non-empty content in current annotation column. The distance (range 0..1) between query row i and annotation row j is the minimum of pair-wise column distances on the row. In each pairwise comparison concerning two keyword sets, we compute symmetric difference D, union U and intersection I. Keywords are weighted based on NIC, with more informative keywords having greater weight. Distance is defined as SUM(D)/(SUM(U)+SUM(I)), where SUM is the sum of NICs in the set. If a query keyword is not present in the annotation file, its NIC is 1. CPU performance: O(RA*RQ*CA*CQ), where RA, RQ, CA and CQ are the number of rows (R) and selected columns (C) in annotation (A) and query files (Q).
Version | 1.0 |
---|---|
Bundle | microarray |
Categories | Convert |
Authors | Kristian Ovaska (kristian.ovaska@helsinki.fi) |
Issue tracker | View/Report issues |
Source files | component.xml KeywordMatcher.java |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
annotation | CSV | Mandatory | Annotation for a set of genes or other entities. One column is a key column that is reported in the output and others are annotation columns. |
query | CSV | Optional | Attributes of query genes or other entities. These are matched against the annotation file. The query may be omitted, in which case statistics on the annotation file is produced as output. |
Name | Type | Description |
---|---|---|
mapping | CSV | Match results. For each query row, there is one output row. The columns are QueryID, TargetID (comma-separated list of target IDs), TargetDistance (distances corresponding to each target match ID) and Keywords (list of keywords that uniquely identify the item). |
Name | Type | Default | Description |
---|---|---|---|
annotationColumns | string | "*" | Comma-separated list of column names in the annotation file that are used for keyword matching. The special value * includes all columns, including the key column. |
annotationKeyColumn | string | "" | Column name in the annotation file that contains target identifiers. If empty, the first column is used. |
matchDistance | float | 0.2 | Distance threshold below which two rows are considered to to represent the same genes or other entities. Must be between 0 and 1. |
maxMatches | int | 1 | Maximum number of matches that are reported for each query row. Setting this to 1 improves performance. |
pruneKeywordIC | float | 0.1 | Keywords whose information content is below this threshold are removed to improve performance. Effectively, their NIC=0. Setting this too high lowers accuracy. |
queryColumns | string | "*" | Comma-separated list of column names in the query file that are used for keyword matching. The special value * includes all columns, including the key column. |
queryKeyColumn | string | "" | Column name in the query file that contains query identifiers. If empty, the first column is used. |
removePattern | string | "[()\[\]\{\};]" | Java regular expression that is applied to CSV cells before splitting into tokens. Matching portions are replaced with a space character. |
tokenizePattern | string | "[, ]+" | Java regular expression that is used to split CSV cells into tokens. |
trimPattern | string | "[:._-]+" | Java regular expression that is applied to tokens (keywords) to remove leading and trailing portions of the token. For example, if trimPattern="[_.]+", the token "__abc_def__." is transformed into "abc_def". |
Test case | Parameters▼ | IN annotation |
IN query |
OUT mapping |
||
---|---|---|---|---|---|---|
case1 | properties | annotation | query | mapping | ||
matchDistance = 0.5, |
||||||
case2_noquery | properties | annotation | (missing) | mapping | ||
maxMatches = 1 |