Retrieves database records from NCBI Entrez, including gene annotation, PubMed references and sequences. Retrieving is done using Entrez E-Utilities. Please read E-Utilities User Requirements before doing large queries.
Entrez is a collection of databases. All databases supported by E-Utilities can be queried (parameter "db"). Several query modes, assigned with parameter "mode", are supported. Output format depends on the mode and the database. For modes other that "search", querying is done using numeric NCBI identifiers, which are separate for each database.
mode="summary" fetches attributes for a single database (using ESummary). Annotation columns in the output are unique for each database; see partial list.
mode="search" fetches records matching search terms (using ESearch). Querying is done using textual search terms, using a format documented in Entrez and PubMed. For example, "CDK7[TIAB] 'cell cycle' 1999:2001[DP]" (with db=pubmed) fetches PubMed record IDs containing the phrases cell cycle and CDK7 published between 1999 and 2001. In the output, the ID column (outColumn) contains the original query, "Result" contains NCBI identifiers matching the query and "QueryTranslation" shows how Entrez interpreted the query. Result identifiers can be further annotated with a second call having mode=summary.
mode="link" fetches relationships between Entrez databases, such as proteins associated to query genes (using ELink with cmd=neighbor). In the output, each queried link type (link name) has its own column, containing NCBI identifiers of the target database.
mode="linkout" fetches links to external WWW resources for the query records using LinkOut (ELink with cmd=llink). These resources include full-text links for PubMed articles and information WWW sites for genes. In the output, each link occupies one CSV row. Annotation columns are "URL" (the link); "LinkName", "SubjectType", "Category", "Attribute" (link attributes); and "ProviderName", "ProviderID", "ProviderAbbr" (external provider info).
mode="sequence" fetches genomic/protein sequences (using EFetch with rettype=fasta). In the output, "TSeq_sequence" contains the sequence and other columns contain sequence annotation. The sequences can be converted to FASTA format using CSV2FASTA.
Notes on large queries (> 500 records). If you are not in a hurry, increase queryDelay to lessen the load on Entrez server. Use the @execute="once" AndurilScript annotation to avoid re-running database fetching when applicable. If you frequently do large queries, provide the userEmail parameter. Always use the eutils site instead of main NCBI site.
Version | 1.0 |
---|---|
Bundle | microarray |
Categories | Annotation |
Authors | Kristian Ovaska (kristian.ovaska@helsinki.fi) |
Issue tracker | View/Report issues |
Source files | component.xml EntrezAnnotator.java XMLParserBase.java LinkXMLParser.java LinkOutXMLParser.java SearchXMLParser.java SequenceXMLParser.java SummaryXMLParser.java |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
in | CSV | Mandatory | Input file containing query keys. The keys are NCBI numeric IDs, expect for mode=search, when keys are textual search terms. The column name is given with parameter keyColumn. Comma-separated cell values are automatically split and query is done for all of them (also for mode=search). |
Name | Type | Description |
---|---|---|
out | CSV | Annotation results. The first column is the ID column and the rest are annotation columns. |
Name | Type | Default | Description |
---|---|---|---|
attributes | string | "*" | Comma-separated list of attribute columns that are included in the output. The special value * includes all columns provided by Entrez. When mode=link, this value restricts the link types (names) that are queried (URL parameter "linkname"). |
baseURL | string | "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/" | Base URL for the Entrez service. |
batchSize | int | 300 | Maximum number of query keys that are fetched in one HTTP request. This should always be less than 500. When individual query keys may return large result sets (such as finding all SNPs of query genes), this should be lowered. For mode=search, batchSize is automatically set to 1. |
db | string | "gene" | Target Entrez database (URL parameter "db"). For mode=link, this is the database into which query identifiers are mapped. |
dbfrom | string | "" | Only used with mode=link; must be given non-empty value in that mode. Gives the source database name from which query identifiers are taken. URL parameter "dbfrom". |
keyColumn | string | "" | Column name in the query input that contains query keys. If empty, the first column is used. |
mode | string | "summary" | Query mode. Legal values are "summary", "search", "link", "linkout" and "sequence". |
outColumn | string | "" | Name of the ID column in the output file. If empty, the name of the input key column is used. |
queryDelay | float | 0.5 | Delay in seconds between batch queries (HTTP requests). This lessens the load on Entrez servers. Only used when there are several batches. |
tool | string | "Anduril" | Name of the query program. This is sent along with the HTTP request to Entrez site (URL parameter "tool"). If empty, the URL parameter is omitted. |
userEmail | string | "" | Email address of the user running the component. This is sent along with the HTTP request to Entrez site (URL parameter "email"). If empty, the URL parameter is omitted. Provide your email address to Entrez if you do large queries. |
Test case | Parameters▼ | IN in |
OUT out |
|||
---|---|---|---|---|---|---|
case01_summary_gene | properties | in | out | |||
attributes=Name,Chromosome,TaxID |
||||||
case02_link | properties | in | out | |||
mode=link, |
||||||
case03_sequence | properties | in | out | |||
mode=sequence, |
||||||
case04_search | properties | in | out | |||
mode=search, |
||||||
case05_linkout | properties | in | out | |||
mode=linkout, |
||||||
case06_emptyOut | properties | in | (missing) | |||
attributes=Summary |