Up: Component summary Component

EntrezAnnotator

Retrieves database records from NCBI Entrez, including gene annotation, PubMed references and sequences. Retrieving is done using Entrez E-Utilities. Please read E-Utilities User Requirements before doing large queries.

Entrez is a collection of databases. All databases supported by E-Utilities can be queried (parameter "db"). Several query modes, assigned with parameter "mode", are supported. Output format depends on the mode and the database. For modes other that "search", querying is done using numeric NCBI identifiers, which are separate for each database.

mode="summary" fetches attributes for a single database (using ESummary). Annotation columns in the output are unique for each database; see partial list.

mode="search" fetches records matching search terms (using ESearch). Querying is done using textual search terms, using a format documented in Entrez and PubMed. For example, "CDK7[TIAB] 'cell cycle' 1999:2001[DP]" (with db=pubmed) fetches PubMed record IDs containing the phrases cell cycle and CDK7 published between 1999 and 2001. In the output, the ID column (outColumn) contains the original query, "Result" contains NCBI identifiers matching the query and "QueryTranslation" shows how Entrez interpreted the query. Result identifiers can be further annotated with a second call having mode=summary.

mode="link" fetches relationships between Entrez databases, such as proteins associated to query genes (using ELink with cmd=neighbor). In the output, each queried link type (link name) has its own column, containing NCBI identifiers of the target database.

mode="linkout" fetches links to external WWW resources for the query records using LinkOut (ELink with cmd=llink). These resources include full-text links for PubMed articles and information WWW sites for genes. In the output, each link occupies one CSV row. Annotation columns are "URL" (the link); "LinkName", "SubjectType", "Category", "Attribute" (link attributes); and "ProviderName", "ProviderID", "ProviderAbbr" (external provider info).

mode="sequence" fetches genomic/protein sequences (using EFetch with rettype=fasta). In the output, "TSeq_sequence" contains the sequence and other columns contain sequence annotation. The sequences can be converted to FASTA format using CSV2FASTA.

Notes on large queries (> 500 records). If you are not in a hurry, increase queryDelay to lessen the load on Entrez server. Use the @execute="once" AndurilScript annotation to avoid re-running database fetching when applicable. If you frequently do large queries, provide the userEmail parameter. Always use the eutils site instead of main NCBI site.

Version 1.0
Bundle microarray
Categories Annotation
Authors Kristian Ovaska (kristian.ovaska@helsinki.fi)
Issue tracker View/Report issues
Source files component.xml EntrezAnnotator.java XMLParserBase.java LinkXMLParser.java LinkOutXMLParser.java SearchXMLParser.java SequenceXMLParser.java SummaryXMLParser.java
Usage Example with default values

Inputs

Name Type Mandatory Description
in CSV Mandatory Input file containing query keys. The keys are NCBI numeric IDs, expect for mode=search, when keys are textual search terms. The column name is given with parameter keyColumn. Comma-separated cell values are automatically split and query is done for all of them (also for mode=search).

Outputs

Name Type Description
out CSV Annotation results. The first column is the ID column and the rest are annotation columns.

Parameters

Name Type Default Description
attributes string "*" Comma-separated list of attribute columns that are included in the output. The special value * includes all columns provided by Entrez. When mode=link, this value restricts the link types (names) that are queried (URL parameter "linkname").
baseURL string "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/" Base URL for the Entrez service.
batchSize int 300 Maximum number of query keys that are fetched in one HTTP request. This should always be less than 500. When individual query keys may return large result sets (such as finding all SNPs of query genes), this should be lowered. For mode=search, batchSize is automatically set to 1.
db string "gene" Target Entrez database (URL parameter "db"). For mode=link, this is the database into which query identifiers are mapped.
dbfrom string "" Only used with mode=link; must be given non-empty value in that mode. Gives the source database name from which query identifiers are taken. URL parameter "dbfrom".
keyColumn string "" Column name in the query input that contains query keys. If empty, the first column is used.
mode string "summary" Query mode. Legal values are "summary", "search", "link", "linkout" and "sequence".
outColumn string "" Name of the ID column in the output file. If empty, the name of the input key column is used.
queryDelay float 0.5 Delay in seconds between batch queries (HTTP requests). This lessens the load on Entrez servers. Only used when there are several batches.
tool string "Anduril" Name of the query program. This is sent along with the HTTP request to Entrez site (URL parameter "tool"). If empty, the URL parameter is omitted.
userEmail string "" Email address of the user running the component. This is sent along with the HTTP request to Entrez site (URL parameter "email"). If empty, the URL parameter is omitted. Provide your email address to Entrez if you do large queries.

Test cases

Test case Parameters IN
in
OUT
out
case01_summary_gene properties in out

attributes=Name,Chromosome,TaxID

case02_link properties in out

mode=link,
dbfrom=gene,
db=protein,
attributes=gene_protein

case03_sequence properties in out

mode=sequence,
db=nucleotide,
attributes=TSeq_seqtype,TSeq_accver,TSeq_taxid,TSeq_length,TSeq_sequence

case04_search properties in out

mode=search,
db=pubmed

case05_linkout properties in out

mode=linkout,
dbfrom=pubmed

case06_emptyOut properties in (missing)

attributes=Summary


Generated 2019-02-08 07:42:09 by Anduril 2.0.0