Up: Component summary Component

DMML

Identifies differentially methylated sites in tumor samples with varying, unknown tumor cell fraction using a maximum-likelihood method. For the details about the method and validation, please refer to our publication on the matter.

Version 1.0
Bundle sequencing
Categories DNA Methylation
Authors Juha Koiranen (juha.koiranen@helsinki.fi)
Issue tracker View/Report issues
Requires Python ; dmml ; installer (bash)
Source files component.xml dmml.py
Usage Example with default values

Inputs

Name Type Mandatory Description
input_files Array<BinaryFile> Mandatory List of tumor sample files for DMML. At least two inputs are required: left and right samples. Additional files are treated as controls. Maximum total number of input files is 32.
sample_compositions CSV Mandatory A single column file describing sample compositions (via bitmask).
index_file CSV Optional List of genomic sites. If supplied, they are used in outputs (probs, pvalues, map estimates, expected values) instead of indexing starting from number 1.

Outputs

Name Type Description
params BinaryFile Output containing negative log-likelihood, the estimated error parameter, and the purity of each tumor sample. The format can be se to be in binary, or in ASCII (tab separated values with a header).
probs BinaryFile Output containing the joint distributions of the latent methylation patterns. The format can be se to be in binary, or in ASCII (tab separated values with a header).
pvalues BinaryFile Output containing the p-values of the two-tailed hypothesis tests that the methylation patterns are equal between each pair of the tumor samples at a specific site distributions of the latent methylation patterns. The format can be se to be in binary, or in ASCII (tab separated values with a header).
map_estimates CSV Output containing the MAP estimates for the methylation patterns.
expected_values CSV Expected values for the methylation patterns.
diff_meth CSV Output for differential methylation.

Parameters

Name Type Default Description
diffmeth_format string "a" Whether differential methylation should be outputted in ASCII ("a") not at all (empty string). Default is ASCII.
ev_format string "a" Whether expected values should be outputted in ASCII ("a") not at all (empty string). Default is ASCII.
map_format string "a" Whether MAP estimates should be outputted in ASCII ("a") not at all (empty string). Default is ASCII.
max_iters int 25 Specifies the maximum number of expectation maximization iterations per restart (default: 25). For very large problems or for more accurate parameter estimates a higher number of iterations is recommended. For genome scale data, more than 1000 iterations rarely benefits anything.
min_delta float 1.49e-8 Specifies the minimum improvement in the objective that is not considered a stall (default: 1.49e-8). This value can be increased for earlier halt (faster optimization) at the expense of a lower accuracy.
order int 1 Specifies the order of local comethylation modeling (default: 1). Higher order is expected to result in more accurate estimates, but increase the computational effort exponentially. Typical values are 1 to 3, but up to 10 might be feasible for small problems.
params_format string "a" Indicates whether parameters should be outputted in binary ("b") or ASCII ("a") or not at all (empty string). Default is "a".
probs_format string "b" Whether joint distributions should be outputted in binary ("b") or ASCII ("a") or not at all (empty string). Default is "b".
purity_left float 0.5 Sets the initial tumor purity estimates p of the two first samples (default: 0.5) and optionally their prior precision pp (default: 0). The purity means the fraction of cancer cells of interest, while the rest is assumed to be explained by the normal cells.
purity_right float 0.5 Sets the initial tumor purity estimates p of the two first samples (default: 0.5) and optionally their prior precision pp (default: 0). The purity means the fraction of cancer cells of interest, while the rest is assumed to be explained by the normal cells.
pvalues_format string "b" Whether p-values should be outputted in binary ("b") or ASCII ("a") or not at all (empty string). Default is "b".
resets int 10 Specifies the number of quasi-Monte Carlo resets (default: 10). A large number of resets facilitates the algorithm to explore the whole state-space and not get stuck at a local minimum. For comparison of high number of samples simultaneously, a larger number of resets might be needed.
sites string "" Specifies a subset of the sites to be used in the analysis (1-based, inclusive). Subset is specified as a string with syntax "a:b", where a is the beginning (default: 1) to b (default: automatic). The default (empty string) is to model site from 1 up to largest site index encountered in any of the input files. This might be inconvenient if multiple comparisons are made, which cover different regions of genome, and can be overridden by specifying b. Alternatively, subsets of the genome can be analyzed separately by specifying both a and b. Examples: "1:300", "3"

Test cases

Test case Parameters IN
input_files
IN
sample_compositions
IN
index_file
OUT
params
OUT
probs
OUT
pvalues
OUT
map_estimates
OUT
expected_values
OUT
diff_meth
case1_basic_test properties input_files sample_compositions (missing) params probs pvalues map_estimates expected_values diff_meth

params_format = a,
probs_format = b,
pvalues_format = a,
purity_left = 0.5,
purity_right = 0.5,
min_delta = 1.49e-08,
resets = 10,
max_iters = 25,
order = 1

case2_binary_input properties input_files sample_compositions (missing) params probs pvalues map_estimates expected_values diff_meth

params_format = b,
probs_format = b,
pvalues_format = b,
purity_left = 0.5,
purity_right = 0.5,
min_delta = 1.49e-08,
resets = 10,
max_iters = 25,
order = 1

case3_index_file properties input_files sample_compositions index_file params probs pvalues map_estimates expected_values diff_meth

params_format = a,
probs_format = b,
pvalues_format = a,
purity_left = 0.5,
purity_right = 0.5,
min_delta = 1.49e-08,
resets = 10,
max_iters = 25,
order = 1

case4_replicated properties input_files sample_compositions (missing) params probs pvalues map_estimates expected_values diff_meth

params_format = a,
probs_format = b,
pvalues_format = a,
purity_left = 0.5,
purity_right = 0.5,
min_delta = 1.49e-08,
resets = 10,
max_iters = 25,
order = 1


Generated 2019-02-08 07:42:12 by Anduril 2.0.0