Identifies differentially methylated sites in tumor samples with varying, unknown tumor cell fraction using a maximum-likelihood method. For the details about the method and validation, please refer to our publication on the matter.
Version | 1.0 |
---|---|
Bundle | sequencing |
Categories | DNA Methylation |
Authors | Juha Koiranen (juha.koiranen@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | Python ; dmml ; installer (bash) |
Source files | component.xml dmml.py |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
input_files | Array<BinaryFile> | Mandatory | List of tumor sample files for DMML. At least two inputs are required: left and right samples. Additional files are treated as controls. Maximum total number of input files is 32. |
sample_compositions | CSV | Mandatory | A single column file describing sample compositions (via bitmask). |
index_file | CSV | Optional | List of genomic sites. If supplied, they are used in outputs (probs, pvalues, map estimates, expected values) instead of indexing starting from number 1. |
Name | Type | Description |
---|---|---|
params | BinaryFile | Output containing negative log-likelihood, the estimated error parameter, and the purity of each tumor sample. The format can be se to be in binary, or in ASCII (tab separated values with a header). |
probs | BinaryFile | Output containing the joint distributions of the latent methylation patterns. The format can be se to be in binary, or in ASCII (tab separated values with a header). |
pvalues | BinaryFile | Output containing the p-values of the two-tailed hypothesis tests that the methylation patterns are equal between each pair of the tumor samples at a specific site distributions of the latent methylation patterns. The format can be se to be in binary, or in ASCII (tab separated values with a header). |
map_estimates | CSV | Output containing the MAP estimates for the methylation patterns. |
expected_values | CSV | Expected values for the methylation patterns. |
diff_meth | CSV | Output for differential methylation. |
Name | Type | Default | Description |
---|---|---|---|
diffmeth_format | string | "a" | Whether differential methylation should be outputted in ASCII ("a") not at all (empty string). Default is ASCII. |
ev_format | string | "a" | Whether expected values should be outputted in ASCII ("a") not at all (empty string). Default is ASCII. |
map_format | string | "a" | Whether MAP estimates should be outputted in ASCII ("a") not at all (empty string). Default is ASCII. |
max_iters | int | 25 | Specifies the maximum number of expectation maximization iterations per restart (default: 25). For very large problems or for more accurate parameter estimates a higher number of iterations is recommended. For genome scale data, more than 1000 iterations rarely benefits anything. |
min_delta | float | 1.49e-8 | Specifies the minimum improvement in the objective that is not considered a stall (default: 1.49e-8). This value can be increased for earlier halt (faster optimization) at the expense of a lower accuracy. |
order | int | 1 | Specifies the order of local comethylation modeling (default: 1). Higher order is expected to result in more accurate estimates, but increase the computational effort exponentially. Typical values are 1 to 3, but up to 10 might be feasible for small problems. |
params_format | string | "a" | Indicates whether parameters should be outputted in binary ("b") or ASCII ("a") or not at all (empty string). Default is "a". |
probs_format | string | "b" | Whether joint distributions should be outputted in binary ("b") or ASCII ("a") or not at all (empty string). Default is "b". |
purity_left | float | 0.5 | Sets the initial tumor purity estimates p of the two first samples (default: 0.5) and optionally their prior precision pp (default: 0). The purity means the fraction of cancer cells of interest, while the rest is assumed to be explained by the normal cells. |
purity_right | float | 0.5 | Sets the initial tumor purity estimates p of the two first samples (default: 0.5) and optionally their prior precision pp (default: 0). The purity means the fraction of cancer cells of interest, while the rest is assumed to be explained by the normal cells. |
pvalues_format | string | "b" | Whether p-values should be outputted in binary ("b") or ASCII ("a") or not at all (empty string). Default is "b". |
resets | int | 10 | Specifies the number of quasi-Monte Carlo resets (default: 10). A large number of resets facilitates the algorithm to explore the whole state-space and not get stuck at a local minimum. For comparison of high number of samples simultaneously, a larger number of resets might be needed. |
sites | string | "" | Specifies a subset of the sites to be used in the analysis (1-based, inclusive). Subset is specified as a string with syntax "a:b", where a is the beginning (default: 1) to b (default: automatic). The default (empty string) is to model site from 1 up to largest site index encountered in any of the input files. This might be inconvenient if multiple comparisons are made, which cover different regions of genome, and can be overridden by specifying b. Alternatively, subsets of the genome can be analyzed separately by specifying both a and b. Examples: "1:300", "3" |
Test case | Parameters▼ | IN input_files |
IN sample_compositions |
IN index_file |
OUT params |
OUT probs |
OUT pvalues |
OUT map_estimates |
OUT expected_values |
OUT diff_meth |
---|---|---|---|---|---|---|---|---|---|---|
case1_basic_test | properties | input_files | sample_compositions | (missing) | params | probs | pvalues | map_estimates | expected_values | diff_meth |
params_format = a, |
||||||||||
case2_binary_input | properties | input_files | sample_compositions | (missing) | params | probs | pvalues | map_estimates | expected_values | diff_meth |
params_format = b, |
||||||||||
case3_index_file | properties | input_files | sample_compositions | index_file | params | probs | pvalues | map_estimates | expected_values | diff_meth |
params_format = a, |
||||||||||
case4_replicated | properties | input_files | sample_compositions | (missing) | params | probs | pvalues | map_estimates | expected_values | diff_meth |
params_format = a, |