This component performs dimensionality reduction with an R wrapper of the C++ implementation of Barnes-Hut-SNE as described in http://lvdmaaten.github.io/tsne/. Remember to cite his paper whenever you use this component.
Version | 1.1 |
---|---|
Bundle | tools |
Categories | Multivariate Statistics |
Authors | Julia Casado (julia.casado@helsinki.fi) |
Issue tracker | View/Report issues |
Requires | R ; installer (bash) |
Source files | component.xml tsne.R |
Usage | Example with default values |
Name | Type | Mandatory | Description |
---|---|---|---|
in | CSV | Mandatory | A numeric matrix on which dimensionality reduction will be applied. Rows represent datapoints, e.g. cells/patients, and columns represent the dimensions, e.g. features or markers, that we want to reduce. |
Name | Type | Description |
---|---|---|
out | CSV | A numeric matrix with two or three columns, depending on dims parameter. |
entropy | CSV | A matrix of entropy estimates in natural-base units (nats) for each sample. The column "orig" is an estimate of entropy in the original space, while "lots" is an estimate of the Kullback-Leibler divergence from the output to the input space. |
Name | Type | Default | Description |
---|---|---|---|
check_duplicates | boolean | false | It is best to check for duplicates with previous components because for big files this check-up will take too long time. |
cost_tol | float | 1.48e-8 | Tolerance for cost function stall. |
dims | int | 2 | Output dimensionality. Possible values are 2 or 3 because the original method was developed for visualization purposes and not thoroughly tested for larger dimensionality. |
entropy_fast | boolean | true | Use fast and cheap entropy approximation. |
initial_dims | int | 50 | Number of dimensions in a preliminary step of dimensionality reduction using PCA. Only read if parameter pca is true. |
is_distance | boolean | false | Indicates whether the input is a distance matrix. In the documentation at the time of creating this component it warns that is an experimental feature. Use at own risk. |
max_iter | int | 1000 | Number of iterations. |
pca | boolean | false | Recommended for big files, over 5000 datapoints and 100 features. If true, it will run first basic PCA to reduce the dimensions. May result in poor performance for small datasets. |
perplexity | int | 30 | It is a measure of information that in this case can be used as the number of nearest neighbors k that is employed in many manifold learners. If the visualization out of the output shows most of the points clustered like a ball means that the perplexity parameter was too high. It will depend on the size and structure of the data. |
seed | int | -1 | Seed number to make test cases reproducible. If null, the system generates one every time. |
theta | float | 0 | Variable for Speed/accuracy trade-off. Higher theta means shorter running time and less accuracy of the results. Change only if the dataset is really really big. |
verbose | boolean | true | Log the tsne process to terminal |
Test case | Parameters▼ | IN in |
OUT out |
OUT entropy |
||
---|---|---|---|---|---|---|
case1 | properties | in | (missing) | (missing) | ||
perplexity=3, |