Up: Component summary Component

OptimalClustering

The component takes the output of the MMClustering component, where each data sample has been clustered with multiple different cluster numbers, and determines the optimal cluster number for each sample. The optimal number of clusters for each sample is determined independently of the other samples. The parameters method and metric determine how the optimal number of cluster is chosen.

Version 1.0
Bundle flowand
Categories FlowCytometry
Authors Anna-Maria Lahesmaa-Korpinen (anna-maria.lahesmaa@helsinki.fi), Erkka Valo (erkka.valo@helsinki.fI)
Requires R ; fpc (R-package)
Source files component.xml OptimalClustering.r
Usage Example with default values

Inputs

Name Type Mandatory Description
clusters CSVList Mandatory A directory containing clustering results for one or multiple samples. For one sample there should be multiple results corresponding to results with different number of clusters. One column in the CSV files should contain the cluster membership information for the row.
clustStat CSV Mandatory One row corresponds to clustering results for one sample with specific number of clusters. There should be columns for the corresbonding file name in clusters, number of clusters used in the clustering, the original file name and values for BIC, AIC, SWR and ICL.

Outputs

Name Type Description
clusters CSVList The optimal clustering results for each sample are copied to output.
report Latex Report containing a plot for the optimal cluster number metrics as a function of the cluster number.

Parameters

Name Type Default Description
clusterClustCol string "cluster" The name of the column in the clusterFiles which represents the cluster number of the rows.
method string "min" Method used to choose the optimal clustering given the metric. Possible values are 'min', 'max' and 'changepoint'. 'min' and 'max' choose the clustering results with the minimum and maximum value of the metric respectively. 'changepoint' fits two linear models to the data to detect the changepoint.
metric string "SWR" Metric used for choosing the optimal number of clusters for each sample. Possible values are SWR (Scaleefree Weighted Ratio), AID (Average Intercluster Distance), IIR (Average Intracluster Distance / Average Intercluster Distance), AIC (Akaike Information Criterion, BIC (Bayesian Information Criterion) or ICL (Integrated Completed Likelihood).
nSample int 1000 The number of data points to sample from each clustering result to calculate AID and IIR. If there is less or equal number of data points as nSample, all data points are used. This can be very slow for large values of nSample.
seed int 123456 Random seed. Used to make the sampling of the data reproducible.
useAIDAndIIR boolean true If true calculate AID and IIR metrics for the different clustering results. This can be time consuming.

Generated 2019-02-08 07:42:08 by Anduril 2.0.0