How to achieve consistent clustering in R language 07/04 Update SLTechnology News&Howtos

How to achieve consistent clustering in R language

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, I would like to share with you the relevant knowledge points about how to achieve consistent clustering in R language. the content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article, let's take a look at it.

Consistent clustering of Gene expression data by ConsensusClusterPlus package

Consensus Clustering (consistent clustering) is an unsupervised clustering method and a common method for the classification of cancer subtypes. Samples can be divided into several subtypes according to different taxonomic data sets, so as to find new disease subtypes or compare and analyze different subtypes. This kind of articles generally analyze the gene expression (chip data or RNA-seq data) or methylation and other data to select the optimal cluster number; differential expression analysis of the cluster group to get DEGs, differentially expressed genes for GO, pathway,PPI and a series of analysis, in the analysis of the relationship with survival, the difference of immune cell abundance, and so on.

The basic principle of Consensus Clustering assumes that the samples extracted from different subclasses of the original dataset form a new dataset, and different samples are extracted from the same subclass, then the results of clustering analysis on the new dataset, both the number of clusters and the samples within the class should not be much different from the original dataset. Therefore, the more stable the clustering is relative to the sampling variation, the more we can believe that the same clustering represents a real subclass structure. The resampling method can disrupt the original data set, so cluster analysis is carried out for each resampled sample, and then the results of multiple cluster analysis are comprehensively evaluated to give a Consensus evaluation. In summary, uniform clustering verifies the rationality of clustering based on resampling. Its main purpose is to evaluate the stability of clustering and can be used to determine the best clustering number K.

Compared with other clustering methods, the advantages of consistent clustering:

Cannot provide "objective" criteria and boundaries for the number of classifications, such as Hierarchical Clustering.

The number of classifications needs to be given in advance, and there is no uniform standard to compare the results of classifications under different numbers of classifications, such as K-means Clustering.

The rationality and reliability of clustering results can not be verified.

R to achieve consistent clustering

ConsensusClusterPlus implements Consensus Clustering in R.

# installation package

If (! requireNamespace ("BiocManager", quietly = TRUE)) install.packages ("BiocManager") BiocManager::install ("ConsensusClusterPlus")

Main methods:

(1) ConsensusClusterPlus method is used for consistent clustering.

ConsensusClusterPlus (d=NULL, maxK = 3, reps=10, pItem=0.8, pFeature=1, clusterAlg= "hc", title= "untitled_consensus_cluster", innerLinkage= "average", finalLinkage= "average", distance= "pearson", ml=NULL, tmyPal=NULL,seed=NULL, plot=NULL,writeTable=FALSE,weightsItem=NULL, weightsFeature=NULL,verbose=F,corUse= "everything")

Common parameters:

The data matrix that needs to be clustered is provided, where the column is the sample and the row is features, which can be the gene expression matrix.

MaxK

The maximum number of classifications in the clustering result must be an integer.

Reps

Number of resampling

PItem

The sampling ratio of the sample, for example, pItem=0.8 indicates that 80% of the sample is sampled by the resampling plan, and a stable and reliable subgroup classification is found after repeated sampling.

PFeature

Sampling ratio of Feature

ClusterAlg

The clustering algorithm used is "hc" for hierarchical clustering, "pam" for PAM (Partioning Around Medoids) algorithm, "km" for K-Means algorithm, or custom functions.

Title

Set the path to the generated file

Distance

The methods of calculating distance include pearson, spearman, euclidean, binary, maximum, canberra, minkowski.

TmyPal

You can specify the color used by the consistency matrix, which is white-blue by default

Seed

Set random seeds.

Plot

When not set, the picture result is only output to the screen, or you can set the output to 'pdf',' png', 'pngBMP'.

WriteTable

If TRUE, output the conformance matrix, ICL, log to the CSV file

WeightsItem

Weight of sample when sampling

WeightsFeature

The weight of Feature sampling

Verbose

If TRUE, you can output progress information on the screen.

CorUse

Set how missing values are handled:

All.obs: assume that there is no missing data-an error will be reported when missing data is encountered

Everything: when missing data is encountered, the calculation result of the correlation coefficient will be set to missing

Complete.obs: line deletion

Pairwise.complete.obs: delete in pairs, pairwisedeletion

(2) calcICL function:

Usage:

CalcICL (res,title= "untitled_consensus_cluster", plot=NULL,writeTable=FALSE)

Parameters:

Res

Results of consensusClusterPlus

Title

Set the path to the generated file

Plot

When not set, the picture result is only output to the screen, or you can set the output to 'pdf',' png', 'pngBMP'.

WriteTable

If TRUE, output the conformance matrix, ICL, log to the CSV file

Data analysis

First, the data for cluster analysis, such as the experimental results of mRNA expression microarray or immunohistochemical staining intensity, were collected. The format of the input data should be matrix. Let's take the ALL gene expression data as an example.

# # A total of 128 samples using ALL sample data library (ALL) data (ALL) d=exprs (ALL) d [1JV 5pm 1JV 5] # 12625 probe data # 01005 01010 03002 04006 0400 '1000_at 7.597323 7.479445 7.567593 7.384684 7.90531' 1001_at 5.046194 4.932537 4.799294 4.922627 4.84456 '1002_f_at 3.900466 4.208155 3.886169 4.206798 3.41692' 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.68799 '1004_at 5.925780 5.893209 6.170245 5.615210' normalization' Cluster analysis of gene or probe data that vary greatly from sample to sample is used for cluster analysis mads=apply (dmaine 1 (mads)) # calculate the standard deviation of each gene [rev (order (mads)) [1 order],] # sweep function minus the median to normalize d = sweep (dMage1, apply (dMage1, apply)) # this d matrix can also be normalized with the normalization of DESeq, depending on the situation.

Cluster analysis.

Title= "F:/ConsensusClusterPlus" # sets the image output path results

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.