Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What if there are not enough normal samples in TCGA database?

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

TCGA database normal sample is not enough to do, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Although the cancer I want to mine has data in the TCGA database, there is too little normal (adjacent to cancer samples or blood). If we do difference analysis, we will face the problem of imbalance in the number of samples, and whether it can be included in the normal tissue transcriptome sequencing data in the GTEx database.

GTEx,The Genotype-Tissue _ Expression (GTEx) project was first proposed in 2013, when hundreds of scientists published an article in Nature Genetics magazine to introduce "Genotype-tissue expression Engineering" for the first time and established the "Genotype-tissue expression Research Alliance". The GTEx has catalogued gene expression in > 9000 samples across 53 tissues from 544 healthy individuals.TCGA,The cancer genome altas, https://cancergenome.nih.gov/ is a cancer research project established by National Cancer Institute (NCI, National Cancer Institute) and National Human Genome Research Institute (NHGRI, National Human Genome Research Institute). It collates all kinds of cancer-related data. The Cancer Genome Atlas (TCGA) has quantified gene expression levels in > 12000 samples from > 33 cancer types.

In fact, there is no simple answer to whether TCGA and GTEx databases can be integrated, or how to combine, the statistics behind this is slightly more complex, not just the batch effect. Published in Sci Data. Article: Unifying cancer and normal RNA sequencing data from different sources explains in detail the natural differences in transcriptome data between TCGA and GTEx databases:

Sequencing platform and chemistry, personnel, details in the analysis pipeline, etc gene expression range: 4-10 (log2 of normalized_count) for TCGA, and 0-4 (log2 of RPKM) for GTEx

All the code is shared in: GitHub (https://github.com/mskcc/RNAseqDB).

Unify TCGA and GTEx quantitative processes

In a recent article published in SR,17 February 2020, Variability in estimated gene expression among commonly used RNA-seq pipelines compared the impact of common transcriptome sequencing data analysis processes on quantitative expression matrices:

We compared gene expression values from common samples (4800 tumor samples from TCGA and 1890 normal-tissue samples from GTEx) processed by the pipelines to understand how gene expression quantification is impacted by differences in data processing.

TCGA and GTEX are two super-large programs with RNA-seq data, with TCGA covering 33 cancers and more than 10, 000 samples, while GTEX also has nearly 10, 000 samples from more than 50 tissues of more than 500 patients. Their respective initiators deal with RNA-seq data differently, and there are also some new processes that try to unify the RNA-seq data analysis results of the two databases. The more famous five processes are:

TOPMed pipeline (https://github.com/broadinstitute/gtex-pipeline)recount2 pipeline (https://jhubiostatistics.shinyapps.io/recount/))

The author applies these five processes to TCGA and GTEX, and obtains 10 different combinations of data.

GDC (GDC-Xena/Toil, GDC-Piccolo, GDC-Recount2, GDC-MSKCC and GDC-MSKCC Batch). GTEx (GTEx-Xena/Toil, GTEx-Recount2, GTEx-MSKCC, GTEx-MSKCC Batch)

Made a very perfect comparison, and published all the code in: https://github.com/sonali-bioc/UncertaintyRNA

Literature on the integration of TCGA and GTEx databases by five common transcriptome quantitative processes

A lot!

Many crude data mining, such as the article published in PeerJ's BIOINFORMATICS AND GENOMICS magazine: Identification of four hub genes associated with adrenocortical carcinoma progression by WGCNA also involves the integration of TCGA databases and GTEx.

First download the TPM expression matrix for TCGA and GTEx databases:

Gene transcripts per million (TPM) data were downloaded from the UCSC Xena database, which included ACC (The Cancer Genome Atlas, n = 77) and normal samples (Genotype Tissue Expression, n = 128).

Then the difference analysis process is:

Of the 60498 genes in each sample, we removed genes with a mean TPM ≤ 2.5 (> 1 is a common cutoff for determining if an isoform is expressed or not in the cancer and normal samples and thus retained 13987 genes.

For those genes in the samples that showed significant changes, we used analysis of variance (ANOVA) in R to determine the variance in genes between the two groups. ANOVA is a collection of statistical models useful for DEG analysis.

We obtained 2953 significant DEGs (Table S2) in ACC with a p

< 0.001 and |log2 (fold-change)| >

1 cutoff.

The result of difference analysis is 1181 up-regulated and 1772 down-regulated genes.

As you can see, the author defaults to TPM, the normalized form of transcriptional group sequencing expression data, which has the characteristics of cross-platform and cross-database, so you don't need to consider the batch effect, just use the simplest and roughest ANOVA test!

If it's methylation data,

As we all know, TCGA database is the most comprehensive and comprehensive cancer patient-related database at present, including:

DNA SequencingmiRNA SequencingProtein Expression arraymRNA SequencingTotal RNA SequencingArray-based ExpressionDNA MethylationCopy Number array

Well-known oncology research institutions have their own TCGA database exploration tools, such as:

Broad Institute FireBrowse portal, The Broad InstitutecBioPortal for Cancer Genomics, Memorial Sloan-Kettering Cancer Center

For transcriptional expression level information, the best choice is of course to integrate TCGA and GTEx databases, but for methylation data, do we have a super queue similar to GTEx database?

I haven't touched it yet, and I've shared it before: this diagnostic model is excellent. The author downloads TCGA's colorectal cancer methylation site signal matrix file:

Tissue DNA methylation data were obtained from the TCGA (TCGA, TCGA-COAD, and TCGA-READ).

And the methylation signal value of normal blood as control:

Whole-blood DNA methylation profiles from healthy donors were generated in an aging study (GSE40279)

The above two cohorts were used to identify the specific methylation sites of rectal cancer, and the difference analysis was performed to identify top 1000 methylation markers.

It is reasonable to speculate that there is no methylation data from normal human tissues to use, so they will choose to use the methylation signal value of normal human blood as a control.

After reading the above, do you know what to do if the normal samples of the TCGA database are not enough? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 218

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report