In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the use of stringTie tools, the article is very detailed, has a certain reference value, interested friends must read it!
For transcriptome data, the most basic analysis is the quantification of genes and transcripts. Quantification is to determine the expression of a gene or transcript, in which there are many quantitative ways.
The most direct way is to count the number of reads from mapping to this gene / transcript and take the number of reads as the expression level. We call this expression raw count.
On the basis of raw count, the quantitative method of TPM value is obtained by using exon length normalization. For each gene, raw count is divided into the length of the gene (the sum of the length of exon), and the expression quantity after length normalization is obtained. The TPM value of a gene is to use the normalized expression to calculate a relative abundance. The specific calculation formula is as follows. Note that the gene length is in k.
On the basis of raw count, the quantitative method of RPKM/ FPKM value is obtained by normalizing the two factors of sequencing quantity and exon length. Firstly, the relative abundance is obtained by dividing raw count by all the reads numbers on mapping, and the RPKM value can be calculated by dividing the length of the gene (the sum of exon lengths). During the test, each insert is called a fragment, and for double-ended sequencing, one fragment gets two reads.
The only difference between RPKM and FPKM is the calculation of raw count. RPKM calculates the number of reads, while the value of FPKM calculates the number of fragments. For single-ended sequencing, the number of fragment and reads is equal; for double-ended sequencing, the number of reads is twice the number of fragments. For FPKM, even if the two reads at both ends are aligned to the genome, they are counted once, because the two reads come from the same fragment.
The specific calculation formula is as follows, we should pay attention to the unit. The total number of reads on mapping is in M unit, and the gene length is in k unit.
There are many software that can carry out quantitative analysis. This paper mainly introduces the software stringTie.
In the early transcriptome data analysis, the most classic analysis strategy is tophat+cufflinks+cuffdiff. The pipeline of this analysis will give the quantitative results based on FPKM values, and then carry out difference analysis. However, with the improvement of the amount of sequencing data and the development of analysis methods, there are many problems in this analysis strategy.
First of all, the speed of tophat is very slow, compared with the new comparison software, its speed can be regarded as tortoise speed, the same amount of data, hisat/star only takes half an hour to complete, tophat2 needs at least 5 to 6 hours; secondly, the difference results based on FPKM values are less consistent with experimental methods such as qPCR verification.
In order to comply with the new trend of sequencing and analysis, the original development team upgraded the whole pipeline, using hisat instead of tophat and stringTie + ballgown instead of cufflinks + cuffdiff.
StringTie can be seen as an upgraded version of cufflinks software, and its function is the same as cufflinks, including the following two main functions
Transcript assembly
Quantitative analysis
It runs faster than cuffinks. The official website of the software is as follows
Https://ccb.jhu.edu/software/stringtie/index.shtml
The input file of stringTie is a sorted bam file, and the common uses are as follows
1. Quantify a known transcript
For model organisms, such as human, mouse, etc., you usually only need to quantify known transcripts, as follows
Stringtie-p 10\-G hg19.gtf\-o output.gtf\-b ballgown_out_dir-e\ align.sorted.bam
The-G parameter specifies the gtf file of the reference genome, the-o specifies the output file, and the format is also gtf, and-b specifies the output result directory of ballgown, this parameter is to facilitate downstream ballgown difference analysis, and the-e parameter requires the software to output only the quantitative results of known transcripts.
In the output file in GTF format, the following three expressions are given for each transcript
Coverage
TPM
FPKM
two。 Assemble this assembly
For a single sample assembly, the usage is as follows
Stringtie align.sorted.bam-o assembly.gtf-p 20murg hg19.gtf
Quantitative results are also given in the assembled transcripts. New transcripts and genes assembled are distinguished by default using STRG plus numerical numbers, as shown in the following example
Gene_id "STRG.1" transcript_id "STRG.1.1"
After the assembly of a single sample is completed, the transcript assembly results of all samples are merged to produce a non-redundant transcript set, which is used as follows
Stringtie-- merge\-o assembly.gtf\-p 20\-G hg19.gtf\ sampleA.gtf sampleB.gtf
In the merged non-redundant transcript, the genes and transcripts are numbered by MSTRG plus numeric numbering, as shown in the following example
Gene_id "MSTRG.2" transcript_id "MSTRG.2.2"
In essence, stringTie only provides the expression at the transcriptional level, and the quantitative methods include TPM and FPKM values. In order to quantify raw count, officials have provided a prepED.py script that can calculate the amount of raw count expression, as follows
Python prepDE.py\-I sample_list.txt\-g gene_count_matrix.csv\-o transcript_count_matrix.csv
The input file is sample_list.txt, which consists of two columns separated by\ t, the first column is the sample name, and the second column is the path of the quantitative gtf file. The example is as follows
SampleA A.stringtie.gtfsampleB B.stringtie.gtf
Raw count expression levels at both gene and transcript levels were output.
Using stringTie for quantitative analysis, fast running speed is an advantage, and it is also the most convenient place to provide the results of three quantitative methods of raw count, FPKM and TPM.
These are all the contents of this article entitled "what are stringTie tools for?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.