Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the use of the stringTie tool

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the use of stringTie tools, the article is very detailed, has a certain reference value, interested friends must read it!

For transcriptome data, the most basic analysis is the quantification of genes and transcripts. Quantification is to determine the expression of a gene or transcript, in which there are many quantitative ways.

The most direct way is to count the number of reads from mapping to this gene / transcript and take the number of reads as the expression level. We call this expression raw count.

On the basis of raw count, the quantitative method of TPM value is obtained by using exon length normalization. For each gene, raw count is divided into the length of the gene (the sum of the length of exon), and the expression quantity after length normalization is obtained. The TPM value of a gene is to use the normalized expression to calculate a relative abundance. The specific calculation formula is as follows. Note that the gene length is in k.

On the basis of raw count, the quantitative method of RPKM/ FPKM value is obtained by normalizing the two factors of sequencing quantity and exon length. Firstly, the relative abundance is obtained by dividing raw count by all the reads numbers on mapping, and the RPKM value can be calculated by dividing the length of the gene (the sum of exon lengths). During the test, each insert is called a fragment, and for double-ended sequencing, one fragment gets two reads.

The only difference between RPKM and FPKM is the calculation of raw count. RPKM calculates the number of reads, while the value of FPKM calculates the number of fragments. For single-ended sequencing, the number of fragment and reads is equal; for double-ended sequencing, the number of reads is twice the number of fragments. For FPKM, even if the two reads at both ends are aligned to the genome, they are counted once, because the two reads come from the same fragment.

The specific calculation formula is as follows, we should pay attention to the unit. The total number of reads on mapping is in M unit, and the gene length is in k unit.

There are many software that can carry out quantitative analysis. This paper mainly introduces the software stringTie.

In the early transcriptome data analysis, the most classic analysis strategy is tophat+cufflinks+cuffdiff. The pipeline of this analysis will give the quantitative results based on FPKM values, and then carry out difference analysis. However, with the improvement of the amount of sequencing data and the development of analysis methods, there are many problems in this analysis strategy.

First of all, the speed of tophat is very slow, compared with the new comparison software, its speed can be regarded as tortoise speed, the same amount of data, hisat/star only takes half an hour to complete, tophat2 needs at least 5 to 6 hours; secondly, the difference results based on FPKM values are less consistent with experimental methods such as qPCR verification.

In order to comply with the new trend of sequencing and analysis, the original development team upgraded the whole pipeline, using hisat instead of tophat and stringTie + ballgown instead of cufflinks + cuffdiff.

StringTie can be seen as an upgraded version of cufflinks software, and its function is the same as cufflinks, including the following two main functions

Transcript assembly

Quantitative analysis

It runs faster than cuffinks. The official website of the software is as follows

Https://ccb.jhu.edu/software/stringtie/index.shtml

The input file of stringTie is a sorted bam file, and the common uses are as follows

1. Quantify a known transcript

For model organisms, such as human, mouse, etc., you usually only need to quantify known transcripts, as follows

Stringtie-p 10\-G hg19.gtf\-o output.gtf\-b ballgown_out_dir-e\ align.sorted.bam

The-G parameter specifies the gtf file of the reference genome, the-o specifies the output file, and the format is also gtf, and-b specifies the output result directory of ballgown, this parameter is to facilitate downstream ballgown difference analysis, and the-e parameter requires the software to output only the quantitative results of known transcripts.

In the output file in GTF format, the following three expressions are given for each transcript

Coverage

TPM

FPKM

two。 Assemble this assembly

For a single sample assembly, the usage is as follows

Stringtie align.sorted.bam-o assembly.gtf-p 20murg hg19.gtf

Quantitative results are also given in the assembled transcripts. New transcripts and genes assembled are distinguished by default using STRG plus numerical numbers, as shown in the following example

Gene_id "STRG.1" transcript_id "STRG.1.1"

After the assembly of a single sample is completed, the transcript assembly results of all samples are merged to produce a non-redundant transcript set, which is used as follows

Stringtie-- merge\-o assembly.gtf\-p 20\-G hg19.gtf\ sampleA.gtf sampleB.gtf

In the merged non-redundant transcript, the genes and transcripts are numbered by MSTRG plus numeric numbering, as shown in the following example

Gene_id "MSTRG.2" transcript_id "MSTRG.2.2"

In essence, stringTie only provides the expression at the transcriptional level, and the quantitative methods include TPM and FPKM values. In order to quantify raw count, officials have provided a prepED.py script that can calculate the amount of raw count expression, as follows

Python prepDE.py\-I sample_list.txt\-g gene_count_matrix.csv\-o transcript_count_matrix.csv

The input file is sample_list.txt, which consists of two columns separated by\ t, the first column is the sample name, and the second column is the path of the quantitative gtf file. The example is as follows

SampleA A.stringtie.gtfsampleB B.stringtie.gtf

Raw count expression levels at both gene and transcript levels were output.

Using stringTie for quantitative analysis, fast running speed is an advantage, and it is also the most convenient place to provide the results of three quantitative methods of raw count, FPKM and TPM.

These are all the contents of this article entitled "what are stringTie tools for?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report