How to realize quantitative Analysis Operation in htseq-count 04/19 Update SLTechnology News&Howtos

How to realize quantitative Analysis Operation in htseq-count

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

htseq-count how to achieve quantitative analysis operation, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, there are people who need this to learn, I hope you can gain something.

Like featurecounts, htseq-count is also a software that does raw count quantification. The software is developed in Python and integrated into HTseq.

For python packages, it is convenient to install them through pip. The code is as follows

pip install HTSeq

HTSeq provides many functions for processing NGS data, htseq-count is just one module for quantitative analysis.

htseq-count is similar to featurecounts in that it includes features and meta-features. For transcriptome data, feature refers to exon, and meta-feature can be gene or transcript.

Two documents are required for quantitative analysis

Compare BAM/SAM files

Genome GTF file

For two-terminal data, the BAM file after sort is required.

Due to sequence read length constraints and genome homology, a read may align to multiple genes, and there are overlaps between genes. When dealing with these special cases, htseq-count has the following three built-in patterns

union

intersection-strict

intersection-nonempty

Specify a mode with the--mode parameter; the default is union. These three modes have different criteria when judging whether a read belongs to a certain feature. The schematic diagram is as follows.

In BAM files, there are aligned reads and unaligned reads. Only aligned reads will be used for counting. htseq-count will filter BAM files according to the quality of mapping by default. The default value is 10, which means that only reads with mapping quality > 10 will be used for counting. Of course, this threshold can be modified by-a parameter.

When it is clear that reads belong to a feature, for example, in the first case of schematic diagram species, reads are completely a fragment of gene_A, and the count of the feature is increased by 1; when it is clear that reads do not belong to a feature, it is called no_feature, for example, in the second case of schematic diagram species, a part of reads is compared with gene_A. In the intersection_strict mode, it is determined that reads do not belong to gene_A, that is, no_feature.

When it is unclear whether a read belongs to a feature, it is usually because the reads are in the overlap region of two genes, such as the sixth and seventh cases in the schematic diagram.

When a read matches two features, it is marked as alignment_not_unique.

When counting the number of reads belonging to a gene, you need to pay attention to the processing of reads that are ambiguous and alignment_not_unique. You can specify them through the--nonunique parameter. There are two values below.

none

all

When the default value is none, these two reads are ignored and no count is made; when the value is all, the count of all corresponding features will be increased by 1.

In addition to the--mode and--nonunique parameters, you also need to pay attention to the--stranded parameter, which specifies the type of library. The default value is yes, which means that the library is a strand specific library, and no means that it is a non-strand specific library. For non-strand specific libraries, when judging whether a read belongs to a gene, only the alignment position needs to be paid attention to, while strand specific libraries also need to pay attention to whether the aligned positive and negative strands are consistent with the positive and negative strands of the gene. Only when they are consistent, they will be counted.

If you understand the above three parameters, you will be able to use htseq-count correctly. For non-strand specific data, the general usage is as follows

htseq-count \-f bam \-r name \-s no \-a 10 \-t exon \-i gene_id \-m union \--nonunique=none \-o htseq.count \align.sorted.bam \hg19.gtf

In terms of running speed, featurecounts is many times faster than htseq-count, and feature-count not only supports gene/transcript quantification, but also supports the quantification of single features such as exon. Therefore, it is recommended to use featurecounts to quantify.

Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.