How to analyze and compare the Genome tool hisat2 04/18 Update SLTechnology News&Howtos

How to analyze and compare the Genome tool hisat2

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to parse and compare the genome tool hisat2. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Due to the limitation of the machine read length of the sequencer, it is necessary to segment the DNA in the process of constructing the library, and the sequence obtained by sequencing is only part of the genome. To determine the location of the sequencing reads on the genome, you need to align the reads back to the reference genome, a step called mapping.

When doing mapping, you need to consider the following factors

1. Consumption of hardware resources

Generally speaking, the larger the genome, the more memory it takes up. For large genomes, such as the human genome, optimizing memory consumption is critical.

two。 operational speed

With the decline of sequencing price and the demand of in-depth data mining, the amount of sequencing is getting larger and larger, and the comparison of mass sequencing reads requires that the speed must be fast enough.

3. Accuracy.

SNP/indel, sequencing error rate and other factors make the sequencing reads and the original sequence on the genome will have several bp errors, so the algorithm of mapping must support the base mismatch, or the existence of gap. At the same time, because the short sequence of sequencing may have homology with multiple locations of the genome, a reads will be compared to multiple locations of the genome. Double-terminal sequencing technology can correct multiple locations to a certain extent, because double-terminal reads comes from the same DNA fragment, and their positions on the genome will not be too far apart, but this alone can not solve all homologous alignments, which requires the alignment algorithm to judge and score multiple locations and give the reliability of the alignment results.

4. RNA

For transcriptome data, the existence of variable splicing in eukaryotes leads to the discontinuity of the position of cDNA fragments in the genome, and there may be introns in the middle. When comparing transcriptome data, you need to consider skipping cut points.

At present, there are many mapping tools, such as bwa, hisat, star and so on. Hisat is one of the fastest and is the upgraded version of tophat software. Using the improved FM index algorithm, for the human genome, only about 4.3GB memory is needed. It also supports the comparison of DNA and RNA data. The official website of the software is as follows

Http://ccb.jhu.edu/software/hisat2/index.shtml

At present, the latest edition is hisat2. The installation process is as follows

Wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zipunzip hisat2-2.1.0-Linux_x86_64.zip

Download and decompress it.

Before making a comparison, you need to index the reference genome. the basic usage is as follows

Hisat2-build-p 20 hg19.fa hg19

For transcriptome data, when building the index, you can get the cut site and exon information from the gtf file, as follows

Hisat2_extract_splice_sites.py hg19.gtf > hg19.sshisat2_extract_exons.py hg19.gtf > hg19.exonhisat2-build-p 20-- ss hg19.ss-- exon hg19.exon hg19.fa hg19

Hisat2 supports input files in multiple formats. There are two common formats

Fasta

Fastq

The-f parameter indicates that the input format is fasta, and the-Q parameter indicates that the input file format is fastq. The input file can be a gzip compressed file, and the default input file is in fastq format.

For single-ended data,-U is used to specify the input file; for double-ended data,-1 and-2 are used to specify the input file on R1 and R2, respectively.

Reads aligns to a location on the genome, which we call an alignment. The software will score and judge all the alignments. The alignment that can meet the filtering conditions is called valid alignment, and only valid alignments will output.

Similar to blast, each alignment has a corresponding scoring mechanism. Hisat scores alignment from the following aspects

1. Mismatch base penalty score

The penalty for mismatched bases is specified by the-mp parameter, whose value is two numbers separated by commas, with the first number being the maximum penalty point and the second number being the smallest penalty point.

2. Gap penalty points on reads

The penalty score of gap is divided into two parts, the first occurrence of gap penalty score and gap extension penalty score, the gap penalty score on reads is specified by-- rdg parameter, the value is comma-separated two digits, the first number is the penalty score of the first position of gap, and the second number is the penalty score of gap extension.

3. Gap penalty points on reference

The gap penalty score on reference is specified by the-- rdg parameter, which is a comma-separated number, the first of which is the penalty of the first position of the gap, and the second of which is the penalty of the gap extension.

After a series of penalty mechanisms, each alignment will have a corresponding score, and then a threshold will be used to determine whether the score meets the requirements of the valid alignment.

Hisat specifies the threshold through the-- score--min parameter, and the specified method is a function related to the degree of reads. The default value is Lmemo, which is 0.2, and the corresponding function is

F (x) = 0-0.2 * x

According to the length of the reads, the threshold of the score can be calculated, and the alignment greater than this threshold is considered to be valid alignment and can be output. L stands for linear function, in addition, other types of functions are also supported, such as constant, natural logarithm, etc., please refer to the official documentation for more choices.

A reads may have more than one valid alignments, and when exporting, it does not output all the alignments, but only the N alignments,-k parameters specified by the-k parameter with a default value of 5.

The output is saved in SAM format and is output to the screen by default. The output file can be specified with the-S parameter.

Usually, the default parameters can meet our needs. The use of single-ended data comparison is as follows

Hisat-x hg19-p 20-U reads.fq-S align.sam

The use of double-ended data is as follows

Hisat-x hg19-p 20-1 R1.fq-2 R2.fq-S align.sam on how to parse and compare the genome tool hisat2 to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.