Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand SAM/BAM file format

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

How to understand the SAM/BAM file format, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

This paper focuses on the meaning of the comparison part in the SAM file. The information in the comparison part is a\ t-separated 11-column file. The meaning of each column is as follows.

1. Column1

The first column is QNAME, which represents the name of the input sequence, usually the identifier of the reads.

2. Column2

The second column is FLAG, which defines the following flag in advance. Each flag is represented by a number, corresponding to a comparison.

1 indicates that this sequence is sequenced by PE double-terminal sequencing.

2 means that this sequence matches the reference sequence exactly and there is no insertion deletion.

4 means the sequence has no mapping to the reference sequence.

8 means that the sequence at the other end of this sequence is not aligned to the reference sequence, for example, this sequence is R1, and its corresponding R2 terminal sequence is not aligned to the reference sequence.

16 represents the alignment of this sequence to the negative chain of the reference sequence.

32 represents the alignment of the other end of this sequence to the negative chain of the reference sequence.

64 means that this sequence is the R1 terminal sequence.

128 means that this sequence is the R2 terminal sequence.

A sequence may be aligned to multiple positions of the reference sequence. According to the quality value of map, all alignments can be divided into primary alignments and secondary alignments.

512 means the sequence failed in QC and could not be filtered out.

1024 means that this sequence is a PCR repeat.

2048 means that the alignments is Supplement alignments, and the alignment of a reads is usually a full-length alignment of a region of the genome, but for a chimera sequence, a reads is aligned to two different regions of the genome, and two alignments are formed. These alignments are called Supplement alignments.

The above tags are all to the n power of 2. A characteristic of such a sequence is that several of them are selected at random, and their sum is unique. For example, 65 can only be composed of 1 and 64, which means that the sequence is double-ended sequencing and read1.

3. Column3

The third column is RNAME, which represents the name of the genome sequence, usually the name of the chromosome.

4. Column4

The fourth column is POS, which represents the starting position of reads alignment to the chromosome.

5. Column5

The fifth column is MAPQ, which stands for mapping quality

6.column6

The sixth column is CIGAR, which represents the description of the comparison. The following characters are used to describe the comparison of the reads.

M stands for match, which can be an exact match or a mismatch. I means to insert a base on the genome, D means to delete a base on the genome, N means to skip the base on the genome, and S means to remove some bases from the sequence.

The example comparison is as follows

Ref: AAG CGCTATAGAAquery AAGTCGCT AG

For the query sequence, first there is a matching of three bases, represented by 3M, then one base is inserted into the reference sequence, represented by 1i, then there is a matching of four bases, represented by 4m, and then there are two bp insertions on the query sequence, which is a deletion of two bp relative to the genome, represented by 2D, and finally a matching of two bases, represented by 2m.

To sum up, the CIGAR for this alignment counterpart is 3M1I4M2D2M.

7.column7

The seventh column is RNEXT. If a reads matches more than one location of the genome, this column records the name of the chromosome for the next alignment alignment. If it matches only one location of the genome, the column reads "*".

8. Column8

The eighth column is PNEXT. If a reads is compared to multiple locations in the genome, this column records the location of the next alignment alignment, and if it is compared to only one location in the genome, the content of the column is "*".

9. Column9

The ninth column is TLEN, which represents the length of the inserted fragment, which can be estimated according to the reads alignment.

10. Column10

The tenth column is SEQ, which represents the input sequence, usually the sequence in the fastq file.

11. Column11

The eleventh column is QUAL, which represents the quality of the input sequence, usually the base mass in the fastq file.

In addition to the above 11 columns, there are some other optional tag information, which is written as tag:type:value, with each type represented by a letter, A for a single character, Z for a string, I for an integer, and f for a floating point number.

For example, the tag of NH represents the number of alignments of reads, written as follows

NH:i:2

Indicates that this reads is aligned with two locations in the genome. For a detailed explanation of all tag, please refer to the following link

Https://samtools.github.io/hts-specs/SAMtags.pdf

After reading the above, have you mastered how to understand the SAM/BAM file format? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report