Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are SAM and BAM files

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what SAM and BAM files are. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

When the fastq data we sequenced is map to the genome, we get a file with the extension sam or bam. Here, the full name of SAM is sequence alignment/map format. And BAM is the binary file of SAM, that is, the sam file in compressed format. So what is the format of the SAM file? Here is a brief explanation for you.

Introduction to SAM format

The SAM file consists of header files and map results. The header file is a comment message, starting with @, which is optional, so there is no need to introduce it. The important thing is to compare the results, such as this:

E00514:173:H3C3JCCXY:4:1124:12398:67234 337 Chr00 32904 0 150M Chr09 33498107 0 TCAATTTCACTTGAAGCTTACTTGTAGTTTCAGGCTTGGTCAAGCGCGATACAAACCATGTAGTAGGAGTCCTCCAAGTCGCCAAGCTAGGGGATCTGCTGAAAGAGGTGACAGACAAGGTAAGCAATCAGAGCTCTAAGCAATCAGTCC iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii Chr16 2469225 0 TCAATTTCACTTGAAGCTTACTTGTAGTTTCAGGCTTGGTCAAGCGCGATACAAACCATGTAGTAGGAGTCCTCCAAGTCGCCAAGCTAGGGGATCTGCTGAAAGAGGTGACAGACAAGGTAAGCAATCAGAGCTCTAAGCAATCAGTCC iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii XO:i:0 XG:i:0 NM:i:1 MD:Z:136C13 YT:Z:UU NH:i:8 CC:Z:Chr10 CP:i:18604313 HI:i:4 RG:Z:J36CK1E00514:173:H3C3JCCXY:4:1124:12398:67234 369 Chr00 32904 0150M Chr17 31040767 0 TCAATTTCACTTGAAGCTTACTTGTAGTTTCAGGCTTGGTCAAGCGCGATACAAACCATGTAGTAGGAGTCCTCCAAGTCGCCAAGCTAGGGGATCTGCTGAAAGAGGTGACAGACAAGGTAAGCAATCAGAGCTCTAAGCAATCAGTCC iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii UU NH:i:8 CC:Z:Chr10 CP:i:18604313 HI:i:6 RG:Z:J36CK1E00514:173:H3C3JCCXY:4:1212:19025:24532 409 Chr00 33538 0150M * 00 GATTCCAAGTGCTGACTGATTGCTCTCTTTCTCCTTGTCTTGCAGGTAAGAACAAGGCCAAAGGAAAAGACAGGGAAAAAACATGAAATGAGATACTCTTGCTTTTAACCCTGATGATATGAGATATTCTTGCTCTAGTATAGCTTGTTT i`e`eiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii

Fields, that is, columns, are separated by Tab. The specific meaning of each field is shown in the following figure:

Where:

1. QNAME represents the name of reads

2. FLAG: indicates the result of the comparison, which is represented by numbers. Different values have different meanings. The list is as follows:

Chinese explanation:

1: PE double-terminal sequencing is used to represent this sequence 2: it represents that this sequence matches the reference sequence exactly, without mismatch and insertion deletion 4: it means that this sequence has no mapping to the reference sequence. 8: the sequence at the other end of this sequence is not aligned to the reference sequence. For example, this sequence is R1, and its corresponding R2 terminal sequence is not aligned to the reference sequence 16: represents the sequence alignment to the negative chain of the reference sequence 32: represents the alignment of the other end of the sequence to the negative chain of the reference sequence 64: indicates that the sequence is the R1 terminal sequence, read1 This sequence is the R2 terminal sequence, read2 It means that this sequence is not the primary alignment, a sequence may be aligned to multiple positions of the reference sequence, only one is the primary alignment position, and the other is secondary 512: it means that the sequence failed in QC Cannot be filtered out (# this tag is not commonly used) 1024: indicates that this sequence is a PCR repeat sequence (# this tag is not commonly used) 2048: indicates that this sequence is a supplementary alignment (the exact meaning of the # tag is not clear, but it is not commonly used

The comparison result can also be a combination of the above values (that is, numerical addition). For example, a FLAG of 83 (64 / 16 / 2 / 1) indicates that the first reads in the paired-end reads is aligned to the reference sequence.

3. RNAME: indicates the name of the reference sequence, such as the chromosome number of the genome. If there is no alignment, it is displayed as *.

4. POS: indicates the starting position of the alignment, starting with 1, or 0 if there is no comparison

5. MAPQ: comparison quality; (the larger the number, the higher the specificity)

6. CIGAR: string, that is, the details of the alignment, record insertion, missing, mismatch, post-cut splicing connector

7. RNEXT: the name of the reference series for the next reads alignment in double-terminal sequencing, denoted by "*" if not, or "=" if compared to the same reference sequence as the previous reads.

8. PNEXT: the next reads is aligned to the position on the reference sequence. If not, it is indicated by 0.

9. TLEN: the length of the sequence template

10. Sequence information of SEQ:reads

11. Sequence quality information of QUAL:reads

twelve。 Optional fields: format such as: TAG:TYPE:VALUE, where TAG is composed of two uppercase letters, each TAG represents a type of information, one TAG per line can only appear once, and TYPE represents the type of TAG corresponding value, which can be string, integer, byte, array, etc.

Common bam/sam file processing

Because files in sam format are usually very large, to save storage space, convert sam to binary format for easy storage, that is, bam files. Sam/bam files can be processed by specific software (such as samtools), including format conversion, sorting, indexing, and so on.

1. Bam file reading

Bam files are binary files that cannot be viewed directly, but can be read with samtools:

Samtools view xxx.bamsamtools view xxx.bam | less-S

2. Sam/bam conversion

Samtools view-h xxx.bam > xxx.samsamtools view-b-S xxx.sam > xxx.bam

3. Sort bam files

Samtools sort xxx.bam outputPrefix

4. Create index in bam file

Samtools index xxx.bam

5. Evaluate the results of mapping

After mapping, the quality of mapping's results can be evaluated by samtools.

Samtools idxstats xxx.bam

You need to go through sort and index before performing this step. The results are as follows:

Chr1 195471971 6112404 0 chr10 130694993 3933316 0 chr11 122082543 6550325 0 chr12 120129022 3876527 0 chr13 120421639 5511799 0 chr14 124902244 3949332 0 chr15 104043685 3872649 0

The first column is the chromosome name, the second column is the sequence length, the third column is the mapped reads number, and the fourth column is the unmapped reads number.

6. Statistical flag information

Count the comparison flag information in the bam file, and output the comparison statistical results.

Samtools flagstat xxx.bam

Total: total number of reads analyzed (all lines of bam file) mapped: number of reads on the comparison (overall comparison rate) paired in sequencing: total number of reads in pairs read1: number of reads belonging to reads1 read2: number of reads belonging to reads2 properly paired: number of reads in the correct pair with itself and mate mapped: number of reads on a pair of reads singletons: only the number of reads on a single reads comparison is counted as reads, and a pair of reads is counted as two.

You can also quickly see the meaning of the threshold value by using the following command:

The value of $samtools flags 1410x8d 141READ1 indicates that the sequence is sequenced at both ends, while the value of 1 PAIRED,UNMAP,MUNMAP,READ2# map indicates that the sequence does not have mapping to the reference sequence, while the value of 4 position map indicates that the other end of the sequence is not aligned to the reference sequence, and that the value of READ1 indicates that this sequence is the R1 terminal sequence, and its value is 128.# and the sum of values is 141mm.

7. Merge BAM files

Merge multiple sorted sequence files into one file

Samtools merge-n out.bam in1.bam in2.bam in3.bam... This is the end of the article on "what are SAM and BAM files". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report