Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use samtools

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge of this article "how to use samtools", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can gain something after reading this article. Let's take a look at this "how to use samtools" article.

First of all, let's take a look at what commands samtools has.

$samtoolsProgram: samtools (Tools for alignments in the SAM format) Version: 1.9 (using htslib 1.9) Usage: samtools [options] Commands:-- Indexing dict create a sequence dictionary file faidx index/extract FASTA fqidx index/extract FASTQ index index alignment-- Editing calmd recalculate MD/NM tags and'= 'bases fixmate fixmate information reheader replace BAM Header targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags markdup mark duplicates-File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact Fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA-Statistics bedcov read depth per BED region depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck)-Viewing flags explain BAM flags tview text alignment viewer view SAMBAMCRAM conversion depad convert padded BAM to unpadded BAM

As we can see from above, roughly I have five types of command blocks: Indexing,Editing,File operations,Statistics,Viewing. Let's take a look at a few commonly used commands.

1.view

The main functions of the view command are: interchange the sam file with the bam file; then perform various operations on the bam file, such as sorting (sort) and extraction of data (these operations are performed on the bam file, so when the input is sam file, this operation cannot be carried out); finally, the sorted or extracted data is output to bam or sam (default) format.

Advantages of bam files: bam files are binary files and occupy less disk space than sam text files; the operation speed of using bam binary files is fast.

In the view command, the input (- t or-T) and output (- h) to the sam file header (sequence ID) are controlled by separate parameters.

Usage: samtools view [options] | [region1 [...]]

Some of the parameters of the following view command

If region is not added by default, all region.options is output:

-b output BAM # this parameter sets the output BAM format. By default, the output is in SAM format. By default, h print header for the SAM output # exports sam format files without header. This parameter is set to export sam files with header information-H print SAM header only (no alignments) # only the header file of the output file-S input is SAM # input is a BAM file by default, if the input is a SAM file, it is best to add this parameter Otherwise, sometimes the wrong report will be made. -u uncompressed BAM output (force-b) # the use of this parameter requires the-b parameter, which saves time, but requires more disk space. -c print only the count of matching records# outputs only matching statistical records-L FILE only include reads overlapping this BED FILE [null] # includes only the name of the reads-o FILE output file name [stdout] # output file where overlap exists with the bed file-F INT only include reads with none of the FLAGS in INT present [0] # filter flag, and only outputs the lowest quality value of the specified flag value sequence-q INT only include reads with mapping quality > = INT [0] # alignment It is generally believed that 20 is a unique comparison, and you can use the above-bF parameter to extract a specific comparison result-@ Number of additional threads to use [0] # refers to the number of threads used

Here are a few examples. If you want more intuitive results, you can use the test data that comes with the software in the example folder.

# convert sam file to bam file samtools view-bS abc.sam > abc.bam# BAM to SAMsamtools view-h-o out.sam out.bam# extract alignment to reference sequence samtools view-bF 4 abc.bam > abc.F.bam# extract the alignment result of both reads in paired reads to reference sequence Samtools view-bF 12 abc.bam > abc.F12.bam# can extract the alignment result without alignment to the reference sequence samtools view-bf 4 abc.bam > abc.f.bam# to extract the alignment result from the bam file to caffold1. And save it to the sam file format samtools view abc.bam scaffold1 > scaffold1.sam# extraction scaffold1 to compare the comparison results to the 30k to 100k regions samtools view abc.bam scaffold1:30000-100000$ gt According to fasta file, scaffold1_30k-100k.sam# adds header to sam or bam file samtools view-T genome.fasta-h scaffold1.sam > scaffold1.h.sam2. Sort

Sort sorts the bam files.

Usage: samtools sort [option]-o-n Sort by read name# sets the sort method to sort by the ID of short reads. By default, it is sorted by the order of the sequence in the fasta file (that is, header) and by the point where the sequence is from left to right. -m INT Set maximum memory per thread; suffix K/M/G recognized [768m] # sets the maximum memory for each thread, which can be K/M/G. The default is 768m. When dealing with big data, if there is enough memory, set a larger value to save time. -t TAG Sort by value of TAG. Uses position as secondary index (or read name if-n is set) # sort by tag value-o FILE Write final output to FILE rather than standard output # output to a file with the file name added

Example:

# tmp.bam sorts by sequence position and outputs the results to tmp.sort.bam samtools sort-n tmp.bam-o tmp.sort.bam samtools view tmp.sort.bam3.merge and cat

Merge merges multiple sort bam files into a single bam file. The fused files are already sort if they are not needed. The cat command does not need to sort the bam file.

Usage: samtools merge [- nurlf] [- h inh.sam] [- b] [...] Options:-n Input files are sorted by read name# input file is through sort-n-t TAG Input files are sorted by TAG value# input file is through sort-t-r Attach RG tag (inferred from file names) # add RG tag-u Uncompressed BAM output # output uncompressed bam-f Overwrite the output BAM if exist# to overwrite existing bam-1 Compress level compression-l INT Compression level From 0 to 9 [- 1] # specify compression multiple-R STR Merge file in the specified region STR [all]-h FILE Copy the header in FILE to [in1.bam] $samtools catUsage: samtools cat [options] [...] Samtools cat [options] [...] Options:-b FILE list of input BAM/CRAM file names, one per line-h FILE copy the header from FILE [default is 1st input file]-o FILE output BAM/CRAM4.index

The sorted sequence is indexed and output as a bai file for fast random processing. In many cases, especially when you need to display alignment sequences, bai files are essential, such as the tview command that follows.

Usage: samtools index [out.index] samtools index abc.sort.bam5. Faidx

Index the fasta file, and the resulting index file ends with the .fai suffix. This command can also quickly extract a sequence from a fasta file according to the index file.

Usage: samtools faidx [[...]] samtools faidx [[...]]

Such as indexing genome files

Samtools faidx genome.fasta# generates the index file genome.fasta.fai, which is a text file divided into five columns. The first column is the name of the subsequence; the second column is the length of the subsequence; I think "the third column is where the sequence is located" because the number grows from top to bottom, and the last number is the size of the genome.fasta file; columns 4 and 5 mean something. Therefore, through this file, we can locate the location of the subsequence on the disk of the fasta file, and quickly call out the subsequence directly. # because of the index file, the subsequence samtools faidx genome.fasta scffold_10 > scaffold_10.fasta6 in fasta format can be quickly extracted from the genome using the following command. Tview

Tview can visually show how reads compares genomes, similar to genomic browsers. Generally speaking, IGV is better for visualization, but tview is not recommended.

Usage: samtools tview [ref.fasta]

When the reference genome is given, the sequence of the reference genome is displayed in the first row, otherwise, the first row is all represented by N.

Press g, and you are prompted to reach a certain point in the genome. The example "scaffold_10:1000" indicates reaching the 1000th base point of scaffold 10.

Use H (left) J (top) K (bottom) L (right) to move the display interface. Uppercase letters move fast and lowercase letters move slowly.

Use the space Jian to move quickly to the left (similar to L) and use the Backspace key to move quickly to the left (similar to H).

Ctrl+H moves 1kb base distance to the left; Ctrl+L moves 1kb base distance to the right

Color can be used to label the comparison mass, base mass, nucleotides and so on. The base mass or comparison mass of 3040 is expressed in white.

20: 30 yellow; 10: 20 green; 0: 10 blue.

Use the period'.' Toggle the display of bases and dots; use r to switch the display of read name, etc.

There are many other instructions for use, press specifically? Key to check.

7. Flagstat

The comparison results of BAM files are given, and the comparison statistical results are output. There are no parameters other than the-@ parameter that specifies the thread

Usage: samtools flagstat [options] samtools flagstat tmp.bam 20000 + 0 in total (QC-passed reads + QC-failed reads) # Total number of reads 0 + 0 secondary0 + 0 supplementary0 + 0 duplicates18995 + 0 mapped (94.98%: Nsupplementary0 A) # overall matching rate of reads 20000 + 0 paired in sequencing# how many reads belong to the number of reads in paired reads10000 + 0 read1# reads1 10000 + 0 read2# reads2 the number of reads 18332 + 0 properly paired (91.66%: Nash A) ) # number and proportion of perfectly matched reads: aligned to the same reference sequence And the distance between the two reads accords with the threshold of 0 + 0 with itself and mate mapped# paired reads, in which both of them are compared to the number of reads on the reference sequence 579 + 0 singletons (2.90%: reads A) # the number of reads matched to the reference sequence alone, and the sum of the previous one is the total number of reads on the match. The number of reads in 0 + 0 with mate mapped to a different chr# paired reads compared to two different reference sequences is 0 + 0 with mate mapped to a different chr (mapQ > = 5) # same as the previous one, except that the number of reads with alignment quality > = 5 is 8. Depth

The sequencing depth of each base site was obtained and output to the standard output. The imported bam file must be samtools index first

Usage: samtools depth [- r reg] [- Q baseQthres] [- Q mapQthres] [- b in.bed] [...]-r region# followed by chromosome number (region)-an output all positions (including zero depth) # enter sequences at all locations, including-q base quality threshold [0] # base quality threshold-Q mapping quality threshold [0] # alignment quality threshold

For example:

Samtools depth tmp.index.bam > tmp.depth.bam9. Other useful commands

Reheader replaces the header of the bam file

$samtools reheader

Idxstats counts a table with 4 columns, namely "sequence name, sequence length, reads number on alignment, unmapped reads number". The fourth column should be the number of reads that one end of the paired reads can match to that scaffold and the other end does not match to any scaffolds.

Samtools idxstats 10. Convert bam files to fastq files

Sometimes, we need to extract the reads that is aligned to a reference sequence and analyze it in a small range to facilitate debug and so on. At this point, you need to convert the bam or sam file to fastq format.

The website provides a program for converting bam to fastq: http://www.hudsonalpha.org/gsl/information/software/bam2fastq

Wget http://www.hudsonalpha.org/gsl/static/software/bam2fastq-1.1.0.tgztar zxf bam2fastq-1.1.0.tgzcd bam2fastq-1.1.0make./bam2fastq 11. Mpileup

Samtools also has a very important command mpileup, which used to be pileup. This command is used to generate bcf files, and then use bcftools for SNP and Indel analysis. Bcftools is the software included with samtool and can be found in the installation folder of samtools.

Usage:

Usage: samtools mpileup [- EBug] [- C capQcoef] [- r reg] [- f in.fa] [- l list] [- M capMapQ] [- Q minBaseQ] [- Q minMapQ] in.bam [in2.bam [...]]

The most commonly used parameters are 2:

-f to enter fasta reference sequences with index files

-g output to bcf format. The usage and the simplest examples are as follows

Samtools mpileup-f genome.fasta abc.bam > abc.txtsamtools mpileup-gSDf genome.fasta abc.bam > abc.bcfsamtools mpileup-guSDf genome.fasta abc.bam | bcftools view-cvNg-> abc.vcf

When mpileup does not use the-u or-g arguments, a binary bcf file is not generated, but a text file (output to standard output) is generated. The text file counts the alignment of each base site in the reference sequence; each line of the file represents the alignment result of a base site in the reference sequence. For example:

Scaffold_1 2841 A 11,...,.... BHIGDGIJ?FFscaffold_1 2842 C 12, $,... ^ I. CFGEGEGGCFF+scaffold_1 2843 G 11,... FDDDDCD?DD+scaffold_1 2844 G 11,...,. FA?AAAA

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report