In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "what is the role of MarkDuplicates". The explanation in this article is simple and clear, easy to learn and understand. Please follow the ideas of Xiaobian and go deep into it slowly to study and learn "what is the role of MarkDuplicates" together.
One of the most important steps in data preprocessing is Mark Duplicates, which literally means marking repeated sequences. How do repeats arise, and why label them?
First of all, there are two ways to generate repetitive sequences.
PCR duplicates
It's easy to understand that PCR amplifies multiple copies of a template, and between the multiple copies of the same template are PCR repeats.
Optical duplicates
The basic unit of illumina sequencer is flowcell. Sequencing reactions occur and proceed on flowcell. The high density of flowcell significantly improves the throughput of sequencing, which also brings the problem of repeated sequence reading. Although the percentage is very low, it needs to be taken into account.
GATK officials have carried out statistics on PCR duplication and systematic duplication. It can be seen that the proportion of PCR duplication increases with the increase of sequencing quantity, while the proportion of Optical duplicates repeats is a random distribution, always exists, and its proportion is relatively stable. It fluctuates within a certain range, in line with the characteristics of systematic error.
Repeat sequences are labeled for downstream SNP analysis. SNP locus identification, simple understanding can be regarded as a probability problem. For example, the following two situations:
case A
There are 100 reads covering a locus with base A. 99 of them are A and 1 is C;
case B
There are 100 reads covering a T locus in the genome. Among them, 54 are T and 46 are C;
In both cases, two bases were detected. Does that mean two SNP sites were detected?
Of course not, the proportion of C bases in case A is 1%, which is likely to be a sequencing error, of course not a SNP site; case B can only be considered as a candidate SNP site from the distribution of reads, of course, other factors must be analyzed to determine whether it is a snp site. It can also be seen from this that the count of reads is particularly important for the detection of SNP loci.
But reads here refers to valid reads , the number of reads actually present in the sample. In counting, repeated sequences are counted only once. MarkDuplicates is used to label repeated sequences. After labeling, in downstream analysis, the program will automatically identify repeated sequences according to the corresponding tags.
There are two ways to determine repetitive sequences:
The sequence is identical.
to the start of the genome.
When the sequences are identical, there is certainly no big problem in thinking of them as repeats. Although there will be factors such as homology and repeated sequences, the probability is very small and can be ignored basically. The same alignment position is also considered to be a duplicate sequence, because there will be sequencing errors in the sequencing process. For sequences that are exactly the same, the reads obtained by sequencing may not be exactly the same (there may be several bases different), and in the process of removing low quality, there will also be differences (the base number of low quality removed at the end is different). Therefore, the final judgment is based on the results of the alignment genome. Sequences are considered to be duplicates if they align to the same starting position on the genome.
GATK4 commands for marking repetitive sequences are as follows:
soft/gatk-4.0.4.0/gatk MarkDuplicates -I input.bam -M metrc.csv -O marked.bam
In the output bam file, the repeated sequence is marked by means of the flag in the second column, the value of the flag being the superposition of multiple cases, where 1024 represents the repeated sequence
samtools flags 1024
0x400 1024 DUP
In the generated bam file, whether the sequence is a repeated sequence can be known by the value of flag.
The flag already knows which sequences are repeated, which is sufficient for downstream analysis of gatk. Sometimes we also remove repetitive sequences. When removing repetitive sequences, we will select a read with the largest sum of base mass values as a representative sequence according to the base mass values of the sequence and keep it.
Thank you for reading, the above is "MarkDuplicates what is the role" of the content, after the study of this article, I believe that we have a deeper understanding of the role of MarkDuplicates this problem, the specific use of the situation also needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.