In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail how to compare vcf files. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.
If the reference genome versions of the two vcf files we want to compare are inconsistent, we need to use software such as CrossMap to convert the reference genome version, and then use the Concordance command of the SnpSift software to compare them. The CrossMap software relies on pyBigWig and is installed using conda. The code is as follows:
Conda create-n py3 python=3.6
Conda activate py3
Conda install-c bioconda pyBigWig
Pip3 install CrossMap
The command to convert the reference genome version is as follows:
# need to download hg19ToHg38.over.chain.gz files and reference genomic Homo_sapiens_assembly38.fasta
Python ~ / miniconda3/envs/py3/bin/CrossMap.py\
Vcf ~ / data/liftover/hg19ToHg38.over.chain.gz test.snp.hg19.vcf\
~ / data/Homo_sapiens_assembly38.fasta test.snp.hg38.vcf
You can convert both the vcf files of snp and indel, and the converted files are as follows:
1.3M Jul 8 05:16 test.indel.hg38.vcf
23K Jul 8 05:16 test.indel.hg38.vcf.unmap
1003K Jun 19 11:10 test.indel.vcf
13M Jul 8 05:18 test.snp.hg38.vcf
245K Jul 8 05:18 test.snp.hg38.vcf.unmap
13M Jun 19 18:29 test.snp.vcf
We can see that the success rate of conversion is very high! The unmap file is small because there is a change in the reference genome, and there is always a genome segment that has been modified.
However, what is interesting is that before, our vcf file was sorted strictly according to the genome coordinates, but after the transformation, some of the coordinates were out of order, as follows:
This is easy to understand, because different versions of the reference genome of the same species must have
Chr1 119955031. G A
Chr1 148483282 rs7513869 C T
Chr1 144995248 rs6600697 A G
Chr1 144995236 rs6600696 A C
Chr1 144995050 rs1884147 C T
Chr1 144995033 rs1884146 A G
In other words, when the human reference genome evolves from hg19 to hg38, it is not only the natural expansion of fragments, but also the correction of some fragments that have previously been assembled in the wrong order.
Vcf files with disordered coordinates are unfriendly in many downstream analyses, so you can use the following code for simple filtering.
Input=test.snps.VQSR.vcf
Cat $input | java-jar ~ / biosoft/snpEff/SnpSift.jar filter "(DP > 20 & FILTER = 'PASS')" |\
Perl-alne'{print unless $F [0] = ~ / _ /}'|\
Awk'$1 ~ / ^ # / {print $0 position next} {print $0 | "sort-K1Magi 1-K2Jing 2n"}'|\
Grep-v '1amp 2' > test.filter.sort.vcf
# check the distribution of different chromosomes:
Cat new.filter.sort.vcf | grep-v'^ #'| cut-f 1 | sort | uniq
# then you can comment on the clean VCF file
Java-jar ~ / biosoft/snpEff/snpEff.jar GRCh48.86\
Test.filter.sort.vcf > test.filter.sort.eff.vcf
This is the end of the article on "how to compare vcf files". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.