What are the basic knowledge and editing methods of vcf format files? 02/09 Update SLTechnology News&Howtos

What are the basic knowledge and editing methods of vcf format files?

2026-02-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the "vcf format file basic knowledge and editing operation method is what", in the daily operation, I believe that many people in the vcf format file basic knowledge and editing operation method is what the problem is, Xiaobian consulted all kinds of data, sorted out a simple and easy to use method of operation, hope to answer the "vcf format file basic knowledge and editing operation method is what" the doubt is helpful. Next, please follow the editor to study!

1. Introduction to basic knowledge of VCF file:

Introduction to the VCF file:

People who have done DNA resequencing, population genetic evolution, BSA,GWAS and other projects will encounter the VCF file, which records all position variations (mainly SNP and InDel) in the genomes of all samples. Almost all the subsequent analyses are based on this file, such as evolutionary tree analysis, population structure analysis, PCA analysis, GWAS association analysis and so on.

Therefore, it is important to understand the VCF file format and the significance of recording the results. VCF files are actually text files that can be opened with Windows as a Chinese text editor software, such as editplus, etc. Because the VCF file is often very large (usually more than 1G), opening it directly under the Windows system will consume a lot of memory and cause jam. Pilotedit is recommended here if you want to open it smoothly.

Here is a partial example of a typical VCF file (which can be dragged left or right):

# # fileformat=VCFv4.0##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial##INFO=##INFO=##INFO=##INFO=##INFO=##INFO=##FILTER=##FILTER=##FORMAT=##FORMAT=##FORMAT=##FORMAT=#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB H _ 2 GT:GQ:DP:HQ 0 | 0vu48 _ 1 _ 1 _ 51 _ 1 | _ 0 _ _ 48 _ T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 | 0 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 49 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 0 | 0 Vera 49 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 3 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 5 5 Juan 6 5 Juan 3 0 Magi 3 0 Juan 0 5 Juan 3 0 Juan 0 5 Magi 3 0 Juan 0 5 Juan 3 0 5 Juan 3 0 Magi 3 0 Magi 3 5 Juan 3 0 Magi 5 5 Magi 3 0 Magi 5 5 Juan 3 0 Magi 5 5 Juan 3 0 Magi 5 5 Magi 3 5 Juan 3 5 Magi 3 5 Juan 3 5 Magi 3 5 Magi 3 5 Magi 3 Juan 3 5 Juan 3 0 Magi 3 5 Juan 3 5 Juan 3 0 Magi 3 5 Magi 3 5 Juan 3 0 Juan 3 0 Magi 3 5 Juan 3 0 Juan 3 0 J T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 | 0 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 54, 7 PASS NS=3;DP=9;AA=G GT:GQ:DP 7, 56 PASS NS=3;DP=9;AA=G GT:GQ:DP 600 | 0, 48, 4, 51, 51, 0, 6, 10, 0, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4,

The VCF file begins with the overall comment information, usually starting with # #, followed by words such as FILTER,INFO,FORMAT.

For example: the line that begins with # # FILTER indicates the description of the acronym in column 7 in the VCF file, such as the acronym description in column 8 of the line comment Q10 that begins with Quality below 10 orientation info, such as AF stands for Allele Frequency, which is the allele frequency; the line comment at the beginning of # # FILTER, the acronym description in column 9 of VCF; and other information, the file version "fileformat=VCFv4.0", and so on.

Description of the meaning of each column of VCF

The columns are separated by tab blanks; the first 9 columns are fixed, and the 10th column begins with the sample information column, which can be infinitely many.

# CHROM

POS

REF

ALT

QUAL

FILTER

INFO

FORMAT

The following columns are all sample genotype information columns.

The details are as follows

CHROM record chromosome number

POS records chromosome location information

The dbSNP number of ID SNP/INDEL usually begins with rs. Generally, only the human genome has a dbSNP number.

The base types of REF reference genomes must be Ameme C Magi G Magi T M N and all uppercase.

The base type of ALT mutation must be A-line C-P-R-G-R-T-L-N-1. They are all uppercase and are separated by commas. "." Indicates that this place is missing without reads coverage.

The higher the detection quality of QUAL mutation information is, the more reliable it is.

FILTER tag filter results of the column, usually we VCF file variation information for quality control, filter out low-quality variation sites, if the site through the filter criteria, then we can mark as "PASS" in this column, indicating that the column quality value is high. After tagging, we can use other tools to filter out the columns marked "PASS" to facilitate subsequent analysis. If the missing value is not applied. Instead.

INFO is an additional information column, usually adding additional annotation information columns in the form of =;. For example, DP=18 indicates that the sequence depth of the locus is 18X / AFTX 0.1, which indicates that the allele frequency is 0.1.

FORMAT is the description column of the next 10 columns of information, usually separated by ":" acronyms. There may be differences in different mutation detection software. take the results of GATK as an example:

10 columns (including) are followed by sample genotype columns, and each information is separated by ":" corresponding to the FORMAT column one by one.

GT stands for genotype, usually separated by "/" or | "|" the two alleles of phase, that is, heterozygous, know which allele comes from which chromosome; 0 represents the base type of the reference genome; 1 represents the first base of the ALT base type (multiple bases are separated by ","), 2 represents the second base of ALT, and so on; for example, REF is listed as A, and ALT is listed as GJT. So the genotype of GG is AG heterozygous and the genotype of SNP;1/2 is homozygous for GT. Indicates missing

The number of bases supported by each of the two bases of AD is separated by "," to represent the depth of the two alleles.

DP the sum of the sequencing depth of the mutation site of the sample, that is, the sum of the two numbers of AD

The possibility of each genotype after PL normalization is usually separated by three numbers',', and the order corresponds to the AA,AB,BB genotype. A represents REF,B for ALT (that is, 0max 0,0max 1, and 1max 1). Since after normalization, the smaller the number, the more reliable the genotype; then the genotype corresponding to the smallest number is interpreted as the most likely genotype of the sample.

The higher the genotype quality value, the better the genotypic quality value obtained by GQ according to the interpretation of PL. Because the lowest number after PL normalization is 0, then the quality value of the genotype takes the second smallest number in PL, if the second smallest number is greater than 99, we only take 99, because the larger value in GATK is meaningless, and the second smallest number greater than 99 generally indicates that the genotypic interpretation is very reliable, only when the second smallest number is less than 99, it is necessary to doubt the reliability of genotypes.

At this point, you should have a preliminary understanding of the VCF file. If you have any other questions, please leave a message below.

2. Introduction to VCF file processing tools:

Filtering SNP/Indel information under specific conditions is a common requirement for personalized analysis, but VCF files are usually large and are not recommended to be processed in Windows. So how do we filter and process VCF files? Here is an introduction to a very useful vcf processing tool in Linux, vcftools http://vcftools.sourceforge.net/man_latest.html. This is his help link; the function is very rich; vcftools is a tool for dealing with vcf files with small size and fast running speed. It can filter variation information under specific conditions according to site information, base location and other information. It can also compare the difference of variation information between two vcf files. It can also split VCF files, format conversion, quality filtering and other functions. Highly recommended Here are some of the command options I have used at work.

Filter variation type

The vcf file may contain both snp and indel variants, and vcftools can quickly separate the two.

How to use it:

Filter out indel, leaving only snp, using the command option:-- remove-indels.

Execute the following command:

Vcftools-remove-indels-recode--recode-INFO-all-vcf raw.vcf-stdout > raw.snp.vcf

Filter out snp, leaving only indel, using the command option:-- keep-only-indels.

Execute the following command:

Vcftools-keep-only-indels-recode--recode-INFO-all-vcf raw.vcf-stdout > raw.indel.vcf

In this way, you can get vcf files that contain only snp and indel, respectively.

Screening of mutation sites at specified locations

Vcftools can also pick out variation information in certain regions of the genome.

How to use it:

Vcftools-vcf Variants.snp.unknown_multianno.vcf-chr A03-from-bp 577700-to-bp 607700-out out_prefix-recode--recode-INFO-all

Here is an explanation of the parameters:

-- vcf: followed by the vcf file

-- chr: followed by the chromosome where the screening region is located

-- form-bp: followed by the starting position of the filter area

-- to-bp: followed by the end position of the filter area

-- out: prefix of the output file

-- recode: no output without this parameter

Filter mutation sites with a specified deletion rate

A lot of snp in vcf file is missing in some samples, that is, the genotype is ". /." If the deletion rate is high, this snp locus cannot be used in many analyses and needs to be removed. The option used here is-- max-missing.

How to use it:

Vcftools-vcf snp.vcf-recode--recode-INFO-all-stdout-max-missing 1 > snp.new.vcf

-- max-missing is followed by a value of 0-1, 1 means not allowed to be missing, and 0 means all are allowed to be missing.

Calculate the snp deletion rate

There are two parameters in vcftools to calculate the missing rate of snp in the vcf file.

They are:

-- missing-indv: generate a file with the suffix ".imiss" that reports the missing status of each sample.

-- missing-site: generates a file that reports the deletion of each snp locus, with the suffix ".lmarker".

How to use it:

Vcftools-- vcf snp.vcf. -- missing-site

Running the above command generates an out.lmiss file in the current directory in the following format:

CHR POS N_DATA N_GENOTYPE_FILTERED N_MISS F_MISSchr01 194921 988 0 368 0.37247chr01 384714 988 0 204 0.206478chr01 384719 988 0 202 0.204453chr01 518438 988 0 488 0.493927chr01 518473 988 0 452 0.45749chr01 518579 988 0 418 0.423077chr01 518635 988 0 428 0.433198chr01 680786 988 0 346 0.350202chr01 680834 988 0 412 0.417004

The first two listed as the location of snp, the third as the total number of alleles, the fifth as the total number of deletions, and the last as the deletion rate.

Vcftools-- vcf snp.vcf. -- missing-indv

Running the above command generates an out.imiss file in the current directory in the following format:

INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS1 8747 0 3632 0.41522810 8747 0 1264 0.144507102 8747 0 2016 0.230479105 8747 0 6322 0.722762106 8747 0 2365 0.270378107 8747 0 4376 0.500286108 8747 0 5682 0.649594109 8747 0 1877 0 . 21458811 8747 0 1039 0.118784

The first is the sample name, the second is the total number of snp, the fourth is the total number of deletions, and the last is the deletion rate.

Take a designated sample at random

Vcftools can randomly select the vcf file of a specified sample, using the option-- max-indv, and specify that you want to randomly select a specified sample from the vcf file.

How to use it:

Take 5 random samples and execute the following code:

Vcftools-- vcf snp.vcf-- max-indv 5-- remove-indels-- recode-- out outfilename at this point, the study of "what is the basic knowledge and editing operation of vcf format files" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.