Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to convert GFF to GTF file

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to convert GFF into GTF files, with a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.

Both gtf and gff3 formats can be used to store the structural information of genes and transcripts. In the actual analysis, you will need to convert two formats. For example, NCBI only provides download files in GFF format, and we need to convert them to GTF files before using them.

To accomplish this task, you can write your own scripts or with the help of ready-made tools. Next, take a look at the usage and characteristics of each tool. To test using NCBI's GFF file, the link is as follows

Ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh48.p12/GCF_000001405.38_GRCh48.p12_genomic.gff.gz

1. Gffread

Gffread is a tool provided by cufflinks's development team to read gff files, which can be converted from gff files to gtf files.

Gffread-T GCF_000001405.38_GRCh48.p12_genomic.gff-o hg38.gtf

An example of the generated gtf is as follows

NC_000001.11 BestRefSeq exon 11874 12227. +. Transcript_id "rna0"; gene_id "gene0"; gene_name "DDX11L1"

Only two types of structural information, exon and CDS, are provided in the gtf file generated by gffread, and the attributes in the ninth column are only transcript_id, gene_id and gene_name. The most important thing is that these ID do not have any meaning, what we want more is the Entrez ID of the gene and the RefSeq number of the transcript.

2. UCSC

UCSC uses GenePred format to store the structure information of genes and transcripts. Through the gadget of UCSC, we can convert GFF to GTF with the help of GenePred format. The usage is as follows

Gff3ToGenePred GCF_000001405.38_GRCh48.p12_genomic.gff hg38.GenePredgenePredToGtf database hg38.GenePred hg38.gtf

First convert to GenePred format using gff3ToGenePred, and then convert to GTF format using genePredToGtf.

An example of the generated gtf is as follows

NC_012920.1 hg38.GenePred transcript 15956 16023. -. Gene_id "gene60958"; transcript_id "rna171196"; gene_name "gene60958"

UCSC offers more types than gffreads, including the following

1. Exon

2. CDS

3. Start_codon

4. Stop_codon

5. Transcript

Although there are more types of intervals, the problem with attributes is the same as gffread, except that the gene_name attribute has no value.

Of course, there are all kinds of scripts written by others online, but there are more or less problems. The best solution is to write it yourself. First of all, we need to figure out what type of interval information we need in the GTF file.

For the practical use of GTF, we only need exon interval information to distinguish different transcripts, and in the quantitative process, we only need to refer to the location information of exon. So, if you write your own conversion script, you only need to keep the eoxn information.

Another question is what properties the ninth column provides. In my experience, only the following six attributes are needed

Gene_id

Gene_name

Transcript_id

Transcript_name

Gene_type

Transcript_type

Gene_id can be used to store gene ID in different databases, such as NCBI Entrez Id, Ensembl gene Id, and of course, it can also be consistent with the gene_name attribute; the gene_name attribute is used to store gene symbol, and symbol is used more frequently in articles than id.

Transcript_id and transcript_name represent the id and name of the transcript, which can be RefSeq ID or Ensembl transcript id, and are used to distinguish different transcripts.

Gene_type and transcript_type denote the type of gene and transcript, such as protein_coding, lncRNA, rRNA, etc. In the analysis, we usually select some of the transcripts according to the type, such as analyzing only the protein-encoded transcripts.

The above six attributes can meet almost 100% of the scenarios. For files in different databases, you only need to write your own script to extract this information.

Thank you for reading this article carefully. I hope the article "how to convert GFF into GTF document" shared by the editor will be helpful to everyone. At the same time, I also hope you can support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report