Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to download the GTF file of a genome from UCSC

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

How to download the GTF file of the genome from UCSC, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

There are two ways to download the GTF file of the genome from UCSC, one is through a table browser browser, and the other is through a FTP service.

1. Table Browser

Table Browser provides a search and download portal that supports downloads in various formats. Downloading gtf files is just one of the functions. The URL is as follows

Http://genome.ucsc.edu/cgi-bin/hgTables

The three tags in the first line are used to determine the species and version. Clade provides species classification, including the following types

Mammal mammals

Vertebrate vertebrates

Deuterostome postostoma

Insect insect

Nematode nematode

Viruses virus

Other other

From the category here, it can also be found that UCSC mainly provides animal genomes, if you want to download plant genome-related files, you can only consider NCBI,Ensembl. According to the category of clade, you can quickly find species. Genome selects the corresponding species name, and assembly selects the genome version.

Group is used to select the type of file save information, providing the following types

Mapping and Sequencing

Genes and Gene Predictions

Phenotype and Literature

MRNA and EST

Expression

Regulation

Comparative Genomics

Variation

Repeats

All Tracks

All Tables

The GTF file holds the structural information of genes and transcripts, so select 2Powertrack to select the corresponding database and version, usually NCBI RefSeq.

Table selects data, and for NCBI RefSeq, the following options are provided

RefSeq All

RefSeq Curated

RefSeq Predicted

UCSC RefSeq

RefSeq All represents all transcripts in RefSeq, RefSeq Curated represents checked data, with high credibility, starting with NM, NR, YP, RefSeq Predicted represents predicted data, starting with XM, XR; UCSC RefSeq represents all transcripts that begin with NM, NR. You can usually choose UCSC RefSeq.

Region chooses the region to download, whether to choose the entire genome or just a portion of the chromosome.

Output format selects the output file format, which is commonly used in the following two ways

GTF (limited)

BED

Output file specifies the name of the output file. If it is not specified, it will be displayed in the browser by default. If you download the information of the entire genome, it is recommended to enter the name of the output file. File type returned chooses the format of the returned file and supports returning the compressed file.

With a simple check box, you can download to the GTF file. However, the GTF file downloaded in this way is limited and only contains the transcript ID, as shown in the following example

Chr1 hg38_refGene exon 11106531 11107500 0.000000 -. Gene_id "NM_004958"; transcript_id "NM_004958"

Transcript of the corresponding gene name is very important information, if you want to solve this problem, you can download it through the FTP server.

2. FTP

UCSC's FTP service provides species annotation files for download. The FTP address of hg38 is as follows.

Http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

However, files in bed12 and gtf formats are not directly provided in FTP, because these formats have redundant information and the file size will be relatively large. As an example of saving disk space, UCSC proposed genePred as a format. In this format, each line represents a transcript information with less redundant information. For more information, please refer to the official documentation.

Https://genome.ucsc.edu/FAQ/FAQformat.html#format9

The file corresponding to the UCSC RefSeq information is refGene.txt.gz. For this file, it needs to be converted to gtf format with the help of the official format conversion tool provided by UCSC.

GenePredToGtf is the tool to convert genepred format to gtf format. The mode of use is as follows

Wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gzgunzip refGene.txt.gzcut-f 2-refGene.txt | genePredToGtf file stdin-source=hg38_Ref hg38.gtf

The first column of information in refGene.txt is redundant, and after deletion, the entire file is in genePred format. The resulting file is as follows

Chr20 hg19_Ref exon 63865228 63865384. +. Gene_id "TPD52L2"; transcript_id "NM_003288"; exon_number "1"; exon_id "NM_003288.1"; gene_name "TPD52L2"

As you can see, there is information about gene_id. However, there are still some shortcomings. Compared with NCBI and Ensembl, the GTF file provided by UCSC lacks the information of gene_biotype, so it is impossible to determine the gene type.

This is the answer to the question about how to download the GTF file of the genome from UCSC. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report