In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
How to download the GTF file of the genome from UCSC, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
There are two ways to download the GTF file of the genome from UCSC, one is through a table browser browser, and the other is through a FTP service.
1. Table Browser
Table Browser provides a search and download portal that supports downloads in various formats. Downloading gtf files is just one of the functions. The URL is as follows
Http://genome.ucsc.edu/cgi-bin/hgTables
The three tags in the first line are used to determine the species and version. Clade provides species classification, including the following types
Mammal mammals
Vertebrate vertebrates
Deuterostome postostoma
Insect insect
Nematode nematode
Viruses virus
Other other
From the category here, it can also be found that UCSC mainly provides animal genomes, if you want to download plant genome-related files, you can only consider NCBI,Ensembl. According to the category of clade, you can quickly find species. Genome selects the corresponding species name, and assembly selects the genome version.
Group is used to select the type of file save information, providing the following types
Mapping and Sequencing
Genes and Gene Predictions
Phenotype and Literature
MRNA and EST
Expression
Regulation
Comparative Genomics
Variation
Repeats
All Tracks
All Tables
The GTF file holds the structural information of genes and transcripts, so select 2Powertrack to select the corresponding database and version, usually NCBI RefSeq.
Table selects data, and for NCBI RefSeq, the following options are provided
RefSeq All
RefSeq Curated
RefSeq Predicted
UCSC RefSeq
RefSeq All represents all transcripts in RefSeq, RefSeq Curated represents checked data, with high credibility, starting with NM, NR, YP, RefSeq Predicted represents predicted data, starting with XM, XR; UCSC RefSeq represents all transcripts that begin with NM, NR. You can usually choose UCSC RefSeq.
Region chooses the region to download, whether to choose the entire genome or just a portion of the chromosome.
Output format selects the output file format, which is commonly used in the following two ways
GTF (limited)
BED
Output file specifies the name of the output file. If it is not specified, it will be displayed in the browser by default. If you download the information of the entire genome, it is recommended to enter the name of the output file. File type returned chooses the format of the returned file and supports returning the compressed file.
With a simple check box, you can download to the GTF file. However, the GTF file downloaded in this way is limited and only contains the transcript ID, as shown in the following example
Chr1 hg38_refGene exon 11106531 11107500 0.000000 -. Gene_id "NM_004958"; transcript_id "NM_004958"
Transcript of the corresponding gene name is very important information, if you want to solve this problem, you can download it through the FTP server.
2. FTP
UCSC's FTP service provides species annotation files for download. The FTP address of hg38 is as follows.
Http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
However, files in bed12 and gtf formats are not directly provided in FTP, because these formats have redundant information and the file size will be relatively large. As an example of saving disk space, UCSC proposed genePred as a format. In this format, each line represents a transcript information with less redundant information. For more information, please refer to the official documentation.
Https://genome.ucsc.edu/FAQ/FAQformat.html#format9
The file corresponding to the UCSC RefSeq information is refGene.txt.gz. For this file, it needs to be converted to gtf format with the help of the official format conversion tool provided by UCSC.
GenePredToGtf is the tool to convert genepred format to gtf format. The mode of use is as follows
Wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gzgunzip refGene.txt.gzcut-f 2-refGene.txt | genePredToGtf file stdin-source=hg38_Ref hg38.gtf
The first column of information in refGene.txt is redundant, and after deletion, the entire file is in genePred format. The resulting file is as follows
Chr20 hg19_Ref exon 63865228 63865384. +. Gene_id "TPD52L2"; transcript_id "NM_003288"; exon_number "1"; exon_id "NM_003288.1"; gene_name "TPD52L2"
As you can see, there is information about gene_id. However, there are still some shortcomings. Compared with NCBI and Ensembl, the GTF file provided by UCSC lacks the information of gene_biotype, so it is impossible to determine the gene type.
This is the answer to the question about how to download the GTF file of the genome from UCSC. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.