How to download data in bulk by NCBI 07/11 Update SLTechnology News&Howtos

How to download data in bulk by NCBI

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article is about how NCBI downloads data in bulk. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

NCBI batch search and download sequence

Script code:

From Bio import Entrezimport os,sysfrom Bio.Seq import Seqfrom Bio.SeqRecord import SeqRecordfrom Bio.SeqFeature import SeqFeature, FeatureLocationfrom Bio import SeqIOimport sys, os, argparse, os.path,re,math,time'''database: ['pubmed',' protein', 'nucleotide',' nuccore', 'nucgss',' nucest','structure', 'genome',' books', 'cancerchromosomes',' cdd', 'gap','domains',' gene', 'genomeprj',' gensat', 'geo' 'gds', 'homologene','journals',' mesh', 'ncbisearch',' nlmcatalog', 'omia',' omim', 'pmc','popset',' probe', 'proteinclusters',' pcassay', 'pccompound','pcsubstance',' snp', 'taxonomy',' toolkit', 'unigene',' unists'] 'parser = argparse.ArgumentParser (description='This script is used to fasta from ncbi') parser.add_argument ('- t') '--term',help='input search term: https://www.ncbi.nlm.nih.gov/books/NBK3837/#_EntrezHelp_Entrez_Searching_Options_',required=True)parser.add_argument('-d','--database',help='Please input database to search nucleotide or protein default nucleotide',default =' nucleotide',required=False) parser.add_argument '--out_dir',help='Please input out_put directory path',default = os.getcwd (), required=False) parser.add_argument Seq') args = parser.parse_args () dout=''if os.path.exists (args.out_dir): dout=os.path.abspath (args.out_dir) else: os.mkdir (args.out_dir) dout=os.path.abspath (args.out_dir) output_handle = open (dout+'/'+args.name+'.%s'%args.rettype) "w") Entrez.email = "huangls@biomics.com.cn" # Always tell NCBI who you are#handle = Entrez.efetch (db= "nucleotide", id= "EU490707", rettype= "gb", retmode= "text") # print (handle.read ()) handle = Entrez.esearch (db=args.database, term=args.term, idtype= "acc") record = Entrez.read (handle) for i in record ['IdList']: print iota'\ n 'handle = Entrez.efetch (db=args.database, id=i, rettype=args.rettype Retmode= "text") # print (handle.read ()) record = SeqIO.read (handle, args.rettype) SeqIO.write (record, output_handle, args.rettype) output_handle.close ()

Help documentation:

1python / share/work/huangls/piplines/01.script/search_NCBI.py-h 2usage: search_NCBI.py [- h]-t TERM [- d DATABASE] [- r RETTYPE] [- o OUT_DIR] 3 [- n NAME] 4This script is used to fasta from ncbi 5optional arguments: 6-h,-- help show this help message and exit 7-t TERM -- term TERM input search term: https://www.ncbi.nlm.nih.gov/books 8 / NBK3837/#_EntrezHelp_Entrez_Searching_Options_ 9-d DATABASE,-- database DATABASE10 Please input database to search nucleotide or protein11 default nucleotide12-r RETTYPE,-- rettype RETTYPE13 return type fasta or gb default gb14-o OUT_DIR -- out_dir OUT_DIR15 Please input out_put directory path16-n NAME,-- name NAME Please specify the output, seq

Instructions for use:

Let's take a look at an example:

Python search_NCBI.py-t "Polygonatum [Organization] AND chloroplast AND PsaA"-d protein-r fasta-n psaA

This command downloads the protein sequences of PsaA genes in all chloroplasts of Polygonatum from NCBI protein database, and the output format is fasta.

-t: followed by the search criteria, enclosed in double quotation marks. We can use Boolean operators and index builders to find content more accurately. Let's start with the Boolean operator, which provides a way to generate a precise query that produces a well-defined result set. There are three main Boolean operators, namely AND, OR, and NOT. How they work is as follows:

AND operators must be capitalized, while OR and NOT are not required, but it is recommended that all three operators be capitalized.

Boolean operators operate from left to right, for example:

Promoters OR response elements NOT human AND mammals

Indicates to query promoters or response elements in mammals other than humans. Use parentheses to change the order of operations, for example:

Promoters OR response elements NOT (human OR mouse) AND mammals

Indicates to query promoters or response elements in mammals other than humans and mice.

The content in "[]" is an index builder that can explain the type of previous search term. For example, [Organism] in the example indicates that the previous Polygonatum is an organism. Here are some other examples:

In addition, a wide range of searches can be performed, such as sequence length and publication date.

-d: followed by search database, nucleotide or protein, default nucleotide.

-r: followed by the output format, fasta or gb (genbank), the default gb.

-o: followed by the output directory.

-n: followed by the output file name prefix.

Extract sequences from genbank

Then give you an Amway python program, which can extract genome sequence, cds and protein sequence of related genes and location information of genes from genbank files according to the list of gene names provided, and store them in * .gb.genome.fa, * .gb.cds.fa, * .gb.pep.fa, * .gb.cds_location.txt files respectively.

Script code:

Import sys, os, argparse, os.path, globfrom Bio import SeqIOparser = argparse.ArgumentParser (description='This script was used to get fa from genbank file; *. Faa = pep file; *. Ffn=cds file; * fna=genome fa file') parser.add_argument Default cwd',default = os.getcwd (), required=False) parser.add_argument Default cwd',default = os.getcwd () Required=False) args = parser.parse_args () dout=''din=''if os.path.exists (args.in_dir): din=os.path.abspath (args.in_dir) if os.path.exists (args.out_dir): dout=os.path.abspath (args.out_dir) else: os.mkdir (args.out_dir) dout=os.path.abspath (args.out_dir) args.id=os.path.abspath (args.id) gene = {} input = open (args.id) "r") for line in input: line = line.strip () gene [line] = linegenbank=glob.glob (din+ "/ * gb") for gdkfile in genbank: name = os.path.basename (gdkfile) input_handle = open (gdkfile, "r") pep_file = dout+'/'+name+ ".pep.fa" genePEP = open (pep_file, "w") cds_file = dout+'/'+name+ ".cds.fa" geneCDS = open (cds_file "w") gene_file = dout+'/'+name+ ".genome.fa" gene_handle = open (gene_file, "w") cds_locat_file = dout+'/'+name+ ".cds _ location.txt" cds_locat_handle = open (cds_locat_file, "w") for seq_record in SeqIO.parse (input_handle "genbank"): print "Dealing with GenBank record% s"% seq_record.id gene_handle.write (">% s%\ n% s\ n"% (seq_record.id, seq_record.description) Seq_record.seq)) for seq_feature in seq_record.features: geneSeq = seq_feature.extract (seq_record.seq) if seq_feature.type== "CDS": assert len (seq_feature.qualifiers ['translation']) = = 1 if gene.has_key (seq_feature.qualifiers [' gene'] [0]): GenePEP.write (">% s\ n% s\ n"% (seq_feature.qualifiers ['gene'] [0]) # seq_record.name, seq_feature.qualifiers ['translation'] [0]) geneCDS.write (">% s\ n% s\ n"% (seq_feature.qualifiers [' gene'] [0], # seq_record.name GeneSeq)) cds_locat_handle.write (">% s location% s\ n"% (seq_feature.qualifiers ['gene'] [0], seq_feature.location)) input_handle.close () genePEP.close () geneCDS.close ()

Help documentation:

Python / share/work/wangq/script/genbank/genbank.py usage: get_data_NCBI.py-I IDLIST-o OUT_DIR-m IN_DIRoptional arguments:-I IDLIST,-- idlist IDLIST Please gene name list file-m IN_DIR,-- in_dir IN_DIR Please input complete in_put directory path-o OUT_DIR -- out_dir OUT_DIR Please input complete out_put directory path example: python / share/work/wangq/script/genbank/genbank.py-I id.txt-m / share/nas1/wangq/work/NCBI_download-o / share/nas1/wangq/work/NCBI_download

Note: input after-m is a directory, which can have multiple genbank files, and the program will read them in batches. -I is followed by a list of gene names to be extracted in the following format:

Rpl2psbAndhDndhF

Genbank to gff3

The last script, bp_genbank2gff3.pl, can generate gff3 files from genbank files, provided by Bioperl, and can be used directly after installing and configuring Bioperl. The usage is also very simple, bp_genbank2gff3.pl followed by the genbank file on it!

Bp_genbank2gff3.pl filename (s) Thank you for your reading! This is the end of the article on "how NCBI downloads data in bulk". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.