The method of dividing NCBI protein Database in R language 07/02 Update SLTechnology News&Howtos

The method of dividing NCBI protein Database in R language

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, I would like to share with you the R language NCBI protein database sub-database method of the relevant knowledge, detailed content, clear logic, I believe that most people still know too much about this knowledge, so share this article for your reference, I hope you can get something after reading this article, let's take a look at it.

NCBI protein database branch

1. Download data:

# wget-c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz#wget-c https://ftp.ncbi.nlm.nih.gov/genbank/livelists/gi2acc_mapping/gi2acc_lmdb.db.gz#wget-c https://ftp.ncbi.nlm.nih.gov/genbank/livelists/gi2acc_mapping/gi2accession.py#wget-c https://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz#wget-c https:// Ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz# Quick download method / share/work/biosoft/aspera/latest/cli/bin/ascp-v-k 1-T-l 400m-I / asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/genbank/livelists/gi2acc_mapping/gi2acc_lmdb.db.gz. / / share/work/biosoft/aspera/latest/cli/bin/ascp-v-k 1-T-l 400m- I ~ / asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/pub/taxonomy/accession2taxid/prot.accession2taxid.gz. / / share/work/biosoft/aspera/latest/cli/bin/ascp-v-k 1-T-l 400m-I / asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/blast/db/FASTA/nr.gz. /

two。 Write a script to classify all protein sequence nr libraries according to the protein ID and classification information in prot.accession2taxid.gz files.

0 | BCT | Bacteria | | 1 | INV | Invertebrates | | 2 | MAM | Mammals | | 3 | PHG | Phages | | 4 | PLN | Plants and Fungi | | 5 | PRI | Primates | | 6 | ROD | Rodents | | 7 | SYN | Synthetic and Chimeric | | 8 | UNA | Unassigned | No species nodes should inherit this division assignment | 9 | | | VRL | Viruses | | 10 | VRT | Vertebrates | | 11 | ENV | Environmental samples | Anonymous sequences cloned directly from the environment |

The code is as follows:

Perl / share/work/huangls/piplines/01.script/split_taxid_ncbiv2.pl division.dmp nodes.dmp prot.accession2taxid.gz nr.gz. / # ls * fa | while read a share do bundles ${a%%.fa}; echo mv "$a ${b} _ nr.fa" Done/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in INV_nr.fa-dbtype prot-title INV_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in PLN_nr.fa-dbtype prot-title PLN_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in MAM_nr.fa-dbtype prot-title MAM_nr -parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in PHG_nr.fa-dbtype prot-title PHG_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in PRI_nr.fa-dbtype prot-title PRI_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in ROD_nr.fa-dbtype prot-title ROD_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in SYN_nr.fa-dbtype prot-title SYN_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in UNA_nr.fa-dbtype prot-title UNA_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in VRL_nr.fa-dbtype Prot-title VRL_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in VRT_nr.fa-dbtype prot-title VRT_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in ENV_nr.fa-dbtype prot-title ENV_nr-parse_seqids/share/work/biosoft/blast/ncbi-blast-2.6.0+/bin/makeblastdb-in BCT_nr. Fa-dbtype prot-title BCT_nr-parse_seqids

Perl code, this code requires more than 320 GB of memory, running time is about 3 days

Split_taxid_ncbiv2.pl#The gi_taxid_nucl.dmp is about 160MB and contains two columns: the nucleotide's gi and taxid.#The gi_taxid_prot.dmp is about 17 MB and contains two columns: the protein's gi and taxid.#Divisions file (division.dmp): # division id-taxonomy database division id# division cde-GenBank division code (three characters) # division name-- e.g. BCT PLN, VRT, MAM PRI...# comments#0 | BCT | Bacteria | | # 1 | INV | Invertebrates | | # 2 | MAM | Mammals | | # 3 | PHG | Phages | | # 4 | | PLN | Plants and Fungi | | # 5 | PRI | Primates | | # 6 | ROD | Rodents | | # 7 | SYN | Synthetic and Chimeric | | # 8 | UNA | Unassigned | No species nodes should inherit this division assignment | # 9 | VRL | Viruses | | # 10 | VRT | Vertebrates | | # 11 | ENV | Environmental samples | Anonymous sequences cloned directly from the environment | # nodes.dmp file consists of taxonomy nodes. The description for each node includes the following#fields:# tax_id-node id in GenBank taxonomy database# parent tax_id-parent node id in GenBank taxonomy database# rank-rank of this node (superkingdom, kingdom ) # embl code-- locus-name prefix Not unique# division id-see division.dmp file# inherited div flag (1 or 0)-1 if node inherits division from parent# genetic code id-see gencode.dmp file# inherited GC flag (1 or 0)-1 if node inherits genetic code from parent# Mitochondrial genetic code id-see gencode.dmp file# inherited MGC flag (1 or 0)-1 if node inherits mitochondrial gencode from parent# GenBank hidden flag (1 or 0)-1 if name is suppressed in GenBank entry lineage# hidden subtree root flag (1 or 0)-1 if this subtree has no sequence data yet# comments -free-text comments and citationsdie "perl $0" unless (@ ARGV==5) Use Math::BigFloat;use Bio::SeqIO;use Bio::Seq;use Data::Dumper;use PerlIO::gzip;use FileHandle;use Cwd qw (abs_path getcwd); if ($ARGV [3] = ~ / gz$/) {open $Fa, "$od/$division2name {$I} .fa"); my $out = Bio::SeqIO- > new (- fh = > $FO,-format = > 'Fasta'); $fout {$I} = $out;} print "$ARGV [0] readed\ n" # print Dumper (\% fout); # print Dumper (\% division2name); if ($ARGV [1] = ~ / gz$/) {open IN, "$out" or die $!; my$out_nr = Bio::SeqIO- > new (- fh = > $GZ,-format = > 'fasta'); while (my$ seq = $in- > next_seq ()) {my$ id=$seq- > id; my$ sequence=$seq- > seq; my$ desc=$seq- > desc # my ($gi) = ($id=~/gi\ | (\ d+)\ | ref\ | /); if (exists $prot2gi {$id}) {my $sroomBiopura Seq-> new (- seq= > $sequence,-id= > "gi | $prot2gi {$id} | ref | $id |",-desc= > $desc); $out_nr- > write_seq ($s) My$gi=$prot2gi {$id}; if (exists ($gi2taxid {$gi}) and exists ($taxid2division {$gi2taxid {$gi}})) {$fout {$taxid2division {$gi2taxid {$gi}}-> write_seq ($s);} else {print "unknown tax for gi: $gi\ n" }} else {print "unknown prot id: $id\ n";} $out_nr- > close (); $in- > close (); for my $I (keys% fout) {$fout {$I}-> close ();} these are all the contents of the article "how to divide the R language NCBI protein database". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.