How perl separates NR NT libraries 04/16 Update SLTechnology News&Howtos

How perl separates NR NT libraries

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "perl how to separate NR NT library", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "perl how to separate NR NT library" this article.

Isolation of NR NT library and rapid blast local alignment of homologous annotation genes

We need to annotate our genes and compare blast homology to the NR NT library in NCBI. Usually, if we do a non-reference transcript group, we will organize more than 100,000 unigene. If we compare the whole database, it will be a waste of time. We can separate the database according to NCBI's classified database, download the following files, and then use the following perl script to separate the NR or NT library into small libraries:

Wget-c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gzwget-c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gzwget-c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gzwget-c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gzwget-c ftp://ftp.ncbi.nlm .nih.gov / pub/taxonomy/taxdump.tar.gz

The perl script is as follows. Because the script reads all the classification files into memory, all scripts consume a large amount of memory. It is recommended to ensure that the memory space is more than 100g.

# The gi_taxid_nucl.dmp is about 160MB and contains two columns: the nucleotide's gi and taxid.#The gi_taxid_prot.dmp is about 17 MB and contains two columns: the protein's gi and taxid.#Divisions file (division.dmp): # division id-taxonomy database division id# division cde-GenBank division code (three characters) # division name-e.g. BCT PLN, VRT, MAM PRI...# comments#0 | BCT | Bacteria | | # 1 | INV | Invertebrates | | # 2 | MAM | Mammals | | # 3 | PHG | Phages | | # 4 | | PLN | Plants and Fungi | | # 5 | PRI | Primates | | # 6 | ROD | Rodents | | # 7 | SYN | Synthetic and Chimeric | | # 8 | UNA | Unassigned | No species nodes should inherit this division assignment | # 9 | VRL | Viruses | | # 10 | VRT | Vertebrates | | # 11 | ENV | Environmental samples | Anonymous sequences cloned directly from the environment | # nodes.dmp file consists of taxonomy nodes. The description for each node includes the following#fields:# tax_id-node id in GenBank taxonomy database# parent tax_id-parent node id in GenBank taxonomy database# rank-rank of this node (superkingdom, kingdom ) # embl code-- locus-name prefix Not unique# division id-see division.dmp file# inherited div flag (1 or 0)-1 if node inherits division from parent# genetic code id-see gencode.dmp file# inherited GC flag (1 or 0)-1 if node inherits genetic code from parent# Mitochondrial genetic code id-see gencode.dmp file# inherited MGC flag (1 or 0)-1 if node inherits mitochondrial gencode from parent# GenBank hidden flag (1 or 0)-1 if name is suppressed in GenBank entry lineage# hidden subtree root flag (1 or 0)-1 if this subtree has no sequence data yet# comments -free-text comments and citationsdie "perl $0" unless (@ ARGV==5) Use Math::BigFloat;use Bio::SeqIO;use Bio::Seq;use Data::Dumper;use PerlIO::gzip;use FileHandle;use Cwd qw (abs_path getcwd); if ($ARGV [3] = ~ / gz$/) {open $Fa, "$od/$division2name {$I} .fa"); my $out = Bio::SeqIO- > new (- fh = > $FO,-format = > 'Fasta'); $fout {$I} = $out;} print "$ARGV [0] readed\ n" # print Dumper (\% fout); # print Dumper (\% division2name); open IN, "$ARGV [1]" or die "$!"; my%taxid2division= (); while () {chomp; my@tmp=split (/\ s +\ |\ chomp; my@tmp=split /); $taxid2division {$tmp [0]} = $tmp [4]; # last if $. > 100;} close (IN); print "$ARGV [1] readed\ n"; my% gi2taxid= () Open IN, "$ARGV [2]" or die "$!"; while () {chomp; my@tmp=split (/\ tmp /); $gi2taxid {$tmp [0]} = $tmp [1]; # last if $. > 100;} close (IN); print "$ARGV [2] readed\ n"; # print Dumper (\% gi2taxid); while (my $seq = $in- > next_seq ()) {my $id=$seq- > id My ($gi) = ($id=~/gi\ | (\ d+)\ | ref\ | /); if (exists ($gi2taxid {$gi}) and exists ($taxid2division {$gi2taxid {$gi}})) {$fout {$taxid2division {$gi2taxid {$gi}}-> write_seq ($seq);} else {print "unknown gi:$gi\ n" }} for my $I (keys% fout) {$fout {$I}-> close ();} above is all the content of the article "how perl separates the NR NT Library". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.