How to use Orthomcl to find homologous genes 04/16 Update SLTechnology News&Howtos

How to use Orthomcl to find homologous genes

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use Orthomcl to find homologous genes, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.

OrthoMCL introduction

OrthoMCL (http://orthomcl.org/orthomcl/) v2.0 version) is the most widely used software to find direct homologous genes (Orthologs) and collateral homologous genes (Paralog).

According to the tutorials on the official website, it takes more than ten steps to complete the entire running process, but most of the work has code to use, so it's easy to follow his steps. If not, let's take the protein sequence as an example to introduce the use of Orthomcl in detail.

Auxiliary module

The detailed configuration of linux required by Orthomcl is briefly described.

System: unix

BLAST: we recommend NCBI's BLAST

Database:oracle or mysql, below we choose mysql to elaborate.

Hardware: 4G memory, 100G hard disk

MCL program

Software installation

(1) Mysql installation

Orthomcl needs to use the database, it doesn't matter if you don't know much about the database, as long as you can install the database and use a few simple SQL statements, the more complex work is done directly by the program. The specific installation process will not be mentioned.

(2) install mcl

Download it at http://www.micans.org/mcl/src/mcl-latest.tar.gz and get the latest version automatically.

. / configuremakemake install

# notice the appearance of make [] Nothing to be done for'*'

Make [] leaving directory'* *'

It's not a make error! When make, it is best to use root permission, that is, to add sudo before make

(3) install Orthmcl:

Download link: http://orthomcl.org/common/downloads/software/v2.0/orthomclSoftware-v2.0.9.tar.gz

Use the following command to extract:

Tar-xf orthomclSoftware-v2.0.9.tar.gz

The decompressed folder includes four bin config doc lib folders. You can add the bin directory to the environment variable to facilitate future operations:

Cd ~ echo "export PATH=$PATH:/home/wangq/.../orthomclSoftware-v2.0.9/bin" > > .bashrcsource .bashrc

You can then create a folder under the Orthomcl software home folder or other address as the working directory. Here, take the my_orthomcl_dir of the official website document as an example. Copy the / doc/OrthoMCLEngine/Main/orthomcl.config.template file to my_orthomcl_dir with the following command: [path: decompressed orthomclSoftware-v2.0.9]

Mkdir my_orthomcl_dircp / doc/OrthoMCLEngine/Main/orthomcl.config.template my_orthomcl_dir/orthomcl.config # copy and rename to orthomcl.config

Modify orthomcl.config:

# this config assumes a mysql database named 'orthomcl'. The database used by adjust according# to your situation.dbVendor=mysql # is mysql, if oracle Then change it to oracledbConnectString=dbi:mysql:orthomcl # to connect to the orthomcl database in mysql dbLogin=wangq # the user name of the database dbPassword=123 # the password corresponding to the user name similarSequencesTable=SimilarSequences # below are the various tables generated in the middle, orthologTable=OrthologinParalogTable=InParalogcoOrthologTable=CoOrthologinterTaxonMatchView=InterTaxonMatchpercentMatchCutoff=50 # Coverage cutoff values, here select 50% Coverage It depends on you. EvalueExponentCutoff=-5 # blast filtered e-value does not default to oracleIndexTblSpc=NONE if you have used blast.

Concrete operation

The specific steps include creating database, converting sequence format, filtering, alignment, parsing results, clustering and other steps, as described in detail as follows:

(1) create a database and create a table

This part is based on the config file just configured to configure mysql and create some empty tables in the database. Note: before doing this, please create a new database in your mysql, such as create database orthomcl, and I will use this database to manipulate the data.

Mysql-u root-p # first use root login to create a database named orthomcl mysql > create database orthomcl;mysql > grant all on orthomcl.* to 'wangq'@'%' # give wangq users all permissions to operate the orthomcl database, user name at wangq,% on behalf of any host orthomclInstallSchema orthomcl.config mysql.log species # execute orthomclInstallSchema commands, create tables in the database as configured by orthomcl.config, and mysql.log log files (select) add species names to each table name (select)

(2) format orthomcl input file

We use the protein sequences of all genes of multiple species to find homologous genes, and the data come from transcriptome or database downloads. This step will convert your pep file into the file required by orthmcl, which is actually a process of rewriting. The format requirements are as follows:

> taxoncode | unique_protein_id # taxoncode is the species code generally 3-4 letters unique_protein_id is used between the protein id,taxoncode and the protein id. | separate the MFAXGETHFD.

For example:

> Dha | CAG25565MFAXGETHFD.

Using the orthomclAdjustFasta program, you can convert the sequence files in fasta format to the standard format of orthomcl. Before converting the format, create a folder named compliantFasta in the my_orthomcl_dir directory as follows:

Mkdir compliantFastacd compliantFastaorthomclAdjustFasta hsa. / Homo_sapiens.NCBI36.53.pep.all.fa 1 # hsa is the species code; * * .fa is the sequence file, which is stored in the my_orthomcl_dir directory; 1 indicates that the species name is added before id and |

After executing the above command, the resulting files are stored as hsa.fasta in the compliantFasta directory. The proteome of each species is stored in the compliantFasta folder, such as Hsa.fasta Dha.fasta Ali.fasta Kla.fasta.

(3) filter sequence

Use the orthomclFilterFasta command to filter the sequences under the compliantFasta folder. The recommended rule for orthomcl is to allow protein sequences with a minimum length of 10 and the maximum percentage of stop coden is 20%. The command will generate goodProteins.fasta and poorProteins.fasta,goodProteins.fasta files in the my_orthomcl_dir directory that contain all the screened species proteomes under the comliantFasta folder.

OrthomclFilterFasta compliantFasta/ 10 20

(4) blast comparison

Build a library with goodProteins.fasta and compare it with yourself. Due to the large amount of data, the comparison time may be longer, one or two days is normal, friends, please wait patiently!

Makeblastdb-in goodProteins.fasta-dbtype prot-title orthomcl-parse_seqids-out orthomcl-logfile orthomcl.log # uses goodProteins.fasta as the sequence file to create a blast database named orthomcl blastp-db orthomcl-query goodProteins.fasta-seg yes-out orthomcl.blastout-evalue 1e-5-outfmt 7-num_threads 24 # goodProteins.fasta to do blast to the orthomcl library, and the resulting file is orthomcl.blastout (do blast to yourself to find out the homologous genes among target species)

(5) processing the results produced by blast

OrthomclBlastParser blastresult compliantFasta > similarSequences.txt

# use the orthomclBlastParser command to import files under the compliantFasta folder, generate similarSequences.txt files, and find out the similarity sequence. The output files from column 1 to column 8 are: query_id, subject_id, query_taxon, subject_taxon, evalue_mant, evalue_exp, percent_ident, percent_match.

(6) the similarity sequence is loaded into mysql database.

OrthomclLoadBlast orthomcl.config similarSequences.txt # Import data into the database

(7) looking for paired proteins

OrthomclPairs orthomcl.config orthomcl_pairs.log cleanup=no # this command operates on tables in the database

(8) Export data from mysql database

OrthomclDumpPairsFiles orthomcl.config

This command generates a mclInput file and a pairs folder under my_orthomcl_dir, and the pairs folder contains coorthologs.txt and inparalogs.txt and orthologs.txt files.

(6) (7) (8) the three steps are the operation of the database, it doesn't matter if you don't understand, just do it.

(9) clustering pairs with mcl

Mcl mclInput-- abc-I 1.5-o mclOutput

(10) extract the result of mcl and generate group.txt file

OrthomclMclToGroups Fungi 1

< mclOutput >

Groups.txt # generates groups.txt files. The number of each homologous group starts with Fungi1 and increases in turn.

At this point, the orthomcl program runs, and the resulting groups.txt is the final result file, which can be used for various data operations, such as extracting a single copy of the lineal homologous gene. It is only necessary to judge that the homologous group contains all the species studied, and each species has only one gene, so it is a group of single copy lineal homologous genes.

Thank you for reading this article carefully. I hope the article "how to use Orthomcl to find homologous genes" shared by the editor will be helpful to everyone. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.