Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What's the use of CPAT software?

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Editor to share with you what is the use of CPAT software, I hope you will gain something after reading this article, let's discuss it together!

With the application of high-throughput sequencing in the field of lncRNA research, more and more lncRNA have been found. For transcriptome sequencing data, after assembling the transcript, the first thing to do is to distinguish between protein-coded and non-protein-coded RNA.

At present, there are many solutions to this problem, which can be divided into the following two categories.

Alignment-based

Alignment-free

The first algorithm is based on sequence alignment and can better identify conserved protein-coding genes, including software such as CPC,PhyloCSF. The second algorithm does not need alignment, but distinguishes it by the sequence characteristics of coding and non-coding transcripts, including CNCI, CPAT, PLEK and so on.

The conservatism of lncRNA among species is poor, and there is overlap between the chromosome location and protein coding genes of some lncRNA, so it is easy to cause misjudgment by sequence alignment. In addition, the running speed of the software based on sequence alignment is relatively slow, so the comprehensive effect of the software using the second algorithm is better.

This article mainly introduces the use of CPAT. The website is as follows.

Http://lilab.research.bcm.edu/cpat/

For a transcript, whether it is coding or noncoding is essentially a dichotomy problem, so the developers of CPAT came up with the idea of solving this problem through logical regression. The software builds a logical regression model based on the following four features to distinguish coding from noncoding.

Open reading frame size

Open reading frame coverage

Fickett TESTCODE statistic

Hexamer usage bias

The first two factors are defined for the open reading frame, the first factor is the size of the open reading frame, the second factor is the proportion of the open reading frame to the total length of the transcript, the third factor is defined based on the base composition and codon distribution of the sequence, and the fourth factor is defined based on the frequency of the hexamer in the sequence.

In this paper, according to the above four characteristics, we first evaluate the distribution in coding and noncoding, as shown below.

It can be seen that coding and noncoding form two different peaks, indicating that there are differences in these four characteristics between coding and noncoding.

In this paper, the performance of different software is evaluated by ROC curve, and the results are as follows.

You can see that CPAT and CPC are the best. CPAT is developed based on the python programming language, and the installation is very easy. The code is as follows

Pip install CPAT

The software can be run locally as well as online.

1. Online version

The URL of the online version is as follows

Http://lilab.research.bcm.edu/cpat/

You can enter a sequence in fasta format directly or a file in bed format. At this point, you need to specify the corresponding genome version, as shown below

two。 Local version

There are also two uses for the local version, and the use of input bed files is as follows

Cpat.py-r / database/hg19.fa\-g mRNA_hg19.bed\-d dat/Human_logitModel.RData\-x dat/Human_Hexamer.tsv\-o output.txt

The use of the input fasta file is as follows

Cpat.py-g transcript.fa\-d dat/Human_logitModel.RData\-x dat/Human_Hexamer.tsv\-o output.txt

The files corresponding to the-d and-x parameters are the built model of the software and are located in the installation directory of the software. The output of the software is as follows

The last column shows the protein coding information of the transcript. Yes represents that the transcript is a protein-coding transcript and no represents that the transcript is a noncoding transcript.

After reading this article, I believe you have a certain understanding of "what is the use of CPAT software". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report