How to realize Fusion Gene Operation in SOAPfuse 07/01 Update SLTechnology News&Howtos

How to realize Fusion Gene Operation in SOAPfuse

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Today, I would like to talk to you about how to achieve fusion gene operation in SOAPfuse, many people may not know much about it. In order to make you understand better, the editor summarized the following content for you. I hope you can get something from this article.

1. Reference database

Like other software, SOAPfuse also needs to build a database for species. In the source directory where the software is installed, a script is provided to set up the database. The usage is as follows

Perl SOAPfuse-S00-Generate_SOAPfuse_database.pl\-wg hg19.fa\-gtf Homo_sapiens.GRCh47.85.chr.gtf\-cbd cytoBand.txt.gz\-gf HGNC_Gene_Family.tsv\-sd / software/SOAPfuse-v1.27\-dd / database/hg19\-rft chr.gtp

Wg parameters represent the fasta file of the genome, gtf parameters represent the gtf file, cbd represents the cytoband file downloaded from UCSC, gf represents the genetic information downloaded from HGNC, sd represents the installation directory of the software, and rft represents the corresponding relationship between the chromosome name in the gtf file and the chromosome name in the fasta file.

For files that need to be downloaded from the database, very detailed prompts are given in the help information of the script, so I won't repeat them here. For rft files, the contents are two columns separated by\ t. Examples are as follows

1 chr12 chr2

The first column represents the chromosome number in the gtf file, and the second column represents the chromosome number in the fasta file.

2. Sample list

SOAPfuse reads the sample information through the sample.list file, which contains the following

\ t separated four columns, the first column represents the sample name, the second column represents lane ID, the third column represents run ID, and the fourth column represents the read length. The reason why each sample needs to provide lane ID and run ID is due to the consideration that a sample will have multiple lane when sequencing. For multiple lane data, because they belong to the same sample, they need to be merged.

In the actual analysis, we only have R1 and R2 data corresponding to each sample, so lane ID and run ID can be defined casually. Here is a practical example, a total of 6 samples.

A1 Lib-A1 Run-A1 150A2 Lib-A2 Run-A2 150A3 Lib-A3 Run-A3 150B1 Lib-B1 Run-B1 150B2 Lib-B2 Run-B2 150B3 Lib-B3 Run-B3 1503. Sample sequence directory

Sample.list only provides information such as the name of the sample, and it is necessary to know the path of the sequencing data corresponding to each sample when analyzing. In SOAPfuse, it is implemented through a fixed directory structure, as shown below

The sequencing data of all samples are located in a general directory, called WHOLE_SEQ-DATA_DIR, under this directory, each sample is a subdirectory, and the name must be the same as the sample name in the sample.list file; in the directory of each sample, it is the corresponding directory of each lane ID; in the directory of lane ID, it is the original data of the sample, prefixed with run ID.

For the sequencing data of samples, the gzip format is required, and fasta and fastq formats are supported. The file name should start with the corresponding run ID, and the double-end data should be distinguished by _ 1 and _ 2. The suffix only needs to be unified for all samples, and the specific suffix can be set in the configuration file.

4. Config

Under the config directory where the software is installed, there is a template configuration file called config.txt, which we need to modify, mainly by modifying the following

DB_db_dir = / software/SOAPfuse-v1.27/db/PG_pg_dir = / software/SOAPfuse-v1.27/source/binPS_ps_dir = / software/SOAPfuse-v1.27/sourcePA_all_fq_postfix = fq.gz

DB_db_dir represents the directory of the database built in the first step. The last two options only need to be replaced with the actual installation directory of soapfuse. PA_all_fq_postfix represents the suffix of the original file name of the sequencing data. The default is fq.gz.

Once the above four points are ready for analysis, the code is as follows

Perl SOAPfuse-RUN.pl\-c config.txt\-fd raw_data\-l sample.list\-o out_dir

-c specifies the configuration file,-fd specifies the directory where the sequencing data is stored,-l specifies the sample.list file of the sample, and-o specifies the output directory of the results.

After reading the above, do you have any further understanding of how to implement fusion gene operation in SOAPfuse? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.