Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand the transcriptome data comparison tool STAR

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

How to understand the transcriptional group data comparison tool STAR, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

STAR is a special RNA_seq data comparison software, the comparison speed is very fast, the biggest advantage is high sensitivity, GATK recommends using STAR comparison, and then downstream SNP analysis. The source code of the software is saved on github at the following address

Https://github.com/alexdobin/STAR

The installation process is as follows

Wget https://github.com/alexdobin/STAR/archive/2.6.1b.tar.gztar xzvf 2.6.1b.tar.gz

After unzipping, the compiled executable file STAR is provided in the bin/Linux_x86_64_static directory. Unlike software such as hisat, STAR integrates all functions into the same program and performs different tasks by switching runMode.

1. Construction of Genome Index

Before running the comparison, you need to index the genome first. The corresponding runMode is genomeGenerate. The basic usage is as follows

STAR-- runMode genomeGenerate\-- runThreadN 20\-- genomeFastaFiles hg19.fasta\-- genomeDir hg19_STAR_db\-- sjdbGTFfile hg19.gtf\-- sjdbOverhang 149

Indexing requires genomic fasta and gtf files, which are specified by genomeFastaFiles and sjdbGTFfile parameters respectively; STAR indexing needs to specify an output directory, which must be created beforehand, in which many files will be generated, so you must have write permission; runThreadN specifies the number of threads; the default value of sjdbOverhang is 100. in actual setting, the best value is max (read_length)-1.

When building the index, the interval information of intron is also supported. The corresponding files are specified through sjdbFileChrStartEnd, and multiple files are separated by commas. Files in this format are generated by STAR alignment and are usually used in 2-pass comparison mode.

It is officially recommended that the fasta of the genome adopt the primary_assembly version and should not contain alt_scaffold and patches. For human, the link to NCBI is as follows

Ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.38_GRCh48.p12/GCF_000001405.38_GRCh48.p12_assembly_structure/Primary_Assembly/

The link to Ensembl is as follows

Ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/dna/Homo_sapiens.GRCh48.dna.primary_assembly.fa.gz

two。 Running comparison

STAR supports input files in fasta/fastq format. If the sequence files are compressed, you need to use the readFilesCommand parameter to specify the method of decompressing the files. For gzip compressed files, there are two underwriting methods

-readFilesCommand zcat--readFilesCommand gzip-c

When the comparison is complete, many files are output, including four categories

Log file

Sam file

Bam file

Cut point file

Each file has a predefined name, and when multiple samples are running at the same time, to distinguish it, you can specify the prefix of the output file through outFileNamePrefix. The first three types of files are relatively easy to understand, the shear site file is actually based on the mapping situation, the estimated intron interval information, the default file name is SJ.out.tab.

The default output comparison file is in SAM format. In order to save disk space and facilitate downstream analysis, you can specify the output bam file through the outSAMtype parameter. The parameter has two field values, the first value specifies the file type, the values are SAM and BAM, and the second value specifies whether to sort or not. The range of values include Unsorted and SortedByCoordinate, written as follows.

-- outSAMtype BAM SortedByCoordinate

The above method outputs the sorted bam file.

The basic usage of single-ended data comparison is as follows

STAR\-- runThreadN 20\-- genomeDir hg19_STAR_db\-- readFilesIn reads.fq\-- sjdbGTFfile hg19.gtf\-- sjdbOverhang 149\-- outFileNamePrefix sampleA\-- outSAMtype BAM SortedByCoordinate

The basic usage of double-ended data comparison is as follows

STAR\-- runThreadN 20\-- genomeDir hg19_STAR_db\-- readFilesIn r1.fq.gz r2.fq.gz\-- readFilesCommand zcat\-- sjdbGTFfile hg19.gtf\-- sjdbOverhang 149\-- outFileNamePrefix sampleA\-- outSAMtype BAM SortedByCoordinate

The above are only basic comparisons. STAR officials recommend using 2-pass comparison mode, that is, comparing twice, there are two ways

Multi-sample 2-pass

The first alignment is consistent with the above usage, and after the alignment, each sample produces an interval file of intron, SJ.out.tab;. Before the second alignment, the genome index is reconstructed, the SJ.out.tab files of all samples are added, and then the new genome index is used for re-alignment. This method integrates the intron information of multiple samples, the sensitivity of comparison will be higher, but the disadvantage is that the operation is more cumbersome.

Per-sample 2-pass

For a single sample, add the-twopassMode Basic parameter directly during the comparison, and the software will automatically make two comparisons, add the SJ.out.tab of the first alignment to the index, and then re-align. This method is simple to operate and suitable for 2-pass comparison of a single sample.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 211

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report