Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to remove adapter sequences from cutadapt

2025-01-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces cutadapt how to remove the adapter sequence, the article is very detailed, has a certain reference value, interested friends must read it!

For NGS data analysis, the first step is to carry out quality control, including the removal of adapter sequences, the removal of low-quality sequences and so on. In the library construction phase, in order to sequence on the computer, adapter sequences are added at both ends of the inserted fragment. When the read length of the sequence exceeds the length of the insert, the adapter sequence is read.

The adapter sequence is a man-made sequence, and we are concerned about the sequencing results of the inserted fragments, so the first thing to do is to remove the adapter sequence. The following two factors need to be considered when removing adapter sequences

Due to the error rate of sequencing, there are several base errors between the adapter sequence and the original adapter sequence, so the base mismatch must be allowed when removing the adapter sequence.

Because the length of the inserted fragment varies in a certain range, and the adpter sequence appears at both ends, the adapter sequence read by sequencing may be only part of the original adapter sequence.

Cutadapt is a quality filtering software for NGS data, whether it is 5 'adapter or 3' adapter, can effectively remove, but also can filter low quality, remove the length of the sequence is too short.

This software is developed by python and is easy to install. The code is as follows

Pip install cutadapt1. Removal of 3 'terminal primer sequence

For 3'- end sequences, the following may exist

The green part is the adapter sequence, and the gray part is the sequence that the software will remove. We can see that the software can effectively remove the 3 'end adapter sequence no matter it reads only part of the adapter sequence or the complete adapter sequence.

The usage is as follows

Cutadapt-an AACCGGTT-o output.fastq input.fastq

For the current mainstream double-terminal sequencing data, the adapter sequence appears at the 3 'end, the 3'adapter sequence may appear at the 3' end of the R1 sequence, and the reverse complementary sequence of the 5 'end adpter appears at the 3' end of the R2 sequence, as shown below.

It should be noted that adapter does not appear at the 5 'end of either R1 or R2, because the sequencing reaction begins directly with the insertion of the fragment. For double-ended data, you only need to remove the 3'- terminal adapter sequence from R1 and R2 sequences respectively.

two。 Removal of 5 'end adapter sequence

Cutadapt software also supports the removal of 5'- terminal adapter sequences. Although 5'adapter does not appear in the sequencing reaction, the concept of adapter can be extended here, such as PCR primer sequences. In some sequencing strategies, the first choice is to amplify the target fragment by PCR reaction, and then build the library. This usage comes in handy if you want to remove the PCR primers at the 5 'end of the insertion fragment.

For 5'- end sequences, the following may exist

The green part is the adapter sequence, and the gray part is the sequence that the software will remove. The first two formats and, whether reading only part of the adapter sequence or the complete adapter sequence, the software can effectively remove the 5 'end adapter sequence.

The usage is as follows

Cutadapt-g AACCGGTT-o output.fastq input.fastq

When looking for adapter sequences, cutadapt also provides an Anchored pattern, in which the complete adapter sequence must be found before excision is performed.

The 3 'end Anchored mode is written as follows

Cutadapt-an AACCGGTT$-o output.fastq input.fastq

The 5 'end Anchored mode is written as follows

Cutadapt-g ^ AACCGGTT-o output.fastq input.fastq

When looking for adapter, cutadapt has two default behaviors

1. Mismatch and insertion deletion are allowed by default

Assuming that the adapter sequence is ADAPTER, for the following three cases

ADABTER has a mismatch, ADAPTR has a missing ADAPPTER, has an insert

Cutadapt is considered as an adapter sequence, and then removed. You can use the-e parameter to specify the ratio of mismatches. The default is-e, for example, the length of the adapter sequence is 21, and the allowed number of mismatches is 21 * 0.1 = 2.1, and then rounded down directly to 2, so the allowed number of mismatches is 2. You can use the-no-indels parameter to prohibit insertions and deletions.

two。 Partial matching is allowed by default

Cutadapt allows partial matching by default, for example, the adapter sequence is ADAPTER, the sequence obtained by sequencing is ATCGATGCTADCGAGCGC, and the AD in the middle of the sequence is part of the adapter sequence, and all the AD and subsequent sequences will be cut off, which is a misjudgment. In order to prevent this kind of misdiscrimination, by default, cutadapt must have at least 3 base matches before it is considered to be an adapter sequence, and then cut out. This threshold can be specified by the-- overlap parameter.

Cutdadapt also supports filtering based on quality, using the following

Cutadapt-Q 10-o output.fastq input.fastq

The low-quality sequence usually appears at the 3 'end of the reads. The above method indicates that the low-quality base at the 3' end is filtered, and the mass threshold is 10. The specific calculation process is as follows, assuming that the sequence quality code is

42, 40, 26, 27, 8, 7, 11, 4, 2, 3

If the threshold of quality filtering is-Q, 10 is subtracted first.

32, 30, 16, 17,-2,-3, 1,-6,-8,-7

Then add up from the end and get the following values

(70), (38), 8, 8,-25,-23,-20,-21,-15,-7

-25 is the smallest, so the base before-25 is retained, that is, the first 4 bases are retained, and the subsequent base is considered to be a low-quality base, which is removed directly.

Cutadapt can also filter the sequence according to its length, the-m parameter specifies the minimum length of the sequence, the sequence below that length will be filtered out, the-M parameter specifies the maximum length of the sequence, and the sequence greater than that length will be filtered out.

The above is all the content of the article "how to remove the adapter sequence from cutadapt". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report