seqtrim
Trim sequences based on specified criterion
Syntax
Description
seqtrim(
trims
the sequences in fastqFile
)fastqFile
and saves the trimmed
sequences in new FASTQ files. By default, the trimmed sequences are
saved under file names with the suffix '_trimmed'
appended.
If you do not specify any trimming criterion, the function trims sequences
using the default.
seqtrim(
uses
additional options specified by one or more fastqFile
,Name,Value
)Name,Value
pair
arguments.
[
returns a cell array outFiles
,nSeqTrimmed
,nSeqUntrimmed
]
= seqtrim(___)outFiles
with
the names of output files. nSeqTrimmed
and nSeqUntrimmed
represent
the numbers of sequences trimmed and untrimmed from each input file,
respectively.
Examples
Trim sequences in a FASTQ file
Trim each sequence when the number of bases with quality below 20 is greater than 3 within a sliding window of size 25.
[outFile,nt,unt] = seqtrim('SRR005164_1_50.fastq', 'Method', 'MaxNumberLowQualityBases', ... 'Threshold', [3 20], 'WindowSize', 25);
Check the number of sequences that were trimmed.
nt
nt = 36
Check the number of sequences that were untrimmed.
unt
unt = 14
Trim the first 10 bases of each sequence.
[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','Termini', ... 'Threshold',[10 0]);
Trim the last 5 bases.
[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','Termini', ... 'Threshold',[0 5]);
Trim each sequence at position 50.
[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','BasePositions', ... 'Threshold',[1 50]);
Trim each sequence when the running average base quality becomes less than 20.
[outFile,nt,unt] = seqtrim('SRR005164_1_50.fastq','Method','MeanQuality', ... 'Threshold',20)
Trim each sequence when the percentage of bases with quality below 10 is more than 15.
[outFile,nt,unt] = seqtrim('SRR005164_1_50.fastq','Method','MaxPercentLowQualityBases', ... 'Threshold',[15 10])
Input Arguments
fastqFile
— Names of FASTQ files with sequence and quality information
character vector | string | string vector | cell array of character vectors
Names of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.
Example: 'SRR005164_1_50.fastq'
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'Method','MaxNumberLowQualityBases','Threshold',[3
20]
specifies to trim each sequence when the number of bases
with quality below 20 is greater than 3.
Method
— Criterion to trim sequences
'MaxNumberLowQualityBases'
(default) | 'MaxPercentLowQualityBases'
| 'MeanQuality'
| 'BasePositions'
| 'Termini'
Criterion to trim sequences, specified as one of the following options. Specify only one trimming criterion per function call.
'MaxNumberLowQualityBases'
– applies a maximum threshold on the number of low-quality bases allowed before trimming a sequence starting at the5'
end.'MaxPercentLowQualityBases'
– applies a maximum threshold on the percentage of low-quality bases allowed before trimming a sequence starting at the5'
end.'MeanQuality'
– applies a minimum threshold on the running average base quality allowed before trimming a sequence starting at the5'
end.'BasePositions'
– trims each sequence according to the base positions (first base and last base) starting at the5'
end.'Termini'
– trims each sequence from either the5'
or3'
end or from both ends.
Use this name-value pair argument together with 'Threshold'
to specify the appropriate threshold value. Depending on the trimming criterion, the corresponding value for 'Threshold'
varies. See the 'Threshold'
option for the default values.
Note
Sequences resulting in empty sequences after trimming are saved in the output files as empty sequences. To remove empty sequences from files, use the seqfilter
function with the 'MinLength'
option set to the value of 1
.
Threshold
— Threshold value for trimming criterion
scalar | vector
Threshold value for the trimming criterion, specified as a scalar or vector. Use this name-value pair to define the threshold value for the trimming criterion specified by 'Method'
.
Depending on the trimming criterion, the corresponding value for 'Threshold'
can be a scalar or two-element vector. If you do not specify 'Threshold'
, then the function uses the default threshold value of the corresponding method. For each trimming criterion, the function uses the encoding format of the base quality specified by the 'Encoding'
name-value pair argument.
'Method' | 'Threshold' | Default 'Threshold' value |
---|---|---|
'MaxNumberLowQualityBases' | Two-element vector [V1 V2] . V1 is a nonnegative integer that specifies the maximum number of low-quality bases allowed before trimming. V2 specifies the minimum base quality. Any base with quality less than V2 is considered a low-quality base. | [0 10] |
'MaxPercentLowQualityBases' | Two-element vector [V1 V2] . V1 is a scalar between 0 and 100 that specifies the maximum percentage of low quality bases allowed before trimming. V2 specifies the minimum base quality. Any base with quality less than V2 is considered a low-quality base. | [0 10] |
'MeanQuality' | Positive scalar that specifies the minimum threshold on the running average base quality allowed before trimming a sequence starting at the 5' end. | 0 |
'BasePositions' | Two-element vector To trim only the To trim only the | [1 Inf] , that is, each sequence is left untrimmed. |
'Termini' | Two-element vector To trim V1 bases at the To trim V2 bases at the | [0 0] , that is, each sequence is left untrimmed. |
WindowSize
— Size of sliding window to apply filtering criterion to sequence
Inf
(default) | positive integer
Size of the sliding window to apply the trimming criterion to a sequence, specified as a positive integer. The size of the window corresponds to the number of bases that the function uses at one time to apply the criterion. Any given sequence is trimmed before the first base of the window that violates the given criterion.
The sliding window can be applied to the following methods:
'MaxNumberLowQualityBases'
,'MaxPercentLowQualityBases'
, and'MeanQuality'
.
Note
Sequences shorter than the size of the window are saved in the output file as empty sequences. To remove empty sequences from files, use the seqfilter
function with the 'MinLength'
option set to the value of 1
.
Encoding
— Base quality encoding format
'Illumina18'
(default) | 'Sanger'
| 'Solexa'
| 'Illumina13'
| 'Illumina15'
Base quality encoding format, specified as a character vector or string.
OutputDir
— Relative or absolute path to output file directory
character vector | string
Relative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.
Example: 'OutputDir','F:\results'
OutputSuffix
— Suffix to use in output file name
'_trimmed'
(default) | character vector | string
Suffix to use in the output file name, specified as a character vector or string. It is inserted after the input file name and before the file extension. The default is '_trimmed'
.
UseParallel
— Boolean indicating whether to perform computation in parallel
false
(default) | true
Boolean indicating whether to perform computation in parallel,
specified as true
or false
.
For parallel computing, you must have Parallel Computing Toolbox™. If a parallel pool does not exist, one is created automatically when the auto-creation option is enabled in your parallel preferences. Otherwise, computation runs in serial mode.
Note
There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.
During parallel computations, the work is divided by files, not by sequences, meaning that, for a single large file, running in parallel does not make a difference.
Example: 'UseParallel',true
OverWrite
— Flag to overwrite existing files
false
or 0 (default) | true
or 1
Flag to overwrite existing files, specified as a numeric or logical 1
(true
) or 0 (false
).
When the value is false
and a file matching one of
the output file names already exists, the function generates an
error.
Data Types: double
| logical
Output Arguments
outFiles
— Output file names
cell array of character vectors
Output file names, returned as a cell array of character vectors.
nSeqTrimmed
— Number of sequences trimmed from each input file
scalar | vector
Number of sequences trimmed from each input file, returned as
a scalar or an n-by-1
vector
where n is the number of input files. If there
are multiple input files, the order within nSeqTrimmed
corresponds
to the order of the input files.
nSeqUntrimmed
— Number of sequences untrimmed from each input file
scalar | vector
Number of sequences untrimmed from each input file, returned
as a scalar or an n-by-1
vector
where n is the number of input files. If there
are multiple input files, the order within nSeqUntrimmed
corresponds
to the order of the input files.
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To run in parallel, set 'UseParallel'
to true
.
For more information, see the 'UseParallel'
name-value pair argument.
Version History
Introduced in R2016b
See Also
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)