seqsplit
Split sequences into separate files based on barcodes
Description
seqsplit(
splits
sequences in fastqFile
,barcodeFile
)fastqFile
according to the barcodes
in barcodeFile
and saves the sequences in separate
files. By default, the output file name consists of the input file
name followed by the barcode identifier. Sequences that do not match
any provided barcodes, or that match multiple barcodes ambiguously,
are saved in a file with the suffix '_unmatched'
instead
of the barcode identifier.
seqsplit(___,
uses
additional options specified by one or more Name,Value
)Name,Value
pair
arguments.
Examples
Split sequences into separate files based on barcodes
Create a tab-delimited file with barcode IDs and barcode sequences.
barcodeInfo = {'ID1', 'AAAAC'; 'ID2', 'AGATT'; 'ID3', 'GACTT'}; writetable(cell2table(barcodeInfo), 'barcodeExample.txt', ... 'Delimiter', '\t', 'WriteVariableNames', false);
Split sequences into separate output files based on the barcode sequences. By default, the function assumes that the barcode is located at the 5' end of each sequence, and no mismatches are allowed during barcode matching.
[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt');
Check the number of sequences in each output file after splitting.
N
N = 3×1
2
1
1
Allow up to two mismatches during the barcode matching.
[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt', ... 'MaxMismatches',2,'OutputSuffix','_MM2_split');
N
N = 3×1
5
9
5
Input Arguments
fastqFile
— Names of FASTQ files with sequence and quality information
character vector | string | string vector | cell array of character vectors
Names of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.
Example: 'SRR005164_1_50.fastq'
barcodeFile
— Name of barcode files with barcode information
character vector | string
Name of barcode file with barcode information, specified as a character vector or string. The file must be tab-formatted, containing barcode IDs and barcode sequences. Each ID must be followed by a barcode sequence, and all barcode sequences must have the same length.
Example: 'barcodeExample.txt'
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'MaxMismatches',2
specifies to allow
up to 2 mismatches during barcode matching.
MaxMismatches
— Maximum number of mismatches allowed during barcode matching
0
(default) | nonnegative integer
Maximum number of mismatches allowed during barcode matching, specified as a nonnegative integer. The default is 0, that is, no mismatches are allowed.
BarcodeFormat
— Type of barcode to match
5
(default) | 3
Type of barcode to match, specified as 3
or 5
. A value of 5
corresponds to the barcode located at the 5'
end of each sequence, and 3
corresponds to the 3'
end.
Example:
RemoveBarcode
— Whether to remove the barcode
true
(default) | false
Whether to remove the barcode and corresponding quality information from the matched sequences, specified as true
or false
. The default is true
.
WriteUnmatched
— Whether to save unmatched sequences
false
(default) | true
Whether to save unmatched sequences and corresponding quality information in a separate output file, specified as true
or false
. The output file name has the suffix '_unmatched'
instead of the barcode ID.
OutputDir
— Relative or absolute path to output file directory
character vector | string
Relative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.
Example: 'OutputDir','F:\results'
OutputSuffix
— Suffix to use in output file name
'_split'
(default) | character vector | string
Suffix to use in the output file name, specified as a character vector or string. It is inserted after the input file name and before the barcode ID. The default is '_split'
.
UseParallel
— Whether to perform computation in parallel
false
(default) | true
Whether to perform computation in parallel, specified as true
or false
.
For parallel computing, you must have Parallel Computing Toolbox™. If a parallel pool does not exist, one is created automatically when the auto-creation option is enabled in your parallel preferences. Otherwise, computation runs in serial mode.
Note
There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.
Example: 'UseParallel',true
Output Arguments
outFiles
— Output file names
cell array of character vectors
Output file names, returned as a cell array of character vectors.
By default, the name of each output file consists of the input file
name followed by the output suffix ('_split'
) and
the barcode identifier.
N
— Numbers of sequences saved in each output file
scalar | vector
Numbers of sequences saved in each output file, returned as
a scalar or an n-by-1
vector,
where n is the number of output files. If there
are multiple output files, the order within N
corresponds
to the order of the output files.
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To run in parallel, set 'UseParallel'
to true
.
For more information, see the 'UseParallel'
name-value pair argument.
Version History
Introduced in R2016b
See Also
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)