Main Content

BioRead

Contain sequence reads and their quality data

Description

The BioRead object contains sequencing read data, including sequence headers, nucleotide sequences, and quality scores.

Create a BioRead object from NGS (next-generation sequencing) data stored in an FASTQ- or SAM-formatted file. Each element in the object has a sequence, header, and quality score associated with it. Use the object properties and functions to explore, access, filter, and manipulate all the data or a subset of the data. If you have data with reads that are already mapped to a reference sequence, and you need to access alignment records, use BioMap instead.

Creation

Description

example

bioreadObj = BioRead creates an empty BioRead object bioreadObj.

example

bioreadObj = BioRead(File) creates a BioRead object from File, an FASTQ- or SAM-formatted file. The data remains in the source file after the object is created, and you have access to data through the object properties but cannot modify the properties, except the Name property.

example

bioreadObj = BioRead(S) creates a BioRead object from S, a MATLAB® structure, containing the fields Header, Sequence, and Quality. The data from S remains in memory, and you can modify the properties of the object.

example

bioreadObj = BioRead(Seqs) creates a BioRead object from Seqs, a cell array of character vectors or string vector containing nucleotide sequences.

example

bioreadObj = BioRead(Seqs,Quals) creates a BioRead object from Seqs and sets the Quality property of the object to Quals, a cell array of character vectors or string vector containing the ASCII representation of per-base quality scores for each read.

example

bioreadObj = BioRead(Seqs,Quals,Headers) also sets the Header property of the object to Headers, a cell array of character vectors or string vector containing the header text for each read.

example

bioreadObj = BioRead(___,Name,Value) specifies options using one or more name-value pair arguments in addition to input arguments in previous syntaxes. For instance, br = BioRead('SRR005164_1_50.fastq','InMemory',true) specifies to load the data in memory instead of leaving it in the source file.

Input Arguments

expand all

Name of FASTQ- or SAM-formatted file, specified as a character vector or string.

The BioRead object accesses data using an auxiliary index file. The index file must have the same name as the source file, but with an .idx extension. If the index file is not in the same folder as the source file, the BioRead function creates the index file in that folder.

Note

Because the data remains in the source file, do not delete the source file and auxiliary index file.

Example: 'ex1.sam'

Data Types: char

Sequence information, specified as a structure. S must contain the fields Header, Sequence, and Quality. For instance, the fastqread and samread functions return such a structure.

Example: S

Data Types: struct

Nucleotide sequences, specified as a cell array of character vectors or string vector.

Data Types: cell

Sequence quality information, specified as a cell array of character vectors.

Data Types: cell

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: br = BioRead('SRR005164_1_50.fastq','InMemory',true) specifies to load the data in memory instead of leaving it in the source file.

Boolean indicator to keep data in memory, specified as the comma-separated pair consisting of 'InMemory' and true or false.

When you create a BioRead object from a file, the object does not load the data in memory, but leaves it in the source file and accesses it using an index file to make the process more memory efficient. You cannot modify the object properties if you do not load the data in memory.

If the first input is not a file, this name-value pair argument is ignored, and the data is automatically placed in memory.

Example: 'InMemory',true

Data Types: logical

Path to the index file folder where the index file exists or is created, specified as the comma-separated pair consisting of 'IndexDir' and a character vector or string.

Example: 'IndexDir','C:\data\'

Data Types: char

Properties

expand all

Header information of reads, specified as a cell array of character vectors. Each character vector represents the header text for each read. There is a one-to-one relationship between the number and order of character vectors (elements) in the Header and Sequence properties, unless Header is an empty cell array.

Data Types: cell

Object name, specified as a character vector or string.

Example: 'seqdata'

Data Types: char

Number of reads in the object, specified as a positive integer.

Example: 20000

Data Types: double

Per-base quality scores for all reads, specified as a cell array of character vectors. Each element is an ASCII representation of per-base quality scores for each read. A one-to-one relationship exists between the number and order of elements in Quality and Sequence, unless Quality is an empty cell array.

Example: {'<<:<<<','<<<7<:'}

Data Types: cell

Nucleotide sequences (reads), specified as a cell array of character vectors.

Example: {'TATCTG','ATCTAC'}

Data Types: cell

Object Functions

combineCombine two objects
getRetrieve property of object
getHeaderRetrieve sequence headers from object
getQualityRetrieve sequence quality information from object
getSequenceRetrieve sequences from object
getSubsequenceRetrieve partial sequences from object
getSubsetRetrieve subset of elements from object
setSet property of object
setHeaderUpdate header information of reads
setQualityUpdate quality information
setSequenceUpdate read sequences
setSubsequenceUpdate partial sequences
setSubsetUpdate elements of object
writeWrite contents of BioRead or BioMap object to file

Examples

collapse all

Create a BioRead object from sequencing read data saved in a FASTQ-formatted file.

br = BioRead('SRR005164_1_50.fastq')
br = 
  BioRead with properties:

     Quality: [50x1 File indexed property]
    Sequence: [50x1 File indexed property]
      Header: [50x1 File indexed property]
       NSeqs: 50
        Name: ''


By default, when creating a BioRead object from a file, the function also creates an index file if one does not already exist. This example uses an existing index file created and saved in:

fullfile(matlabroot,'toolbox','bioinfo','bioinfodata','SRR005164_1_50.fastq.idx')

The data remains in the source file, and the object accesses the data using the index file, making the process more memory efficient. But you cannot edit the object properties, except the Name property.

To edit the properties, set 'InMemory' to true .

brEdit = BioRead('SRR005164_1_50.fastq','InMemory',true);
brEdit.Header(1) = {'SR1'};
brEdit.Header(1)
ans = 1x1 cell array
    {'SR1'}

If you create the object from a MATLAB structure or cell array of nucleotide sequences, the sequence data is always saved in memory by default, and the InMemory option is ignored.

For instance, generate MATLAB variables containing synthetic sequences and quality scores.

seqs = {randseq(10);randseq(15);randseq(20)};
quals = {repmat('!',1,10); repmat('%',1,15);repmat('&',1,20)};
headers = {'H1';'H2';'H3'};

Create a structure using these variables.

structData = struct('Header',headers,'Sequence',seqs,'Quality',quals);

Create a BioRead object from the structure.

brStruct = BioRead(structData);

You can edit the properties of the object because the data remains in memory.

brStruct.Header(1) = {'H1.1'};
brStruct.Header(1)
ans = 1x1 cell array
    {'H1.1'}

Introduced in R2010a