seqpdist
Calculate pairwise distance between sequences
Description
Examples
Measure Distance Between Amino Acid Sequences
Read amino acid alignment data into a structure.
seqs = fastaread("pf00002.fa");
For every possible pair of sequences in the multiple alignment, ignore sites with gaps and score with the scoring matrix PAM250.
dist = seqpdist(seqs,Method="alignment-score", ... Indels="pairwise-delete",... ScoringMatrix="pam250");
Force the realignment of each sequence pair ignoring the provided multiple alignment.
dist = seqpdist(seqs,Method="alignment-score",... Indels="pairwise-delete", ... ScoringMatrix="pam250", ... PairwiseAlignment=1);
Measure the Jukes-Cantor pairwise distances after realigning each sequence pair, counting the gaps as point mutations.
dist = seqpdist(seqs,Method="jukes-cantor", ... Indels="score", ... ScoringMatrix="pam250", ... PairwiseAlignment=true);
Input Arguments
Seqs
— Nucleotide or amino acid sequences
cell array of character vectors | vector of strings | matrix of characters | vector of structures
Nucleotide or amino acid sequences, specified as a cell array of character vectors, vector of strings, matrix of characters, or vector of structures.
You can specify:
Cell array of character vectors or vector of strings containing nucleotide or amino acid sequences.
Matrix of characters, in which each row corresponds to a nucleotide or amino acid sequence.
Vector of structures containing a
Sequence
field.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: D = seqpdist(Seqs,Method="p-distance")
calculates the
pairwise distance using the p-distance method.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: D = seqpdist(Seqs,"Method","p-distance")
Method
— Method to calculate pairwise distances
"Jukes-Cantor"
(default) | character vector | string scalar | function handle
Method to calculate pairwise distances, specified by a character vector or string scalar in the table.
Methods for Nucleotides and Amino Acids
Method | Description |
---|---|
p-distance | Proportion of sites at which the two sequences are different.
d = p |
Jukes-Cantor (default) | Maximum likelihood estimate of the number of substitutions between
two sequences. For nucleotides:
For amino acids:
|
alignment-score | Distance ( d = (1-score12/score11)* (1-score12/score22) d = 0 . |
Methods with No Scoring of Gaps (Nucleotides Only)
Method | Description |
---|---|
Tajima-Nei | Maximum likelihood estimate considering the background nucleotide frequencies. The |
Kimura | Considers separately the transitional nucleotide substitution and the transversional nucleotide substitution. |
Tamura | Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the GC content. The
|
Hasegawa | Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the background nucleotide frequencies. The |
Nei-Tamura | Considers separately the transitional nucleotide substitution between purines, the transitional nucleotide substitution between pyrimidines, the transversional nucleotide substitution, and the background nucleotide frequencies. The |
Methods with No Scoring of Gaps (Amino Acids Only)
Method | Description |
---|---|
Poisson | Assumes that the number of amino acid substitutions at each site has a Poisson distribution. |
Gamma | Assumes that the number of amino acid substitutions at each site
has a Gamma distribution with parameter By
default, the |
You can also specify a user-defined distance function using @
,
for example, @distfun
. The distance function must have the
form:
function D = distfun(S1, S2, OptionalArgs)
The distfun
function takes the following arguments:
S1
,S2
— Two sequences of the same length (nucleotide or amino acid).OptionalArgs
— Optional problem-dependent arguments.
The distfun
function returns a scalar that represents the
distance between S1
and
S2
.
Data Types: char
| string
| function_handle
Indels
— How to treat sites with gaps
"score"
(default) | "pairwise-del"
| "complete-del"
How to treat sites with gaps, specified as one of these strings or character vectors.
score
— Scores these sites either as a point mutation or with the alignment parameters, depending on the method selected.pairwise-del
— For every pairwise comparison, it ignores the sites with gaps.complete-del
— Ignores all the columns in the multiple alignment that contain a gap. This option is available only if you provided a multiple alignment as the inputSeqs
.
Data Types: char
| string
OptArgs
— Optional arguments for distance method
numeric scalar | four-element numeric vector
Optional arguments for the distance method, specified as a numeric scalar or a four-element numeric vector.
Methods with No Scoring of Gaps (Nucleotides Only)
Method | Description |
---|---|
Tajima-Nei | Background nucleotide frequencies, specified as a 4-element numeric
vector of the form [gA gC gG gT] . |
Tamura | Proportion of GC content, specified as a number in the range [0, 1]. |
Hasegawa | Background nucleotide frequencies, specified as a 4-element numeric
vector of the form [gA gC gG gT] . |
Nei-Tamura | Background nucleotide frequencies, specified as a 4-element numeric
vector of the form [gA gC gG gT] . |
Methods with No Scoring of Gaps (Amino Acids Only)
Method | Description |
---|---|
Gamma | Shape parameter a of the Gamma distribution, specified
as a numeric scalar. |
PairwiseAlignment
— Global pairwise alignment
true
or 1
| false
or 0
Controls the global pairwise alignment of input sequences, specified as a numeric
or logical true
(1
) or false
(0
). When true
, the
seqpdist
function performs global pairwise alignment using the
nwalign
function, while ignoring the
multiple alignment of the input sequences (if any).
The default value depends on the length of the sequences:
true
— When all input sequences do not have the same length.false
— When all input sequences have the same length.
Tip
If your input sequences are the same length, then
seqpdist
assumes they are aligned. If they are not aligned,
do one of the following:
Align the sequences before passing them to
seqpdist
, for example, using themultialign
function.Set
PairwiseAlignment
totrue
when usingseqpdist
.
UseParallel
— Use parallel computation
false
or 0
(default) | true
or 1
Use parallel computation to calculate the distance, specified as a numeric or
logical false
(0
) or true
(1
).
If
true
, and Parallel Computing Toolbox™ is installed, then computation occurs usingparfor
-loops.If a
parpool
is open, then the computation uses the openparpool
and occurs in parallel.If there are no open
parpool
, but automatic creation is enabled in the Parallel Preferences, then the default pool will be automatically opened and computation occurs in parallel.If there are no open
parpool
and automatic creation is disabled, then computation usesparfor
-loops in serial mode.
If Parallel Computing Toolbox is not installed, then computation uses
parfor
-loops in serial mode.If
false
, then the computation uses for-loops in serial mode.
SquareForm
— Return output as square matrix
false
or 0
(default) | true
or 1
Return the output D
as a square matrix, specified as a
numeric or logical false
(0
) or
true
(1
).
When true
, the seqpdist
function converts
the output into a square matrix such that
denotes the distance between the D
(I
,J
)I
th and
J
th sequences. The square matrix is symmetric and has a
zero diagonal. Setting Squareform
to true
is the
same as using the squareform
function in Statistics and Machine Learning Toolbox™.
Alphabet
— Type of sequence
"AA"
(default) | "NT"
Type of sequence, specified as "AA"
(amino acid) or
"NT"
(nucleotide).
Data Types: char
| string
ScoringMatrix
— Scoring matrix
character vector | string scalar | numeric matrix
Scoring matrix to use for the global pairwise alignment, specified as a character vector, string scalar, or numeric matrix.
You can specify a scoring matrix name. Valid choices are:
"BLOSUM50"
(default for amino acid sequences)"NUC44"
(default for nucleotide sequences). This choice is not supported for amino acid sequences."BLOSUM62"
"BLOSUM30"
increasing by5
up to"BLOSUM90"
"BLOSUM100"
"PAM10"
increasing by10
up to"PAM500"
"DAYHOFF"
"GONNET"
Note
The above scoring matrices, provided with the software, also include a scale
factor that converts the units of the output score to bits. You can also specify
the Scale
name-value argument to specify an additional scale
factor to convert the output score from bits to another unit.
You can also specify a numeric matrix, such as the one returned by the blosum
, pam
, dayhoff
, gonnet
, or nuc44
function.
Note
If you use a scoring matrix that you created or was created by one of these scoring matrix functions, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix. You can use the
Scale
name-value argument to specify a scale factor to convert the output score to another unit.If you need to compile
seqpdist
into a standalone application or software component using MATLAB® Compiler™, use a matrix instead of a character vector or string forScoringMatrix
.
You can specify this argument only when the Method
argument
is "alignment-score"
or the PairwiseAlignment
argument is true
.
Scale
— Scale factor applied to output distance
1
(default) | positive number
Scale factor applied to the output distance, specified as a positive number. By default, there is no scaling or change in the units of the output distance. If the scoring matrix information also provides a scale factor, then both are used.
Use this argument to control the units of the output distance.
You can specify this argument only when the Method
argument
is "alignment-score"
or the PairwiseAlignment
argument is true
.
GapOpen
— Penalty for opening gap
8
(default) | positive integer
Penalty for opening a gap in the alignment, specified as a positive integer.
You can specify this argument only when the Method
argument
is "alignment-score"
or the PairwiseAlignment
argument is true
.
ExtendGap
— Penalty for extending gap
positive integer
Penalty for extending a gap in the alignment, specified as a positive integer. The
default is equal to GapOpen
.
You can specify this argument only when the Method
argument
is "alignment-score"
or the PairwiseAlignment
argument is true
.
Output Arguments
D
— Biological distance between all pairs of sequences
numeric row vector | numeric matrix
Biological distance between all pairs of sequences stored in the
M
elements of Seqs
, returned as a numeric row
vector or a numeric matrix.
By default, D
is a row vector of length
1
-by-(M*(M-1)/2)
. The elements are arranged in
the order ((2,1),(3,1),..., (M,1),(3,2),...(M,2),...(M,M-1))
. This is
the lower-left triangle of the full M
-by-M
distance matrix. To get the distance between the I
th and the
J
th sequences for I > J
, use the
formula D((J-1)*(M-J/2)+I-J)
.
When you specify the SquareForm
name-value argument as
true
, D
is the full
M
-by-M
distance matrix.
Data Types: double
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To run in parallel, set 'UseParallel'
to true
.
For more information, see the 'UseParallel'
name-value pair argument.
Version History
Introduced before R2006a
See Also
fastaread
| dnds
| dndsml
| multialign
| nwalign
| phytree
| seqlinkage
| pdist
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)