Main Content

text2speech

Synthesize speech from text

Since R2022b

    Description

    [speech,fs] = text2speech(text) synthesizes a speech signal from the provided text using a HiFi-GAN/Tacotron2 pretrained model.

    Note

    Using the HiFi-GAN/Tacotron2 pretrained model requires Deep Learning Toolbox™ and Audio Toolbox™ Interface for SpeechBrain and Torchaudio Libraries. You can download this support package from the Add-On Explorer. For more information, see Get and Manage Add-Ons.

    example

    [speech,fs] = text2speech(text,Client=clientObj) synthesizes a speech signal using the specified pretrained deep learning model or third-party speech service..

    Note

    To use third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    example

    [speech,fs,rawOutput] = text2speech(___) also returns the unprocessed server output from the third-party speech service.

    Examples

    collapse all

    Call text2speech with a string to synthesize a speech signal using the HiFi-GAN/Tacotron2 pretrained model. This model requires Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries. If this support package is not installed, the function provides a link to the Add-On Explorer, where you can download and install the support package.

    [speech,fs] = text2speech("hello world");

    Listen to the synthesized speech.

    sound(speech,fs)

    Create a speechClient object that uses the HiFi-GAN/Tacotron2 pretrained model. Set ExecutionEnvironment to "gpu" to use the GPU when running the model.

    hifiganSpeechClient = speechClient("hifigan",ExecutionEnvironment="gpu");

    Call text2speech on a string of text with the HiFi-GAN/Tacotron2 speechClient object to synthesize the speech signal.

    [x,fs] = text2speech("hello world",Client=hifiganSpeechClient);

    Listen to the synthesized speech.

    sound(x,fs)

    Input Arguments

    collapse all

    Text to synthesize into speech, specified as a string or character array.

    Example: "Hello world"

    Data Types: char | string

    Client object, specified as an object returned by speechClient. The object is an interface to a pretrained model or to a third-party speech service. By default, text2speech uses a HiFi-GAN/Tacotron2 client object.

    You cannot use text2speech with a speechClient object that interfaces with the wav2vec 2.0 or Emformer pretrained models.

    Using the HiFi-GAN/Tacotron2 model requires Deep Learning Toolbox and Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries. If this support package is not installed, calling speechClient with "hifigan" provides a link to the Add-On Explorer, where you can download and install the support package.

    To use the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    Example: speechClient("IBM")

    Output Arguments

    collapse all

    Synthesized speech signal, returned as a column vector (single channel).

    Data Types: double

    Sample rate of speech signal in Hz, returned as a positive double. The sample rate depends on the third-party service and the server options set through the clientObj. See the documentation for the specific speech service for more information.

    Data Types: double

    Unprocessed server output, returned as a matlab.net.http.ResponseMessage object containing the HTTP response from the third-party speech service. If the third-party speech service is Amazon®, text2speech returns the server output as a structure.

    Limitations

    The HiFi-GAN/Tacotron2 model cannot synthesize speech signals longer than approximately 10 seconds.

    Version History

    Introduced in R2022b