Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Speech Modeling - Elements of Discrete Signal Analysis - Laboratory Assignment, Exercises of Signal Processing and Analysis

The Elements of Discrete Signal Analysis and the interactive internet programming with java, is very helpful series of lecture slides, which made programming an easy task. The major points in these laboratory assignment are:Speech Modeling, Prediction, Synthesis, Time Domain, Transform Domain, Fourier Transform, Speech Transmission, Determine Bandwidth, Signal to Noise Ratios, Overlap of Blocks, Matlab Command Filter

Typology: Exercises

2012/2013

Uploaded on 04/24/2013

baijayanthi
baijayanthi 🇮🇳

4.5

(13)

171 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Page 1 of 6
Speech Modeling, Prediction, and Synthesis
PURPOSE
ln this laboratory assignment, you will learn how to generate digitally synthesized speech by using a difference equation
model for digital speech. Using MATLAB, you can easily implement digital filters defined by difference equations (for
real-time speech synthesis, the digital filters are implemented on a dedicated DSP board). Using your model, you will
explore the quality of speech you can digitally synthesize and the associated storage and transmission requirements.
Using the Fourier transform, you will determine the bandwidth required for speech transmission, explore how speech
frequency content changes with time, and compare the spectra of true and synthesized speech.
OBIECTTVES
By the end of this assignment, you should be able to
Time Domain
1. Use difference equations to model and synthesize speech.
2. Use MATLAB to simulate the time-domain response of difference equation model for DT LTI systems.
3. Compute transmission and storage requirements for speech technologies.
Transform Domain
4. Identify formant frequencies from sampled speech data and compute formant frequencies from linear prediction
filter coefficients.
5. Determine the bandwidth and minimum sampling rate required for speech data.
6. Analyze effects of digital filtering using frequency-domain techniques and create inverse filters.
7. Compare spectra and signal to noise ratios (SNRs) for digital speech transmitted byquantized error signals and
by quantized speech signals.
LABORATORY PREPARATION
Problems:
Question 1
Assume that you have a digitized speech signal that was sampled at 8 kHz. If the speech is broken down into 20 ms
blocks, how many samples NS are there per block? If one second of a recorded speech signal is in a MAILAB vector,
how many 20 ms blocks, NBLKS, are there (assuming no overlap of blocks)? What is a one-line MATLAB command
that will extract the nth 20ms segment of speech, where n = 1, ..., NBLKS?
Question 2.
Suppose you wish to use as input to your speech model a train of equally spaced Discrete-Time (DT) unit pulses, and
would like the pitch to be 200 Hz. If the speech is assumed to be sampled at 8 kHz, how many DT samples are there per
period
Question 3.
Look up the description of the MATLAB command filter. Determine how you need to define the vectors a and b, used as
input to filter, in terms of αi; and G to create difference equations that will allow you to perform the following
operations:
1. Give you s*[n] as output when s[n] is the input (linear prediction);
2. Give you e[n] as output when s[n] is input (prediction error);
3. Give you s’[n] as output when e[n] is input (synthesis).
In each case, how can you ensure that you use the correct initial conditions for each successive speech block, defined as
the final values from the previous block when using filter. Assume that you are in a loop that generates one 20 ms block
at a time.
BACKGROUND
You work for a company that is d eveloping a digital telephone answering machine fo r home computers. The system will
sample data from a telephone line, detect rings, pick up the phone, speak a greeting, and record a message. The greeting
will be a text message you type in, a nd the system will synthesize your speech from samples of your speech which will be
Docsity.com
pf3
pf4
pf5

Partial preview of the text

Download Speech Modeling - Elements of Discrete Signal Analysis - Laboratory Assignment and more Exercises Signal Processing and Analysis in PDF only on Docsity!

Page 1 of 6

Speech Modeling, Prediction, and Synthesis

PURPOSE

ln this laboratory assignment, you will learn how to generate digitally synthesized speech by using a difference equation model for digital speech. Using MATLAB, you can easily implement digital filters defined by difference equations (for real-time speech synthesis, the digital filters are implemented on a dedicated DSP board). Using your model, you will explore the quality of speech you can digitally synthesize and the associated storage and transmission requirements. Using the Fourier transform, you will determine the bandwidth required for speech transmission, explore how speech frequency content changes with time, and compare the spectra of true and synthesized speech.

OBIECTTVES

By the end of this assignment, you should be able to

Time Domain

  1. Use difference equations to model and synthesize speech.
  2. Use MATLAB to simulate the time-domain response of difference equation model for DT LTI systems.
  3. Compute transmission and storage requirements for speech technologies. Transform Domain
  4. Identify formant frequencies from sampled speech data and compute formant frequencies from linear prediction filter coefficients.
  5. Determine the bandwidth and minimum sampling rate required for speech data.
  6. Analyze effects of digital filtering using frequency-domain techniques and create inverse filters.
  7. Compare spectra and signal to noise ratios (SNRs) for digital speech transmitted byquantized error signals and by quantized speech signals.

LABORATORY PREPARATION

Problems: Question 1 Assume that you have a digitized speech signal that was sampled at 8 kHz. If the speech is broken down into 20 ms blocks, how many samples NS are there per block? If one second of a recorded speech signal is in a MAILAB vector, how many 20 ms blocks, NBLKS, are there (assuming no overlap of blocks)? What is a one-line MATLAB command that will extract the nth 20ms segment of speech, where n = 1, ..., NBLKS?

Question 2. Suppose you wish to use as input to your speech model a train of equally spaced Discrete-Time (DT) unit pulses, and would like the pitch to be 200 Hz. If the speech is assumed to be sampled at 8 kHz, how many DT samples are there per period

Question 3. Look up the description of the MATLAB command filter. Determine how you need to define the vectors a and b , used as input to filter , in terms of αi; and G to create difference equations that will allow you to perform the following operations:

  1. Give you s*[n] as output when s[n] is the input (linear prediction);
  2. Give you e[n] as output when s[n] is input (prediction error);
  3. Give you s’[n] as output when e[n] is input (synthesis). In each case, how can you ensure that you use the correct initial conditions for each successive speech block, defined as the final values from the previous block when using filter. Assume that you are in a loop that generates one 20 ms block at a time. BACKGROUND

You work for a company that is developing a digital telephone answering machine for home computers. The system will sample data from a telephone line, detect rings, pick up the phone, speak a greeting, and record a message. The greeting will be a text message you type in, and the system will synthesize your speech from samples of your speech which will be

Page 2 of 6

recorded and analyzed when the system is installed. In addition, it compresses the recorded phone messages before saving to the hard drive so as to save space.

You are in charge of developing the compression and synthesizer portions of the system. As a first step, you attempt to model your own speech and determine how much compression is possible.

Speech Fundamentals Physically, Continuous-Time (CT) speech is produced when air from your lungs excites your vocal tract system. Sampling and quantizing CT speech results in digital speech. In telecommunications, speech is digitized by sampling at 8kHz, using 8 bits per sample. The vocal tract behaves as a resonant cavity so that the signal emanating from your mouth is a weighted sum of delayed versions of the original vocal signal plus the excitations. We can model speech as a linear difference equation; the weights on the delayed signal versions are the coefficients of the model. Different sounds can be produced by using different inputs to and coefficients of this model.

Different types of speech sounds can be roughly categorized as either voiced or unvoiced, where the category is determined by the type of input used to produce the sound. Voiced sounds are produced by using a periodic sequence of pulses as input; the fundamental period of this sequence determines the resulting pitch. Vowels are voiced sounds; if you say "aah," you can feel the vibrations at the top of your vocal tract. Unvoiced sounds are produced by using random white noise as input (alone it sounds like static). These sounds generally are produced more by turbulent air flow in the mouth, such as "sh."

Discete-Time Speech Models A mathematical difference equation model for the vocal tract can be developed as follows. Since each successive DT speech sample is very closely related to previous samples, the value of the current speech sample can be estimated as a linear combination of previous samples.

p

s *[ n ] i 1 α i s [ n i ]

s* [n] is the estimate of the speech signal s[n] for the nth sample. The error between the estimate and the original signal is

e[n] = s [n] – s*[n]

Prediction Model Combining the two equations above yields a difference equation model of the prediction process for speech:

s [ n ] 1 s [ n i ] e [ n ]

p

− ∑ i =α i − =

This prediction model is used in telecommunications to increase the number of voice signals that can be transmitted over a channel. If the coefficients αi are known at both the transmitting and receiving ends, then only the error needs to be transmitted and the speech signal can be reconstructed at the receiving end using the difference equation above. At the transmitting end s[n] is the prediction filter input and e[n] is the filter output. It turns out that sending a sampled error signal can result in substantial channel bandwidth savings; this idea is explored further in this laboratory assignment.

Synthesis Model We can modify this same basic speech prediction model for use in speech synthesis. If our goal is to create a signal s’[n ] that mimics the original sampled speech segment s[n], then we can replace the error e[n] by an input signal x[n] multiplied by a gain G. Using the same form as the difference equation model for prediction results in the following difference equation model for speech synthesis:

'[ ] '[ ] [ ]

1

s n s n i Gxn

p i i^

= α

If Gx[n] = e[n], then the synthesized speech s’[n] should exactly match the original sampled speech segment s[n ]; in this case the process is called reconstruction rather than synthesis.

Typically the coefficients αi change every 10-20 msec as the vocal tract changes to produce different sounds. In synthesis, you apply a sequence of excitations to the model that has coefficients appropriate for that time interval to

Page 4 of 6

an equivalent amount of information, 4 bits x 8 kHz + 16 x 10 coefficients x 100 ten ms chunks per second = 48,000 bits per second – 76% of the previous rate. If only 1 quantization bit is necessary for the error signal, 24,000 bits per second are necessary – 37%of the previous rate. Using this technique, two people can have conversations in the same space as one person, a tremendous savings.

Both quantizing the sampled speech, as in the first paragraph, and quantizing the prediction error, as above, add distortion to the reconstructed speech. In some of the lab experiments, you will observe the differences between original speech segments and their reconstructed versions resulting from using different numbers of quantization levels. Representing the coefficients with 16 data bits also introduces some quantization error, which can lead to poor quality reproduction on the receiving end. This property is explored in the analysis questions.

In the computerized answering machine, the number of bits to be stored directly relates to the amount of space required. Since the library of sounds necessary to reproduce the answering machine greetings can become very large very quickly, having a good compression technique will allow more message flexibility and require less memory.

Windowing a Data Stream As discussed above, the αi coefficients change every l0-20 ms. For every l0 ms block of speech a new set of α coefficients must be calculated from the sampled speech data. The process of extracting a l0 ms block of speech from the entire segment is called windowing.

The simplest type of windowing involves taking the speech samples in the current 10 ms segment as data. This operation is mathematically equivalent to multiplying the entire signal by a rectangular function having a value of 1in the region of interest and 0 everywhere else, just as when you multiply a signal by a difference of time-shifted unit step functions. This window function is called a rectangular window. At the edges of the data region, there is a sharp transition from signal to nothing, which can cause problems in analysis.

A better way to window the data sequence is to multiply by a function that has a smooth transition from one end to the other. The most common function that does this is called the Hamming window, which can be calculated using MATLAB for any length by the function hamming.

To understand why the Hamming window is preferred to the simpler rectangular window, it is instructive to look at the impact of windowing in the frequency domain. Since windowing a signal is a multiplication operation in the time domain, it corresponds to convolving the Fourier transform of the window function with the frequency spectrum of the speech segment. If the window transform approximates an impulse in frequency, then this convolution operation yields a frequency spectrum identical to the original speech spectrum. However, the less the window transform is like an impulse, the more windowing distorts the original speech signal spectrum.

Implementing Difference Equation Models in MATLAB The filter command in MATLAB can be used to compute the response of a difference equation model to specified input signals and initial conditions. Using difference equations to operate on an input signal is called filtering. The initial conditions of the various delay elements playan important part in the output of a difference equation. For example, computing y[n] recursively using the difference equation y[n]= x[n]+.x[n - l] – y[n-1] requires knowledge of y[n-1]at each time. The initial value of y[n-1] is the value of the output prior to application of a new input; different initial values result in different initial responses to the input. Often, when using difference equations for continual filtering of a constantly applied input, the initial conditions are assumed to be zero, as the response to these initial conditions will only affect the filter response at system start-up.

For this experiment, however, the initial conditions are very important. Since speech is a continuous phenomenon but we are breaking it into 20 ms chunks, we would like the filter output at a20 ms boundary to be consistent with the values from the previous block. Otherwise, there will be "pops" - caused by the errors in the initial conditions - in the output. The last output samples in the previous 20 ms segment should be used as initial conditions for the current 20ms segment.

LABORATORY EXPERIMENT

Voiced Speech Models Your objective is to generate a model for your speech, and then to try synthesizing your speech. As a first pass, you want

Page 5 of 6

to try to synthesize a purely voiced segment of speech.

Problem 1. Record yourself saying "We were away a year ago." Sample at 8000 samples/s and store the resulting signal in a MATLAB vector. To save time, you may want to process only a portion of this sentence when you are troubleshooting.

Problem 2 You have been provided with a MATLAB function file P_4.m. Given your original sampled speech vector, an excitation vector, and the number of full 20 ms blocks in your speech segment, P_4.m generates the prediction error, predicted speech, speech reconstructed using the prediction error as input, and speech synthesized using the excitation vector as input. Information regarding how to use this function is included as comments in the file. Variable definitions can be found by typing help P_4.

You need to edit P_4.m and add in the commands to create the appropriate a and b vectors, segmented original speech, excitation vectors, and filter command to generate (a) e[n], (b) synthesized speech s’ [n], and (c) speech synthesized using x[n ] as input. The use of the filter command is illustrated in the code by generating s*[n] from s[n ]. You may wish to review your answers to Question 1 and 3 prior to coding.

Problem 3. Generate a DT signal that is a periodic sequence of DT unit impulses. Pick your period to generate a pitch somewhere in the range of 50-300 Hz. Use your answer to Question 2 for guidance.

Problem 4 Use P_4.m to generate synthesized speech using the impulse train. Look at the resulting plots of and listen to the signals generated by P_4 .m.

Additional discussion for your report/presentation: How do the synthesized speech, predicted speech and error signal compare to your original sampled speech? Consider both the perceived quality of the sound and the visual similarities and differences for the time- and frequency-domain signal representations (You need plots!!). Be sure to identify what parameters you selected, e.g. pitch. What is the impact of these parameters? What happened if you assume zero initial conditions for each segment (try this and discuss your observations).

Improvements on the Voiced Model Some ideas to consider and possible ways to improve your synthesized speech are described below. Use these suggestions to further explore speech synthesis using difference equations

Problem 5. Try different pitches for your voiced excitation in Problem 3. You may want to try to estimate a reasonable pitch period from your original speech vector by looking for the significant periodicity present in the time- or frequency-domain plots of each speech segment.

Problem 6 Try synthesizing a different segment of speech, such as "Sally sells sea shells by the sea shore," using the same approach as above. Is it intelligible? Does it retain the same perceptual characteristics as the original speech segment? Pay particular attention to the "sh" sound.

Problem 7. Try using an unvoiced excitation vector as input for the speech segment in Problem 6 by using the MATLAB command randn to generate white Gaussian noise having zero mean and unit variance (the default for randn). How does using this input instead of periodic impulses impact the perceived quality of the synthesized speech segments for the sentences from Problems 1 and 6? Problem 8. Using what you've learned in Problems 5 through 7, try creating an excitation vector which uses both unvoiced and variable-pitch voiced inputs to create more realistic sounding speech. Try to do this for an arbitrary speech segment, as well as for the sentences above. You may want to analyze e[n] to determine whether voiced or unvoiced excitation is appropriate for synthesizing a given segment of speech. For segments where e[n] looks more random, use an unvoiced