



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The Elements of Discrete Signal Analysis and the interactive internet programming with java, is very helpful series of lecture slides, which made programming an easy task. The major points in these laboratory assignment are:Speech Modeling, Prediction, Synthesis, Time Domain, Transform Domain, Fourier Transform, Speech Transmission, Determine Bandwidth, Signal to Noise Ratios, Overlap of Blocks, Matlab Command Filter
Typology: Exercises
1 / 6
This page cannot be seen from the preview
Don't miss anything!
ln this laboratory assignment, you will learn how to generate digitally synthesized speech by using a difference equation model for digital speech. Using MATLAB, you can easily implement digital filters defined by difference equations (for real-time speech synthesis, the digital filters are implemented on a dedicated DSP board). Using your model, you will explore the quality of speech you can digitally synthesize and the associated storage and transmission requirements. Using the Fourier transform, you will determine the bandwidth required for speech transmission, explore how speech frequency content changes with time, and compare the spectra of true and synthesized speech.
OBIECTTVES
By the end of this assignment, you should be able to
Time Domain
LABORATORY PREPARATION
Problems: Question 1 Assume that you have a digitized speech signal that was sampled at 8 kHz. If the speech is broken down into 20 ms blocks, how many samples NS are there per block? If one second of a recorded speech signal is in a MAILAB vector, how many 20 ms blocks, NBLKS, are there (assuming no overlap of blocks)? What is a one-line MATLAB command that will extract the nth 20ms segment of speech, where n = 1, ..., NBLKS?
Question 2. Suppose you wish to use as input to your speech model a train of equally spaced Discrete-Time (DT) unit pulses, and would like the pitch to be 200 Hz. If the speech is assumed to be sampled at 8 kHz, how many DT samples are there per period
Question 3. Look up the description of the MATLAB command filter. Determine how you need to define the vectors a and b , used as input to filter , in terms of αi; and G to create difference equations that will allow you to perform the following operations:
You work for a company that is developing a digital telephone answering machine for home computers. The system will sample data from a telephone line, detect rings, pick up the phone, speak a greeting, and record a message. The greeting will be a text message you type in, and the system will synthesize your speech from samples of your speech which will be
recorded and analyzed when the system is installed. In addition, it compresses the recorded phone messages before saving to the hard drive so as to save space.
You are in charge of developing the compression and synthesizer portions of the system. As a first step, you attempt to model your own speech and determine how much compression is possible.
Speech Fundamentals Physically, Continuous-Time (CT) speech is produced when air from your lungs excites your vocal tract system. Sampling and quantizing CT speech results in digital speech. In telecommunications, speech is digitized by sampling at 8kHz, using 8 bits per sample. The vocal tract behaves as a resonant cavity so that the signal emanating from your mouth is a weighted sum of delayed versions of the original vocal signal plus the excitations. We can model speech as a linear difference equation; the weights on the delayed signal versions are the coefficients of the model. Different sounds can be produced by using different inputs to and coefficients of this model.
Different types of speech sounds can be roughly categorized as either voiced or unvoiced, where the category is determined by the type of input used to produce the sound. Voiced sounds are produced by using a periodic sequence of pulses as input; the fundamental period of this sequence determines the resulting pitch. Vowels are voiced sounds; if you say "aah," you can feel the vibrations at the top of your vocal tract. Unvoiced sounds are produced by using random white noise as input (alone it sounds like static). These sounds generally are produced more by turbulent air flow in the mouth, such as "sh."
Discete-Time Speech Models A mathematical difference equation model for the vocal tract can be developed as follows. Since each successive DT speech sample is very closely related to previous samples, the value of the current speech sample can be estimated as a linear combination of previous samples.
p
s* [n] is the estimate of the speech signal s[n] for the nth sample. The error between the estimate and the original signal is
e[n] = s [n] – s*[n]
Prediction Model Combining the two equations above yields a difference equation model of the prediction process for speech:
p
This prediction model is used in telecommunications to increase the number of voice signals that can be transmitted over a channel. If the coefficients αi are known at both the transmitting and receiving ends, then only the error needs to be transmitted and the speech signal can be reconstructed at the receiving end using the difference equation above. At the transmitting end s[n] is the prediction filter input and e[n] is the filter output. It turns out that sending a sampled error signal can result in substantial channel bandwidth savings; this idea is explored further in this laboratory assignment.
Synthesis Model We can modify this same basic speech prediction model for use in speech synthesis. If our goal is to create a signal s’[n ] that mimics the original sampled speech segment s[n], then we can replace the error e[n] by an input signal x[n] multiplied by a gain G. Using the same form as the difference equation model for prediction results in the following difference equation model for speech synthesis:
1
p i i^
= α
If Gx[n] = e[n], then the synthesized speech s’[n] should exactly match the original sampled speech segment s[n ]; in this case the process is called reconstruction rather than synthesis.
Typically the coefficients αi change every 10-20 msec as the vocal tract changes to produce different sounds. In synthesis, you apply a sequence of excitations to the model that has coefficients appropriate for that time interval to
an equivalent amount of information, 4 bits x 8 kHz + 16 x 10 coefficients x 100 ten ms chunks per second = 48,000 bits per second – 76% of the previous rate. If only 1 quantization bit is necessary for the error signal, 24,000 bits per second are necessary – 37%of the previous rate. Using this technique, two people can have conversations in the same space as one person, a tremendous savings.
Both quantizing the sampled speech, as in the first paragraph, and quantizing the prediction error, as above, add distortion to the reconstructed speech. In some of the lab experiments, you will observe the differences between original speech segments and their reconstructed versions resulting from using different numbers of quantization levels. Representing the coefficients with 16 data bits also introduces some quantization error, which can lead to poor quality reproduction on the receiving end. This property is explored in the analysis questions.
In the computerized answering machine, the number of bits to be stored directly relates to the amount of space required. Since the library of sounds necessary to reproduce the answering machine greetings can become very large very quickly, having a good compression technique will allow more message flexibility and require less memory.
Windowing a Data Stream As discussed above, the αi coefficients change every l0-20 ms. For every l0 ms block of speech a new set of α coefficients must be calculated from the sampled speech data. The process of extracting a l0 ms block of speech from the entire segment is called windowing.
The simplest type of windowing involves taking the speech samples in the current 10 ms segment as data. This operation is mathematically equivalent to multiplying the entire signal by a rectangular function having a value of 1in the region of interest and 0 everywhere else, just as when you multiply a signal by a difference of time-shifted unit step functions. This window function is called a rectangular window. At the edges of the data region, there is a sharp transition from signal to nothing, which can cause problems in analysis.
A better way to window the data sequence is to multiply by a function that has a smooth transition from one end to the other. The most common function that does this is called the Hamming window, which can be calculated using MATLAB for any length by the function hamming.
To understand why the Hamming window is preferred to the simpler rectangular window, it is instructive to look at the impact of windowing in the frequency domain. Since windowing a signal is a multiplication operation in the time domain, it corresponds to convolving the Fourier transform of the window function with the frequency spectrum of the speech segment. If the window transform approximates an impulse in frequency, then this convolution operation yields a frequency spectrum identical to the original speech spectrum. However, the less the window transform is like an impulse, the more windowing distorts the original speech signal spectrum.
Implementing Difference Equation Models in MATLAB The filter command in MATLAB can be used to compute the response of a difference equation model to specified input signals and initial conditions. Using difference equations to operate on an input signal is called filtering. The initial conditions of the various delay elements playan important part in the output of a difference equation. For example, computing y[n] recursively using the difference equation y[n]= x[n]+.x[n - l] – y[n-1] requires knowledge of y[n-1]at each time. The initial value of y[n-1] is the value of the output prior to application of a new input; different initial values result in different initial responses to the input. Often, when using difference equations for continual filtering of a constantly applied input, the initial conditions are assumed to be zero, as the response to these initial conditions will only affect the filter response at system start-up.
For this experiment, however, the initial conditions are very important. Since speech is a continuous phenomenon but we are breaking it into 20 ms chunks, we would like the filter output at a20 ms boundary to be consistent with the values from the previous block. Otherwise, there will be "pops" - caused by the errors in the initial conditions - in the output. The last output samples in the previous 20 ms segment should be used as initial conditions for the current 20ms segment.
LABORATORY EXPERIMENT
Voiced Speech Models Your objective is to generate a model for your speech, and then to try synthesizing your speech. As a first pass, you want
to try to synthesize a purely voiced segment of speech.
Problem 1. Record yourself saying "We were away a year ago." Sample at 8000 samples/s and store the resulting signal in a MATLAB vector. To save time, you may want to process only a portion of this sentence when you are troubleshooting.
Problem 2 You have been provided with a MATLAB function file P_4.m. Given your original sampled speech vector, an excitation vector, and the number of full 20 ms blocks in your speech segment, P_4.m generates the prediction error, predicted speech, speech reconstructed using the prediction error as input, and speech synthesized using the excitation vector as input. Information regarding how to use this function is included as comments in the file. Variable definitions can be found by typing help P_4.
You need to edit P_4.m and add in the commands to create the appropriate a and b vectors, segmented original speech, excitation vectors, and filter command to generate (a) e[n], (b) synthesized speech s’ [n], and (c) speech synthesized using x[n ] as input. The use of the filter command is illustrated in the code by generating s*[n] from s[n ]. You may wish to review your answers to Question 1 and 3 prior to coding.
Problem 3. Generate a DT signal that is a periodic sequence of DT unit impulses. Pick your period to generate a pitch somewhere in the range of 50-300 Hz. Use your answer to Question 2 for guidance.
Problem 4 Use P_4.m to generate synthesized speech using the impulse train. Look at the resulting plots of and listen to the signals generated by P_4 .m.
Additional discussion for your report/presentation: How do the synthesized speech, predicted speech and error signal compare to your original sampled speech? Consider both the perceived quality of the sound and the visual similarities and differences for the time- and frequency-domain signal representations (You need plots!!). Be sure to identify what parameters you selected, e.g. pitch. What is the impact of these parameters? What happened if you assume zero initial conditions for each segment (try this and discuss your observations).
Improvements on the Voiced Model Some ideas to consider and possible ways to improve your synthesized speech are described below. Use these suggestions to further explore speech synthesis using difference equations
Problem 5. Try different pitches for your voiced excitation in Problem 3. You may want to try to estimate a reasonable pitch period from your original speech vector by looking for the significant periodicity present in the time- or frequency-domain plots of each speech segment.
Problem 6 Try synthesizing a different segment of speech, such as "Sally sells sea shells by the sea shore," using the same approach as above. Is it intelligible? Does it retain the same perceptual characteristics as the original speech segment? Pay particular attention to the "sh" sound.
Problem 7. Try using an unvoiced excitation vector as input for the speech segment in Problem 6 by using the MATLAB command randn to generate white Gaussian noise having zero mean and unit variance (the default for randn). How does using this input instead of periodic impulses impact the perceived quality of the synthesized speech segments for the sentences from Problems 1 and 6? Problem 8. Using what you've learned in Problems 5 through 7, try creating an excitation vector which uses both unvoiced and variable-pitch voiced inputs to create more realistic sounding speech. Try to do this for an arbitrary speech segment, as well as for the sentences above. You may want to analyze e[n] to determine whether voiced or unvoiced excitation is appropriate for synthesizing a given segment of speech. For segments where e[n] looks more random, use an unvoiced