celp
DESCRIPTION
report on code-book excited linear predictorTRANSCRIPT
![Page 1: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/1.jpg)
MAJOR PROJECT - I
FINAL SUBMISSION REPORT
(Year 2012)
DSP TOOLS IN WIRELESS COMMUNICATION
SUBMITTED TO:
Mr. Hemant Kumar Meena
Presented by-
Piyush Virmani (9102259)
Palash Relan(9102262)
![Page 2: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/2.jpg)
CERTIFICATE
This is to certify that the work titled “DSP Tools in Wireless Communication”
submitted by “Piyush Virmani & Palash Relan” in partial fulfilment for the award
of degree B.TECH of Jaypee Institute of Information Technology University,
Noida has been carried out under my supervision. This work has not been
submitted partially or wholly to any other University or Institute for the award of
this or any other degree or diploma.
Signature of Supervisor ………………………………………………………...
Name of Supervisor ……………………..…………………………………
Designation ……………………..…………………………………………
Date ……………………..………………………………………
![Page 3: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/3.jpg)
ACKNOWLEDGEMENT
We are highly obliged to our project supervisor, Mr. Hemant Kumar
Meena, for assigning this work of study on the topic DSP Tools in
Wireless Communication which has helped us to develop
understanding of Speech processing. We are grateful to him for all his
time, assistance and guidance which motivated us to work on this topic
and without which our major project would have not seen its end. We
are also thankful to the external examiners Mr. R.K Dubey and Mr V.K
Dwivedi who helped us build a better understanding on the matter.
Date: …………………..
Name of Students: Piyush Virmani (09102259)
Palash Relan (09102262)
![Page 4: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/4.jpg)
CONTENTS
1. Certificate
2. Acknowledgement
3. Contents
4. Abstract
i. Wireless Communication for Voice Transmission
ii. Digital Speech Processing
5. Application of Digital Speech Processing
i. Speech Coding
ii. Text to Speech Synthesis
iii. Speech Recognition and Pattern Matching
iv. Other Applications
6. Human Speech
7. Properties of Speech
8. Speech Analysis
i. Short Term Energy
ii. Short Term Zero Crossing
iii. Short Term Autocorrelation Function
9. General Encoding of Arbitrary Waveforms
i. Types of Vocoders
ii. Vocoder Quality Measurement
10. Linear Predictive Analysis
i. Introduction
ii. LPC Model
iii. LPC Analysis
i. Input Speech
ii. Pitch Period Estimation
![Page 5: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/5.jpg)
iii. Vocal Tract Filter
iv. Voiced/Unvoiced Determination
v. Levinson-Durbin Algorithm
iv. LPC Synthesis/Decoding
v. Transmission of patrameters
vi. Applications of LPC
11. Full LPC Model and Implementation
i. LPC Encoder Model
ii. LPC Decoder Model
iii. MATLAB Implementation
12. Discussion and Conclusion
13. References
![Page 6: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/6.jpg)
Abstract
Wireless Communication for Voice Transmission
Wireless communications operators see phenomenal growth in consumer demand for high
quality and low cost services. Since the physical spectrum for wireless services is limited,
operators and equipment suppliers continually find ways to optimise bandwidth efficiency.
Digital communications technology provides an efficiency advantage over analog wireless
communications; multiplexing and filtering is easier, components are cheaper, encryption is
more secure and network management is easier. Additionally, digital technology provides more
value added services to customers (security, text and voice messages together, etc.).
Today wireless communication is primarily voice. The operator meets the increasing need
for services by combining digital technology and special encoding techniques for voice. These
encoders ("vocoders") take advantage of predictable elements in human speech. Several low
data rate encoders are described here with an assessment of their subjective quality.
Test methods to determine voice quality are necessarily subjective.
The most efficient vocoders have acceptable quality levels and have data rates between 2
and 8 kbit/s. Higher data rate encoders (8-13 kbit/s) have improved quality while 32 kbit/s coders
have excellent quality (but use more network resources. The operator must engineer the proper
balance between cost, quality and available resources to provide the optimum solution to the
customer.
Digital Speech Processing
Since even before the time of Alexander Graham Bell’s revolutionary invention, engineers and
scientists have studied the phenomenon of speech communication with an eye on creating more
efficient and effective systems of human-to-human and human-to-machine communication.
Starting in the 1960s, digital signal processing (DSP), assumed a central role in speech studies,
and today DSP is the key to realizing the fruits of the knowledge that has been gained through
decades of research. Concomitant advances in integrated circuit technology and computer
architecture have aligned to create a technological environment with virtually limitless
opportunities for innovation in speech communication applications.
In this project, we highlight the central role of DSP techniques in modern speech communication
research and applications.
![Page 7: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/7.jpg)
Applications of Digital Speech Processing
The first step in most applications of digital speech processing is to convert the acoustic
waveform to a sequence of numbers. Most modern A-to-D converters operate by sampling at a
very high rate, applying a digital lowpass filter with cutoff set to preserve a prescribed
bandwidth, and then reducing the sampling rate to the desired sampling rate, which can be as low
as twice the cutoff frequency of the sharp-cutoff digital filter. This discrete-time representation is
the starting point for most applications.
Speech Coding Perhaps the most widespread applications of digital speech processing technology occur in the
areas of digital transmission and storage of speech signals. In these areas the centrality of the
digital representation is obvious, since the goal is to compress the digital waveform
representation of speech into a lower bit-rate representation. It is common to refer to this activity
as “speech coding” or “speech compression.”
Speech coders enable a broad range of applications including narrowband and broadband wired
telephony, cellular communications, voice over internet protocol (VoIP) (which utilizes the
internet as a real-time communications medium), secure voice for privacy and encryption (for
national security applications), extremely narrowband communications channels (such as
battlefield applications using high frequency (HF) radio), and for storage of speech for telephone
answering machines, interactive voice response (IVR) systems, and pre-recorded messages.
Speech coders often utilize many aspects of both the speech production and speech perception
processes, and hence may not be useful for more general audio signals such as music. Coders
that are based on incorporating only aspects of sound perception generally do not achieve as
much compression as those based on speech production, but they are more general and can be
used for all types of audio signals. These coders are widely deployed in MP3 and AAC players
and for audio in digital television systems.
![Page 8: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/8.jpg)
Text-to-Speech Synthesis
For many years, scientists and engineers have studied the speech production process with the
goal of building a system that can start with text and produce speech automatically. In a sense, a
text-to-speech synthesizer such as depicted in figure is a digital simulation of the entire upper
part of the speech chain diagram.
Text to Speech Synthesis Block Diagram
The input to the system is ordinary text such as an email message or an article from a newspaper
or magazine. The first block in the text-to-speech synthesis system, labelled linguistic rules, has
the job of converting the printed text input into a set of sounds that the machine must synthesize.
The conversion from text to sounds involves a set of linguistic rules that must determine the
appropriate set of sounds (perhaps including things like emphasis, pauses, rates of speaking, etc.)
so that the resulting synthetic speech will express the words and intent of the text message in
what passes for a natural voice that can be decoded accurately by human speech perception.
Once the proper pronunciation of the text has been determined, the role of the synthesis
algorithm is to create the appropriate sound sequence to represent the text message in the form of
speech. In essence, the synthesis algorithm must simulate the action of the vocal tract system in
creating the sounds of speech.
Speech Recognition and Other Pattern Matching Problems
Another large class of digital speech processing applications is concerned with the automatic
extraction of information from the speech signal. Most such systems involve some sort of pattern
matching. The figure shows a block diagram of a generic approach to pattern matching problems
in speech processing. Such problems include the following: speech recognition, where the object
is to extract the message from the speech signal; speaker recognition, where the goal is to
identify who is speaking; speaker verification, where the goal is to verify a speaker’s claimed
identity from analysis of their speech signal; word spotting, which involves monitoring a speech
signal for the occurrence of specified words or phrases; and automatic indexing of speech
recordings based on recognition (or spotting) of spoken keywords.
![Page 9: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/9.jpg)
The first block in the pattern matching system converts the analog speech waveform to digital
form using an A-to-D converter. The feature analysis module converts the sampled speech signal
to a set of feature vectors. Often, the same analysis techniques that are used in speech coding are
also used to derive the feature vectors. The final block in the system, namely the pattern
matching block, dynamically time aligns the set of feature vectors representing the speech signal
with a concatenated set of stored patterns, and chooses the identity associated with the pattern
which is the closest match to the time-aligned set of feature vectors of the speech signal. The
symbolic output consists of a set of recognized words, in the case of speech recognition, or the
identity of the best matching talker, in the case of speaker recognition, or a decision as to
whether to accept or reject the identity claim of a speaker in the case of speaker verification.
Speech Recognition Block Diagram
The major areas where such a system finds applications include command and control of
computer software, voice dictation to create letters, memos, and other documents, natural
language voice dialogues with machines to enable help desks and call centres, and for agent
services such as calendar entry and update, address list modification and entry, etc.
Other Speech Applications
![Page 10: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/10.jpg)
Human Speech
The fundamental purpose of speech is communication, i.e., the transmission of messages.
According to Shannon’s information theory , a message represented as a sequence of discrete
symbols can be quantified by its information content in bits, and the rate of transmission of
information is measured in bits/second (bps). In speech production, as well as in many human-
engineered electronic communication systems, the information to be transmitted is encoded in
the form of a continuously varying (analog) waveform that can be transmitted, recorded,
manipulated, and ultimately decoded by a human listener. In the case of speech, the fundamental
analog form of the message is an acoustic waveform, which we call the speech signal. Speech
signals, as illustrated in Figure 1.1, can be converted to an electrical waveform by a microphone,
further manipulated by both analog and digital signal processing, and then converted back to
acoustic form by a loudspeaker, a telephone handset or headphone, as desired. This form of
speech processing is, of course, the basis for Bell’s telephone invention as well as today’s
multitude of devices for recording, transmitting, and manipulating speech and audio signals.
![Page 11: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/11.jpg)
Properties of Speech
The two types of speech sounds, voiced and unvoiced, produce different sounds and spectra due
to their differences in sound formation. With voiced speech, air pressure from the lungs forces
normally closed vocal cords to open and vibrate. The vibrational frequencies (pitch) vary from
about 50 to 400 Hz (depending on the person’s age and sex) and forms resonance in the vocal
track at odd harmonics. These resonance peaks are called formants and can be seen in the voiced
speech figures below.
Voiced Speech Sample
Power Spectral Density, Voiced Speech
![Page 12: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/12.jpg)
Unvoiced sounds, called fricatives (e.g., s, f, sh) are formed by forcing air through an opening
(hence the term, derived from the word “friction”). Fricatives do not vibrate the vocal cords and
therefore do not produce as much periodicity as seen in the formant structure in voiced speech;
unvoiced sounds appear more noise-like (see figures 3 and 4 below). Time domain samples lose
periodicity and the power spectral density does not display the clear resonant peaks that are
found in voiced sounds.
Unvoiced Speech Sample
Power Spectral Density, Unvoiced Speech
![Page 13: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/13.jpg)
The spectrum for speech (combined voiced and unvoiced sounds) has a total bandwidth of
approximately 7000 Hz with an average energy at about 3000 Hz. The auditory canal optimizes
speech detection by acting as a resonant cavity at this average frequency. Note that the power of
speech spectra and the periodic nature of formants drastically diminish above 3500 Hz.
Speech encoding algorithms can be less complex than general encoding by concentrating
(through filters) on this region. Furthermore, since line quality telecommunications employ
filters that pass frequencies up to only 3000-4000 Hz, high frequencies produced by fricatives
are removed. A caller will often have to spell or otherwise distinguish these sounds to be
understood (e.g., “F as in Frank”).
Schematic Model of Vocal Tract System
![Page 14: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/14.jpg)
Speech Analysis
Our goal is to extract parameters of the model by analysis of the speech signal, it is common
to assume structures (or representations) for both the excitation generator and the linear system.
One such model uses a more detailed representation of the excitation in terms of separate source
generators for voiced and unvoiced speech as shown in the figure.
In this model the unvoiced excitation is assumed to be a random noise sequence, and the voiced
excitation is assumed to be a periodic impulse train with impulses spaced by the pitch period (P0)
rounded to the nearest sample. The pulses needed to model the glottal flow waveform during
voiced speech are assumed to be combined (by convolution) with the impulse response of the
linear system, which is assumed to be slowly-time-varying (changing every 50–100 ms or so).
By this we mean that over the timescale of phonemes, the impulse response, frequency response,
and system function of the system remains relatively constant. For example over time intervals
of tens of milliseconds, the system can be described by the convolution expression
where the subscript n denotes the time index pointing to the block of samples of the entire speech
signal s[n] wherein the impulse response hˆn[m] applies.We use n for the time index within that
interval, and m is the index of summation in the convolution sum.
To simplify analysis, it is often assumed that the system is an all-pole system with system
function of the form:
![Page 15: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/15.jpg)
Although the linear system is assumed to model the composite spectrum effects of radiation,
vocal tract tube, and glottal excitation pulse shape (for voiced speech only) over a short time
interval, the linear system in the model is commonly referred to as simply the “vocal tract”
system and the corresponding impulse response is called the “vocal tract impulse response.” For
all-pole linear systems, as represented by the equation, the input and output are related by a
difference equation of the form:
Short-Time Energy and Zero-Crossing Rate Two basic short-time analysis functions useful for speech signals are the short-time energy and
the short-time zero-crossing rate. These functions are simple to compute, and they are useful for
estimating properties of the excitation function in the model.
The short-time energy is defined as:
Similarly, the short-time zero crossing rate is defined as the weighted average of the number of
times the speech signal changes sign within the time window. Representing this operator in terms
of linear filtering leads to:
The short-time energy and short-time zero-crossing rate are important because they abstract
valuable information about the speech signal, and they are simple to compute. The short-time
energy is an indication of the amplitude of the signal in the interval around time. From our
model, we expect unvoiced regions to have lower short-time energy than voiced regions.
Similarly, the short-time zero-crossing rate is a crude frequency analyzer. Voiced signals have a
high frequency (HF) fall off due to the lowpass nature of the glottal pulses, while unvoiced
sounds have much more HF energy. Thus, the short-time energy and short-time zero-crossing
rate can be the basis for an algorithm for making a decision as to whether the speech signal is
voiced or unvoiced at a particular time.
![Page 16: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/16.jpg)
Short-Time Autocorrelation Function (STACF)
The autocorrelation function is often used as a means of detecting periodicity in signals, and it is
also the basis for many spectrum analysis methods. This makes it a useful tool for short-time
speech analysis. The STACF is defined as the deterministic autocorrelation function of the
sequence xˆn[m] = x[m]w[ˆn − m] that is selected by the window shifted to time ˆn, i.e.,
Voiced and Unvoiced Segments of speech and their corresponding Autocorrelation
![Page 17: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/17.jpg)
General Encoding of Arbitrary Waveforms
Waveform encoders typically use Time Domain or Frequency Domain coding and attempt to
accurately reproduce the original signal. These general encoders do not assume any previous
knowledge about the signal. The decoder output waveform is very similar to the signal input to
the coder. Examples of these general encoders include Uniform Binary Coding for music
Compact Disks and Pulse Code Modulation for telecommunications.
Pulse Code Modulation (PCM) is a general encoder used in standard voice grade circuits.
The PCM encodes into eight bit words Pulse Amplitude Modulated (PAM) signals that have
been samples at the Nyquist rate for the voice channel (8000 samples per second, or twice the
channel bandwidth). The PCM signal therefore requires a 64 Kb/s transmission channel.
However, this is not feasible over communication channels where bandwidth is a premium. It is
also inefficient when the communication is primarily voice that exhibits a certain amount of
predictability as seen in the periodic structure from formants. The increasing use of limited
transmission media such as radio and satellite links and limited voice storage resources require
more efficient coding methods. Special encoders have been designed that assume the input
signal is voice only. These vocoders use speech production models to reproduce only the
intelligible quality of the original signal waveform.
The most popular vocoders used in digital communications are presented below.
Types of Voice Encoders
Linear Predictive Coder (LPC)
Regular Pulse Excited (RPE) Coder
Code Book Excited (CELP) Coder
Vocoder Quality Measurements There are several points to rate vocoder quality:
Cost/complexity
Voice Quality Data Rate Transparency for non-voice signals Tolerance of transmission errors
Effects of tandem encodings Coding formats
Signal processing requirements.
It is suggested that the most important quality measures are voice quality, data rate,
communication delay and coding algorithm complexity. While all of these can easily be
measured and analysed, voice quality remains subjective.
![Page 18: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/18.jpg)
Linear Predictive Analysis
Proposal Linear predictive coding(LPC) is defined as a digital method for encoding an analog signal in
which a particular value is predicted by a linear function of the past values of the signal. It was
first proposed as a method for encoding human speech by the United States Department of
Defence in federal standard 1015, published in 1984. Human speech is produced in the vocal
tract which can be approximated as a variable diameter tube. The linear predictive coding (LPC)
model is based on a mathematical approximation of the vocal tract represented by this tube of a
varying diameter. At a particular time, t, the speech sample s(t) is represented as a linear sum of
the p previous samples. The most important aspect of LPC is the linear predictive filter which
allows the value of the next sample to be determined by a linear combination of previous
samples.
Under normal circumstances, speech is sampled at 8000 samples/second with 8 bits used to
represent each sample. This provides a rate of 64000 bits/second. Linear predictive coding
reduces this to 2400 bits/second. At this reduced rate the speech has a distinctive synthetic sound
and there is a noticeable loss of quality. However, the speech is still audible and it can still be
easily understood. Since there is information loss in linear predictive coding, it is a lossy form of
compression.
Introduction There exist many different types of speech compression that make use of a variety of different
techniques. However, most methods of speech compression exploit the fact that speech
production occurs through slow anatomical movements and that the speech produced has a
limited frequency range. The frequency of human speech production ranges from around 300 Hz
to 3400 Hz. Speech compression is often referred to as speech coding which is defined as a
method for reducing the amount of information needed to represent a speech signal. Most forms
of speech coding are usually based on a lossy algorithm. Lossy algorithms are considered
acceptable when encoding speech because the loss of quality is often undetectable to the human
ear. There are many other characteristics about speech production that can be exploited by
speech coding algorithms. One fact that is often used is that period of silence take up greater than
50% of conversations. An easy way to save bandwidth and reduce the amount of information
needed to represent the speech signal is to not transmit the silence. Another fact about speech
production that can be taken advantage of is that mechanically there is a high correlation
between adjacent samples of speech. Most forms of speech compression are achieved by
modelling the process of speech production as a linear digital filter. The digital filter and its slow
changing parameters are usually encoded to achieve compression from the speech signal.
Linear Predictive Coding (LPC) is one of the methods of compression that models the process
of speech production. Specifically, LPC models this process as a linear sum of earlier samples
using a digital filter inputting an excitement signal. An alternate explanation is that linear
prediction filters attempt to predict future values of the input signal based on past signals. LPC
“models speech as an autoregressive process, and sends the parameters of the process as opposed
to sending the speech itself”.
![Page 19: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/19.jpg)
All vocoders, including LPC vocoders, have four main attributes: bit rate, delay, complexity,
quality. Any voice coder, regardless of the algorithm it uses, will have to make trade offs
between these different attributes. The first attribute of vocoders, the bit rate, is used to
determine the degree of compression that a vocoder achieves. Uncompressed speech is usually
transmitted at 64 kb/s using 8 bits/sample and a rate of 8 kHz for sampling. Any bit rate below
64 kb/s is considered compression.
The linear predictive coder transmits speech at a bit rate of 2.4 kb/s, an excellent rate of
compression. Delay is another important attribute for vocoders that are involved with the
transmission of an encoded speech signal. Vocoders which are involved with the storage of the
compressed speech, as opposed to transmission, are not as concern with delay. The general delay
standard for transmitted speech conversations is that any delay that is greater than 300 ms is
considered unacceptable. The third attribute of voice coders is the complexity of the algorithm
used. The complexity affects both the cost and the power of the vocoder. Linear predictive
coding because of its high compression rate is very complex and involves executing millions of
instructions per second.
The general algorithm for linear predictive coding involves an analysis or encoding part and
a synthesis or decoding part. In the encoding, LPC takes the speech signal in blocks or frames of
speech and determines the input signal and the coefficients of the filter that will be capable of
reproducing the current block of speech. This information is quantized and transmitted. In the
decoding, LPC rebuilds the filter based on the coefficients received. The filter can be thought of
as a tube which, when given an input signal, attempts to output speech. Additional information
about the original speech signal is used by the decoder to determine the input or excitation signal
that is sent to the filter for synthesis.
LPC Model
The particular source-filter model used in LPC is known as the Linear predictive coding model.
It has two key components: analysis or encoding and synthesis or decoding. The analysis part of
LPC involves examining the speech signal and breaking it down into segments or blocks. Each
segment is than examined further to find the answers to several key questions:
Is the segment voiced or unvoiced?
What is the pitch of the segment?
What parameters are needed to build a filter that models the vocal tract for the current
segment?
LPC analysis is usually conducted by a sender who answers these questions and usually
transmits these answers onto a receiver. The receiver performs LPC synthesis by using the
answers received to build a filter that when provided the correct input source will be able to
accurately reproduce the original speech signal.
![Page 20: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/20.jpg)
Essentially, LPC synthesis tries to imitate human speech production. Figure demonstrates what
parts of the receiver correspond to what parts in the human anatomy. This diagram is for a
general voice or speech coder and is not specific to linear predictive coding. All voice coders
tend to model two things: excitation and articulation. Excitation is the type of sound that is
passed into the filter or vocal tract and articulation is the transformation of the excitation signal
into speech.
LPC Analysis/Encoding
Input speech
The input signal is sampled at a rate of 8000 samples per second. This input signal is then broken
up into segments or blocks which are each analysed and transmitted to the receiver. The 8000
samples in each second of speech signal are broken into 180 sample segments. This means that
each segment represents 22.5 milliseconds of the input speech signal.
Voice/Unvoiced Determination
According to LPC-10 standards, before a speech segment is determined as being voiced or
unvoiced it is first passed through a low-pass filter with a bandwidth of 1 kHz. Determining if a
segment is voiced or unvoiced is important because voiced sounds have a different waveform
then unvoiced sounds. The differences in the two waveforms creates a need for the use of two
![Page 21: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/21.jpg)
different input signals for the LPC filter in the synthesis or decoding. One input signal is for
voiced sounds and the other is for unvoiced. The LPC encoder notifies the decoder if a signal
segment is voiced or unvoiced by sending a single bit.
Recall that voiced sounds are usually vowels and can be considered as a pulse that is similar
to periodic waveforms. These sounds have high average energy levels which means that they
have very large amplitudes. Voiced sounds also have distinct resonant or formant frequencies.
Pitch Period Estimation
Determining if a segment is a voiced or unvoiced sound is not all of the information that is
needed by the LPC decoder to accurately reproduce a speech signal. In order to produce an input
signal for the LPC filter the decoder also needs another attribute of the current speech segment
known as the pitch period. The period for any wave, including speech signals, can be defined as
the time required for one wave cycle to completely pass a fixed position. For speech signals, the
pitch period can be thought of as the period of the vocal cord vibration that occurs during the
production of voiced speech. Therefore, the pitch period is only needed for the decoding of
voiced segments and is not required for unvoiced segments since they are produced by turbulent
air flow not vocal cord vibrations.
It is very computationally intensive to determine the pitch period for a given segment of
speech. There are several different types of algorithms that could be used. One type of algorithm
takes advantage of the fact that the autocorrelation of a period function, Rxx(k), will have a
maximum when k is equivalent to the pitch period. These algorithms usually detect a maximum
value by checking the autocorrelation value against a threshold value. One problem with
algorithms that use autocorrelation is that the validity of their results is susceptible to
interference as a result of other resonances in the vocal tract. When interference occurs the
algorithm can not guarantee accurate results. Another problem with autocorrelation algorithms
occurs because voiced speech is not entirely periodic. This means that the maximum will be
lower than it should be for a true periodic signal.
LPC does not use an algorithm with autocorrelation, instead it uses an algorithm called
average magnitude difference function (AMDF) which is defined as
Since the pitch period, P, for humans is limited, the AMDF is evaluated for a limited range of the
possible pitch period values. Therefore, in LPC there is an assumption that the pitch period is
between 2.5 and 19.5 milliseconds. If the signal is sampled at a rate of 8000 samples/second then
20 < P < 160.
For voiced segments we can consider the set of speech samples for the current segment, {yn},
as a periodic sequence with period Po. This means that samples that are Po apart should have
similar values and that the AMDF function will have a minimum at Po, that is when P is equal to
the pitch period.
![Page 22: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/22.jpg)
An advantage of the AMDF function is that it can be used to determine if a sample is voiced
or unvoiced. When the AMDF function is applied to an unvoiced signal, the difference between
the minimum and the average values is very small compared to voiced signals. This difference
can be used to make the voiced and unvoiced determination. For unvoiced segments the AMDF
function we also have a minimum when P equals the pitch period however, any additional
minimums that are obtained will be very close to the average value. This means that these
minimums will not be very deep.
Voiced Unvoiced
Vocal Tract Filter
The filter that is used by the decoder to recreate the original input signal is created based on
a set of coefficients. These coefficients are extracted from the original signal during encoding
and are transmitted to the receiver for use in decoding. Each speech segment has different filter
coefficients or parameters that it uses to recreate the original sound. Not only are the parameters
themselves different from segment to segment, but the number of parameters differ from voiced
to unvoiced segment. Voiced segments use 10 parameters to build the filter while unvoiced
sounds use only 4 parameters.
A filter with n parameters is referred to as an nth order filter. In order to find the filter coefficients
that best match the current segment being analysed the encoder attempts to minimize the mean
squared error. The mean squared error is expressed as:
![Page 23: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/23.jpg)
where {yn} is the set of speech samples for the current segment and {ai} is the set of coefficients.
In order to provide the most accurate coefficients, {ai} is chosen to minimize the average value
of en for all samples in the segment.
The first step in minimizing the average mean squared error is to take the derivative.
Taking the derivative produces a set of M equations. In order to solve for the filter coefficients
E[yn-iyn-j] has to be estimate. There are two approaches that can be used for this estimation:
autocorrelation and autocovariance. Although there are version of LPC that use both approaches,
autocorrelation is the approach that will be explained in this paper for linear predictive coding.
Autocorrelation requires that several initial assumptions be made about the set or sequence of
speech samples, {yn}, in the current segment. First, it requires that {yn} be stationary and second,
it requires that the {yn} sequence is zero outside of the current segment. In autocorrelation, each
E[yn-iyn-j] is converted into an autocorrelation function of the form Ryy(| i-j |). The estimation of
an autocorrelation function Ryy(k) can be expressed as:
Using Ryy(k), the M equations that were acquired from taking the derivative of the mean
squared error can be written in matrix form RA = P where A contains the filter coefficients.
![Page 24: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/24.jpg)
In order to determine the contents of A, the filter coefficients, the equation A = R-1P must be
solved.
This equation can not be solved with out first computing R-1. This is an easy computation if one
notices that R is symmetric and more importantly all diagonals consist of the same element. This
type of matrix is called a Toeplitz matrix and can be easily inverted.
The Levinson-Durbin (L-D) Algorithm is a recursive algorithm that is considered very
computationally efficient since it takes advantage of the properties of R when determining the
filter coefficients.. This algorithm is denoted with a superscript, {ai (j)}for a jth order filter, and the
average mean squared error of a jth order filter is denoted Ej instead of E[e2n]. When applied to an
Mth order filter, the L-D algorithm computes all filters of order less than M. That is, it determines
all order N filters where N=1,...,M-1.
During the process of computing the filter coefficients {ai} a set of coefficients, {ki}, called
reflection coefficients or partial correlation coefficients (PARCOR) are generated. These
coefficients are used to solve potential problems in transmitting the filter coefficients. The
quantization of the filter coefficients for transmission can create a major problem since errors in
the filter coefficients can lead to instability in the vocal tract filter and create an inaccurate output
signal. This potential problem is averted by quantizing and transmitting the reflection
coefficients that are generated by the Levinson-Durbin algorithm. These coefficients can be used
to rebuild the set of filter coefficients {ai} and can guarantee a stable filter if their magnitude is
strictly less than one.
![Page 25: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/25.jpg)
Transmitting the Parameters In an uncompressed form, speech is usually transmitted at 64,000 bits/second using 8
bits/sample and a rate of 8 kHz for sampling. LPC reduces this rate to 2,400 bits/second by
breaking the speech into segments and then sending the voiced/unvoiced information, the pitch
period, and the coefficients for the filter that represents the vocal tract for each segment.
The input signal used by the filter on the receiver end is determined by the classification of the
speech segment as voiced or unvoiced and by the pitch period of the segment. The encoder sends
a single bit to tell if the current segment is voiced or unvoiced. The pitch period is quantized
using a log-companded quantizer to one of 60 possible values. 6 bits are required to represent the
pitch period.
If the segment contains voiced speech than a 10th order filter is used. This means that 11
values are needed: 10 reflection coefficients and the gain. If the segment contains unvoiced
speech than a 4th order filter is used. This means that 5 values are needed: 4 reflection
coefficients and the gain. The reflection coefficients are denote kn where 1 < n < 10 for voiced
speech filters and 1 < n < 4 for unvoiced filters.
LPC Synthesis/Decoding The process of decoding a sequence of speech segments is the reverse of the encoding
process. Each segment is decoded individually and the sequence of reproduced sound segments
is joined together to represent the entire input speech signal. The decoding or synthesis of a
speech segment is based on the 54 bits of information that are transmitted from the encoder.
The speech signal is declared voiced or unvoiced based on the voiced/unvoiced determination
bit. The decoder needs to know what type of signal the segment contains in order to determine
what type of excitement signal will be given to the LPC filter. Unlike other speech compression
algorithms like CELP which have a codebook of possible excitement signals, LPC only has two
possible signals.
For voiced segments a pulse is used as the excitement signal. This pulse consists of 40 samples
and is locally stored by the decoder. A pulse is defined as “...an isolated disturbance, that travels
through an otherwise undisturbed medium” [10]. For unvoiced segments white noise produced
by a pseudorandom number generator is used as the input for the filter.
The pitch period for voiced segments is then used to determine whether the 40 sample pulse
needs to be truncated or extended. If the pulse needs to be extended it is padded with zeros since
the definition of a pulse said that it travels through an undisturbed medium. This combination of
voice/unvoiced determination and pitch period are the only things that are need to produce the
excitement signal.
Each segment of speech has a different LPC filter that is eventually produced using the
reflection coefficients and the gain that are received from the encoder. 10 reflection coefficients
are used for voiced segment filters and 4 reflection coefficients are used for unvoiced segments.
These reflection coefficients are used to generate the vocal tract coefficients or parameters which
are used to create the filter.
![Page 26: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/26.jpg)
The final step of decoding a segment of speech is to pass the excitement signal through the
filter to produce the synthesized speech signal.
LPC Applications In general, the most common usage for speech compression is in standard telephone systems.
In fact, a lot of the technology used in speech compression was developed by the phone
companies.
Linear predictive coding only has application in the area of secure telephony because of its low
bit rate. Secure telephone systems require a low bit rate since speech is first digitalized, then
encrypted and transmitted. These systems have a primary goal of decreasing the bit rate as much
as possible while maintaining a level of speech quality that is understandable.
Other standards such as the digital cellular standard and the international telephone network
standard have higher quality standards and therefore require a higher bit rate. In these standards,
understanding the speech is not good enough, the listener must also be able to recognize the
speech as belonging to the original source.
A second area that linear predictive coding has been used is in Text-to-Speech synthesis. In
this type of synthesis the speech has to be generated from text. Since LPC synthesis involves the
generation of speech based on a model of the vocal tract, it provides a perfect method for
generating speech from text.
Further applications of LPC and other speech compression schemes are voice mail systems,
telephone answering machines, and multimedia applications. Most multimedia applications,
unlike telephone applications, involve one-way communication and involve storing the data. An
example of a multimedia application that would involve speech is an application that allows
voice annotations about a text document to be saved with the document. The method of speech
compression used in multimedia applications depends on the desired speech quality and the
limitations of storage space for the application. Linear Predictive Coding provides a favourable
method of speech compression for multimedia applications since it provides the smallest storage
space as a result of its low bit rate.
![Page 27: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/27.jpg)
Full LPC Model and Implementation
![Page 28: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/28.jpg)
MATLAB Implementation
Main.m
%MAIN BODY
clear all;
clc;
disp('wavfile');
%INPUT
inpfilenm = 'sample1';
[x, fs] =wavread(inpfilenm);
%LENGTH (IN SEC) OF INPUT WAVEFILE,
t=length(x)./fs;
sprintf('Processing the wavefile "%s"', inpfilenm)
sprintf('The wavefile is %3.2f seconds long', t)
%THE ALGORITHM STARTS HERE,
M=10; %prediction order
[aCoeff, pitch_plot, voiced, gain] = f_ENCODER(x, fs, M); %pitch_plot
is pitch periods
synth_speech = f_DECODER (aCoeff, pitch_plot, voiced, gain);
%RESULTS
beep;
disp('Press a key to play the original sound!');
pause;
soundsc(x, fs);
disp('Press a key to play the LPC compressed sound!');
pause;
soundsc(synth_speech, fs);
figure;
subplot(2,1,1), plot(x); title(['Original signal = "', inpfilenm,
'"']);
subplot(2,1,2), plot(synth_speech); title(['synthesized speech of "',
inpfilenm, '" using LPC algo']);
![Page 29: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/29.jpg)
f_ENCODER.m
function [aCoeff, pitch_plot, voiced, gain] = f_ENCODER(x, fs, M);
M = 10; %prediction order=10;
b=1;
fsize = 30e-3; %frame size
frame_length = round(fs .* fsize);
N= frame_length - 1;
%VOICED/UNVOICED and PITCH; [independent of frame segmentation]
[voiced, pitch_plot] = f_VOICED (x, fs, fsize);
%FRAME SEGMENTATION for aCoeff and GAIN;
for b=1 : frame_length : (length(x) - frame_length),
y1=x(b:b+N);
y = filter([1 -.9378], 1, y1); %pre-emphasis filtering
%aCoeff [LEVINSON-DURBIN METHOD];
[a, tcount_of_aCoeff, e] = func_lev_durb (y, M);
aCoeff(b: (b + tcount_of_aCoeff - 1)) = a;
%GAIN;
pitch_plot_b = pitch_plot(b); %pitch period
voiced_b = voiced(b);
gain(b) = f_GAIN (e, voiced_b, pitch_plot_b);
end
func_lev_durbin.m
%function of levinsonDurbin
function [aCoeff, tcount_of_aCoeff, e] = func_lev_durb (y, M);
if (nargin<2), M = 10; end
sk=0;
a=[zeros(M+1);zeros(M+1)];
z=xcorr(y);
![Page 30: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/30.jpg)
%finding array of R[l]
R=z( ( (length(z)+1) ./2 ) : length(z));
s=1;
J(1)=R(1);
%GETTING OTHER PARAMETERS OF PREDICTOR OF ORDER "(s-1)":
for s=2:M+1,
sk=0;
for i=2:(s-1),
sk=sk + a(i,(s-1)).*R(s-i+1);
end
k(s)=(R(s) + sk)./J(s-1);
J(s)=J(s-1).*(1-(k(s)).^2);
a(s,s)= -k(s);
a(1,s)=1;
for i=2:(s-1),
a(i,s)=a(i,(s-1)) - k(s).*a((s-i+1),(s-1));
end
end
aCoeff=a((1:s),s)';
tcount_of_aCoeff = length(aCoeff);
est_y = filter([0 -aCoeff(2:end)],1,y);
e = y - est_y;
f_VOICED.m
%function_main of voiced/unvoiced detection
function [voiced, pitch_plot] = f_VOICED(x, fs, fsize);
f=1;
b=1;
frame_length = round(fs .* fsize);
N= frame_length - 1;
%FRAME SEGMENTATION:
for b=1 : frame_length : (length(x) - frame_length),
y1=x(b:b+N);
y = filter([1 -.9378], 1, y1); %pre-emphasis filter
msf(b:(b + N)) = func_vd_msf (y);
![Page 31: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/31.jpg)
zc(b:(b + N)) = func_vd_zc (y);
pitch_plot(b:(b + N)) = func_pitch (y,fs);
end
thresh_msf = (( (sum(msf)./length(msf)) - min(msf)) .* (0.67) ) +
min(msf);
voiced_msf = msf > thresh_msf; %=1,0
thresh_zc = (( ( sum(zc)./length(zc) ) - min(zc) ) .* (1.5) ) +
min(zc);
voiced_zc = zc < thresh_zc;
thresh_pitch = (( (sum(pitch_plot)./length(pitch_plot)) -
min(pitch_plot)) .* (0.5) ) + min(pitch_plot);
voiced_pitch = pitch_plot > thresh_pitch;
for b=1:(length(x) - frame_length),
if voiced_msf(b) .* voiced_pitch(b) .* voiced_zc(b) == 1,
% if voiced_msf(b) + voiced_pitch(b) > 1,
voiced(b) = 1;
else
voiced(b) = 0;
end
end
voiced;
pitch_plot;
func_pitch.m
function pitch_period = func_pitch (y,fs)
clear pitch_period;
period_min = round (fs .* 2e-3);
period_max = round (fs .* 20e-3);
R=xcorr(y);
[R_max , R_mid]=max(R);
pitch_per_range = R ( R_mid + period_min : R_mid + period_max );
[R_max, R_mid] = max(pitch_per_range);
pitch_period = R_mid + period_min;
![Page 32: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/32.jpg)
func_vd_msf.m
function m_s_f = func_vd_msf (y)
clear m_s_f;
[B,A] = butter(9,.33,'low'); %.5 or .33?
y1 = filter(B,A,y);
m_s_f=sum(abs(y1));
func_vd_zc.m
function ZC = func_vd_zc (y)
ZC=0;
for n=1:length(y),
if n+1>length(y)
break
end
ZC=ZC + (1./2) .* abs(sign(y(n+1))-sign(y(n)));
end
ZC;
f_GAIN.m
%function for calc gain per frame
function [gain_b, power_b] = f_GAIN (e, voiced_b, pitch_plot_b);
if voiced_b == 0,
denom = length(e);
power_b = sum(e (1:denom) .^2) ./ denom;
gain_b = sqrt( power_b );
else
denom = ( floor( length(e)./pitch_plot_b ) .* pitch_plot_b );
power_b = sum( e (1:denom) .^2 ) ./ denom;
gain_b = sqrt( pitch_plot_b .* power_b );
end
![Page 33: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/33.jpg)
power_b;
gain_b;
f_DECODER.m
%DECODER PORTION
function synth_speech = f_DECODER (aCoeff, pitch_plot, voiced, gain);
frame_length=1;
for i=2:length(gain)
if gain(i) == 0,
frame_length = frame_length + 1;
else break;
end
end
%decoding starts here,
for b=1 : frame_length : (length(gain)),
if voiced(b) == 1, %voiced frame
pitch_plot_b = pitch_plot(b);
syn_y1 = f_SYN_V (aCoeff, gain, frame_length,
pitch_plot_b, b);
else syn_y1 = f_SYN_UV (aCoeff, gain, frame_length, b); %unvoiced
frame
end
synth_speech(b:b+frame_length-1) = syn_y1;
end
f_SYN_V.m
%a function of f_DEOCDER
function syn_y1 = f_SYN_V (aCoeff, gain, frame_length, pitch_plot_b,
b);
%creating pulsetrain;
for f=1:frame_length
if f./pitch_plot_b == floor(f./pitch_plot_b)
ptrain(f) = 1;
else ptrain (f) = 0;
end
end
![Page 34: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/34.jpg)
syn_y2 = filter(1, [1 aCoeff((b+1):(b+1+9))], ptrain);
syn_y1 = syn_y2 .* gain(b);
f_SYN_UV.m
%a function of f_DEOCDER
function syn_y1 = f_SYN_UV (aCoeff, gain, frame_length, b);
wn = randn(1, frame_length);
syn_y2 = filter(1, [1 aCoeff((b+1):(b+1+9))], wn);
syn_y1 = syn_y2 .* gain(b);
![Page 35: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/35.jpg)
Discussion and Conclusion
Linear Predictive Coding is an analysis/synthesis technique to lossy speech compression that
attempts to model the human production of sound instead of transmitting an estimate of the
sound wave. Linear predictive coding achieves a bit rate of 2400 bits/second which makes it is
ideal for use in secure telephone systems. Secure telephone systems are more concerned that the
content and meaning of speech, rather than the quality of speech, be preserved. The trade off for
LPC’s low bit rate is that it does have some difficulty with certain sounds and it produces speech
that sound synthetic.
Linear predictive coding encoders break up a sound signal into different segments and then
send information on each segment to the decoder. The encoder send information on whether the
segment is voiced or unvoiced and the pitch period for voiced segment which is used to create an
excitement signal in the decoder. The encoder also sends information about the vocal tract which
is used to build a filter on the decoder side which when given the excitement signal as input can
reproduce the original speech.
![Page 36: CELP](https://reader031.vdocuments.us/reader031/viewer/2022013107/54657522b4af9f583f8b4f6c/html5/thumbnails/36.jpg)
References
[1] Lawrence R. Rabiner and Ronald W. Schafer . Introduction to Digital Speech Processing
Vol. 1, Nos. 1–2 (2007) 1–194
[2] V. Hardman and O. Hodson. Internet/Mbone Audio (2000) 5-7.
[3] Scott C. Douglas. Introduction to Adaptive Filters, Digital Signal Processing Handbook
(1999) 7-12.
[4] Poor, H. V., Looney, C. G., Marks II, R. J., Verdú, S., Thomas, J. A., Cover, T. M.
Information Theory. The Electrical Engineering Handbook (2000) 56-57.
[5] R. Sproat, and J. Olive. Text-to-Speech Synthesis, Digital Signal Processing Handbook
(1999) 9-11 .
[6] Richard C. Dorf, et. al.. Broadcasting (2000) 44-47.
[7] Richard V. Cox. Speech Coding (1999) 5-8.
[8] Randy Goldberg and Lance Riek. A Practical Handbook of Speech Coders (1999)
Chapter 2:1-28, Chapter 4: 1-14, Chapter 9: 1-9, Chapter 10:1-18.
[9] Mark Nelson and Jean-Loup Gailly. Speech Compression, The Data Compression Book
(1995) 289-319.
[10] Khalid Sayood. Introduction to Data Compression (2000) 497-509.
[11] Richard Wolfson, Jay Pasachoff. Physics for Scientists and Engineers (1995) 376-377.