celp

36
MAJOR PROJECT - I FINAL SUBMISSION REPORT (Year 2012) DSP TOOLS IN WIRELESS COMMUNICATION SUBMITTED TO: Mr. Hemant Kumar Meena Presented by- Piyush Virmani (9102259) Palash Relan(9102262)

Upload: piyush-virmani

Post on 12-Nov-2014

32 views

Category:

Documents


8 download

DESCRIPTION

report on code-book excited linear predictor

TRANSCRIPT

Page 1: CELP

MAJOR PROJECT - I

FINAL SUBMISSION REPORT

(Year 2012)

DSP TOOLS IN WIRELESS COMMUNICATION

SUBMITTED TO:

Mr. Hemant Kumar Meena

Presented by-

Piyush Virmani (9102259)

Palash Relan(9102262)

Page 2: CELP

CERTIFICATE

This is to certify that the work titled “DSP Tools in Wireless Communication”

submitted by “Piyush Virmani & Palash Relan” in partial fulfilment for the award

of degree B.TECH of Jaypee Institute of Information Technology University,

Noida has been carried out under my supervision. This work has not been

submitted partially or wholly to any other University or Institute for the award of

this or any other degree or diploma.

Signature of Supervisor ………………………………………………………...

Name of Supervisor ……………………..…………………………………

Designation ……………………..…………………………………………

Date ……………………..………………………………………

Page 3: CELP

ACKNOWLEDGEMENT

We are highly obliged to our project supervisor, Mr. Hemant Kumar

Meena, for assigning this work of study on the topic DSP Tools in

Wireless Communication which has helped us to develop

understanding of Speech processing. We are grateful to him for all his

time, assistance and guidance which motivated us to work on this topic

and without which our major project would have not seen its end. We

are also thankful to the external examiners Mr. R.K Dubey and Mr V.K

Dwivedi who helped us build a better understanding on the matter.

Date: …………………..

Name of Students: Piyush Virmani (09102259)

Palash Relan (09102262)

Page 4: CELP

CONTENTS

1. Certificate

2. Acknowledgement

3. Contents

4. Abstract

i. Wireless Communication for Voice Transmission

ii. Digital Speech Processing

5. Application of Digital Speech Processing

i. Speech Coding

ii. Text to Speech Synthesis

iii. Speech Recognition and Pattern Matching

iv. Other Applications

6. Human Speech

7. Properties of Speech

8. Speech Analysis

i. Short Term Energy

ii. Short Term Zero Crossing

iii. Short Term Autocorrelation Function

9. General Encoding of Arbitrary Waveforms

i. Types of Vocoders

ii. Vocoder Quality Measurement

10. Linear Predictive Analysis

i. Introduction

ii. LPC Model

iii. LPC Analysis

i. Input Speech

ii. Pitch Period Estimation

Page 5: CELP

iii. Vocal Tract Filter

iv. Voiced/Unvoiced Determination

v. Levinson-Durbin Algorithm

iv. LPC Synthesis/Decoding

v. Transmission of patrameters

vi. Applications of LPC

11. Full LPC Model and Implementation

i. LPC Encoder Model

ii. LPC Decoder Model

iii. MATLAB Implementation

12. Discussion and Conclusion

13. References

Page 6: CELP

Abstract

Wireless Communication for Voice Transmission

Wireless communications operators see phenomenal growth in consumer demand for high

quality and low cost services. Since the physical spectrum for wireless services is limited,

operators and equipment suppliers continually find ways to optimise bandwidth efficiency.

Digital communications technology provides an efficiency advantage over analog wireless

communications; multiplexing and filtering is easier, components are cheaper, encryption is

more secure and network management is easier. Additionally, digital technology provides more

value added services to customers (security, text and voice messages together, etc.).

Today wireless communication is primarily voice. The operator meets the increasing need

for services by combining digital technology and special encoding techniques for voice. These

encoders ("vocoders") take advantage of predictable elements in human speech. Several low

data rate encoders are described here with an assessment of their subjective quality.

Test methods to determine voice quality are necessarily subjective.

The most efficient vocoders have acceptable quality levels and have data rates between 2

and 8 kbit/s. Higher data rate encoders (8-13 kbit/s) have improved quality while 32 kbit/s coders

have excellent quality (but use more network resources. The operator must engineer the proper

balance between cost, quality and available resources to provide the optimum solution to the

customer.

Digital Speech Processing

Since even before the time of Alexander Graham Bell’s revolutionary invention, engineers and

scientists have studied the phenomenon of speech communication with an eye on creating more

efficient and effective systems of human-to-human and human-to-machine communication.

Starting in the 1960s, digital signal processing (DSP), assumed a central role in speech studies,

and today DSP is the key to realizing the fruits of the knowledge that has been gained through

decades of research. Concomitant advances in integrated circuit technology and computer

architecture have aligned to create a technological environment with virtually limitless

opportunities for innovation in speech communication applications.

In this project, we highlight the central role of DSP techniques in modern speech communication

research and applications.

Page 7: CELP

Applications of Digital Speech Processing

The first step in most applications of digital speech processing is to convert the acoustic

waveform to a sequence of numbers. Most modern A-to-D converters operate by sampling at a

very high rate, applying a digital lowpass filter with cutoff set to preserve a prescribed

bandwidth, and then reducing the sampling rate to the desired sampling rate, which can be as low

as twice the cutoff frequency of the sharp-cutoff digital filter. This discrete-time representation is

the starting point for most applications.

Speech Coding Perhaps the most widespread applications of digital speech processing technology occur in the

areas of digital transmission and storage of speech signals. In these areas the centrality of the

digital representation is obvious, since the goal is to compress the digital waveform

representation of speech into a lower bit-rate representation. It is common to refer to this activity

as “speech coding” or “speech compression.”

Speech coders enable a broad range of applications including narrowband and broadband wired

telephony, cellular communications, voice over internet protocol (VoIP) (which utilizes the

internet as a real-time communications medium), secure voice for privacy and encryption (for

national security applications), extremely narrowband communications channels (such as

battlefield applications using high frequency (HF) radio), and for storage of speech for telephone

answering machines, interactive voice response (IVR) systems, and pre-recorded messages.

Speech coders often utilize many aspects of both the speech production and speech perception

processes, and hence may not be useful for more general audio signals such as music. Coders

that are based on incorporating only aspects of sound perception generally do not achieve as

much compression as those based on speech production, but they are more general and can be

used for all types of audio signals. These coders are widely deployed in MP3 and AAC players

and for audio in digital television systems.

Page 8: CELP

Text-to-Speech Synthesis

For many years, scientists and engineers have studied the speech production process with the

goal of building a system that can start with text and produce speech automatically. In a sense, a

text-to-speech synthesizer such as depicted in figure is a digital simulation of the entire upper

part of the speech chain diagram.

Text to Speech Synthesis Block Diagram

The input to the system is ordinary text such as an email message or an article from a newspaper

or magazine. The first block in the text-to-speech synthesis system, labelled linguistic rules, has

the job of converting the printed text input into a set of sounds that the machine must synthesize.

The conversion from text to sounds involves a set of linguistic rules that must determine the

appropriate set of sounds (perhaps including things like emphasis, pauses, rates of speaking, etc.)

so that the resulting synthetic speech will express the words and intent of the text message in

what passes for a natural voice that can be decoded accurately by human speech perception.

Once the proper pronunciation of the text has been determined, the role of the synthesis

algorithm is to create the appropriate sound sequence to represent the text message in the form of

speech. In essence, the synthesis algorithm must simulate the action of the vocal tract system in

creating the sounds of speech.

Speech Recognition and Other Pattern Matching Problems

Another large class of digital speech processing applications is concerned with the automatic

extraction of information from the speech signal. Most such systems involve some sort of pattern

matching. The figure shows a block diagram of a generic approach to pattern matching problems

in speech processing. Such problems include the following: speech recognition, where the object

is to extract the message from the speech signal; speaker recognition, where the goal is to

identify who is speaking; speaker verification, where the goal is to verify a speaker’s claimed

identity from analysis of their speech signal; word spotting, which involves monitoring a speech

signal for the occurrence of specified words or phrases; and automatic indexing of speech

recordings based on recognition (or spotting) of spoken keywords.

Page 9: CELP

The first block in the pattern matching system converts the analog speech waveform to digital

form using an A-to-D converter. The feature analysis module converts the sampled speech signal

to a set of feature vectors. Often, the same analysis techniques that are used in speech coding are

also used to derive the feature vectors. The final block in the system, namely the pattern

matching block, dynamically time aligns the set of feature vectors representing the speech signal

with a concatenated set of stored patterns, and chooses the identity associated with the pattern

which is the closest match to the time-aligned set of feature vectors of the speech signal. The

symbolic output consists of a set of recognized words, in the case of speech recognition, or the

identity of the best matching talker, in the case of speaker recognition, or a decision as to

whether to accept or reject the identity claim of a speaker in the case of speaker verification.

Speech Recognition Block Diagram

The major areas where such a system finds applications include command and control of

computer software, voice dictation to create letters, memos, and other documents, natural

language voice dialogues with machines to enable help desks and call centres, and for agent

services such as calendar entry and update, address list modification and entry, etc.

Other Speech Applications

Page 10: CELP

Human Speech

The fundamental purpose of speech is communication, i.e., the transmission of messages.

According to Shannon’s information theory , a message represented as a sequence of discrete

symbols can be quantified by its information content in bits, and the rate of transmission of

information is measured in bits/second (bps). In speech production, as well as in many human-

engineered electronic communication systems, the information to be transmitted is encoded in

the form of a continuously varying (analog) waveform that can be transmitted, recorded,

manipulated, and ultimately decoded by a human listener. In the case of speech, the fundamental

analog form of the message is an acoustic waveform, which we call the speech signal. Speech

signals, as illustrated in Figure 1.1, can be converted to an electrical waveform by a microphone,

further manipulated by both analog and digital signal processing, and then converted back to

acoustic form by a loudspeaker, a telephone handset or headphone, as desired. This form of

speech processing is, of course, the basis for Bell’s telephone invention as well as today’s

multitude of devices for recording, transmitting, and manipulating speech and audio signals.

Page 11: CELP

Properties of Speech

The two types of speech sounds, voiced and unvoiced, produce different sounds and spectra due

to their differences in sound formation. With voiced speech, air pressure from the lungs forces

normally closed vocal cords to open and vibrate. The vibrational frequencies (pitch) vary from

about 50 to 400 Hz (depending on the person’s age and sex) and forms resonance in the vocal

track at odd harmonics. These resonance peaks are called formants and can be seen in the voiced

speech figures below.

Voiced Speech Sample

Power Spectral Density, Voiced Speech

Page 12: CELP

Unvoiced sounds, called fricatives (e.g., s, f, sh) are formed by forcing air through an opening

(hence the term, derived from the word “friction”). Fricatives do not vibrate the vocal cords and

therefore do not produce as much periodicity as seen in the formant structure in voiced speech;

unvoiced sounds appear more noise-like (see figures 3 and 4 below). Time domain samples lose

periodicity and the power spectral density does not display the clear resonant peaks that are

found in voiced sounds.

Unvoiced Speech Sample

Power Spectral Density, Unvoiced Speech

Page 13: CELP

The spectrum for speech (combined voiced and unvoiced sounds) has a total bandwidth of

approximately 7000 Hz with an average energy at about 3000 Hz. The auditory canal optimizes

speech detection by acting as a resonant cavity at this average frequency. Note that the power of

speech spectra and the periodic nature of formants drastically diminish above 3500 Hz.

Speech encoding algorithms can be less complex than general encoding by concentrating

(through filters) on this region. Furthermore, since line quality telecommunications employ

filters that pass frequencies up to only 3000-4000 Hz, high frequencies produced by fricatives

are removed. A caller will often have to spell or otherwise distinguish these sounds to be

understood (e.g., “F as in Frank”).

Schematic Model of Vocal Tract System

Page 14: CELP

Speech Analysis

Our goal is to extract parameters of the model by analysis of the speech signal, it is common

to assume structures (or representations) for both the excitation generator and the linear system.

One such model uses a more detailed representation of the excitation in terms of separate source

generators for voiced and unvoiced speech as shown in the figure.

In this model the unvoiced excitation is assumed to be a random noise sequence, and the voiced

excitation is assumed to be a periodic impulse train with impulses spaced by the pitch period (P0)

rounded to the nearest sample. The pulses needed to model the glottal flow waveform during

voiced speech are assumed to be combined (by convolution) with the impulse response of the

linear system, which is assumed to be slowly-time-varying (changing every 50–100 ms or so).

By this we mean that over the timescale of phonemes, the impulse response, frequency response,

and system function of the system remains relatively constant. For example over time intervals

of tens of milliseconds, the system can be described by the convolution expression

where the subscript n denotes the time index pointing to the block of samples of the entire speech

signal s[n] wherein the impulse response hˆn[m] applies.We use n for the time index within that

interval, and m is the index of summation in the convolution sum.

To simplify analysis, it is often assumed that the system is an all-pole system with system

function of the form:

Page 15: CELP

Although the linear system is assumed to model the composite spectrum effects of radiation,

vocal tract tube, and glottal excitation pulse shape (for voiced speech only) over a short time

interval, the linear system in the model is commonly referred to as simply the “vocal tract”

system and the corresponding impulse response is called the “vocal tract impulse response.” For

all-pole linear systems, as represented by the equation, the input and output are related by a

difference equation of the form:

Short-Time Energy and Zero-Crossing Rate Two basic short-time analysis functions useful for speech signals are the short-time energy and

the short-time zero-crossing rate. These functions are simple to compute, and they are useful for

estimating properties of the excitation function in the model.

The short-time energy is defined as:

Similarly, the short-time zero crossing rate is defined as the weighted average of the number of

times the speech signal changes sign within the time window. Representing this operator in terms

of linear filtering leads to:

The short-time energy and short-time zero-crossing rate are important because they abstract

valuable information about the speech signal, and they are simple to compute. The short-time

energy is an indication of the amplitude of the signal in the interval around time. From our

model, we expect unvoiced regions to have lower short-time energy than voiced regions.

Similarly, the short-time zero-crossing rate is a crude frequency analyzer. Voiced signals have a

high frequency (HF) fall off due to the lowpass nature of the glottal pulses, while unvoiced

sounds have much more HF energy. Thus, the short-time energy and short-time zero-crossing

rate can be the basis for an algorithm for making a decision as to whether the speech signal is

voiced or unvoiced at a particular time.

Page 16: CELP

Short-Time Autocorrelation Function (STACF)

The autocorrelation function is often used as a means of detecting periodicity in signals, and it is

also the basis for many spectrum analysis methods. This makes it a useful tool for short-time

speech analysis. The STACF is defined as the deterministic autocorrelation function of the

sequence xˆn[m] = x[m]w[ˆn − m] that is selected by the window shifted to time ˆn, i.e.,

Voiced and Unvoiced Segments of speech and their corresponding Autocorrelation

Page 17: CELP

General Encoding of Arbitrary Waveforms

Waveform encoders typically use Time Domain or Frequency Domain coding and attempt to

accurately reproduce the original signal. These general encoders do not assume any previous

knowledge about the signal. The decoder output waveform is very similar to the signal input to

the coder. Examples of these general encoders include Uniform Binary Coding for music

Compact Disks and Pulse Code Modulation for telecommunications.

Pulse Code Modulation (PCM) is a general encoder used in standard voice grade circuits.

The PCM encodes into eight bit words Pulse Amplitude Modulated (PAM) signals that have

been samples at the Nyquist rate for the voice channel (8000 samples per second, or twice the

channel bandwidth). The PCM signal therefore requires a 64 Kb/s transmission channel.

However, this is not feasible over communication channels where bandwidth is a premium. It is

also inefficient when the communication is primarily voice that exhibits a certain amount of

predictability as seen in the periodic structure from formants. The increasing use of limited

transmission media such as radio and satellite links and limited voice storage resources require

more efficient coding methods. Special encoders have been designed that assume the input

signal is voice only. These vocoders use speech production models to reproduce only the

intelligible quality of the original signal waveform.

The most popular vocoders used in digital communications are presented below.

Types of Voice Encoders

Linear Predictive Coder (LPC)

Regular Pulse Excited (RPE) Coder

Code Book Excited (CELP) Coder

Vocoder Quality Measurements There are several points to rate vocoder quality:

Cost/complexity

Voice Quality Data Rate Transparency for non-voice signals Tolerance of transmission errors

Effects of tandem encodings Coding formats

Signal processing requirements.

It is suggested that the most important quality measures are voice quality, data rate,

communication delay and coding algorithm complexity. While all of these can easily be

measured and analysed, voice quality remains subjective.

Page 18: CELP

Linear Predictive Analysis

Proposal Linear predictive coding(LPC) is defined as a digital method for encoding an analog signal in

which a particular value is predicted by a linear function of the past values of the signal. It was

first proposed as a method for encoding human speech by the United States Department of

Defence in federal standard 1015, published in 1984. Human speech is produced in the vocal

tract which can be approximated as a variable diameter tube. The linear predictive coding (LPC)

model is based on a mathematical approximation of the vocal tract represented by this tube of a

varying diameter. At a particular time, t, the speech sample s(t) is represented as a linear sum of

the p previous samples. The most important aspect of LPC is the linear predictive filter which

allows the value of the next sample to be determined by a linear combination of previous

samples.

Under normal circumstances, speech is sampled at 8000 samples/second with 8 bits used to

represent each sample. This provides a rate of 64000 bits/second. Linear predictive coding

reduces this to 2400 bits/second. At this reduced rate the speech has a distinctive synthetic sound

and there is a noticeable loss of quality. However, the speech is still audible and it can still be

easily understood. Since there is information loss in linear predictive coding, it is a lossy form of

compression.

Introduction There exist many different types of speech compression that make use of a variety of different

techniques. However, most methods of speech compression exploit the fact that speech

production occurs through slow anatomical movements and that the speech produced has a

limited frequency range. The frequency of human speech production ranges from around 300 Hz

to 3400 Hz. Speech compression is often referred to as speech coding which is defined as a

method for reducing the amount of information needed to represent a speech signal. Most forms

of speech coding are usually based on a lossy algorithm. Lossy algorithms are considered

acceptable when encoding speech because the loss of quality is often undetectable to the human

ear. There are many other characteristics about speech production that can be exploited by

speech coding algorithms. One fact that is often used is that period of silence take up greater than

50% of conversations. An easy way to save bandwidth and reduce the amount of information

needed to represent the speech signal is to not transmit the silence. Another fact about speech

production that can be taken advantage of is that mechanically there is a high correlation

between adjacent samples of speech. Most forms of speech compression are achieved by

modelling the process of speech production as a linear digital filter. The digital filter and its slow

changing parameters are usually encoded to achieve compression from the speech signal.

Linear Predictive Coding (LPC) is one of the methods of compression that models the process

of speech production. Specifically, LPC models this process as a linear sum of earlier samples

using a digital filter inputting an excitement signal. An alternate explanation is that linear

prediction filters attempt to predict future values of the input signal based on past signals. LPC

“models speech as an autoregressive process, and sends the parameters of the process as opposed

to sending the speech itself”.

Page 19: CELP

All vocoders, including LPC vocoders, have four main attributes: bit rate, delay, complexity,

quality. Any voice coder, regardless of the algorithm it uses, will have to make trade offs

between these different attributes. The first attribute of vocoders, the bit rate, is used to

determine the degree of compression that a vocoder achieves. Uncompressed speech is usually

transmitted at 64 kb/s using 8 bits/sample and a rate of 8 kHz for sampling. Any bit rate below

64 kb/s is considered compression.

The linear predictive coder transmits speech at a bit rate of 2.4 kb/s, an excellent rate of

compression. Delay is another important attribute for vocoders that are involved with the

transmission of an encoded speech signal. Vocoders which are involved with the storage of the

compressed speech, as opposed to transmission, are not as concern with delay. The general delay

standard for transmitted speech conversations is that any delay that is greater than 300 ms is

considered unacceptable. The third attribute of voice coders is the complexity of the algorithm

used. The complexity affects both the cost and the power of the vocoder. Linear predictive

coding because of its high compression rate is very complex and involves executing millions of

instructions per second.

The general algorithm for linear predictive coding involves an analysis or encoding part and

a synthesis or decoding part. In the encoding, LPC takes the speech signal in blocks or frames of

speech and determines the input signal and the coefficients of the filter that will be capable of

reproducing the current block of speech. This information is quantized and transmitted. In the

decoding, LPC rebuilds the filter based on the coefficients received. The filter can be thought of

as a tube which, when given an input signal, attempts to output speech. Additional information

about the original speech signal is used by the decoder to determine the input or excitation signal

that is sent to the filter for synthesis.

LPC Model

The particular source-filter model used in LPC is known as the Linear predictive coding model.

It has two key components: analysis or encoding and synthesis or decoding. The analysis part of

LPC involves examining the speech signal and breaking it down into segments or blocks. Each

segment is than examined further to find the answers to several key questions:

Is the segment voiced or unvoiced?

What is the pitch of the segment?

What parameters are needed to build a filter that models the vocal tract for the current

segment?

LPC analysis is usually conducted by a sender who answers these questions and usually

transmits these answers onto a receiver. The receiver performs LPC synthesis by using the

answers received to build a filter that when provided the correct input source will be able to

accurately reproduce the original speech signal.

Page 20: CELP

Essentially, LPC synthesis tries to imitate human speech production. Figure demonstrates what

parts of the receiver correspond to what parts in the human anatomy. This diagram is for a

general voice or speech coder and is not specific to linear predictive coding. All voice coders

tend to model two things: excitation and articulation. Excitation is the type of sound that is

passed into the filter or vocal tract and articulation is the transformation of the excitation signal

into speech.

LPC Analysis/Encoding

Input speech

The input signal is sampled at a rate of 8000 samples per second. This input signal is then broken

up into segments or blocks which are each analysed and transmitted to the receiver. The 8000

samples in each second of speech signal are broken into 180 sample segments. This means that

each segment represents 22.5 milliseconds of the input speech signal.

Voice/Unvoiced Determination

According to LPC-10 standards, before a speech segment is determined as being voiced or

unvoiced it is first passed through a low-pass filter with a bandwidth of 1 kHz. Determining if a

segment is voiced or unvoiced is important because voiced sounds have a different waveform

then unvoiced sounds. The differences in the two waveforms creates a need for the use of two

Page 21: CELP

different input signals for the LPC filter in the synthesis or decoding. One input signal is for

voiced sounds and the other is for unvoiced. The LPC encoder notifies the decoder if a signal

segment is voiced or unvoiced by sending a single bit.

Recall that voiced sounds are usually vowels and can be considered as a pulse that is similar

to periodic waveforms. These sounds have high average energy levels which means that they

have very large amplitudes. Voiced sounds also have distinct resonant or formant frequencies.

Pitch Period Estimation

Determining if a segment is a voiced or unvoiced sound is not all of the information that is

needed by the LPC decoder to accurately reproduce a speech signal. In order to produce an input

signal for the LPC filter the decoder also needs another attribute of the current speech segment

known as the pitch period. The period for any wave, including speech signals, can be defined as

the time required for one wave cycle to completely pass a fixed position. For speech signals, the

pitch period can be thought of as the period of the vocal cord vibration that occurs during the

production of voiced speech. Therefore, the pitch period is only needed for the decoding of

voiced segments and is not required for unvoiced segments since they are produced by turbulent

air flow not vocal cord vibrations.

It is very computationally intensive to determine the pitch period for a given segment of

speech. There are several different types of algorithms that could be used. One type of algorithm

takes advantage of the fact that the autocorrelation of a period function, Rxx(k), will have a

maximum when k is equivalent to the pitch period. These algorithms usually detect a maximum

value by checking the autocorrelation value against a threshold value. One problem with

algorithms that use autocorrelation is that the validity of their results is susceptible to

interference as a result of other resonances in the vocal tract. When interference occurs the

algorithm can not guarantee accurate results. Another problem with autocorrelation algorithms

occurs because voiced speech is not entirely periodic. This means that the maximum will be

lower than it should be for a true periodic signal.

LPC does not use an algorithm with autocorrelation, instead it uses an algorithm called

average magnitude difference function (AMDF) which is defined as

Since the pitch period, P, for humans is limited, the AMDF is evaluated for a limited range of the

possible pitch period values. Therefore, in LPC there is an assumption that the pitch period is

between 2.5 and 19.5 milliseconds. If the signal is sampled at a rate of 8000 samples/second then

20 < P < 160.

For voiced segments we can consider the set of speech samples for the current segment, {yn},

as a periodic sequence with period Po. This means that samples that are Po apart should have

similar values and that the AMDF function will have a minimum at Po, that is when P is equal to

the pitch period.

Page 22: CELP

An advantage of the AMDF function is that it can be used to determine if a sample is voiced

or unvoiced. When the AMDF function is applied to an unvoiced signal, the difference between

the minimum and the average values is very small compared to voiced signals. This difference

can be used to make the voiced and unvoiced determination. For unvoiced segments the AMDF

function we also have a minimum when P equals the pitch period however, any additional

minimums that are obtained will be very close to the average value. This means that these

minimums will not be very deep.

Voiced Unvoiced

Vocal Tract Filter

The filter that is used by the decoder to recreate the original input signal is created based on

a set of coefficients. These coefficients are extracted from the original signal during encoding

and are transmitted to the receiver for use in decoding. Each speech segment has different filter

coefficients or parameters that it uses to recreate the original sound. Not only are the parameters

themselves different from segment to segment, but the number of parameters differ from voiced

to unvoiced segment. Voiced segments use 10 parameters to build the filter while unvoiced

sounds use only 4 parameters.

A filter with n parameters is referred to as an nth order filter. In order to find the filter coefficients

that best match the current segment being analysed the encoder attempts to minimize the mean

squared error. The mean squared error is expressed as:

Page 23: CELP

where {yn} is the set of speech samples for the current segment and {ai} is the set of coefficients.

In order to provide the most accurate coefficients, {ai} is chosen to minimize the average value

of en for all samples in the segment.

The first step in minimizing the average mean squared error is to take the derivative.

Taking the derivative produces a set of M equations. In order to solve for the filter coefficients

E[yn-iyn-j] has to be estimate. There are two approaches that can be used for this estimation:

autocorrelation and autocovariance. Although there are version of LPC that use both approaches,

autocorrelation is the approach that will be explained in this paper for linear predictive coding.

Autocorrelation requires that several initial assumptions be made about the set or sequence of

speech samples, {yn}, in the current segment. First, it requires that {yn} be stationary and second,

it requires that the {yn} sequence is zero outside of the current segment. In autocorrelation, each

E[yn-iyn-j] is converted into an autocorrelation function of the form Ryy(| i-j |). The estimation of

an autocorrelation function Ryy(k) can be expressed as:

Using Ryy(k), the M equations that were acquired from taking the derivative of the mean

squared error can be written in matrix form RA = P where A contains the filter coefficients.

Page 24: CELP

In order to determine the contents of A, the filter coefficients, the equation A = R-1P must be

solved.

This equation can not be solved with out first computing R-1. This is an easy computation if one

notices that R is symmetric and more importantly all diagonals consist of the same element. This

type of matrix is called a Toeplitz matrix and can be easily inverted.

The Levinson-Durbin (L-D) Algorithm is a recursive algorithm that is considered very

computationally efficient since it takes advantage of the properties of R when determining the

filter coefficients.. This algorithm is denoted with a superscript, {ai (j)}for a jth order filter, and the

average mean squared error of a jth order filter is denoted Ej instead of E[e2n]. When applied to an

Mth order filter, the L-D algorithm computes all filters of order less than M. That is, it determines

all order N filters where N=1,...,M-1.

During the process of computing the filter coefficients {ai} a set of coefficients, {ki}, called

reflection coefficients or partial correlation coefficients (PARCOR) are generated. These

coefficients are used to solve potential problems in transmitting the filter coefficients. The

quantization of the filter coefficients for transmission can create a major problem since errors in

the filter coefficients can lead to instability in the vocal tract filter and create an inaccurate output

signal. This potential problem is averted by quantizing and transmitting the reflection

coefficients that are generated by the Levinson-Durbin algorithm. These coefficients can be used

to rebuild the set of filter coefficients {ai} and can guarantee a stable filter if their magnitude is

strictly less than one.

Page 25: CELP

Transmitting the Parameters In an uncompressed form, speech is usually transmitted at 64,000 bits/second using 8

bits/sample and a rate of 8 kHz for sampling. LPC reduces this rate to 2,400 bits/second by

breaking the speech into segments and then sending the voiced/unvoiced information, the pitch

period, and the coefficients for the filter that represents the vocal tract for each segment.

The input signal used by the filter on the receiver end is determined by the classification of the

speech segment as voiced or unvoiced and by the pitch period of the segment. The encoder sends

a single bit to tell if the current segment is voiced or unvoiced. The pitch period is quantized

using a log-companded quantizer to one of 60 possible values. 6 bits are required to represent the

pitch period.

If the segment contains voiced speech than a 10th order filter is used. This means that 11

values are needed: 10 reflection coefficients and the gain. If the segment contains unvoiced

speech than a 4th order filter is used. This means that 5 values are needed: 4 reflection

coefficients and the gain. The reflection coefficients are denote kn where 1 < n < 10 for voiced

speech filters and 1 < n < 4 for unvoiced filters.

LPC Synthesis/Decoding The process of decoding a sequence of speech segments is the reverse of the encoding

process. Each segment is decoded individually and the sequence of reproduced sound segments

is joined together to represent the entire input speech signal. The decoding or synthesis of a

speech segment is based on the 54 bits of information that are transmitted from the encoder.

The speech signal is declared voiced or unvoiced based on the voiced/unvoiced determination

bit. The decoder needs to know what type of signal the segment contains in order to determine

what type of excitement signal will be given to the LPC filter. Unlike other speech compression

algorithms like CELP which have a codebook of possible excitement signals, LPC only has two

possible signals.

For voiced segments a pulse is used as the excitement signal. This pulse consists of 40 samples

and is locally stored by the decoder. A pulse is defined as “...an isolated disturbance, that travels

through an otherwise undisturbed medium” [10]. For unvoiced segments white noise produced

by a pseudorandom number generator is used as the input for the filter.

The pitch period for voiced segments is then used to determine whether the 40 sample pulse

needs to be truncated or extended. If the pulse needs to be extended it is padded with zeros since

the definition of a pulse said that it travels through an undisturbed medium. This combination of

voice/unvoiced determination and pitch period are the only things that are need to produce the

excitement signal.

Each segment of speech has a different LPC filter that is eventually produced using the

reflection coefficients and the gain that are received from the encoder. 10 reflection coefficients

are used for voiced segment filters and 4 reflection coefficients are used for unvoiced segments.

These reflection coefficients are used to generate the vocal tract coefficients or parameters which

are used to create the filter.

Page 26: CELP

The final step of decoding a segment of speech is to pass the excitement signal through the

filter to produce the synthesized speech signal.

LPC Applications In general, the most common usage for speech compression is in standard telephone systems.

In fact, a lot of the technology used in speech compression was developed by the phone

companies.

Linear predictive coding only has application in the area of secure telephony because of its low

bit rate. Secure telephone systems require a low bit rate since speech is first digitalized, then

encrypted and transmitted. These systems have a primary goal of decreasing the bit rate as much

as possible while maintaining a level of speech quality that is understandable.

Other standards such as the digital cellular standard and the international telephone network

standard have higher quality standards and therefore require a higher bit rate. In these standards,

understanding the speech is not good enough, the listener must also be able to recognize the

speech as belonging to the original source.

A second area that linear predictive coding has been used is in Text-to-Speech synthesis. In

this type of synthesis the speech has to be generated from text. Since LPC synthesis involves the

generation of speech based on a model of the vocal tract, it provides a perfect method for

generating speech from text.

Further applications of LPC and other speech compression schemes are voice mail systems,

telephone answering machines, and multimedia applications. Most multimedia applications,

unlike telephone applications, involve one-way communication and involve storing the data. An

example of a multimedia application that would involve speech is an application that allows

voice annotations about a text document to be saved with the document. The method of speech

compression used in multimedia applications depends on the desired speech quality and the

limitations of storage space for the application. Linear Predictive Coding provides a favourable

method of speech compression for multimedia applications since it provides the smallest storage

space as a result of its low bit rate.

Page 27: CELP

Full LPC Model and Implementation

Page 28: CELP

MATLAB Implementation

Main.m

%MAIN BODY

clear all;

clc;

disp('wavfile');

%INPUT

inpfilenm = 'sample1';

[x, fs] =wavread(inpfilenm);

%LENGTH (IN SEC) OF INPUT WAVEFILE,

t=length(x)./fs;

sprintf('Processing the wavefile "%s"', inpfilenm)

sprintf('The wavefile is %3.2f seconds long', t)

%THE ALGORITHM STARTS HERE,

M=10; %prediction order

[aCoeff, pitch_plot, voiced, gain] = f_ENCODER(x, fs, M); %pitch_plot

is pitch periods

synth_speech = f_DECODER (aCoeff, pitch_plot, voiced, gain);

%RESULTS

beep;

disp('Press a key to play the original sound!');

pause;

soundsc(x, fs);

disp('Press a key to play the LPC compressed sound!');

pause;

soundsc(synth_speech, fs);

figure;

subplot(2,1,1), plot(x); title(['Original signal = "', inpfilenm,

'"']);

subplot(2,1,2), plot(synth_speech); title(['synthesized speech of "',

inpfilenm, '" using LPC algo']);

Page 29: CELP

f_ENCODER.m

function [aCoeff, pitch_plot, voiced, gain] = f_ENCODER(x, fs, M);

M = 10; %prediction order=10;

b=1;

fsize = 30e-3; %frame size

frame_length = round(fs .* fsize);

N= frame_length - 1;

%VOICED/UNVOICED and PITCH; [independent of frame segmentation]

[voiced, pitch_plot] = f_VOICED (x, fs, fsize);

%FRAME SEGMENTATION for aCoeff and GAIN;

for b=1 : frame_length : (length(x) - frame_length),

y1=x(b:b+N);

y = filter([1 -.9378], 1, y1); %pre-emphasis filtering

%aCoeff [LEVINSON-DURBIN METHOD];

[a, tcount_of_aCoeff, e] = func_lev_durb (y, M);

aCoeff(b: (b + tcount_of_aCoeff - 1)) = a;

%GAIN;

pitch_plot_b = pitch_plot(b); %pitch period

voiced_b = voiced(b);

gain(b) = f_GAIN (e, voiced_b, pitch_plot_b);

end

func_lev_durbin.m

%function of levinsonDurbin

function [aCoeff, tcount_of_aCoeff, e] = func_lev_durb (y, M);

if (nargin<2), M = 10; end

sk=0;

a=[zeros(M+1);zeros(M+1)];

z=xcorr(y);

Page 30: CELP

%finding array of R[l]

R=z( ( (length(z)+1) ./2 ) : length(z));

s=1;

J(1)=R(1);

%GETTING OTHER PARAMETERS OF PREDICTOR OF ORDER "(s-1)":

for s=2:M+1,

sk=0;

for i=2:(s-1),

sk=sk + a(i,(s-1)).*R(s-i+1);

end

k(s)=(R(s) + sk)./J(s-1);

J(s)=J(s-1).*(1-(k(s)).^2);

a(s,s)= -k(s);

a(1,s)=1;

for i=2:(s-1),

a(i,s)=a(i,(s-1)) - k(s).*a((s-i+1),(s-1));

end

end

aCoeff=a((1:s),s)';

tcount_of_aCoeff = length(aCoeff);

est_y = filter([0 -aCoeff(2:end)],1,y);

e = y - est_y;

f_VOICED.m

%function_main of voiced/unvoiced detection

function [voiced, pitch_plot] = f_VOICED(x, fs, fsize);

f=1;

b=1;

frame_length = round(fs .* fsize);

N= frame_length - 1;

%FRAME SEGMENTATION:

for b=1 : frame_length : (length(x) - frame_length),

y1=x(b:b+N);

y = filter([1 -.9378], 1, y1); %pre-emphasis filter

msf(b:(b + N)) = func_vd_msf (y);

Page 31: CELP

zc(b:(b + N)) = func_vd_zc (y);

pitch_plot(b:(b + N)) = func_pitch (y,fs);

end

thresh_msf = (( (sum(msf)./length(msf)) - min(msf)) .* (0.67) ) +

min(msf);

voiced_msf = msf > thresh_msf; %=1,0

thresh_zc = (( ( sum(zc)./length(zc) ) - min(zc) ) .* (1.5) ) +

min(zc);

voiced_zc = zc < thresh_zc;

thresh_pitch = (( (sum(pitch_plot)./length(pitch_plot)) -

min(pitch_plot)) .* (0.5) ) + min(pitch_plot);

voiced_pitch = pitch_plot > thresh_pitch;

for b=1:(length(x) - frame_length),

if voiced_msf(b) .* voiced_pitch(b) .* voiced_zc(b) == 1,

% if voiced_msf(b) + voiced_pitch(b) > 1,

voiced(b) = 1;

else

voiced(b) = 0;

end

end

voiced;

pitch_plot;

func_pitch.m

function pitch_period = func_pitch (y,fs)

clear pitch_period;

period_min = round (fs .* 2e-3);

period_max = round (fs .* 20e-3);

R=xcorr(y);

[R_max , R_mid]=max(R);

pitch_per_range = R ( R_mid + period_min : R_mid + period_max );

[R_max, R_mid] = max(pitch_per_range);

pitch_period = R_mid + period_min;

Page 32: CELP

func_vd_msf.m

function m_s_f = func_vd_msf (y)

clear m_s_f;

[B,A] = butter(9,.33,'low'); %.5 or .33?

y1 = filter(B,A,y);

m_s_f=sum(abs(y1));

func_vd_zc.m

function ZC = func_vd_zc (y)

ZC=0;

for n=1:length(y),

if n+1>length(y)

break

end

ZC=ZC + (1./2) .* abs(sign(y(n+1))-sign(y(n)));

end

ZC;

f_GAIN.m

%function for calc gain per frame

function [gain_b, power_b] = f_GAIN (e, voiced_b, pitch_plot_b);

if voiced_b == 0,

denom = length(e);

power_b = sum(e (1:denom) .^2) ./ denom;

gain_b = sqrt( power_b );

else

denom = ( floor( length(e)./pitch_plot_b ) .* pitch_plot_b );

power_b = sum( e (1:denom) .^2 ) ./ denom;

gain_b = sqrt( pitch_plot_b .* power_b );

end

Page 33: CELP

power_b;

gain_b;

f_DECODER.m

%DECODER PORTION

function synth_speech = f_DECODER (aCoeff, pitch_plot, voiced, gain);

frame_length=1;

for i=2:length(gain)

if gain(i) == 0,

frame_length = frame_length + 1;

else break;

end

end

%decoding starts here,

for b=1 : frame_length : (length(gain)),

if voiced(b) == 1, %voiced frame

pitch_plot_b = pitch_plot(b);

syn_y1 = f_SYN_V (aCoeff, gain, frame_length,

pitch_plot_b, b);

else syn_y1 = f_SYN_UV (aCoeff, gain, frame_length, b); %unvoiced

frame

end

synth_speech(b:b+frame_length-1) = syn_y1;

end

f_SYN_V.m

%a function of f_DEOCDER

function syn_y1 = f_SYN_V (aCoeff, gain, frame_length, pitch_plot_b,

b);

%creating pulsetrain;

for f=1:frame_length

if f./pitch_plot_b == floor(f./pitch_plot_b)

ptrain(f) = 1;

else ptrain (f) = 0;

end

end

Page 34: CELP

syn_y2 = filter(1, [1 aCoeff((b+1):(b+1+9))], ptrain);

syn_y1 = syn_y2 .* gain(b);

f_SYN_UV.m

%a function of f_DEOCDER

function syn_y1 = f_SYN_UV (aCoeff, gain, frame_length, b);

wn = randn(1, frame_length);

syn_y2 = filter(1, [1 aCoeff((b+1):(b+1+9))], wn);

syn_y1 = syn_y2 .* gain(b);

Page 35: CELP

Discussion and Conclusion

Linear Predictive Coding is an analysis/synthesis technique to lossy speech compression that

attempts to model the human production of sound instead of transmitting an estimate of the

sound wave. Linear predictive coding achieves a bit rate of 2400 bits/second which makes it is

ideal for use in secure telephone systems. Secure telephone systems are more concerned that the

content and meaning of speech, rather than the quality of speech, be preserved. The trade off for

LPC’s low bit rate is that it does have some difficulty with certain sounds and it produces speech

that sound synthetic.

Linear predictive coding encoders break up a sound signal into different segments and then

send information on each segment to the decoder. The encoder send information on whether the

segment is voiced or unvoiced and the pitch period for voiced segment which is used to create an

excitement signal in the decoder. The encoder also sends information about the vocal tract which

is used to build a filter on the decoder side which when given the excitement signal as input can

reproduce the original speech.

Page 36: CELP

References

[1] Lawrence R. Rabiner and Ronald W. Schafer . Introduction to Digital Speech Processing

Vol. 1, Nos. 1–2 (2007) 1–194

[2] V. Hardman and O. Hodson. Internet/Mbone Audio (2000) 5-7.

[3] Scott C. Douglas. Introduction to Adaptive Filters, Digital Signal Processing Handbook

(1999) 7-12.

[4] Poor, H. V., Looney, C. G., Marks II, R. J., Verdú, S., Thomas, J. A., Cover, T. M.

Information Theory. The Electrical Engineering Handbook (2000) 56-57.

[5] R. Sproat, and J. Olive. Text-to-Speech Synthesis, Digital Signal Processing Handbook

(1999) 9-11 .

[6] Richard C. Dorf, et. al.. Broadcasting (2000) 44-47.

[7] Richard V. Cox. Speech Coding (1999) 5-8.

[8] Randy Goldberg and Lance Riek. A Practical Handbook of Speech Coders (1999)

Chapter 2:1-28, Chapter 4: 1-14, Chapter 9: 1-9, Chapter 10:1-18.

[9] Mark Nelson and Jean-Loup Gailly. Speech Compression, The Data Compression Book

(1995) 289-319.

[10] Khalid Sayood. Introduction to Data Compression (2000) 497-509.

[11] Richard Wolfson, Jay Pasachoff. Physics for Scientists and Engineers (1995) 376-377.