speech processing in gsm systems. this lesson includes the following topics: human speech...

SPEECH PROCESSINGIN GSM

SYSTEMS

This lesson includes the following topics:• Human speech characteristics.• Digital conversion and PCM.• GSM speech compression algorithm.• Speech transmission through the

GSM network.

The specific voiced sound is generated by vocal cords vibration (open and close). The vibration rate of the vocal cords determines the pitch of the voice.Let’s start by understand what happens in our throat when we are speaking :The human voice is generated by pushing air from our lungs through the vocal tract and through mouth as showed in the drawing.Women and young children tend to havehigh pitch (fast vibration) while adult males tend to have low pitch (slow vibration).

Human Voice

There are two types of speech sounds, voiced and unvoiced. They produce different sounds and spectra due to their differences in sound formation.

Properties of Speech

With voiced speech, air pressure from the lungs forces normally closed vocal cords to open and vibrate. The vibrational frequencies (pitch) vary from about 50 to 400 Hz (depending on the person’s age and sex) and forms resonance in the vocal track at odd harmonics. These resonance peaks are called formants.


Unvoiced sounds, called fricatives like: s, f, sh are formed by forcing air through an opening (hence the term, derived from the word “friction”). Fricatives do not vibrate the vocal cords and therefore do not produce as much periodicity as seen in the formant structure in voiced speech; unvoiced sounds appear more noise-like.Time domain samples lose periodicity and the power spectral density does not display the clear resonant peaks that are found in voiced sounds.

The spectrum for speech (combined voiced and unvoiced sounds) has a total bandwidth of approximately 7000 Hz with an average energy at about 3000 Hz.

The auditory canal optimizes speech detection by acting as a resonant cavity at this average frequency. Note that the power of speech spectra and the periodic nature of formants drastically diminish above 3500 Hz.

Speech encoding algorithms can be less complex than general encoding by concentrating (through filters) on this region. Furthermore, since telecommunications employ filters that pass frequencies up to only 3000-4000 Hz, high frequencies produced by fricatives are removed. A caller will often have to spell or otherwise distinguish these sounds to be understood (e.g. the F in Frank).


Pulse Code Modulation or PCM is a method of converting analog speech into digital signals used not only in telecom networks but also for digital audio in computers, for various compact disk formats and is standard for digital video.

Generally, we call Analog-to-Digital Conversion or A/D the process of transforming signals from analog into digital form and Digital–to-Analog Conversion or D/A the opposite process.

Voice Encoding

Why is A/D necessary?

But how can be continuous in time and amplitude analog signals converted into discrete in time digital signals?

Because information in an analog form cannot be processed by digital computers so it's necessary to convert them into digital form. Besides, digital data can be transported robustly over long distances unlike the analog data and can be interleaved with other digital data (multiplexing that was learned earlier) so various combinations of transmission channels can be used.

This transformation is based on the Sampling Theorem first formulated in 1928 by Harry Nyquist and was formally proved by Claude E. Shannon in 1949.

Voice Encoding

Sampling is the process of converting a continuous analog signal into a numeric sequence that is a function of discrete time.The Sampling TheoremThe theorem states that: for band limited signals sampled at a rate of at least twice the signal bandwidth, the resulting samples represent no loss of information and can therefore be used to reconstruct the original signal with arbitrarily good fidelity.

This is a very important theorem that represents the foundation of the entire modern digital signal processing technology; our world would be completely different without it!

The theorem says that in essence under defined conditions, a continuous in time signal can be fully represented by discrete samples of it.

The Sampling Theorem

After studying a few basic definitions let’s return to the PCM that is our objective. The block diagram of a PCM converter is provided below :

PCM Block Diagram

The first step to convert the signal from analog to digital is to filter out the higher frequency components by using a Low Pass Filter or LPF. This is an electronic circuit that allows frequency components below the cutoff to pass while higher frequency components are attenuated.

The LPF is necessary in order to comply with the first condition of the sampling theorem, namely to have a band limited signal.

In case of human speech, the signal bandwidth can be a few thousand hertz depending on the speaker but most of its energy is between 200 or 300 hertz and up to 2700 or 2800 hertz. For this reason the telecom standards defined the 300Hz to 3,400Hz as the standard speech band.

The Low Pass Filter

There are many ways to implement an LPF. A very simple filter including a resistor and a capacitor is shown in the drawing. The attenuation curve is analytically provided below.

The Low Pass Filter

The second step in converting an analog voice signal to a digital signal is to sample the filtered input signal at a constant sampling frequency. According to sampling theorem, the sampling frequency should be more than double of the highest signal frequency. This is also called the Nyquist frequency.

Sampling rate that is higher than the Nyquist frequency is called over-sampling – in such case, part of the information generated through the sampling is redundant so we are loading the transmission channel with useless information.

Sampling rate that is lower than the Nyquist frequency is called under-sampling – this generates a spectrum distortion called aliasing resulting in loss of information.

The standard sampling frequency selected for PCM is 8,000 samples per second. Sampling is performed by a electronic circuit called sample-and-hold. The result of the sampling is a series of pulses having the amplitude the same as of the original signal. Such a signal is shown in the figure.

Sampling

Quantization is the process of converting the analog samples size (height) from continuous to discrete values as showed in the drawing. The difference between two consecutive samples is the the quantization interval – the amplitude value corresponding to one bit. The drawing represents a 4-bit conversion that is equivalent to a total of 16 levels (24 =16).Let’s take sample A - its amplitude is between levels 11 and 12. Quantization means that its value will be rounded to one of these two, actually the nearest to the real value of the sample, in our case level 11.

Quantization

It is clear that taking smaller steps will decrease the quantization interval and our approximation of the sample’s value will be more accurate. This will obviously require more bits to express the amplitude of the sample.

Therefore, high quality A/D and D/A units that are also more expensive may have 24 (16.7 million levels) or even more bits. As a compromise between quality and cost, 8 bits per sample resolution was defined for PCM.

Quantization

We can now calculate the bit rate for one PCM channel by multiplying the sampling rate with the number of bits per sample.

The difference between the real value of a sample and the quantized value translates into noise at the D/A output called quantization noise. As the distance between quantization steps decreases (the number of bits per sample increases) the noise also decreases as the error is smaller.

In our discussion till now we assumed all quantization intervals as equal. Uniform quantization uses equal quantization levels throughout the entire dynamic range (ratio between highest and lowest signal amplitudes) of an input analog signal. Because quantization noise is not dependent on the signal’s amplitude, the ratio between the signal and the quantization noise is also called - as we already know - S/N, whiich is lower for low level signals.

Since most voice signals generated are low level, providing better voice quality at higher signal levels is a very inefficient way of digitizing voice signals. To improve voice quality at lower signal levels, uniform quantization (uniform PCM) is replaced by a non-uniform quantization process.

Quantization

The term companding is created by combining of two terms, compressing and expanding, into one word. Companding refers to the process of compressing an analog signal at the source, and then expanding this signal back to its original size when it reaches its destination.

As result to the signal companding, quantization intervals become unequal. The scope of companding is to correct the lower S/N ratio at low signal levels by allocating larger quantization intervals to higher signal amplitudes.

How is companding performed?

At the A/D side, the input analog signal samples are compressed using a logarithmic amplifier. After sampling, each segment is then quantized using uniform quantization. The compression increases as the sample signal amplitude increases.

In other words, the larger samples (corresponding to higher amplitudes) are compressed more than the smaller samples. This causes the quantization noise to increase as the sample amplitude (signal amplitude) increases. A logarithmic increase in quantization noise throughout the dynamic range of the input signal keeps the S/N constant throughout this dynamic range. At the receiver, expanding of the decoded signal is performed using an amplifier with the inverse characteristics to the input logarithmic amplifier.

There are two ITU-T standards for companding called A-law and µ-law.

Companding

Bell labs developed the m-law method of logarithmic quantization used in North America and Japan. m-law (or 'mew-law') tends to have a lower idle noise than A-law.

The compressed maximum signal amplitude is divided into 16 equal segments, 8 positive and 8 negative.

The µ-law Companding

Each segment includes 16 equal quantization levels indicated on the right hand side of the drawing. First bit indicates the sign (1 for positive and zero for negative) of the sample. The next 3 bits are for the number of the segment and the last four bits are for the quantization level within the segment.

On left hand side there are the amplitude values.

We can see the compression by observing that in first segment there are 32 amplitude levels (0 to 31) while in the last one 4096 that means compression of 128 times.

The µ-law Companding

The ITU (International Telecommunication Union) modified the method of quantization in G.711 specification to A-law which is used throughout the rest of the world

The division into segments is different here with more emphasize on the low level signals. This is the reason why A-law has slightly better signal-to-noise ratio for low amplitude signals than µ-law.

The A-law Companding

The main advantages and drawbacks of the two compression algorithms are:

A-law provides a greater dynamic range than µ-law. Dynamic range is the ratio in decibels between strongest and weakest signals.

µ-law provides better signal-to-distortion performance for low level signals than A-law leading to higher signal fidelity at low levels.

A-law requires 13-bits for a uniform PCM equivalent while µ-law requires 14-bits for the same uniform PCM equivalent. Uniform PCM equivalent is the number of bits necessary to represent the compressed signal using uniform sampling intervals.

An international connection will always use the A-law. For example, a trans-Atlantic link connecting a µ-law country (US or Canada) with an A-law country (any country in Europe) by definition will use the A-law.

The µ-law to A-law conversion is the responsibility of the µ-law country (the North American country).

Differences Between A-law and µ-law

This last block of the PCM block diagram performs the conversion of the 8 bit digital signal into the PCM waveform. An example presenting a shorter - 4 bits per sample - PCM is showed here.

PCM Waveform Generation

The block diagram below presents the main parts of the speech coding process in the MS.The first two blocks implement an A-law PCM conversion with linear quantization (without the companding) that requires – as was already learned – 13 bits/sample.The bit rate of the resulting stream is: 8000 × 13 = 104 Kbps.The third block implements the GSM compression algorithm called: Regular Pulse Excited - Long Term Prediction (RPE-LTP) that reduces the digital speech rate to 13 Kbps.

GSM Speech Coding

In the opposite direction, PCM coded speech at 64 Kbps is received from MSC. First the 8 bit A-law signal is converted into 13 bit uniform quantized signal having the already known 104 Kbps rate.

In continuation the same RPE/LTP speech compression block as in the MS side reduces the bit rate to 13 Kbps. All this happens inside the TRAU – that as we know – is part of the BSC.

GSM Speech Coding

The Regular Pulse Excited - Long Term Prediction (RPE-LTP) speech encoder of the GSM is the result of intense development work.The LPC stands for Linear Prediction Coefficients – representing a set of parameters that are obtained from the human vocal tract system.The GSM group studied several speech coding algorithms on the basis of subjective speech quality and complexity (which is related to cost, processing delay, and power consumption once implemented) before arriving at the choice of a Regular Pulse Excited Linear Predictive Coder (RPE_LPC) with a Long Term Predictor loop.Speech is divided into 20 millisecond samples, including: 104 × 20 = 2080 bits.Each of which is encoded as 260 bits, giving a total output bit rate of: 260 / 20 = 13 kbps.The 260 output bits are divided into: 36 LPC bits; 36 LTP bits; 188 RPE bits.

GSM Speech Coding

The speech signal transmitted over the GSM radio interface must be protected from errors. GSM uses convolutional encoding and block interleaving to achieve this protection.

We know that the speech encoder produces a block of 260 bits per 20ms.

From subjective testing, it was found that some bits of this block were more important for perceived speech quality than others.

Class 1a 50 bits - most sensitive to bit errors Class 1b 132 bits - moderately sensitive to bit errors Class 2 78 bits - least sensitive to bit errors

GSM Speech Coding

Class 1a bits have a 3 bit Cyclic Redundancy Code (CRC) added for error detection creating together a block of 53 bits. If an error is detected, the frame is judged too damaged to be comprehensible and it is discarded. It is replaced by a slightly attenuated version of the previous correctly received frame.

The 132 Class 1b bits together with the above 53 bits and a 4 bit tail sequence (a total of 189 bits) are input into a 1/2 rate convolutional encoder of constraint length 4. Each input bit is encoded as two output bits, based on a combination of the previous 4 input bits.

The convolutional encoder thus outputs: 2 × 189 = 378 bits

The 78 unprotected class 2 bits are added to the above 378 bits for a total of: 456 bits over 20ms providing an output bit rate of: 456/20 = 22.8 Kbps - the full rate voice channel of the GSM.

GSM Channel Coding

In order to protect the speech signal against burst errors (strings of errors) generated by radio signal fading (the slow fluctuation in received signal level), each sequence is interleaved. The scope is to divide the error burst between two blocks of bits, so that each one will be slightly affected instead of one severely affected.

Note the position of the data from the two channels.

The same data structure applies for all eight 114 bit timeslot bursts.

Interleaving

Below, a different way of presenting the 8 consecutive bursts of same channel is presented.

Each burst includes two groups of 57 data bits each, originating from two different 20 ms speech blocks: the blue one and the red one

Interleaving

Summary of coding and interleaving

END of the PART 2

speech processing in gsm systems. this lesson includes the following topics: human speech...

Documents

analog speech

properties of speech

speech processing

speech transmission

speech detection

voiced speech unvoiced

digital conversion

digital form