audio intro

7/31/2019 Audio Intro

1/17

1

1

Audio

Theory and Characteristics

EE1432 Pengolahan Sinyal Multimedia

Endang Widjiati [email protected]

Bidang Studi Telekomunikasi Multimedia

Jurusan Teknik Elektro

Fakultas Teknologi Industri

Institut Teknologi Sepuluh Nopember

2

Introduction

Sound within the human hearing range called audio; and the waves

in this frequency range called acoustic signal. Speech is an acoustic

signal produced by humans

Typical audio signal classes: telephone speech, wideband speech and

wideband audio. The differences are in bandwidth, dynamic range,

and in listener expectation of offered quality

Some important concepts:- sampling the analog signal in time dimension

- quantization the analog signal in amplitude dimension

- Nyquist theorem


2/17

2

3

Introduction

The frequency range is divided into:

Multimedia systems typically make use of sound only within the

frequency range of human hearing; usually from 8 kHz to 48 kHz. Amplitude of the sound waves is a property heard as loudness

from 1 Ghz to 10 THzHypersound

from 20 KHz to 1 GHzUltrasound

from 20 Hz to 20 KHzHuman hearing frequency range

from 0 to 20 HzInfra sound

4

Introduction

SNR: ratio of the power of the correct signal to the noise; measurethe quality of the signal. Usually measured in decibels (dB).

The levels of sound we hear are described in terms of dB, as a ratioto the quietest sound we are able to hear.

Magnitudes of common sounds, in decibels

Other concepts: SQNR and segmental SNR

120Threshold of discomfort40Average room

140Threshold of pain60Conversation

Damage eardrum

Riveter

Train through station

Loud radio

Busy street

Very quiet room

Rustle of leaves

Threshold of hearing

16070

10020

9010

800


3/17

3

5

Introduction

Coding of the audio gets its compression without making

assumptions about the nature of the audio source. The coder exploits

the perceptual limitations of human auditory system.

Much of the compression results from the removal of perceptually

irrelevant parts of the audio signal. Removal of such part results in

inaudible distortions, thus audio can compress any signal meant to

be heard by the human ear

6

Introduction

Audio format

Audio Quality vs Data Rate

Popular audio file format: .au (Unix workstation), .aiff (MAC), .wav

(PC, DEC workstation)

Sample Rate Data rate (uncompressed) Frequency Band

[KHz] [KBytes/sec] [Hz]

Telephone 8 8 Mono 8 200-3,400

AM Radio 11.025 8 Mono 11.0 100-5.500

FM Radio 22.05 16 Stereo 88.2 20-11,000

CD 44.1 16 Stereo 176.4 20-20,000

DAT 48 16 Stereo 192.0 20-20,000

DVD Audio 192 (max) 24 (max) up to 6 channels 1,200.0 (max) 0-96,000 (max)

Quality Bits per sample Mono/Stereo


4/17


5/17

5

9

MIDI

Control panel: it controls functions that are not directly concerned

with notes and durations, e.g. sets volume

Auxilary controllers: control the notes played on the keyboard.

Two common variables arepitch bendand modulation

Memory: store patches for the sound generators and setting on the

control panel

MIDI messages

Transmit information between MIDI devices and determine type of

musical events can be passed from device to device

Format of MIDI messages consists of the status byte (the first byte of

any message describe the kind of message) and data bytes (the

following bytes)

10

MIDI

Classification MIDI messages

Channel messages: messages that are transmitted on individual

channels rather that globally to all devices in the MIDI network

Channel voice messages: instruct the receiving instrument to

assign particular sounds to its voice; turn notes on and off; alter the

sound of the currently active note or notes. e.g. note on, note off,

control change, etc.

Channel mode messages: determine the way that a receiving MIDI

device responds to channel voice messages. They set the MIDI

channel receiving modes for different MIDI devices, stop spurious

notes from playing and affect local control of a device. e.g. local

control, all notes off, omni mode off, etc.


6/17

6

11

MIDI

System messages: carry information that is not channel specific, such

as timing signal for synchronization, positioning information in pre-

recorded MIDI sequences, and detailed setup information for the

destination device.

System real-time messages: messages related to synchronization.

E.g. system reset, timing clock (MIDI clock), etc.

System common messages: commands that prepare sequencers and

synthesizers to play a song. E.g. song select, tune request, etc. System exclusive messages: messages related to things that cannot

be standardized, and addition to the original MIDI specification. It

is a stream of bytes that start with a system-exclusive-message,

where the manufacturer is specified, and ends with an end-of-

exclusive message.

12

MIDI

General MIDI

Requirements for general MIDI compatibility:

- Support all 16 channels

- Each channel can play a different instrument/program (multitimbral)

- Each channel can play many voices (polyphony)

- Minimum of 24 fully dynamically allocated voices

MIDI + instrument Patch Map + Percusion Key Map a piece ofMIDI music sounds the same anywhere it is played

- Instrument patch map is a standard program list consisting of 128

patch types

- Percussion map specifies 47 percussion sounds

- Key-based percussion is always transmitted on MIDI channel 10.


7/17

7

13

Psychoacoustics model

Threshold in quiet

Put a person in a quiet room. Raise level of 1 kHz tone until just

barely audible. Vary the frequency and plot

The threshold levels are frequency dependent. The human ear is

most sensitive to 2-4 KHz.

14


Frequency masking

Play 1 KHz tone (masking tone) at fixed level (60dB). Play test tone

at different level (e.g. 1.1 kHz), and raise level until just

distinguishable. Vary the frequency of the test tone and plot the

threshold when it becomes audible


8/17

8

15


The threshold for the test tone is much larger than the threshold in

quiet, near the masking frequency

Repeat similar experiment for various frequencies of masking tones,

yields:

Critical Bands: the widths of the masking bands for different

masking tones are different, increasing with the frequency of the

masking tone. About 100Hz for masking frequency < 500Hz, grow

larger and larger above 500Hz.

16


Temporal masking

If we hear a loud sound, then it stops, it takes a little while until we

can hear a soft tone nearby

Play 1 KHz masking tone at 60dB, plus a test tone at 1.1 KHz at

40dB. Test tone cant be heard (its masked). Stop masking tone,

then stop test tone after a short delay. Adjust delay time to the

shortest time that test tone can be heard (e.g., 5ms). Repeat withdifferent level of the test tone and plot:


9/17

9

17


Temporal masking

Try other frequencies for test tone (masking tone duration constant).

Total effect of temporal masking:

18


Perceptual audio coding

Quantization:

The maximum quantization error for a uniform quantizer with

stepsize Q is Q/2

The quantization noise introduced by reducing 1 bit for each

sample (or increase the stepsize by a factor of 2) is 6dB

Subband coding:

Decompose a signal into separate frequency bands by using a

filter bank

Quantize samples in different bands with accuracy proportional

to perceptual sensitivity


10/17

10

19


Perceptual audio coding

The quantization step-size for each frequency band is set so that the

quantization noise is just below the masking level, which is

determined by taken into account of all three masking effects

20

MPEG

MPEG Motion Picture Experts Group; an ISO standard for the high

fidelity compression of digital audio.

MPEG/audio coder gets its compression without making assumption

about the nature of the audio source. It exploits the perceptual

limitations of the human auditory system

MPEG-1 standard: defines coding standards for both audio and video,

and how to packetize the coded audio and video bits to provide timesynchronization

Total rate: 1.5 Mbps for audio and video

Video (352*240 pels/frame, 30 frame/s): 30 Mbps 1.2 Mbps

Audio (2 channels, 48 Ksamples/s, 16 bits/sample): 2*768 kbps


11/17

11

21

MPEG

MPEG-2: for better quality audio and video (520*480 pels/frame)

Supports one or two audio channels in one of the four modes:

Monophonic mode for a single audio channel

Dual-monophonic mode for two independent audio channels

(similar to stereo)

Stereo mode for stereo channels with a sharing bits between the

channels, but no joint-stereo coding

Joint stereo mode either takes advantage of correlations between

stereo channels or irrelevancy of the phase difference between

channels, or both

22

MPEG

MPEG-1 Audio coding block diagram:


12/17

12

23

MPEG

MPEG layers

MPEG defines 3 layers for audio. Basic model is the same, but codec

complexity increases with each layer

Input sequence is separated into 32 frequency bands. Each subband

filter produces 1 sample out for every 32 samples in

Layer 1 processes 12 samples at a time in each subband

Layer 2 and Layer 3 process 36 samples at a time

24

MPEG

Subband filtering and framing:


13/17

13

25

MPEG

Basic steps in algorithm:

Use convolution filters to divide the audio signal into frequency

subbands that approximate the 32 critical bands sub-band filtering

Determine amount of masking for each band based on its frequency

(threshold-in-quiet), and the energy of its neighboring band (frequency

masking) (this is called thepsychoacoustic model)

If the energy in a band is below the masking threshold, dont encode it

Otherwise, determine number of bits needed to represent the

coefficient in this band such that the noise introduced by quantization

is below the masking effect (recall that 1 bit of quantization introduces

about 6 dB of noise)

26

MPEG

Basic steps in algorithm:

Format bitstream: insert proper headers, code the side information, e.g.

quantization scale factors for different bands, and finally code the

quantized coefficient indices, generally using variable length encoding,

e.g. Huffman coding


14/17

14

27

MPEG

Example:

Assume that the levels of 16 of the 32 bands are:

Assume that if the level of the 8th band is 60dB, it gives a masking of

12dB in the 7th band, 15 in the 9th.

Level in 7th band is 10dB (15dB), so send it

can encode with up to 2 bits (=12dB) of quantization error. If the

original sample is represented with 8 bits, then we can reduce it to 6

bits.

Band 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Level (dB) 0 8 12 10 6 2 10 60 35 20 15 2 3 5 3 1

28

MPEG

MPEG-1 audio layers: Performance Comparison

MPEG defines 3 layers audio. Basic model is same (as described thus

far), but coding efficiency increases with each layer, at the expense of

the codec complexity.

5 = perfect, 4 = just noticable . 1 = very annoying

Raw data rate per audio channel: 48 KHz samples/s*16 bits/sample = 768 kbps

LayerTarget bit

rateRatio

quality @

64 kbits

quality @

128 kbits

Layer 1 192 kbit 4:1 -- --Layer 2 128 kbit 6:1 2.1 to 2.6 4+

Layer 3 64 kbit 12:1 3.6 to 3.8 4+


15/17

15

29

MPEG

At the time of MPEG-1 audio development (finalized 1992), layer 3

was considered too complex to be practically useful. But today, layer 3

is the most widely deployed audio coding method (known as MP3),

because it provides good quality at an acceptable bit rate. It is also

because the code for layer 3 is distributed freely

30

MPEG

Technical difference of audio layers:

Input sequence is separated into 32 frequency bands. Each subband

divides into frames, each contains 384 samples, 12 samples from each

subbands

Layer 1: DCT type filter with one frame and equal frequency spread

per band. Psychoacoustic model only uses frequency masking

Layer 2: Use three frames in filter (before, current, next, a total of

1152 samples). This models a little bit of the temporal masking

Layer 3 (MP3): Better critical band filter is used (non-equal

frequencies), psychoacoustic model includes temporal masking effects,

takes into account stereo redundancy, and uses Huffman coder


16/17

16

31

MPEG

MPEG-4

A new standard, which became international in early 1999, that takes

into account that a growing part of information is read, seen and heard

in interactive ways

It supports new forms of communications, in particular:Internet,

Multimedia andMobile Communications.

MPEG-4 represents an audiovisual scene as a composition of (potential

meaningful) objects and supports the evolving ways in whichaudiovisual material is produced, delivered, and consumed.

E.g. computer-generated content becomes part in the production of an

audiovisual scene. In addition, interaction with objects with scene is

possible.

The future: MPEG-7 & MPEG-21

32

References

Z.N. Li and M.S. Drew, Fundamentals of Multimedia, PearsonPrentice Hall, 2004

S. Furui, Digital Speech Processing, Synthesis, and Recognition,Marcel Dekker, Inc, 1989

R. Steinmetz and K. Nahrstedt, Multimedia: Computing,Communications & Applications, Prentice Hall PTR, 1995

B. Gold and N. Morgan, Speech and Audio Signal Processing,Processing and Perceptual of Speech and Music, John Wiley & Sons,Inc. 2000

D. Pan, A Tutorial on MPEG/Audio Compression, IEEEMultimedia, pp. 60-74, summer issue, 1995

P. Noll, Digital Audio for Multimedia, Proc. Signal Processing forMultimedia, NATO Advance Audio Institute, 1999


17/17

17

33

References

T. Painter and A. Spanias, Perceptual Coding of Digital Audio, Proc.

of IEEE, vol. 88. No 4, April 2000

Audio Compression,

http://www.cs.sfu.ca/undergrad/CourseMaterials/CMPT479/material/n

otes/Chap4/Chap4.3/Chap4.3.html

Multimedia Data Representation,

http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap3/Chap

3.1/Chap3.1.html ISO, Overview of the MPEG-4 Standard,

http://www.chiariglione.org/mpeg/standards/mpeg-4/mpeg-4.html

audio intro

Documents