audio intro

Upload: owe-ewo

Post on 04-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Audio Intro

    1/17

    1

    1

    Audio

    Theory and Characteristics

    EE1432 Pengolahan Sinyal Multimedia

    Endang Widjiati [email protected]

    Bidang Studi Telekomunikasi Multimedia

    Jurusan Teknik Elektro

    Fakultas Teknologi Industri

    Institut Teknologi Sepuluh Nopember

    2

    Introduction

    Sound within the human hearing range called audio; and the waves

    in this frequency range called acoustic signal. Speech is an acoustic

    signal produced by humans

    Typical audio signal classes: telephone speech, wideband speech and

    wideband audio. The differences are in bandwidth, dynamic range,

    and in listener expectation of offered quality

    Some important concepts:- sampling the analog signal in time dimension

    - quantization the analog signal in amplitude dimension

    - Nyquist theorem

  • 7/31/2019 Audio Intro

    2/17

    2

    3

    Introduction

    The frequency range is divided into:

    Multimedia systems typically make use of sound only within the

    frequency range of human hearing; usually from 8 kHz to 48 kHz. Amplitude of the sound waves is a property heard as loudness

    from 1 Ghz to 10 THzHypersound

    from 20 KHz to 1 GHzUltrasound

    from 20 Hz to 20 KHzHuman hearing frequency range

    from 0 to 20 HzInfra sound

    4

    Introduction

    SNR: ratio of the power of the correct signal to the noise; measurethe quality of the signal. Usually measured in decibels (dB).

    The levels of sound we hear are described in terms of dB, as a ratioto the quietest sound we are able to hear.

    Magnitudes of common sounds, in decibels

    Other concepts: SQNR and segmental SNR

    120Threshold of discomfort40Average room

    140Threshold of pain60Conversation

    Damage eardrum

    Riveter

    Train through station

    Loud radio

    Busy street

    Very quiet room

    Rustle of leaves

    Threshold of hearing

    16070

    10020

    9010

    800

  • 7/31/2019 Audio Intro

    3/17

    3

    5

    Introduction

    Coding of the audio gets its compression without making

    assumptions about the nature of the audio source. The coder exploits

    the perceptual limitations of human auditory system.

    Much of the compression results from the removal of perceptually

    irrelevant parts of the audio signal. Removal of such part results in

    inaudible distortions, thus audio can compress any signal meant to

    be heard by the human ear

    6

    Introduction

    Audio format

    Audio Quality vs Data Rate

    Popular audio file format: .au (Unix workstation), .aiff (MAC), .wav

    (PC, DEC workstation)

    Sample Rate Data rate (uncompressed) Frequency Band

    [KHz] [KBytes/sec] [Hz]

    Telephone 8 8 Mono 8 200-3,400

    AM Radio 11.025 8 Mono 11.0 100-5.500

    FM Radio 22.05 16 Stereo 88.2 20-11,000

    CD 44.1 16 Stereo 176.4 20-20,000

    DAT 48 16 Stereo 192.0 20-20,000

    DVD Audio 192 (max) 24 (max) up to 6 channels 1,200.0 (max) 0-96,000 (max)

    Quality Bits per sample Mono/Stereo

  • 7/31/2019 Audio Intro

    4/17

  • 7/31/2019 Audio Intro

    5/17

    5

    9

    MIDI

    Control panel: it controls functions that are not directly concerned

    with notes and durations, e.g. sets volume

    Auxilary controllers: control the notes played on the keyboard.

    Two common variables arepitch bendand modulation

    Memory: store patches for the sound generators and setting on the

    control panel

    MIDI messages

    Transmit information between MIDI devices and determine type of

    musical events can be passed from device to device

    Format of MIDI messages consists of the status byte (the first byte of

    any message describe the kind of message) and data bytes (the

    following bytes)

    10

    MIDI

    Classification MIDI messages

    Channel messages: messages that are transmitted on individual

    channels rather that globally to all devices in the MIDI network

    Channel voice messages: instruct the receiving instrument to

    assign particular sounds to its voice; turn notes on and off; alter the

    sound of the currently active note or notes. e.g. note on, note off,

    control change, etc.

    Channel mode messages: determine the way that a receiving MIDI

    device responds to channel voice messages. They set the MIDI

    channel receiving modes for different MIDI devices, stop spurious

    notes from playing and affect local control of a device. e.g. local

    control, all notes off, omni mode off, etc.

  • 7/31/2019 Audio Intro

    6/17

    6

    11

    MIDI

    System messages: carry information that is not channel specific, such

    as timing signal for synchronization, positioning information in pre-

    recorded MIDI sequences, and detailed setup information for the

    destination device.

    System real-time messages: messages related to synchronization.

    E.g. system reset, timing clock (MIDI clock), etc.

    System common messages: commands that prepare sequencers and

    synthesizers to play a song. E.g. song select, tune request, etc. System exclusive messages: messages related to things that cannot

    be standardized, and addition to the original MIDI specification. It

    is a stream of bytes that start with a system-exclusive-message,

    where the manufacturer is specified, and ends with an end-of-

    exclusive message.

    12

    MIDI

    General MIDI

    Requirements for general MIDI compatibility:

    - Support all 16 channels

    - Each channel can play a different instrument/program (multitimbral)

    - Each channel can play many voices (polyphony)

    - Minimum of 24 fully dynamically allocated voices

    MIDI + instrument Patch Map + Percusion Key Map a piece ofMIDI music sounds the same anywhere it is played

    - Instrument patch map is a standard program list consisting of 128

    patch types

    - Percussion map specifies 47 percussion sounds

    - Key-based percussion is always transmitted on MIDI channel 10.

  • 7/31/2019 Audio Intro

    7/17

    7

    13

    Psychoacoustics model

    Threshold in quiet

    Put a person in a quiet room. Raise level of 1 kHz tone until just

    barely audible. Vary the frequency and plot

    The threshold levels are frequency dependent. The human ear is

    most sensitive to 2-4 KHz.

    14

    Psychoacoustics model

    Frequency masking

    Play 1 KHz tone (masking tone) at fixed level (60dB). Play test tone

    at different level (e.g. 1.1 kHz), and raise level until just

    distinguishable. Vary the frequency of the test tone and plot the

    threshold when it becomes audible

  • 7/31/2019 Audio Intro

    8/17

    8

    15

    Psychoacoustics model

    The threshold for the test tone is much larger than the threshold in

    quiet, near the masking frequency

    Repeat similar experiment for various frequencies of masking tones,

    yields:

    Critical Bands: the widths of the masking bands for different

    masking tones are different, increasing with the frequency of the

    masking tone. About 100Hz for masking frequency < 500Hz, grow

    larger and larger above 500Hz.

    16

    Psychoacoustics model

    Temporal masking

    If we hear a loud sound, then it stops, it takes a little while until we

    can hear a soft tone nearby

    Play 1 KHz masking tone at 60dB, plus a test tone at 1.1 KHz at

    40dB. Test tone cant be heard (its masked). Stop masking tone,

    then stop test tone after a short delay. Adjust delay time to the

    shortest time that test tone can be heard (e.g., 5ms). Repeat withdifferent level of the test tone and plot:

  • 7/31/2019 Audio Intro

    9/17

    9

    17

    Psychoacoustics model

    Temporal masking

    Try other frequencies for test tone (masking tone duration constant).

    Total effect of temporal masking:

    18

    Psychoacoustics model

    Perceptual audio coding

    Quantization:

    The maximum quantization error for a uniform quantizer with

    stepsize Q is Q/2

    The quantization noise introduced by reducing 1 bit for each

    sample (or increase the stepsize by a factor of 2) is 6dB

    Subband coding:

    Decompose a signal into separate frequency bands by using a

    filter bank

    Quantize samples in different bands with accuracy proportional

    to perceptual sensitivity

  • 7/31/2019 Audio Intro

    10/17

    10

    19

    Psychoacoustics model

    Perceptual audio coding

    The quantization step-size for each frequency band is set so that the

    quantization noise is just below the masking level, which is

    determined by taken into account of all three masking effects

    20

    MPEG

    MPEG Motion Picture Experts Group; an ISO standard for the high

    fidelity compression of digital audio.

    MPEG/audio coder gets its compression without making assumption

    about the nature of the audio source. It exploits the perceptual

    limitations of the human auditory system

    MPEG-1 standard: defines coding standards for both audio and video,

    and how to packetize the coded audio and video bits to provide timesynchronization

    Total rate: 1.5 Mbps for audio and video

    Video (352*240 pels/frame, 30 frame/s): 30 Mbps 1.2 Mbps

    Audio (2 channels, 48 Ksamples/s, 16 bits/sample): 2*768 kbps

  • 7/31/2019 Audio Intro

    11/17

    11

    21

    MPEG

    MPEG-2: for better quality audio and video (520*480 pels/frame)

    Supports one or two audio channels in one of the four modes:

    Monophonic mode for a single audio channel

    Dual-monophonic mode for two independent audio channels

    (similar to stereo)

    Stereo mode for stereo channels with a sharing bits between the

    channels, but no joint-stereo coding

    Joint stereo mode either takes advantage of correlations between

    stereo channels or irrelevancy of the phase difference between

    channels, or both

    22

    MPEG

    MPEG-1 Audio coding block diagram:

  • 7/31/2019 Audio Intro

    12/17

    12

    23

    MPEG

    MPEG layers

    MPEG defines 3 layers for audio. Basic model is the same, but codec

    complexity increases with each layer

    Input sequence is separated into 32 frequency bands. Each subband

    filter produces 1 sample out for every 32 samples in

    Layer 1 processes 12 samples at a time in each subband

    Layer 2 and Layer 3 process 36 samples at a time

    24

    MPEG

    Subband filtering and framing:

  • 7/31/2019 Audio Intro

    13/17

    13

    25

    MPEG

    Basic steps in algorithm:

    Use convolution filters to divide the audio signal into frequency

    subbands that approximate the 32 critical bands sub-band filtering

    Determine amount of masking for each band based on its frequency

    (threshold-in-quiet), and the energy of its neighboring band (frequency

    masking) (this is called thepsychoacoustic model)

    If the energy in a band is below the masking threshold, dont encode it

    Otherwise, determine number of bits needed to represent the

    coefficient in this band such that the noise introduced by quantization

    is below the masking effect (recall that 1 bit of quantization introduces

    about 6 dB of noise)

    26

    MPEG

    Basic steps in algorithm:

    Format bitstream: insert proper headers, code the side information, e.g.

    quantization scale factors for different bands, and finally code the

    quantized coefficient indices, generally using variable length encoding,

    e.g. Huffman coding

  • 7/31/2019 Audio Intro

    14/17

    14

    27

    MPEG

    Example:

    Assume that the levels of 16 of the 32 bands are:

    Assume that if the level of the 8th band is 60dB, it gives a masking of

    12dB in the 7th band, 15 in the 9th.

    Level in 7th band is 10dB (15dB), so send it

    can encode with up to 2 bits (=12dB) of quantization error. If the

    original sample is represented with 8 bits, then we can reduce it to 6

    bits.

    Band 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    Level (dB) 0 8 12 10 6 2 10 60 35 20 15 2 3 5 3 1

    28

    MPEG

    MPEG-1 audio layers: Performance Comparison

    MPEG defines 3 layers audio. Basic model is same (as described thus

    far), but coding efficiency increases with each layer, at the expense of

    the codec complexity.

    5 = perfect, 4 = just noticable . 1 = very annoying

    Raw data rate per audio channel: 48 KHz samples/s*16 bits/sample = 768 kbps

    LayerTarget bit

    rateRatio

    quality @

    64 kbits

    quality @

    128 kbits

    Layer 1 192 kbit 4:1 -- --Layer 2 128 kbit 6:1 2.1 to 2.6 4+

    Layer 3 64 kbit 12:1 3.6 to 3.8 4+

  • 7/31/2019 Audio Intro

    15/17

    15

    29

    MPEG

    At the time of MPEG-1 audio development (finalized 1992), layer 3

    was considered too complex to be practically useful. But today, layer 3

    is the most widely deployed audio coding method (known as MP3),

    because it provides good quality at an acceptable bit rate. It is also

    because the code for layer 3 is distributed freely

    30

    MPEG

    Technical difference of audio layers:

    Input sequence is separated into 32 frequency bands. Each subband

    divides into frames, each contains 384 samples, 12 samples from each

    subbands

    Layer 1: DCT type filter with one frame and equal frequency spread

    per band. Psychoacoustic model only uses frequency masking

    Layer 2: Use three frames in filter (before, current, next, a total of

    1152 samples). This models a little bit of the temporal masking

    Layer 3 (MP3): Better critical band filter is used (non-equal

    frequencies), psychoacoustic model includes temporal masking effects,

    takes into account stereo redundancy, and uses Huffman coder

  • 7/31/2019 Audio Intro

    16/17

    16

    31

    MPEG

    MPEG-4

    A new standard, which became international in early 1999, that takes

    into account that a growing part of information is read, seen and heard

    in interactive ways

    It supports new forms of communications, in particular:Internet,

    Multimedia andMobile Communications.

    MPEG-4 represents an audiovisual scene as a composition of (potential

    meaningful) objects and supports the evolving ways in whichaudiovisual material is produced, delivered, and consumed.

    E.g. computer-generated content becomes part in the production of an

    audiovisual scene. In addition, interaction with objects with scene is

    possible.

    The future: MPEG-7 & MPEG-21

    32

    References

    Z.N. Li and M.S. Drew, Fundamentals of Multimedia, PearsonPrentice Hall, 2004

    S. Furui, Digital Speech Processing, Synthesis, and Recognition,Marcel Dekker, Inc, 1989

    R. Steinmetz and K. Nahrstedt, Multimedia: Computing,Communications & Applications, Prentice Hall PTR, 1995

    B. Gold and N. Morgan, Speech and Audio Signal Processing,Processing and Perceptual of Speech and Music, John Wiley & Sons,Inc. 2000

    D. Pan, A Tutorial on MPEG/Audio Compression, IEEEMultimedia, pp. 60-74, summer issue, 1995

    P. Noll, Digital Audio for Multimedia, Proc. Signal Processing forMultimedia, NATO Advance Audio Institute, 1999

  • 7/31/2019 Audio Intro

    17/17

    17

    33

    References

    T. Painter and A. Spanias, Perceptual Coding of Digital Audio, Proc.

    of IEEE, vol. 88. No 4, April 2000

    Audio Compression,

    http://www.cs.sfu.ca/undergrad/CourseMaterials/CMPT479/material/n

    otes/Chap4/Chap4.3/Chap4.3.html

    Multimedia Data Representation,

    http://www.cs.sfu.ca/CourseCentral/365/li/material/notes/Chap3/Chap

    3.1/Chap3.1.html ISO, Overview of the MPEG-4 Standard,

    http://www.chiariglione.org/mpeg/standards/mpeg-4/mpeg-4.html