university of manchester school of computer science …barry/mydocs/mycomp18112/co… ·  ·...

27
COMP18112 1 VoIP 17/2/2013 University of Manchester School of Computer Science Comp18112: Foundations of Distributed Computing 2013 Voice over Internet Protocol (VoIP) (An application of Distributed Systems) Barry Cheetham 1. Introduction A Distributed System (DS) may be described as a hardware and software system with more than one processing or storage element, concurrent processes, or multiple programs running collaboratively. A personal computer (PC) with video card, sound card and other processes running collaboratively in parallel is therefore a distributed system. The processes are interconnected by wires (the ‘interconnect’ consists of wires) and the processes may be synchronised to a common LOCAL clock. In practice, all processes are often not synchronised to a single local clock, and there may be several. The ‘Internet’ and ‘email’ are examples of a DS where the functionality is split into parts that run simultaneously on multiple computers communicating over ‘packetised’ network links. Varying delays and unpredictable failures occur and the possibility of synchronising all processes to a common clock is not assumed. Communications and synchronisation are vital aspects of distributed systems. Where computer networks provide ‘the interconnect’, delays and unpredictable failure must be anticipated, and ways must be found of co-ordinating actions that may be occurring tens of meters or thousands of miles apart. Email transmits and receives streams of text and is therefore a form of ‘stream oriented communications’. Email uses ‘asynchronous transmission mode’ which means that there are essentially no delay constraints. If an email arrives many seconds, many minutes or even many hours late it is normally accepted and still useful. We do not normally reject email messages just because they arrive late. However, email messages must be reliable in that there must be absolutely no errors. The naming and location of users is needed, as described in earlier lectures. Computer networks are ideal for EMAIL. Voice over IP (VoIP) telephony generates streams of voice samples, normally two streams, so it is also ‘stream oriented’. It employs ‘synchronous transmission mode’ which means that a maximum allowed delay is imposed by the application. Samples that arrive beyond that maximum allowed delay become useless (because of the delay) and must be discarded. VoIP telephony imposes ‘quality of service’ (QoS) demands on the interconnect and is therefore often described as a ‘QoS’ application and also a ‘real time’ application. The same considerations apply to audio/visual telephony and ‘video-conferencing’. This section on VoIP telephony exemplifies the problems that occur when distributed systems are required to process and distribute information that is ‘real time’ and sensitive to excessive delay. All telephony will soon be VoIP, despite the fact that IP not really ideal for this application. 2. Voice Voice, like music, is sound. Sound is variation in air pressure which travels as a wave from speaker to listener at approximately 300 metres per second or 1080 km/hour. Voice sounds are produced either by vocal cords vibrating as air is forced through them by the lungs of a human

Upload: phungthuy

Post on 23-Apr-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

COMP18112 1 VoIP 17/2/2013

University of Manchester School of Computer Science

Comp18112: Foundations of Distributed Computing 2013 Voice over Internet Protocol (VoIP) (An application of Distributed Systems)

Barry Cheetham

1. Introduction A Distributed System (DS) may be described as a hardware and software system with more than one processing or storage element, concurrent processes, or multiple programs running collaboratively. A personal computer (PC) with video card, sound card and other processes running collaboratively in parallel is therefore a distributed system. The processes are interconnected by wires (the ‘interconnect’ consists of wires) and the processes may be synchronised to a common LOCAL clock. In practice, all processes are often not synchronised to a single local clock, and there may be several. The ‘Internet’ and ‘email’ are examples of a DS where the functionality is split into parts that run simultaneously on multiple computers communicating over ‘packetised’ network links. Varying delays and unpredictable failures occur and the possibility of synchronising all processes to a common clock is not assumed. Communications and synchronisation are vital aspects of distributed systems. Where computer networks provide ‘the interconnect’, delays and unpredictable failure must be anticipated, and ways must be found of co-ordinating actions that may be occurring tens of meters or thousands of miles apart. Email transmits and receives streams of text and is therefore a form of ‘stream oriented communications’. Email uses ‘asynchronous transmission mode’ which means that there are essentially no delay constraints. If an email arrives many seconds, many minutes or even many hours late it is normally accepted and still useful. We do not normally reject email messages just because they arrive late. However, email messages must be reliable in that there must be absolutely no errors. The naming and location of users is needed, as described in earlier lectures. Computer networks are ideal for EMAIL. Voice over IP (VoIP) telephony generates streams of voice samples, normally two streams, so it is also ‘stream oriented’. It employs ‘synchronous transmission mode’ which means that a maximum allowed delay is imposed by the application. Samples that arrive beyond that maximum allowed delay become useless (because of the delay) and must be discarded. VoIP telephony imposes ‘quality of service’ (QoS) demands on the interconnect and is therefore often described as a ‘QoS’ application and also a ‘real time’ application. The same considerations apply to audio/visual telephony and ‘video-conferencing’. This section on VoIP telephony exemplifies the problems that occur when distributed systems are required to process and distribute information that is ‘real time’ and sensitive to excessive delay. All telephony will soon be VoIP, despite the fact that IP not really ideal for this application. 2. Voice Voice, like music, is sound. Sound is variation in air pressure which travels as a wave from speaker to listener at approximately 300 metres per second or 1080 km/hour. Voice sounds are produced either by vocal cords vibrating as air is forced through them by the lungs of a human

COMP18112 2 VoIP 17/2/2013

speaker or by ‘turbulent’ air flow as air is forced though the glottis and the mouth. When it reaches a human listener, the wave causes an ear-drum to vibrate in sympathy to produce vibration patterns that are processed by the brain. A microphone converts the same pressure variation to voltage variation and this copy or ‘analogue’ of the pressure variation may then be conveyed along wires. A sound waveform is a graph of voltage (representing air pressure) against time as illustrated below:

This is similar to the type of waveform produced by vibrating vocal cords, i.e. ‘voiced speech’ which would be heard as a vowel (A.E, I O U, etc.). It is not a sine-wave. It is the sum of many sine-waves of frequency f, 2f, 3f, 4f, 5f, .etc., where ‘f’ is known as the ‘fundamental frequency’ in Hertz (Hz or cycles/second). If f = 500, we would have sine waves of frequency 500, 1000, 1500, 2k, 2.5kHz, and so on. We may ask how many sine-waves there are in total, or how high in frequency they go up to. How many could we actually hear? 3. Bandwidth: It is generally assumed that humans can hear up to 20 kHz. Recorded music on CDs has bandwidth 50 Hz to 20 kHz, and in principle, speech has the same bandwidth. ‘Telephone quality (narrow-band)’ speech is ‘filtered’ to the range 300 Hz to 3.4 kHz with some loss of naturalness but not intelligibility. ‘Wideband speech’ filtered to the range 50 Hz to 7.2 kHz sounds better. Bandwidth means ‘frequency range’, not bit-rate! 4. Plain old fashioned telephone System (POTS) Originally, analogue transmission was used for all telephone speech. Wires carried speech voltage waveforms, and were connected manually, by switch-board operators, to form a ‘circuit’ between two users. The use of analogue transmission meant that the waveforms travelled along the wires with negligible delay, i.e. approximately 1 ms per 100 miles (1 ms = 1/1000 second). The circuit thus set up remained connected until the end of the call. This was circuit switching and it was ‘connection orientated’. Voice digitisation and ‘exchange to exchange’ digital transmission began in the 1960’s, and nowadays, most telephone speech is transmitted digitally. However, the concept of circuit switching with low delay has remained in telephony to the present day. Analogue transmission still used for the ‘last mile’ (or the ‘first mile’) between local telephone exchanges and homes.

Volts

time t

COMP18112 3 VoIP 17/2/2013

5. Digitisation of speech & music The voltage waveforms produced by speech or music are digitised by taking regular samples and converting them to binary numbers of an appropriate ‘word-length’. A famous theorem known as the ‘Sampling Theorem’ tells us that if the signal bandwidth is 0 to B Hz, we must sample at more than 2B samples/second (Hz) to obtain a faithful representation of it. Music (50Hz-20 kHz) is sampled at 44.1 kHz with 16 bits per sample (stereo) to obtain CD quality recordings. Narrow-band telephone speech (300Hz to 3.4 kHz) is often sampled at 8 kHz with 8 bits per sample to obtain 64 kb/s log-PCM (ITU-G711). This is the basis of ‘pulse code modulation (PCM)’ with ‘logarithmic compression’. The logarithmic compression means that small samples are digitised more accurately than larger ones. ITU-G711 is a very famous ‘standard’ for speech digitisation and it is universally used in wired telephony and VoIP. 6. Other standards for speech digitisation The 64kb/s required by ITU-G711 is too high for mobile telephony and often for VoIP. Other ITU standards exist, for example, G726 (32 kb/s), G728 (16kb/s), G729 (8kb/s) and G723.1 (5.3kb/s). ‘Speech compression and decompression’ is applied by distributed processing. The compression is ‘lossy’ (not like ‘zip’ or ‘rar’). Mobile phones use a 9.6 kb/s speech digitisation standard. VoIP often uses the same ITU standards. 7. PC to PC voice link Assume 2 PCs are linked by an ideal connection allowing voice samples to be sent in either direction. Assume that on each PC, an A to D converter (on a sound card) samples speech from a microphone to provide a single 16-bit sample when requested by the CPU. The CPU requests samples at intervals of 1/8000 second, and compresses and sends off each sample via the connection. Both CPUs do this simultaneously. Each CPU receives samples, at intervals of 1/8000 seconds, from the other side. It decompresses and sends them directly to the D to A converter to produce sound. This is simple, but probably not viable because of the use of the CPU at either end to control the timing of the sampling and D to A conversion processes. It is impractical because the CPUs have many parallel tasks to perform and cannot be relied upon to be available at any specific point in time. At the precise time instant when a new sample arrives, the CPU might be busy doing something else. This applies to normal operating systems such as Linux, WIN, etc.

COMP18112 4 VoIP 17/2/2013

8. Buffers In practice, sound cards control their own sampling rates using independent crystal controlled clocks. The sound processing is therefore distributed between intelligence on the sound card and the CPU. Sound cards must have buffers to store sections of their inputs and outputs. A buffer is an array or block of storage. On a sound card, an ‘output’ buffer would be filled periodically by the CPU and emptied at a regular rate of say 8000 samples per second as required to feed the digital to analogue converter. This is like a ‘leaky bucket’, with a hole in the bottom, being filled periodically by turning on and off a water tap.

Also an ‘input buffer’ would be filled at a regular rate by the analogue to digital converter and emptied periodically by the CPU. This is like a bucket being filled by a constantly dripping water source and being regularly emptied by the CPU.

In both cases, the CPU being a little late or early to fill or empty a bucket is not so critical now. Actually the output buffer must never be allowed to become completely empty otherwise the D to A converter would have no samples to convert and would therefore fail catastrophically. Also the

• CPU supplies ‘water’ & controls the tap.

• Bucket with a hole in bottom (on sound card)

• Empties into D-A converter & speaker

• Regular – say 8000 drops/s

• Being a little late or early not critical now.

• Tap turned on when CPU is available.

• Turn off when busy

• Bucket must not empty or overflow

D-A converter

A-D converter

Filling with ‘water’ from mic & A-D Conv Regular – say 8000 drops/s

CPU turns tap on when it can receive water. Off otherwise.

To CPU

COMP18112 5 VoIP 17/2/2013

input and output buffers must never become so full that they overflow and bits are lost. Imagine a ‘low water mark’ and a ‘high water mark’ drawn on the water bucket. 9. Problems caused by lack of synchronization Now consider the accuracy of the sampling rates of the sound cards at either end of a voice communication session. The sampling rates are controlled independently by crystals accurate to 0.01%. A sampling rate that is intended to be 8000 Hz could actually be 8001 Hz or 7999 Hz. We would never hear difference when listening to speech with such a slight inaccuracy in the sampling rate. So why do we have to worry about it? What problems could be caused by this innacuracy? The host with the slower clock receives two extra samples per second. The extra samples gradually fill up the output buffer and will eventually cause it to overflow. After ten minutes, 1200 extra samples will have been received that cannot be output to the D to A converter. Also, as the buffer fills up there will be increasing delay which becomes 300 ms after 20 minutes and likely to be unacceptable for speech telephony. The host with the faster clock will be receiving fewer samples than it is sending to its D to A converter and will eventually run out of samples (buffer underflow). This is a fundamental problem with real time distributed systems when ‘real time processing’ is required. It is solvable by monitoring buffer levels and ‘intelligently’ discarding single samples, or creating extra ones now and then to keep buffer levels between their low and high water marks. Discarding or duplicating a single sample during periods of quiet speech is likely to be imperceptible whereas doing this for a whole block of samples is likely to be very annoying. A similar problem occurs with ‘atomic clocks’ because the Earth’s rotation is slowing down. About 300 million years ago, there were (according to Tanenbaum DS p. 245) 400 days per year. ‘Universal co-ordinated time’ (UTC not UCT) counts ‘ticks’ from highly accurate atomic clocks, but to accommodate the slowing down, it needs to add approximately 1 ms every 13 hours. In practice a ‘leap second’ is added about every 18 months. ‘Network time protocol’ sends time to your PC, accurate to 1 – 50 ms, but it has to be quite clever to achieve this accuracy. 10. Voice connection via network Replacing the ideal connection by a network link introduces new problems. Computer networks convey data in ‘packets’. Delays and imperfections in the links must be expected. The person to be contacted may be far away, and we may not even know where. We need a way of setting up and maintaining communications with acceptable voice quality and round trip delay. These are the problems of VoIP. Firstly, let’s revisit the concept of protocol layers.

COMP18112 6 VoIP 17/2/2013

11. TCP/IP Protocol Layers The transmission and receiving of data over networks is arranged in layers, the principle being to separate the complexity involved into independent units that are easier to design and understand. Assume that an application on one host wishes to convey data to a corresponding application on another host that may be many miles away. The application uses software provided by a ‘lower layer’ (within the operating system) which sends messages to a corresponding ‘lower layer’ on the other machine according to an agreed method or protocol. The agreement includes the structure or ‘format’ of the packet that will convey the data. The agreed format consists of a ‘header’ and the data itself. Instead of communicating directly with the other host, the ‘lower layer’ uses software in even lower layers each with its own agreed protocol and packet format. The principle is best illustrated by looking at the most widely used structure of layers which is known as the TCP/IP layered model. This may be thought of as having five layers, as shown below, but the lower two layers are often combined into one and described as the ‘host to network’ layer.

12. The ‘7-layer OSI reference model: This is alternative definition of network layers seen in many textbooks.

Application

Phy

Transport

Phy

Network (IP)

Data Link

Physical

Data Link

IP layer

Transport

Application

Phy layer

IP layer IP layer

DLL DLL

H to N

Host 1 Host 2

Routers

COMP18112 7 VoIP 17/2/2013

TCP/IP compared with OSI Reference Model

13. Examples of protocols in each layer Application layer: http, POP3, SMTP, DHCP, DNS, IMAP4, TELNET, FTP, SIP. Transport layer: TCP, UDP, RTP, RTCP Network layer: IP (Versions 4 & 6), Data-link layer: Ethernet, IEEE802.11, etc. Physical layer: Ethernet for wired LANs, IEEE802.11 for wireless LANs, PPP for modems, RS232, etc

‘Host-to-Network’ or ‘Link’ Layer

Data Link Layer Physical Layer

2 1

Internet (IP) Layer Network Layer 3

Transport Layer Transport Layer 4

Application Layer Application Layer Presentation Layer Session Layer

7 6 5

TCP/IP Reference Model OSI Reference Model Layer

7) Application Layer

6) Presentation Layer

5) Session Layer

2) Data Link Layer

3) Network Layer

4) Transport Layer

1) Physical Layer

7) Application Layer

6) Presentation Layer

5) Session Layer

2) Data Link Layer

3) Network Layer

4) Transport Layer

1) Physical Layer

COMP18112 8 VoIP 17/2/2013

Looking at the packet formats defined by the TCP/IP model, moving downwards from the application, each layer adds an extra header to what came before, as shown. The DLL layer adds a trailer as well. To study this stack of layers, packet formats and protocols, instead of starting at the top or the bottom, it is convenient to start with the most important and fundamental layer which is the network (or IP) layer. 14. TCP/IP network layer: TCP/IP network layer protocols deal with addressing and routing of ‘IP packets’ (often referred to as datagrams) which have the following agreed format.

The IP header contains the following information:

Vers: IP version number (4 or 6) IHL: Header length (in 32 bit words) Type: can choose between fast delivery & reliability (look this up if you are interested) Length: Overall datagram length ID & Frag: Allows long packets to be broken up into smaller ones (fragmentation) and the fragments to be identified and recombined. (Don’t worry about this for now) TTL: ‘Time to live’ (8 bits) see later Prot: Specifies transport layer process required (TCP, UDP, etc) Check: A ‘Check-sum’ for the header (16 bits) see later Source and destination IP addresses (32 bits each)

Note, first of all, that IP addresses originate at this layer (unsurprisingly). The host’s IP address (source) and the destination IP address are paced in each IP packet. The ‘TTL’ and ‘Check’ fields

IP Header ( 20 bytes) IP Data (Variable length)

T N D P Message (AL) D

Message (AL)

T N D Message (AL) D

T N Message (AL)

T Message (AL)

Applic Layer

Transport

Network (IP)

DLL

Phy Layer

Vers IHL Type Length ID Frag TTL Prot Check Source Dest

COMP18112 9 VoIP 17/2/2013

deserve some brief explanation, but the others fields are either self-explanatory or need not be studied at this stage. The ‘time-to-live’ (TTL) is an 8 bit binary number which is decremented by one each time the datagram is read by a router. If this number ever reaches zero, the datagram is discarded. This ‘time to live’ mechanism is widely used in distributed systems, and here it will eliminate the possibility of a datagram being passed endlessly among a number of routers due to some error in the routing procedure. . Traditional Internet Protocol (IP) provides ‘connectionless’ communication where IP datagrams are conveyed independently by routers towards their destination IP addresses. IP is the fundamental interconnect mechanism of the Internet and many private networks. The service is ‘unreliable’ as there are no guarantees about correct delivery. Datagrams may be delayed, damaged, lost or arrive out of order. Different routes may be taken to the same destination. The IP packet has no sequence numbers, no time-stamp and there are no port numbers in the IP header. These items must provided by higher layers to allow transmission problems to be detected and corrected, and to identify applications. 15. ‘Check-sum’ A ‘check-sum’ is a number of extra bits included in a bit-stream to allow the receiver to detect the occurrence of bit-errors that may have occurred during transmission. The term checksum sometimes means ‘cyclic redundancy check’ (CRC), and it could also mean “the number of 1’s” in a bit-stream. The simplest example of a ‘check-sum’ is the ‘parity-bit’ that is often placed at the end of a binary number to make the total number of one’s even (parity =0) or odd (parity=1). If the parity is made ‘even’ at a transmitter and is found to be ‘odd’ at a receiver, there have clearly been an odd number of bit errors; an even number of bit errors would not be detected. A similar conclusion may be drawn if the parity is made ‘odd’ at the transmitter and found to be ‘even’ at the receiver. The parity of any sequence of bits, b1, b2, …, bN say is just the ‘exclusive or’ of all the bits; i.e. b1b2b3…bN . In IP packet headers, a 16-bit checksum is used and is defined as the ‘one’s complement inverse of the one’s complement sum of all 16-bit words’. What on earth could this mean? Consider three 8-bit numbers: 10110101 10101010 00110011 The normal sum of these 8-bit numbers is the nine-bit number: 110010010 This is the 8-bit number 10010010 with a ‘carry out’ bit of 1. Add in the ‘carry out’ to obtain: 10010011. Then produce the ‘ones complement inverse’ by inverting all the bits to obtain: 01101100. This is the required ‘8-bit’ checksum. The same idea is used for the IP header check-sum except we have 16 bit words. The significance of ‘ones complement arithmetic’ can be disregarded here. The important point is that if we apply the same checksum procedure at the receiver as we have just performed at the transmitter and get a different answer, we know there has been a transmission error. If we get the same answer, we cannot be sure the data is correct as many combinations of 16-bit numbers can produce the same sum. But we have some confidence that it might be correct. The 16-bit checksum allows error in the IP header to be detected at routers and at the receiver. If bit-errors occur, the datagram is discarded. Errors are surprisingly rare in wired networks but they do occur. Bit-errors occur frequently in wireless networks. Note that the header is actually changed by each router when it decrements the ‘time to live’ counter. Other changes to the header can also be made by routers. If the header changes, its check-sum must be recalculated. Note that the IP packet format has no checksum for bit-error checking of the payload; only the header.

COMP18112 10 VoIP 17/2/2013

A cyclic redundancy check (CRC), often called a checksum also, is more powerful. To illustrate the concept suppose we need to transmit the decimal number 139. Divide it by 7 in integer arithmetic and express remainder in binary. We obtain 19 with remainder 6 or ‘110’. Use ‘110’ as the check-bits. The same division is done at receiver, and if we get different remainder, we know that a bit-error has occurred. Note that exactly 3 check bits are always produced when we divide by 7. The ‘generator’ number 7 is agreed in advance and carefully chosen. Again, not all combinations of bit-errors are detectable by this method. Any combination that adds or subtracts multiple of 7 not detected. In practice, CRCs use a much higher number than 7 and don’t divide in ‘normal’ decimal arithmetic, but in ‘excusive or’ arithmetic. But the idea is similar. A parity check is a 1-bit CRC capable of detecting the occurrence of an odd number of bit-errors. Sixteen and 32 bit CRCs are commonly used. 16. Data-link Layer (DLL) The DLL has its own agreed packet formats and protocols for sending and receiving DLL packets using the services of the ‘physical (Phy) layer’ below. The Phy layer sends voltage pulses representing 1’s and 0’s along wires or across wireless connections and here is the source of bit-errors that make links unreliable. The ‘data link’ layer has the responsibility for detecting, and where possible correcting, bit-errors, and also for ‘medium access control’ (MAC) when connection channels are shared among many users. Ethernet can share one wire between many users, using an elegant ‘carrier sensing multiple access’ (CSMA) mechanism implemented within the data-link layer. Medium access control (MAC) requires collision detection, collision avoidance and randomised back-off times when collisions occur (as explained by Steve in a previous lecture). The IEEE802.11 wi-fi data-link layer has a MAC mechanism which is very similar to that used by Ethernet, for sharing the capacity of a single wireless channel. The Ethernet (IEEE802.3) DLL packet format is as follows: This format has 48-bit source & destination MAC (Phy) addresses, the length or type of packet and a 32 bit cyclic redundancy check (CRC32) which is a sort of ‘checksum’ allowing bit-errors to be detected. The IEEE802.11 DLL format for wi-fi is similar except that it has 4 MAC addresses. In both cases, the data can include extra bits for ‘forward error correction’ (FEC). 17. Physical layer The Phy layer sends voltage pulses representing 1’s & 0’s by wire or radio. A ‘preamble’ consisting of seven bytes of ‘10101010’ is first sent to allow the receiver to synchronise to the transmission and then a ‘start of frame’ (SOF) code: 10101011 is sent to allow the start of the payload data to be identified. The data then follows as illustrated below: When communicating over wire, Ethernet uses ‘Manchester Coding’ as illustrated below, instead of straightforward rectangular pulses. A ‘1’ is represented by a positive followed by a negative pulse whereas a zero is represented by a negative followed by a positive voltage pulse. This

MacAddr1 MacAddr2 Len/Type IP data CRC32

48 bits 48 32 Variable 32

Preamble SOF DLL data

COMP18112 11 VoIP 17/2/2013

t volts

1 0 0 0 0 1

really was invented here at Manchester University. What do you consider to be the advantages and disadvantages of this type of signalling in comparison to straightforward rectangular pulses? 18. Transport layer Looking upwards from the network layer now, the ‘transport layer’ has protocols which use the IP layer below to achieve packetised data transfer in a way which is suitable for particular ‘application layer’ protocols. The most important are: - TCP: transmission control protocol - UDP: user datagram protocol Two others, strongly related to UDP but adapted to real time applications, such as VoIP, are: - RTP: Real time protocol - RTCP: Real time control protocol. (RTP and RTCP are sometimes considered to be in the application layer) 19. Transport layer protocol: TCP TCP makes use of IP to provide connection-oriented ‘reliable’ transmission. This protocol is suited to data which cannot tolerate any bit-errors but can tolerate some delay. TCP introduces ‘port numbers’ for distinguishing data streams, ‘sequence numbers’ and an over-all ‘check-sum’ within its 20 byte header. Reliability is achieved by a mechanism for acknowledging correct receipt and re-transmitting packets when necessary. Since this incurs delay and increases congestion, TCP is not ideally suited to VoIP. The format of a TCP packet (or ‘segment’) is as follows: TCP is a two-way protocol capable of efficiently conveying data in both directions between two hosts. The acknowledgement mechanism is quite clever in that each TCP packet may be both an acknowledgement packet and a data carrying packet. Host1 can send data to Host 2, then Host 2 can acknowledge the data and at the same time send some data back to Host1. Setting a one-bit

Checksum over all

Other stuff

Port1 Port2 SEQ ACK Length Check Flags

WS Etc. AL payload

16 bits 16 32 32 10 6 16 16 variable variable

Source Dest Header length

Sequence no. & Ack no. Window size (for

flow control)

Urg, Ack, Psh, Rst, Syn, Fin

TCP Header

COMP18112 12 VoIP 17/2/2013

flag, called the Ack-flag, to 1 rather than 0 makes the IP packet an acknowledgement packet which may, or may not, carry data. There are other one-bit flags, including the ‘Syn flag’ which is set to 1 to request (Ack=0) or acknowledge and accept (Ack=1) the setting up of a TCP connection. A Fin flag is set to 1 to request the release of a TCP connection. Other TCP fields are as follows: SEQ: the 32-bit index of the first byte in this packet (TCP numbers every byte) ACK: 32-bit index for acknowledging receipt of one or more packets. When the one-bit ‘Ack- flag’ is set to make the packet an acknowledgement packet as well as a data packet, the TCP packet is sending the following message: “I acknowledge receipt of samples up to this ACK index, so please sent the next packet”. Note that the ACK index and the Ack-flag are not the same thing, WS- ‘Window-Size’ specifies how many bytes the host is willing accept before sending an acknowledgement packet. This is ‘flow control’. If WS were set for just for 1 packet, the other host would have to wait for an acknowledgement before sending another packet. This ‘one packet at a time’ acknowledgement scheme would work, but it would be slower than necessary. It is better to make WS larger, so that many packets are acknowledged at once.

19.1 Port numbers Port numbers are introduced by transport layer protocols such as TCP. They are ‘addresses’ which identify protocols & applications. They allow the receiver to distinguish between different data streams. Examples are:

Port Protocol Application

22 FTP File transfer 23 Telnet Remote log-in 25 SMTP send Email 143 IMAP receive EMAIL 80 HTTP WWW

443 HTTPS Secure web 110 POP-3 Receive Email

19.2. Three typs of addresses We have seen 3 types of addresses now. Remember that:

IP address (32-bits) are global. MAC (Phy) addresses (48-bits) are local. PORT numbers (16-bit addresses) identify applications.

Consider the analogy shown below.

COMP18112 13 VoIP 17/2/2013

You wish to send me a message, and it is written using an App such as a word-processor. The message is put in a green (Transport Layer) envelope with a port address written on it (to identify the type of App that will be required to open it). The green envelope (with its pink contents) is then put into yellow (Network Layer) envelope with my IP address written on the front. You may have to consult a DNS server to find my IP address. Now put the yellow envelope into a blue (Data Link Layer envelope) and write on it the MAC address of your router. The blue DLL envelope is opened by your router to find my IP address, and your router then decides which intermediate router it should send the blue envelope to in order to send it on its way to me. It writes the MAC address of the intermediate router on the blue envelope The intermediate router opens the blue envelope, finds my IP address and re-addresses the blue envelope to another intermediate router & so on from router to router until it reaches the router on my local area network (LAN). When it reaches router on my LAN, it is re-addressed and sent to my MAC address. I can then open the yellow env, read the port number, then send pink message inside to the right App for reading it

19.3. How does TCP achieve ‘reliable’ & ‘connection orientated’ transmission? A short answer is that reliability is achieved by a mechanism for acknowledging correct receipt of packets (sometimes called segments) and retransmitting them when necessary. Connectivity is achieved by requiring users to exchange ‘state’ information, such as sequence numbers, which is remembered at both ends of a connection until the connection is terminated. Let’s now fill in a bit of detail.

Port no. My IP Addr Mac Addr: of your router

Receive at my MAC addr

open Open with port no.’s application

LAN

R

R

LAN

Your message

COMP18112 14 VoIP 17/2/2013

19.4. TCP communication mechanism

Each byte in a TCP payload is numbered, so the header has a 32 bit ‘sequence number’ which actually specifies the sequence number of the first byte in the payload. There is also a 32 bit ‘acknowledgement number’ whose purpose we shall now illustrate by considering an exchange of TCP packets between two hosts. Host1 sends a packet with the index of the first payload byte as its sequence number N say, and then sets a timer. If Host2 receives the packet, it sends a return packet (with or without data in its payload) with an ‘ACK’ flag set to say it is an acknowledgement. It puts a binary number, M+1 say, in its 32-bit ‘Acknowledgement number’ to say: “I’ve got bytes up to and including byte M”. It also puts a 32-bit sequence number, K say, for the first data byte of its payload (assuming it is returning some data) into the header. TCP is a 2-way communication protocol, designed for sending data in both directions between two users. Host1 waits for acknowledgement from Host2. If the acknowledgement is not received in time, it resends the packet. (Remember: packets may be lost or duplicated). This mechanism will work fine, but it will be slow if Host1 has to wait quite a long time for each acknowledgment from Host2. We can speed up the communication by allowing Host1 to send several datagrams in succession without waiting for acknowledgements for each one. Host2 could then be allowed to acknowledge several datagrams at once. The procedure above caters for this already with its use of the 32-bit acknowledgement number. Host2 just has to acknowledge receipt of all up to a certain byte number which could cover several packets. Clever! TCP also does ‘congestion control’ and ‘flow control’. For example, a host can say “Please don’t send me any more packets for a while!” Detail omitted for now.

time

Host 1 Host 2

Synchronised (exchanging data) Not necessarily packet by packet.

Syn=1,Ack=0, SEQ=N

Syn=1,Ack=1,SEQ=K,ACK=N+1

Syn=0,Ack=1,SEQ=N+1, Ack=K+1

Syn=0,Ack=1,SEQ=K+1, Ack=N+2

Etc.

Connect (with data)

Fin

Fin Ack

Release Fin

Fin Ack Release

Release

Host2 may not want to release until he is sure Host1 has released. Has his ‘Fin Ack’ been received? ‘Two army’ problem.

COMP18112 15 VoIP 17/2/2013

The ‘fin’ mechanism for tearing down is straightforward, but what happens if the ‘Fin’ acknowledgement is lost in the network? The problem of catering for this eventuality is often illustrated by the ‘Two Army’ Problem (see the ‘Computer Networks ed4’ textbook by A Tanenbaum (pub 2003) 19.5. ‘Two army’ problem. A blue army has two equal divisions camped at either side of a valley occupied by a white army. The blue army has superior numbers overall, but must send messengers ‘unreliably’ across the valley to decide when to attack the white army. (The messengers may be captured, poor chaps). If only half the blue army attacks, because the other half has not received an appropriate message or acknowledgement, it will be defeated. How can each half be sure the other half has received a message to attack? How can each host be sure it’s time to shut down (attack the connection) when messages and acknowledgements cannot be guaranteed? You may like to think about this. 20. Transport layer protocol: UDP UDP is simpler than TCP, connectionless and ‘unreliable’. It is a ‘fire and forget’ protocol. It simply encapsulates the following UDP datagram within the payload of an IP datagram:

UDP is widely used for applications which do not need or cannot wait for ‘acknowledgements’ and retransmissions; or where the increased congestion caused by such things would be unacceptable. The ‘unreliability’ of UDP is not such a problem for VoIP because voice not quite as sensitive as data to bit-errors and ‘lost packets. An occasional bit-error might hardly be noticed in a voice stream. Also, the loss of a whole packet can be concealed, to a degree, by filling in the gap by a waveform segment that looks and sounds approximately right, perhaps because it resembles the previous segment. This is packet loss concealment (PLC), and is possible because speech is, to a degree, predictable especially when people speak slowly. The UDP datagram header also has port numbers like TCP. Also there is a length indication for the whole UDP packet, and a 16-bit checksum over the whole packet (rarely used in wired links in practice). 21. Real Time Transfer Protocol (RTP) UDP still has some problems for VoIP which are remedied by the ‘real time transfer protocol’ (RTP) Since UDP datagrams may be lost, damaged or re-ordered, the receiver must know when this happens. RTP was designed to allow this information to be determined. Given a block of speech samples to be sent, RTP adds a ‘time-stamp’, a ‘sequence number’, and some other things to this data and places the resulting bit-stream in the payload of a UDP packet. The UDP packet is then encapsulated inside an IP datagram as usual, and the IP datagram is eventually transmitted across the IP network. At the receiver, the ‘sequence numbers’ introduced by RTP allow the application to re-order any datagrams received out of sequence, and to recognise the need for

Source port no

Dest port no. Length Check Data

16 bits 16 16 16 Variable

COMP18112 16 VoIP 17/2/2013

‘packet-loss concealment’ when a datagram is found to be missing. Duplicate packets can also be recognised and eliminated by the same mechanism. The time-stamp is for the first sample in the RTP packet and specifies the number of ‘clock ticks’ from the start of the conversation as observed by the transmitter. It is not an actual time read from our PC clock and there is no attempt to synchronise clocks at transmitter and receiver. Absolute time-Stamps have no meaning – only differences are important. The number of ticks/second is equal to the sampling frequency which may be say 8000 Hz (with G711). Therefore there will be 8000 ticks/second for VoIP telephony with G711 (A-law PCM). The receiver cannot measure ‘one-way’ delay from timeStamps. The RTP ‘time-stamp’ allows different voice and video streams to be synchronised since we will know exactly when each packet was generated according to the transmitter’s clock. Note that the receiver’s clock will probably not correspond exactly to the transmitter’s clock, but this does not matter as it is only the synchronisation of data from the transmitter that is needed. Remember that in distributed systems, clocks at different locations will usually show slightly different times. If they did try to become synchronised by accessing one single clock at a common reference point, the geographical distance between the reference point and each location would be different. Therefore the delay in receiving a synchronising signal from the common reference point would be different for each location. So each location would set its clock slightly differently and the synchronisation would not have work. The RTP SEQuence number just numbers each RTP packet: 1,2,3,… So the receiver will know if one or more packets are lost, or if packets are received out of order. Even when it is not feasible to acknowledge every single datagram or packet, it is useful for a transmitter to know how many of its datagrams are getting through the network and with what delay variation. RTP’s ‘sister’ protocol, RTCP (real time transfer control protocol), conveys this information at the expense of generating a small number of extra datagrams. It is described below, in detail in RFC1889 and is summarised in the Tanenbaum textbook. 22. Real time transport Control Protocol (RTCP) RTP cannot acknowledge every packet. It is ‘fire & forget’ like IP and UDP. It is useful for the transmitter to know how many of its packets are getting through and with what delay variation. RTCP is ‘sister’ to RTP (why not brother?) It sends feedback reports to the transmitter periodically, specifying: - Average round trip delay, - Percentage of lost packets - Average or worst case jitter introduce by network - Other measurements These are ‘quality of service’ (QoS) measurements measured over the short period of time (typically about five seconds) between reports.

Info P-Type TimeStamp SEQ Etc. Source Id Video/audio stream

8 bits 8 16 32 32 Var Variable

UDP Header (64) UDP Payload

COMP18112 17 VoIP 17/2/2013

Average round trip delay is the time it would take for a packet of information to reach its destination, be instantaneously copied into a returning packet, and be received by the sender. If a VoiP user says ‘hello’, ‘round trip delay’ determines the amount of time the user must wait to hear the other person say ‘hello’, if the other person responds as soon as the first ‘hello’ is heard. Round trip delay is easily measured because the ‘time of sending’ and the ‘time of receiving’ are both recorded with reference to the same clock, i.e. the transmitter’s clock. ‘End to end’ or ‘one way’ delay is difficult to measure because of the need to accurately calibrate clocks at each end of a link. Jitter means ‘delay variation’ or the variation in ‘one-way delay’. If we can’t measure ‘one-way delay’, how can we hope to measure variation in ‘one-way delay’ or jitter? Actually it’s easy by examining the time-stamps within RTP or RTCP packets. 23. Measuring timing jitter Assume we receive two packets with consecutive sequence numbers (this is important) and their time-stamps, converted from ‘clock ticks’ to seconds, are respectively: 1034 1034.006 Assume a timer at the receiver reads 99 seconds when the first packet arrives. If the receive-time for the second packet is 99.006 seconds, the delay variation, or jitter, is clearly zero. Now assume the second packet is received at 99.010 seconds. If the one-way delay for the first packet is D, the delay for the 2nd packet is D + 0.004. The delay variation is therefore 0.004 seconds and we have done this calculation without knowing the one-way delay D. In general, if the time-stamps are S1 and S2 and the receive times are R1 and R2 respectively, the delay variation is (R2 S2) (R1 S1) = (R2 R1) (S2 S1) seconds. This is the absolute value of the ‘receive-time difference’ minus the ‘time-stamp difference’. RTCP will average over the number of packets received between reports or take the worst case time variation over this period. [ref Stallings]. Measurements of jitter can be useful in predicting the future behaviour of the network link since jitter tends to increase when the network starts to become congested. 24. Reminder of a few key points UDP introduces port numbers, a check-sum and little else to IP. UDP, like IP, is not connection oriented and not ‘reliable’. RTP introduces a sequence number, a time-stamp and other information into the UPD payload, but is still not ‘reliable’ nor ‘connection oriented’. IP, UDP and RTP are all ‘fire and forget’ protocols. RTP is useful for real time QoS applications and has a brother protocol, RTCP, for measuring and reporting the current QoS being obtained from a network link. TCP is more complicated and powerful, and is said to be ‘connection orientated’ even over traditional IP which is not-connection orientated. With TCP, connection is made at the transport layer and the connectivity and reliability of TCP are at the expense of additional delay and network usage. Any of these protocols may be used over a ‘Virtual Circuit’ (see later) to gain the benefits of improved QoS. Let’s ask a direct question about TCP: 24.1, Connectivity at the network layer TCP provides reliability and connectivity at the ‘transport layer’ at the expense of latency. VoIP would like ‘circuit switched’ connectivity as provided by traditional telephony, but not at the cost of additional latency. What is needed is connectivity at the network layer. Traditional, IP provides ‘connectionless’ service where IP datagrams are conveyed independently by routers towards their

COMP18112 18 VoIP 17/2/2013

destination IP addresses. They may take different routes & arrive out of order. However, the demand for connectivity based network layer service led to: - Asynchronous transfer mode (ATM) networks. - Multi-protocol label switching (MPLS) - Enhancements to IP Network layer connectivity is different from transport layer connectivity as provided by TCP. To provide network layer connectivity, routing is modified to establish and maintain fixed paths. The network links are still not ‘reliable’. However, disordering of datagrams is eliminated and delay variation (jitter) is reduced. The choice of routing can try to minimise delay. 25. Virtual circuits (VC) A connection orientated network link is a ‘virtual circuit’. Such a circuit may be established by sending ‘set-up’ packets and causing routers to remember the route taken. With MPLS, an extra header ( a ‘label’) is added to each IP packet to specify the route. VC’s must be set up, maintained while needed and then torn down. This is not a ‘fire & forget’ mechanism as used by traditional IP and UDP. It can offer improved ‘quality of service’ while not being totally ‘reliable’. VC’s have clear advantages in providing QoS for VoIP. But this is at a cost as summarised by the comparison below. 26. Comparison of traditional IP network with a VC Issue Trad IP Virtual Circuit

Set up/tear down Not needed Needed

Addressing Each packet contains full addresses

Only VC number needed

Router memory states None for connections Each VC stored

Routing Independent/variable Fixed

Effect of router failing Not catastrophic Many VCs may fail

QoS Not good Much better

Congestion control Difficult Much easier

27. Quality of Service Requirements of Common Applications Remember that the quality of service (QoS) of a network is defined in terms of: - Degree of unreliability (no. of damaged/lost packets) - Delay (latency) - Jitter (variation of delay) - Bandwidth Consider the QoS requirements of a number of applications

COMP18112 19 VoIP 17/2/2013

Application Reliability Delay Jitter Bandwidth

E-mail high Can be high Can be high Not high

Web access high Not too high Not too high Medium

Streamed MM

lower Can be high Can be high Medium

VoIP Teleph lower Must be low Must be low Low

VideoConf lower Must be low Must be low High

28. Setting up and maintaining a VoIP session

B and A register names and IP addresses with proxy servers. To call A, B sends ‘invite’ with A’s name to B’s server. B’s server looks up the address of A’s server, then sends ‘invite’ to it. A’s server sends ‘invite’ to A. Hence B invites A, via the proxy server that looks up A’s IP address. A can accept B’s invitation by returning a message, and B must then acknowledges this acceptance. Now A and B may start to interchange RTP/RTCP packets. They could do this via the proxy server, as in the second laboratory exercise, ‘Instant Messaging’. But this is not a good idea especially if A and B are in the same building and the server is miles away. Instead, establish a direct RTP/RTCP link between B and A. This is a ‘peer-to-peer’ connection. A and B negotiate the bit-rate for the speech coding by monitoring network congestion. They do this by examining the contents of each RTCP packet. If the ‘round trip’ delay, jitter and the number of lost packets starts to increase this is a strong indication of congestion building up, and a good reason for changing to a speech coder with more compression.

Barry Alvaro

Location server

Proxy Server

Look

Reply

Invite Invite

All messages could be sent by TCP

Invite

Proxy Server

Multiple client-server (C/S) connections

COMP18112 20 VoIP 17/2/2013

29. Proxy and location servers The ‘location server’ is just like a ‘dynamic name server (DNS) as covered in Steve’s first lecture. In general, a ‘proxy server’ (PS) is intermediate software that acts as both server and client to make requests on behalf of other clients. Proxy servers are useful for routing, and as a ‘gate-keepers’ for enforcing policy such as admission control (i.e. determining who has permission make VoIP calls), preventing congestion and having efficient access to the location server. Since UDP datagrams are sometimes blocked by firewalls, a PS may use ‘tunnelling’ by encapsulating them within TCP datagrams to get them through. It could also provide convenient access to a ‘gateway’ between the IP network and the public switched telephone network (PSTN). Encryption is not widely used in VoIP yet, but it is not difficult to apply to the voice stream with encryption keys agreed in advance via the proxy server. Without encryption it is very easy for an ‘evesdropper’ to intercept a VoIP packet stream and listen in on a call. 30. ‘Session Initialisation Protocol’ (SIP) SIP is an application layer protocol defined by the IETF (Internet Engineering Task Force) for setting up, maintaining and tearing down VoIP calls. SIP uses Ports 5060 & 5061. Addresses are URLs such as sip:[email protected]. Negotiates suitable voice compression, and other things ‘H323’ is a similar protocol - but older and more complex. There are currently more than 100 different VoIP systems and most use SIP. ‘Win Live Messenger’ is based on SIP. SKYPE is not, but does similar things. SIP messages (Connect, Accept, Ack, etc) may be conveniently sent by TCP and conform to a client/server mechanism. VoIP voice-streams are best communication by RTP/RTCP (based on UDP) and conform to a peer-to-peer ( P2P) mechanism 31. Quality of Service in VoIP A buffer at each receiver allows for variation in delay. This is often called a ‘jitter’ buffer. Remember that ‘jitter’ is variation in one-way delay. Assume, for example, the one-way delay varies between 0.02 & 0.12 seconds. A jitter-buffer of size 0.1 seconds (i.e. 800 bytes with 64

Barry Alvaro

Location server

Proxy Server

Accept Accept

Accept

Proxy Server

Ack Ack Ack

RTP/RTCP packets – P2P

COMP18112 21 VoIP 17/2/2013

kb/s G711 speech) may be employed to avoid ‘data under-flow’, i.e. running out of data while waiting for a delayed packet. The receiver need not know the actual one-way delay which is fortunate since a global clock with exact time at either end would be required, and this would be difficult to achieve with sufficient accuracy. A one-way delay variation from 0.05 to 0.15 seconds would give exactly the same jitter. A packet arriving more than 0.1 ms later than the previous one may then be too late to avoid the buffer becoming empty (underflow) and would then be considered ‘lost’. With wired networks, most ‘lost’ packets are just late. Packets are rarely undelivered or delivered with bit-errors. This is not true for wireless/mobile networks. 32. Jitter-buffer size The receiver’s jitter-buffer introduces delay, and affects the number of packets ‘lost’ due to excessive delay. Increasing the buffer size decreases the number of ‘lost’ packets in a wired VoIP link at the expense of increasing delay. A round trip delay of greater than about 300 ms makes conversation difficult, and this fact limits the amount of delay that can be introduced at each receiver. Illustration: Assume that over a link between Manchester and New York, the ‘one way’ propagation delay, caused largely by the time it takes light to travel along an optical fibre, is 30 ms, and that each RTP packet is filled with 20ms of speech, (i.e. 160 bytes with G711 coding). This ‘packetisation delay’ of 20 ms plus the ‘propagation delay’ of 30 ms introduces a round trip delay of 2 ×50 ms, i.e. 100 ms. If we can allow a 300 ms round trip delay in total, we still have capacity for a 100 ms buffer at each end of the two-way link. This allows us to have a jitter-buffer or reservoir of 800 bytes or 5 packets at either end of the link. Packets delayed by less than150 ms will be OK but any packet delayed by more than 150 ms must be considered ‘lost’ as it is too late to be used. Note: Light travels at about 200,000 km/s through fibre which is a bit slower than the speed of light through open space (300,000 km/s). I am assuming it is 6000 km to New York (a one way flight takes about 7 hours 25,000 seconds). The speed of sound in air is about 0.3 km/s, one million times slower. If you want to estimate the minimum network delay between two places, think how long it would take you to fly there, then divide by one million. 33. A question for you: Having set up such a VoIP link between Manchester & New York, how could you improve it if RTCP reports zero lost packets at the New York end and 10% lost packets at the Manchester end? Answer: Increase the jitter-buffer length at the Manchester end to decrease the number of lost packets at the expense of increasing the round trip delay. Then decrease the jitter-buffer length at the New York end to decrease the delay at the expense of increasing lost packets. There would then be no discernable change in delay, as only the ‘round trip’ delay affects conversation. Voice quality improves at Manchester and gets a bit worse at New York. Ideally, we could achieve the same packet loss rate at either end; say about 5% packet loss. With 5% packet loss, speech quality can be made reasonable by employing ‘packet loss concealment’ (see later). 34. Speech digitisation standards The ITU G711speech coding standard is widely used for VoIP and lower bit/rate coders are available. Many issues are summarised by the table below:

COMP18112 22 VoIP 17/2/2013

Coding Quality Proc

delay Processing complexity

Memory of errors

b/s

Raw v high none none none 128k

G711 high none low none 64k

G726 high low medium some 32k

G728 high low high high 16k

G729 high high high high 8k

G723.1 lower high high high 5.3kb/s

‘Raw’ means the 16-bit samples obtained from a standard sound card set to a sampling rate of 8 kHz. Such a high bit-rate is fine for storage, but rarely used for communications including VoIP. The high processing delays with G729 and G723.1 arise from the need for the coder to have large ‘buffers’ of speech samples to identify ‘patterns’ or ‘predictability’. Bit-rate saving is achieved by identifying speech characteristics that may be predicted, and therefore need not be transmitted. However, if any bit-errors occur during transmission the predictions will be wrong at the receiver so future speech samples will be affected. Then the effect of bit-errors will be remembered for some time in the future. A single bit error may cause distortion to 20 ms (160 samples) of speech (or more) instead of just affecting one speech sample. The higher bit-rate techniques are much simpler and bit-errors will have no or little effect on future samples. It may be observed that, in general, the lower the bit-rate, the higher the processing delay, the greater the complexity, the longer the memory of bit-errors and the worse the speech quality. 35. Mobile IP: Voice over wi-fi The application of VoIP to battery powered wireless enabled mobile devices including mobile phones has received much commercial interest. Since wireless LANs support IP, UDP and TCP, they can support voice over wi-fi (VoWiFi). RTP and H323 or SIP protocols will work on wi-fi connected devices. The QoS of wireless (radio) networks is quite different from that of wired networks in that bit-errors are more frequent, bandwidth is more limited and congestion is more likely. Each packet is more expensive to send. (‘Every bit is sacred, every bit is great.’). Therefore wireless LANs employ ‘forward error correction’ (FEC) which means that a considerable number of extra bits are included in the packet to allow some bit-errors to be not only detected, but actually corrected. There will be more about this in next year’s Mobile Computing course. Let’s just illustrate one of the interesting issues with Vowifi, i.e. the efficiency of transmitting VoIP packets by wireless. Consider the structure and length of wi-fi packets at the physical layer

COMP18112 23 VoIP 17/2/2013

Wireless transmission needs a header for every packet which takes approximately 0.2 ms in addition to the payload. This is to allow the radio receiver to synchronise to the packet before the actual data is received. There is a similar header in Ethernet transmissions, but it is much shorter. At 11 Mb/s as provided by the wi-fi standard, 1000 bytes of text, with additional bits for FEC, takes about 2 ms to transmit. So the overhead of having to send such a long header is about 10%, which is not too bad. If we now send 100 bytes of VoIP speech at the same speed, this takes about 0.2ms, and the phy layer overhead now becomes 100% . Finally, if we compress the speech using one of the ITU techniques (G729) the payload becomes just 30 bytes, and the framing overhead then approaches 300%. Voice over wi-fi begins to appear rather inefficient. Increasing the packet size is not an option with interactive VoIP because of the extra delay this would cause. If retransmissions become necessary because of collisions & radio noise, the inefficiency becomes even worse. Remember that each 2-way VoIP conversation involves about 100 packets per second, i.e. about 50 in each direction. 36. Packet loss concealment (PLC) strategies PLC uses predictability in speech waveforms which enables guesses about what is likely to come next. It can work well for voiced sounds (vowels) which are quasi-periodic as illustrated earlier:

Pre-amble + header

0.2ms 2 ms

1000 bytes of text

100 bytes (12.5 ms of G711speech)

30 bytes (30 ms of G729 compressed speech)

0.2 ms

0.06ms

Volts

t

COMP18112 24 VoIP 17/2/2013

Referring to the waveform segment above, if the dashed the part is lost, the similarity with what came before makes it possible to produce a reasonable replacement. An obvious idea is just to repeat the previous frame when a packet is lost. This works perfectly in the example and is much better than ‘zero stuffing’, i.e. just replacing the missing packet by zero valued samples to produce a period of silence. Simple packet repetition will not always work so well since the wave-shape and its periodicity will be changing. Any discontinuity as the reconstructed frame joins onto next frame will produce a nasty ‘click. We need something a bit more sophisticated to deal with the changing nature of the speech waveform and to smooth over the joins. Fortunately this is provided by a well known algorithm publish as Appendix 1 to the ITU-G711 standard. There are even better PLC techniques also, but they are more complicated. 37. Some finer points of real time ‘Why not forget about the missing packet and go on with next one?’ ‘Just miss it out?’ This might just be acceptable for a very occasional defect when listening to recorded music, though you may get a pretty loud ‘click’ in your ears. If there is a video track as well, you may lose lip-sync. Such an approach may be disastrous for VoIP telephony because of the ‘real time’ requirement. The jitter-buffer will be short of one packet which degrades its effectiveness in dealing with jitter. If other packets are late or lost, (which becomes increasingly likely) and we just disregard them also, our buffer will get shorter and shorter and will then surely it will become empty and underflow. The system must then refill the buffer, output a period of silence, and then restart the sound output process. If this happens more than once, the user will get pretty fed up with the VoIP system. In principle the number of samples sent must be the number of samples received, and if this becomes not true, we must re-create the samples we lose. Otherwise the receiver’s jitter-buffer runs out and generates a ‘click’. Also we must expect small differences in sampling rates at transmitter and receiver as mentioned earlier, and this can be dealt with by recreating or discarding an odd sample every second or so. If we do not do this, buffers will become empty or over-fill over a period of time with bad consequences. 38. IntServ and Diffserv These are techniques for the reservation of transmission capacity and/or assigning priority for certain traffic or types of network traffic. VoIP is widely used over wired networks with reserved capacity. IETF’s IntServ architecture was proposed for giving guaranteed QoS to particular traffic streams by reserving link capacity between routers. IETF’s DiffServ architecture was proposed for allowing certain categories of traffic, such as VoIP, to be prioritised to make it more likely (though not certain) to achieve a desired QoS. 39. Some possible misconceptions about distributed computing Several textbooks and other articles list a number of fallacies or misconceptions about the nature of distributed computing and their interconnection mechanisms. Hopefully this section has shown these to be misconceptions and explored the consequences of having to deal with the true situation.

COMP18112 25 VoIP 17/2/2013

The misconceptions are that interconnections are:

Reliable Secure Homogenous (i.e. having the same type of links throughout) Unchanging in topology So fast that latency (delay) is negligible Essentially unlimited in bandwidth. Available at zero transport cost (e.g. VoIP is free) Set up with just one administrator Linking devices with fully synchronised clocks; ( i.e. each having a universal measure of time )

Many of these misconceptions are revealed as such by the VoIP application. 40. Transparency Finally, we mention the property of ‘transparency’ which is an ideal for distributed systems and their applications. For VoIP, this property means that the distributed application should:

Accommodate differences in data representation (e.g. differences in speech coders, sampling rates, differences due errors, VoIP implementations, SIP/H323 compatibility VoIP/PSTN differences, etc. )

Not worry about where resources are located (Use of location servers, employment of tunnelling where necessary, Compatibility with PSTN and mobile phones etc.)

Allow resources to move while in use (Use voice over wifi, though handover issues have not been discussed here, etc)

Allow resources to be replicated. (Proxy servers, multiple location servers, etc.)

Allow processing to be distributed among hosts (Speech processing is distributed and not exactly synchronised)

Be robust to failure of components: (i.e. one failure should not be catastrophic) (Rely on the nature of IP networks, deal with QoS imperfections and if a VoIP host fails, buy another one)

41. Conclusions and learning outcomes

1. Interactive VoIP illustrates a DS where real time processing is done simultaneously in two or more locations, communication takes place over unreliable networks & timing mechanisms are not & cannot be exactly synchronised.

2. It requires regular transmission of packets with ‘round trip’ latency limits. 3. Bit-rate compression is negotiated according to need to reduce bit-rate. 4. Network communications use protocols defined in layers. 5. Applications such as VoIP use transport layer protocols TCP & UDP. 6. TCP establishes connections, introduces port numbers and many other features, and

achieves ‘reliability’ at the expense of additional delay & network usage. 7. UDP introduces port numbers, a check-sum and little else to IP.

COMP18112 26 VoIP 17/2/2013

8. RTP introduces sequence numbers, time-stamp & other info into the UPD payload, but is still not ‘reliable’.

9. IP, UDP and RTP are all ‘fire and forget’ protocols. 10. RTP & RTCP are useful for conveying real time data in QoS applications such as VoIP. 11. Any of these protocols may be used over Virtual Circuits to gain benefits of improved QoS. 12. Interactive VoIP requires regular transmission of packets with ‘round trip’ latency limits. 13. ‘End to end’ delay not easily measurable and not needed as only ‘round trip’ delay is

discernable. 14. SIP is widely used for setting up maintaining & ending calls. 15. TCP is appropriate for the client-server SIP messages, whereas fire & forget protocols RTP

and RTCP are appropriate for the peer-to-peer real time voice streams. 16. There are no QoS guarantees with RTP and RTCP: jitter causes late packets to appear lost

and bit-errors may damage packets. 17. VoWLAN or VoWiFi works in principle but is less efficient than VoIP over wired links. 18. Wired & wireless networks have different QoS. 19. Packet loss concealment (PLC) is used to provide correct sounding speech packets when

the real packets are lost, or too late to be used. 20. Intserve & diffserve are network mechanisms for improving QoS to some users. 21. Many fundamental issues are illustrated by VoIP.

42. Problems & discussion points 1. Since IP was designed primarily for data why is it now being used for VoIP? 2. Why is VoIP telephony more demanding than streaming audio as used for sound clips and Internet radio. 3. What are the advantages & disadvantages of employing speech bit-rate compression in (i) VoIP over wired links and (ii)VoIP over WiFi? 4. Why are mobile VoIP over WiFi devices traditionally power inefficient? 5. Two mobile VoIP devices with different speech sampling rates, 8000 & 8010 Hz, are communicating 20ms (G711) packets over a WLAN. How could you avoid the accumulation of delay and distortion due to buffer under-flows and over-flows. 6. Can VoIP and data co-exist on a WLAN? What problems can occur & what solutions have been proposed? 7. Why can’t the time-stamp within RTP packets be used to estimate one-way delay? 8. Explain why it would be difficult to have a precisely synchronized measure of exact time across all components of a distributed system. Why can’t we just listen to the ‘pips’ on the one-oclock news on the radio and all set our computer clocks at that point in time? 9. While thinking about the ‘pips’, if you have a DAB radio why are they always a little late? 10. Is it true that IP address are global, MAC (Phy) addresses are local and PORT numbers (addresses) identify applications? [Answer yes] 11. How is connectivity achieved at network layer? 12. How is connectivity achieved at transport layer? 13. Why do we need both a transport & a network layer? 14. How will (a) TCP (b) UDP deal with duplicated packets? 15. Why does each IP packet have a ‘time to live’? 16. Since routers change IP headers, does the checksum have to change? 17. Can you relate the ‘two army’ problem to a different scenario? 18. Add an 8-bit check-sum to: 11101010 11010001 10000001

COMP18112 27 VoIP 17/2/2013

19. An end-to-end VoIP over wi-fi system uses 50 ms G711 speech frames with zero stuffing PLC. It delivers intelligible but fairly poor quality speech, with round trip delay of 200 ms and frequently crashes due to network congestion. How could you improve its performance? What steps would be appropriate if it were a wired link? 20. Devise a method for avoiding buffer overflow or underflow when two users, Alvaro and Barry, are communicating speech samples over a perfect link but their independent ‘sound card’ clocks produce sampling rates of 7999 Hz and 8001Hz respectively, instead of 8 kHz. 21. What problems occur when the buffers are (a) too small and (b) too large. 22. What size of buffer would you consider to be about right? 23. What would happen if Alvaro’s buffer were bigger than Barry’s?