![Page 1: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/1.jpg)
Arrays and Strings
CSCI 2720University of GeorgiaSpring 2007
![Page 2: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/2.jpg)
The Array ADT
Stores a sequence of consecutively numbered objects
Each object can be accessed (selected) using its index
![Page 3: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/3.jpg)
More formally ….
Given integers l and u with u >= l-1, the interval l ..u is defined to be the set
of integers i such that l <=i<=u An array is a function
from any interval(the index set of the array)
to a set of objects or elements the value set of the array
![Page 4: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/4.jpg)
Formally, continued …
If X is an array and i is a member of its index set, We write X[i] to denote the value of X
at i The members of the range of X are
known as the elements of X
![Page 5: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/5.jpg)
The Array ADT
Access(X,i) Length(X) Assign(X,i,v) Initialize(X,v) Iterate(X,F)
![Page 6: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/6.jpg)
Access(X,i)
Return X[i]
![Page 7: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/7.jpg)
Length(X)
Return u – l + 1, the number of elements in I (the interval on X)
![Page 8: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/8.jpg)
Assign(X,i,v)
Replace array X with a function whose value on i is v (and whose value on all other arguments is unchanged).
We also write this as: X[i] <- v
![Page 9: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/9.jpg)
Initialize(X,v)
Assign v to every element of array X
![Page 10: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/10.jpg)
Iterate(X,F)
Apply F to each element of array X in order, from smallest index to largest index. F is an action on a single array element. for i = l to u do
F(X[i])
![Page 11: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/11.jpg)
String
A special type of array If is any finite set, then a string over
is an array whose value set is and whose index
set is 0..n-1 for some non-negative n The set is called an alphabet Each element of is called a character often consists of the Roman alphabet,
plus digits, the space, and common punctuation marks
![Page 12: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/12.jpg)
Strings
If w is a string, then Length(w) = n
Also written |w|
If w = TREE, then w is a string of length 4 w[0] = T, w[1] = R
The null string is the string whose domain is the empty interval Has no elements Written
![Page 13: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/13.jpg)
String-specific operations
Substring(w,i,m) Concat(w1,w2)
![Page 14: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/14.jpg)
Substring(w,i,m)
w is a string; i,m integers Returns the string of length m containing
the portion of w that starts at i Formally:
returns a string w’ with indices 0 .. m-1 such that w’[k] = w[i+k] for each k satisfying 0 <=k <=m
only applies if 0 <= i <= |w| and 0 <= m <= (|w| -1)
otherwise, returns
![Page 15: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/15.jpg)
Substring …
Example: w = SNICKERING Substring(w,2,3) returns ICK Substring(w,3,0) returns Substring(w,10,3) returns
Prefix each substring(w,0,j) for 0<= j <= |w| is a prefix of w
Suffix each substring(w,j, |w| - j) for 0<= j <= |w| is a
suffix of w
![Page 16: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/16.jpg)
Concat(w1,w2)
returns a string of length |w1| + |w2| whose characters are the characters of
w1 followed by those of w2
Concat(w,) = Concat(,w) = w Example: w1 = BIRD, w2 = DOG,
Concat(w1,w2) = BIRDDOG Concat(w2,w1) = DOGBIRD
![Page 17: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/17.jpg)
Tables vs. Arrays
Table = physical organization of memory into sequential cells
Array = an abstract data type, with specific operations
Arrays frequently implemented using tables, but may be implemented in other ways
![Page 18: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/18.jpg)
Multi-dimensional arrays
a function whose range is any set V and whose domain is the Cartesian product of any number of intervals
the Cartesian product of intervals I1, I2, …Id, written as I1 x I2 x … Id, is the set of all d-tuples <i1, i2, … id> such that ik Ik for each k.
![Page 19: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/19.jpg)
Multi-D arrays
if C is a multidimensional array and if i =<i1, i2, … id> then C[i1, i2, … id] is the value of C at i
The dimension of a multi-D array is the number of intervals whose Cartesian product makes up the index set
The size of the kth dimension of such an array is the number of elements in Ik
![Page 20: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/20.jpg)
Contiguous Representation of Arrays: Why Computer Scientists start counting at 0
Store elements in a table: x x+4 x+8 x+12 x+16 x+20
x[0] x[1] x[2] x[3] x[4] x[5]
Each element begins at x + 4(i-1) x = starting address of the array 4 = sizeof(element) i = index of element of interest
17 43 87 94 101 143
![Page 21: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/21.jpg)
More generally
if X is the address of the first cell in memory of an array with indices l..u, and if each element has size L, then the ith element is stored at address
X + L * (i-1) the element can be retrieved in
constant time
![Page 22: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/22.jpg)
When iterating through the array
can save a few operations by doing “pointer arithmetic” just add L to current address to get
next element don’t have to subtract, multiply, add still linear in number of elements, but
faster linear
![Page 23: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/23.jpg)
Where’s the needed info stored?
Could store L, l, and u at the starting address of X .. but would need to adjust the formula to calculate the location of individual cells.
If language is strongly typed, some or all of L, l, and u may be part of the definition of X and stored elsewhere C/C++ -- L part of typing info, l assumed to be
0, u not stored (programmer needs to keep track)
![Page 24: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/24.jpg)
Where’s the needed info stored?
Can use a sentinel value after the last element of the array C/C++ -- we do this with strings. Store
a ‘\0’ at the end means that you need to iterate through to
find Length, no longer O(1)
![Page 25: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/25.jpg)
What if the elements have different lengths?
allot Max to all elements wasted space can still access in O(1) time
store pointers to elements pointers require memory need 2 accesses (calculate location of pointer,
then follow it), but still O(1) pointer to element is at X + P * (i-1) easy to swap even large or complex elements
… just swap their pointers
![Page 26: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/26.jpg)
2D arrays
can also represent in contiguous memory … but do we keep rows together or do we keep columns together??
Example: array with logical orderingA B C D E F G HI J K L
![Page 27: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/27.jpg)
Row major v. column-major
ABCDEFGHIJKL
AEIBFJCGKDHL
![Page 28: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/28.jpg)
Where are 2D elements stored?
Row-major: R[i,j] stored at: R + L * (NPR(i-1) + (j-1)), where
R is starting address of the array L is the size of each element NPR is the number of elements per row i is the row number j is the column number
![Page 29: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/29.jpg)
Where are 2D elements stored?
Column-major: C[i,j] stored at: C + L * (NPC(j-1) + (i-1)), where
C is starting address of the array L is the size of each element NPC is the number of elements per
column i is the row number j is the column number
![Page 30: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/30.jpg)
Multi-dimensional arrays
![Page 31: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/31.jpg)
Constant-time initialization
procedure Initialize(ptr M, value v)//Initialize each element of M to v
Count(M) <- 0Default(m) <- v
function Valid(int I, ptr M): boolean//return true if M[i] has been modified //since last Initialize return (0 <= When(M)[i] < Count(M)) and
(Which(m)[When(M)[i]] == i)
![Page 32: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/32.jpg)
Constant time initialization
function Access(int i, ptr M):value// return M[i]if Valid(I,M) then
return Data(M)[i]else
return Default(M)
procedure Assign(ptr M, int I, value v)// Set M[i] <- v
if not Valid(i, M) thenWhen(M)[i] <- Count(M)Which(M)[Count(M)] <- iCount(M) <- Count(M) + 1
Data(M)[i] <- v
![Page 33: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/33.jpg)
But requires 3x memory …
Which(M)
When(M)
Data(M)
![Page 34: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/34.jpg)
Sparse Arrays
Definitions List Representations Hierarchical Tables Arrays with Special Shapes
![Page 35: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/35.jpg)
Sparse Arrays
some arrays contain only a few elements … wouldn’t it be more efficient to store only the non-null values? same idea when only a few values differ from the majority
some arrays have a special shape … upper diagonal matrix, symmetric matrix
sparse array : an array in which only a small fraction of the elements are significant in some way
null element: doesn’t need to be stored; is either actually null, or well-known, or easily calculated
![Page 36: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/36.jpg)
List representations
![Page 37: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/37.jpg)
Hierarchical tables
![Page 38: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/38.jpg)
Upper-triangular matrix
![Page 39: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/39.jpg)
Representation of Strings
Background Huffman Encoding Lempel-Ziv Encoding
![Page 40: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/40.jpg)
Representing Strings
How much space do we need? Assume we represent every
character. How many bits to represent each
character? Depends on ||
![Page 41: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/41.jpg)
Bits to encode a character
Two character alphabet{A,B} one bit per character:
0 = A, 1 = B Four character alphabet{A,B,C,D}
two bits per character: 00 = A, 01 = B, 10 = C, 11 = D
Six character alphabet {A,B,C,D,E, F} three bits per character: 000 = A, 001 = B, 010 = C, 011 = D, 100=E,
101 =F, 110 =unused, 111=unused
![Page 42: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/42.jpg)
More generally
The bit sequence representing a character is called the encoding of the character.
There are 2n different bit sequences of length n,
ceil(lg||) bits required to represent each character in
if we use the same number of bits for each character then length of encoding of a word is |w| * ceil(lg||)
![Page 43: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/43.jpg)
Can we do better??
If is very small, might use run-length encoding
![Page 44: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/44.jpg)
What if …
the string we encode doesn’t use all the letters in the alphabet?
log2(ceil(|set_of_characters_used|) But then also need to store / transmit
the mapping from encodings to characters
… and is typically close to size of alphabet
![Page 45: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/45.jpg)
Huffman Encoding:
Still assumes encoding on a per-character basis
Observation: assigning shorter codes to frequently used characters can result in overall shorter encodings of strings
requires assigning longer codes to rarely used characters
Problem: when decoding, need to know how many bits to
read off for each character. Solution:
Choose an encoding that ensures that no character encoding is the prefix of any other character encoding. An encoding tree has this property.
![Page 46: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/46.jpg)
A Huffman Encoding Tree
12
21
9
7
43
5
23
A T R N
E
0 1
0 1
0 1 0 1
![Page 47: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/47.jpg)
12
21
9
7
43
5
23
A T R N
E
0 1
0 1
0 1 0 1
A 000
T 001
R 010
N 011
E 1
![Page 48: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/48.jpg)
Weighted path length
A 000
T 001
R 010
N 011
E 1
Weighted path = Len(code(A)) * f(A) +
Len(code(T)) * f(T) + Len(code(R) ) * f(R) +
Len(code(N)) * f(N) + Len(code(E)) * f(E)
= (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1)
= 9 + 6 + 9 + 12 + 9 = 45
Claim (proof in text) : no other encoding can result in a shorter weighted path length
![Page 49: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/49.jpg)
Building the Huffman Tree
A3
T4
R4
E5
![Page 50: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/50.jpg)
Building the Huffman Tree
A3
T4
R4
E5
7
![Page 51: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/51.jpg)
Building the Huffman Tree
R4
E5
A3
T4
7
![Page 52: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/52.jpg)
Building the Huffman Tree
R4
E5
A3
T4
79
![Page 53: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/53.jpg)
Building the Huffman Tree
A3
T4
7
R4
E5
9
![Page 54: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/54.jpg)
Building the Huffman Tree
A3
T4
7
R4
E5
9
16
![Page 55: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/55.jpg)
Building the Huffman Tree
A3
T4
7
R4
E5
9
160
0 1
1
0 1
00 01 10 11
![Page 56: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/56.jpg)
Taking a step back …
Why do we need compression? rate of creation of image and video data image data from digital camera
today 1k by 1.5 k is common = 1.5 mbytes need 2k by 3k to equal 35mm slide = 6
mbytes video at even low resolution of
512 by 512 and 3 bytes per pixel, 30 frames/second
![Page 57: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/57.jpg)
Compression basics video data rate
23.6 mbytes/second 2 hours of video = 169 gigabytes
mpeg-1 compresses 23.6 mbytesdown to 187 kbytes per second 169 gigabytes down to 1.3 gigabytes
compression is essential for both storage and transmission of data
![Page 58: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/58.jpg)
Compression basics
compression is very widely used jpeg, gif for single images mpeg1, 2, 3, 4 for video sequence zip for computer data mp3 for sound
based on two fundamental principles spatial coherence and temporal
coherence similarity with spatial neighbor similarity with temporal neighbor
![Page 59: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/59.jpg)
Basics of compression
character = basic data unit in the input stream
represents byte, bit, etc. strings = sequences of characters encoding = compression decoding = decompression codeword = data elements used to
represent input characters or character strings
codetable = list of codewords
![Page 60: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/60.jpg)
Codeword
encoding/compression takes characters/strings as input and use
codetable to decide on which codewords to produce
decoder/decompressor takes codewords as input and uses same
codetable to decide on which characters/strings to produce
![Page 61: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/61.jpg)
Codetable
clearly both encoder and decoder must pass the encoded data as a series of codewords
also must pass the codetable the codetable can be passed explicitly
or implicitly that is we either
pass it across agree on it beforehand (hard wired) recreate it from the codewords (clever!)
![Page 62: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/62.jpg)
Basic definitions
compression ratio = size of original data / compressed data basically higher compression ratio the better
lossless compression output data is exactly same as input data essential for encoding computer processed data
lossy compression output data not same as input data acceptable for data that is only viewed or heard
![Page 63: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/63.jpg)
Lossless versus lossy
human visual system less sensitive to high frequency losses and to losses in color
lossy compression acceptable for visual data
degree of loss is usually a parameter of the compression algorithm
tradeoff - loss versus compression higher compression => more loss lower compression => less loss
![Page 64: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/64.jpg)
Symmetric versus asymmetric
symmetric encoding time == decoding time essential for real-time applications (ie.
video or audio on demand) asymmetric
encoding time >> decoding ok for write-once, read-many situations
![Page 65: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/65.jpg)
Entropy encoding
compression that does not take into account what is being compressed
normally is also lossless encoding most common types of entropy
encoding run length encoding Huffman encoding modified Huffman (fax…) Lempel Ziv
![Page 66: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/66.jpg)
Source encoding
takes into account type of data (ie. visual)
normally is lossy but can also be lossless most common types in use:
JPEG, GIF = single images MPEG = sequence of images (video) MP3 = sound sequence
often uses entropy encoding as a sub-routine
![Page 67: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/67.jpg)
Run length encoding
one of simplest and earliest types of compression
take account of repeating data (called runs) runs are represented by a count along with
the original data eg. AAAABB => 4A2B
do you run length encode a single character?
no, use a special prefix character to represent start of runs
![Page 68: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/68.jpg)
Run length encoding
runs are represented as <prefix char><repeat count><run char>
prefix char itself becomes<prefix char>1<prefix char>
want a prefix char that is not too common an example early use is MacPaint file
format run length encoding is lossless and has
fixed length codewords
![Page 69: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/69.jpg)
MacPaint File Format
![Page 70: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/70.jpg)
Run length encoding
works best for images with solid background
good example of such an image is a cartoon
does not work as well for natural images
does not work well for English text however, is almost always a part of a
larger compression system
![Page 71: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/71.jpg)
Huffman encoding
assume we know the frequency of each character in the input stream
then encode each character as a variable length bit string, with the length inversely proportional to the character frequency
variable length codewords are used; early example is Morse code
Huffman produced an algorithm for assigning codewords optimally
![Page 72: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/72.jpg)
Huffman encoding
input = probabilities of occurrence of each input character (frequencies of occurrence)
output is a binary tree each leaf node is an input character each branch is a zero or one bit codeword for a leaf is the concatenation of bits
for the path from the root to the leaf codeword is a variable length bit string
a very good compression ratio (optimal)?
![Page 73: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/73.jpg)
Huffman encoding
Basic algorithmMark all characters as free tree nodesWhile there is more than one free node
Take two nodes with lowest freq. of occurrenceCreate a new tree node with these nodes as
children and with freq. equal to the sum of their freqs.
Remove the two children from the free node list.Add the new parent to the free node list
Last remaining free node is the root of the binary tree used for encoding/decoding
![Page 74: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/74.jpg)
Huffman example
a series of colors in an 8 by 8 screen colors are red, green, cyan, blue,
magenta, yellow, and black sequence is
rkkkkkkk gggmcbrr kkkrrkkk bbbmybbr kkrrrrgg gggggggr kkbcccrr grrrrgrr
![Page 75: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/75.jpg)
Huffman example
![Page 76: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/76.jpg)
Huffman example
![Page 77: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/77.jpg)
Huffman example
![Page 78: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/78.jpg)
Huffman example
![Page 79: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/79.jpg)
Fixed versus variable length codewords
run length codewords are fixed length Huffman codewords are variable length length inversely proportional to frequency all variable length compression schemes
have the prefix property one code can not be the prefix of another binary tree structure guarantees that this
is the case (a leaf node is a leaf node!)
![Page 80: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/80.jpg)
Huffman encoding
advantages maximum compression ratio assuming correct
probabilities of occurrence easy to implement and fast
disadvantages need two passes for both encoder and decoder
one to create the frequency distribution one to encode/decode the data
can avoid this by sending tree (takes time) or by having unchanging frequencies
![Page 81: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/81.jpg)
Modified Huffman encoding
if we know frequency of occurrences, then Huffman works very well
consider case of a fax; mostly long white spaces with short bursts of black
do the following run length encode each string of bits on a line Huffman encode these run length codewords use a predefined frequency distribution
combination run length, then Huffman
![Page 82: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/82.jpg)
Beyond Huffman Coding …
1977 – Lempel & Ziv, Israeli information theorists, develop a dictionary-based compression method (LZ77)
1978 – they develop another dictionary-based compression method (LZ78)
![Page 83: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/83.jpg)
The LZ family
LZ77 LZR LZSS LZB LZH – used by zip and unzip
LZ78 LZW – Unix compress LZC – Unix compress LZT LZMW LZJLZFG
![Page 84: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/84.jpg)
Overview of LZ family
To demonstrate: simple alphabet containing only two
letters, a and b, and create a sample stream of text
![Page 85: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/85.jpg)
LZ family overview
Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.
![Page 86: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/86.jpg)
Sender : The Compressor
Before compression, the pieces of text from the breaking-down process are indexed from 1 to n:
![Page 87: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/87.jpg)
indices are used to number the pieces of data. The empty string (start of text) has index 0. The piece indexed by 1 is a. Thus a, together with
the initial string, must be numbered Oa. String 2, aa, will be numbered 1a, because it
contains a, whose index is 1, and the new character a.
![Page 88: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/88.jpg)
the process of renaming pieces of text starts to pay off. Small integers replace what were once
long strings of characters. can now throw away our old stream of
text and send the encoded information to the receiver
![Page 89: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/89.jpg)
Bit Representation of Coded Information
Now, want to calculate num bits needed each chunk is an int and a letter num bits depends on size of table
permitted in the dictionary every character will occupy 8 bits because
it will be represented in US ASCII format
![Page 90: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/90.jpg)
Compression good?
in a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text.
example: 12 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.
![Page 91: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/91.jpg)
Receiver: The Decompressor (Implementation
receiver knows exactly where boundaries are, so no problem in reconstructing the stream of text.
Preferable to decompress the file in one pass; otherwise, we will encounter a problem with temporary storage..
![Page 92: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/92.jpg)
Lempel-Ziv applet
Seehttp://www.cs.mcgill.ca/~cs251/Ol
dCourses/1997/topic23/#JavaApplet
![Page 93: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/93.jpg)
Lempel Ziv Welsch (LZW)
previous methods worked only on characters LZW works by encoding strings some strings are replaced by a single
codeword for now assume codeword is fixed (12 bits) for 8 bit characters, first 256 (or less) entries
in table are reserved for the characters rest of table (257-4096) represent strings
![Page 94: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/94.jpg)
LZW compression
trick is that string-to-codeword mapping is created dynamically by the encoder
also recreated dynamically by the decoder need not pass the code table between the
two is a lossless compression algorithm degree of compression hard to predict depends on data, but gets better as
codeword table contains more strings
![Page 95: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/95.jpg)
LZW encoder
Initialize table with single character stringsSTRING = first input characterWHILE not end of input stream
CHARACTER = next input characterIF STRING + CHARACTER is in the string table
STRING = STRING + CHARACTERELSE
Output the code for STRINGAdd STRING + CHARACTER to the string
tableSTRING = CHARACTER
END WHILEOutput code for string
![Page 96: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/96.jpg)
Demonstrations
Another animated LZ algorithm … http://www.data-compression.com/lempelziv.html
![Page 97: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/97.jpg)
LZW encoder example
compress the string BABAABAAA
![Page 98: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/98.jpg)
LZW decoder
![Page 99: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/99.jpg)
Lempel-Ziv compression
a lossless compression algorithm All encodings have the same length
But may represent more than one character
Uses a “dictionary” approach – keeps track of characters and character strings already encountered
![Page 100: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/100.jpg)
LZW decoder example
decompress the string <66><65><256><257><65><260>
![Page 101: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/101.jpg)
LZW Issues
compression better as the code table grows
what happens when all 4096 locations in string table are used?
A number of options, but encoder and decoder must agree to do the same thing do not add any more entries to table (as is) clear codeword table and start again clear codeword table and start again with
larger table/longer codewords (GIF format)
![Page 102: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/102.jpg)
LZW advantages/disadvantages
advantages simple, fast and good compression can do compression in one pass dynamic codeword table built for each file decompression recreates the codeword
table so it does not need to be passed disadvantages
not the optimum compression ratio actual compression hard to predict
![Page 103: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/103.jpg)
Entropy methods
all previous methods are lossless and entropy based
lossless methods are essential for computer data (zip, gnuzip, etc.)
combination of run length encoding/huffman is a standard tool
are often used as a subroutine by other lossy methods (Jpeg, Mpeg)
![Page 104: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/104.jpg)
Lempel-Ziv compression
a lossless compression algorithm All encodings have the same length
But may represent more than one character
Uses a “dictionary” approach – keeps track of characters and character strings already encountered
![Page 105: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/105.jpg)
String Searching
Background Knuth-Morris-Pratt algorithm Boyer-Moore algorithm Fingerprinting and the Karp-Rabin
algorithm
![Page 106: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/106.jpg)
![Page 107: Arrays and Strings CSCI 2720 University of Georgia Spring 2007](https://reader036.vdocuments.us/reader036/viewer/2022081516/56649e9a5503460f94b9cc6c/html5/thumbnails/107.jpg)