qu anti zation, com pression, and classiÞcation: extracting …yxie77/ece587/introqcc-multi.pdf ·...
Post on 02-Sep-2021
3 Views
Preview:
TRANSCRIPT
EE 372Introductory Lecture
4 April 2007
Quantization,Compression, and Classification:Extracting discrete information
from a continuous world
Robert M. GrayInformation Systems Laboratory, Dept of Electrical Engineering
Stanford, CA 94305rmgray@stanford.edu
Underlying research esearch partially supported by the National Science Foundation, Norsk Electro-Optikk,and Hewlett Packard Laboratories.
Quantization 1
Introduction
continuous world
discrete representations!!
!!!"
##
###$
encoder E
011100100 · · ·
13%%%%%%&
decoder D“Picture of a privateer”
''''''(decoder D
How well can you do it?What do you mean by “well”?
How do you do it?
A dictionary definition of quantization:
Quantization 2
— the division of a quantity into a discrete number of smallparts, often assumed to be integral multiples of a commonquantity.
In general, The mapping of a continuous quantity into a discretequantity.
Converting a continuous quantity into a discrete quantity causesloss of accuracy and information.
Goal: minimize the loss
but need to define “loss” first . . .
and there are other constraints.
Quantization 3
Quantization
Old example: Round off real numbers to nearest integer:estimate densities by histograms [Sheppard (1898)]
In general, quantizer q of a space A (e.g., !k, L2([0, 1]2) =
• An encoder E : A → I, I an index set, e.g., nonnegative integers.⇔ partition S = {Si; i ∈ I}: Si = {x : E(x) = i}.
• A decoder D : I → C ⇔ codebook C = D(E(A))).
Di = E(X)
Encoder DecoderD(E(X))X E
Quantization 4
⇒ q = (E ,D) = (S, C) = {Si,D(i); i ∈ I}
Examples Scalar quantizer:
) xcodebook
partition
︸ ︷︷ ︸S0
︸ ︷︷ ︸S1
︸ ︷︷ ︸S2
︸︷︷︸S3
︸ ︷︷ ︸S4
D(0)∗
D(1)∗
D(2)∗
D(3)∗
D(4)∗
Two dimensional example:Centroidal Voronoi diagram —Nearest neighbor partition,Euclidean centroidsE.g., complex numbers, nearestmail boxes, sensors, repeaters, fishfortresses . . .
Quantization 5
territories of the male Tilapiamossambica [G. W. Barlow,Hexagonal Territories, AnimalBehavior, Volume 22, 1974]
Three dimensions: Voronoipartition around sphereshttp://www-math.mit.edu/dryfluids/gallery/
Quantization 6
Performance: Distortion
Quality of a quantizer measured by the goodness of the resultingreproduction in comparison to the original.
Assume a distortion measure d(x, y) which measures penalty/costif an input x results in an output y = D(E(x)).
Small (large) distortion ⇔ good (bad) quality
Performance measured by average distortion: assume X is arandom object (variable, vector, process, field) described by probabilitydistribution PX: pixel intensities, samples, features, fields, DCT orwavelet transform coefficients, moments, . . .
E.g., if A = !k, PX ⇔ pdf f , or empirical distribution PL fromtraining/learning set L = {xl; l = 1, 2, . . . , |L|}
Quantization 7
Average Distortion:
D(q) = D(E ,D) = E[d(X,D(E(X))]
=∫
d(x,D(E(x)) dPX(x)
Examples: x, y ∈ !k
• Mean squared error (MSE) d(x, y) = ||x− y||2 =∑k
i=1 |xi − yi|2
• Input or output weighted quadratic, Bx positive definite matrix:d(x, y) = (x− y)tBx(x− y) or (x− y)tBy(x− y)
Used in perceptual coding, statistical classification.
Quantization 8
• Functions of norms:
d(x, y) = ρ(||x− y||p)
where !p norm for p ≥ 1 is
||x− y||p =
(k−1∑
i=0
|xi − yi|p)1/p
ρ is convex.
Mostly of mathematical interest, limited demonstrated usefulness inapplications.
• Vector X might be pmf, distortion could be distribution distance,relative entropy (Kullback-Leibler distance), etc.
Quantization 9
• Bayes distortion, “task-driven” quantization.
Y )
SENSOR
PX|Y ) X )
ENCODER
E ) E(X) = i
DECODER
)κ )
D )
Y = κ(i)X = D(i)
e.g., X = Y + noise remote or noisy source coding
“denoise” by compression
Overall distortion: Bayes risk C(y, y) Define quantizer distortion
d(x,κ(i)) = E[C(Y, κ(i))|X = x]⇒E[d(X, κ(E(X))] = E[C(Y, κ(E(X)))]
Y discrete ⇒ classification, detectionY continuous ⇒ estimation, regression
Quantization 10
Performance: Rate
Want average distortion small, doable with big codebooks and small
cells, but there is usually a cost (instantaneous rate) r(i) ofchoosing index i, with average rate R(q) = E[r(E(X))], e.g.,
• r(i) = ln |C| Fixed-rate, classic quantization
• r(i) = !(i) Variable-rate quantization,length function ! satisfies Kraft inequality
∑i e−!(i) ≤ 1
• r(i) = − ln p(i), p(i) = Pr(E(X) = i) Shannon codelengthsR(q) = H(E(X)) = −
∑i p(i) ln p(i) entropy coding
• r(i) = (1− η)!(i) + η ln |C|, η ∈ [0, 1] Combined constraintsProposed by Zador (1982) to unify separate cases.
Quantization 11
Quantizer Model
! best thought of as part of quantizer, a design choice.
Alternative view: Codeword weighting: w(i) = e−!(i), w ⇔ !
!(i) =∞⇔ w(i) = 0; infinite cost, never used.
N(w) = |{i : w(i) > 0}| = |{i : !(i) finite }| codebook size
Kraft inequality on ! ⇔ w a sub-pmf: w(i) ≥ 0,∑
i w(i) ≤ 1
r(i) = (1− η) ln 1w(i) + η lnN(w),
⇒ p must be absolutely continuous wrt w for finite rate.
quantizer model is q = (E ,D, !) = (S, C, w)
Quantization 12
Major Issues
• If know distributions,
– what is optimal tradeoff of distortion D(E ,D) vs. rate R(E , w)?(Want both to be small!)
– How design good codes?
• If do not know distributions, how use training/learning data toestimate tradeoffs and design good codes?(clustering, statistical or machine learning)
Quantization 13
Applications
•
• Communications & signalprocessing: A/D conversion, datacompression. Compression requiredfor efficient transmission– send more data in available
bandwidth– send the same data in less bandwidth– more users on same bandwidthand storage
Graphic courtesy of Jim Storer.
Quantization 14
• Statistical clustering: grouping bird songs, designing Mao suits,grouping gene features, taxonomy.
• Placing points in space: mailboxes, wireless sensors
• Image and video classification/segmentation
• Speech recognition and speaker identification
• Numerical integration and optimal quadrature rules, e.g., How bestchose {Si,D(i)} for
∫g(x)f(x) dx ≈
∑i PX(Si)g(D(i))?
• Approximating continuous probabilistic models by discrete models.Simulating continuous processes.
• Fitting and choosing Gauss mixture models for observed data.
Quantization 15
Distortion-rate optimization: q = (E ,D, w)
D(q) = E[d(X,D(E(X)))] vs. R(q) = E[r(E(X))]
Minimize D(q) for R(q) ≤ R δ(R) = infq:R(q)≤R
D(q)
Minimize R(q) for D(q) ≤ D r(D) = infq:D(q)≤D
R(q)
Lagrangian approach: ρλ(x, i) = d(x,D(i)) + λr(i), λ ≥ 0
Minimize Lagrangian distortion ρ(f,λ, η, q) ∆= Eρλ(X, E(X))
= D(q) + λR(q)
= Ed(X,D(E(X))) + λ [(1− η)E!(E(X)) + η lnN(!)]
ρ(f,λ, η) = infq
ρ(f,λ, η, q)
Traditional cases: fixed-rate η = 1, variable-rate η = 0
Quantization 16
Three Theories of Quantization
Rate-Distortion Theory Shannon distortion-rate function
D(R) ≤ δ(R), achievable in asymptopia of large dimension k and
fixed rate R. [Shannon (1949, 1959), Gallager (1978)]
( Nonasymptotic (Exact) results Necessary conditions for optimalcodes ⇒ iterative design algorithms⇔ statistical clustering [Steinhaus (1956), Lloyd (1957)]
( High rate theory Optimal performance in asymptopia of fixed
dimension k and large rate R . [Bennett (1948), Lloyd (1957),
Zador (1963), Gersho (1979), Bucklew, Wise (1982)]
Quantization 17
Lloyd Optimality Properties: q = (E ,D, !)
Any component can be optimized for the others.
Encoder E(x) = argmini
(d(x,D(i)) + λr(i)) Minimum distortion
Decoder D(i) = argminy
E(d(X, y)| E(X) = i) Lloyd centroid
Codebook size Remove i if Pr(E(X) = i) = 0 PruneLength function !(i) = − lnPX(E(X) = i) Shannon codelength
Iterate steps ⇒ Lloyd clustering algorithm, (rediscovered as k-means,grouped coordinate descent, alternating optimization, principal points)
If fixed length, squared error =⇒ Centroidal Voronoi partition
centroid = D(i) = E(X| E(X) = i) (MMSE estimate)
Quantization 18
Lloyd conditions ⇒• Encoder partition S ⇒ optimal decoder D = D(S) and lengthfunction ! = !(S)• Decoder D + length function ! ⇒ optimal encoder partition S =S(D, !)
Equivalent optimization problems:
ρ(f,λ, η) = infS,D,w
ρ(f,λ, η,S,D, w)
= infS
ρ(f,λ, η,S)
= infD,w
ρ(f,λ, η,D, w)
Good quantizer characterized by either partition S or by weighteddecoder (D, w). Suggests another Lloyd-style optimality condition.
Quantization 19
Subcodes and Supercodes, Pruning and Growing
A partition S ′ is a subpartition of a partition S if every atom of S ′is a union of atoms of S: S refines S ′ or S ′ ⊂ S
A quantizer q′ determined by S ′ is a partition subcode of thequantizer q determined by S if S ′ is a subpartition of S. q is apartition supercode of q′.
A weighted codebook (D′, w′) is a codebook subcode of a weightedcodebook (D, w) if {D′(i), i : w′(i) > 0} ⊂ {D(i), i : w(i) > 0} andw′i = αwi,α ∈ (0, 1). Conversely, (D, w) is a codebook supercode of(D′, w′).
Generally the smaller code has larger distortion and smaller rate.
Quantization 20
Provides an additional Lloyd-style necessary condition:
Codebook size If a quantizer q is optimal there can be nosubcode or supercode q′ for whichD(q′) + λR(q′) < D(q) + λR(q). pruning/growing
Generalization of zero-probability cell pruning.
Can test likely subcodes/supercodes for possible improvement.
Can also use to increase/decrease λ and grow or prune code.
Combining everything ⇒ Lloyd clustering algorithm
A descent algorithm, so distortion converges.
Quantization 21
Step 0: Initialization Given initial (D0, w0).Compute ρ0 = ρ(S(D0, w0),D0, w0). Set m = 1
Step 1: Partition improvement Given(Dm−1, wm−1), form anoptimum partition Sm = S(Dm−1, wm−1).
Step 2: Weighted codebook improvement Given the partitionSm, form an optimum weighted codebook (Dm, wm) =(D(Sm), w(Sm)). Compute ρm = ρ(Sm,Dm, wm).
Step 3: Test Test ρm−1 − ρm. If small enough, go to Step 4.Else set m = m + 1, go to Step 1.
Step 4: Grow/Prune Test sub/super codes for possibleimprovement for fixed λ or for changed λ. Quit or go tostep 1.
Quantization 22
Bayes quantization
If use Bayes distortion for statistical classification or regressionwith a rate constraint, e.g., with Hamming Bayes cost:E[C(Y, κ(E(X)))] = Pe and
κ(i) = argmaxy
Pr(Y = y| E(X) = i)
E(x) = argmaxi
[Pr(Y = κ(i)|X = x)− λr(i)]
Need to estimate PY |X via model or binning/quantization of X
Can add constraints, e.g., constrain encoder to have low-complexitytree structure. (CART/BFOS/TSVQ)
Quantization 23
High Rate (Resolution) Theory
Traditional form (Zador, Gersho, Bucklew, Wise): f , MSE
limR→∞
e2kRδ(R) =
ak||f || k
k+2fixed-rate (R = ln N)
bke2kh(f) variable-rate (R = H), where
h(f) = −∫
f(x) ln f(x) dx, ||f ||k/(k+2) =(∫
f(x)k
k+2 dx
)k+2k
ak ≥ bk are Zador constants (depend on k and d, not f !)
a1 = b1 =112
, a2 =5
18√
3, b2 = ?, ak, bk = ? for k ≥ 3
limk→∞
ak = limk→∞
bk = 1/2πe
Quantization 24
Lagrangian form: variable-rate
limλ→0
[infq
(ρ(f,λ, 0, q)
λ
)+
k
2lnλ
]= θk + h(f)
θk∆= inf
λ>0
[infq
(ρ(u, λ, 0, q)
λ
)+
k
2lnλ
]=
k
2ln
2ebk
k
limλ→0
[infq
(ρ(f,λ, 1, q)
λ
)+
k
2lnλ
]= ψk + ln ||f ||k/2
k/(k+2)
ψk∆= inf
λ>0
[infq
(ρ(u, λ, 1, q)
λ
)+
k
2lnλ
]=
k
2ln
2eak
k
fixed-rate both have form optimum for uniform u + function of f
Quantization 25
Proofs
Rigorous proofs of traditional cases painfully tedious, but followZador’s original approach.
Heuristic proofs developed by Gersho based on assumptions of
• existence of asymptotically optimal quantizer point density functionΛ(x)
• existence of asymptotically optimal quantizer partition cell shape
Quantization 26
Then hand-waving approximations of integrals relateN(q), Df(q),Hf(q)⇒
Df(q) ≈ ckEf
((
1N(q)Λ(X)
)2/k
)
Hf(q(X)) ≈ h(X)− E
(log(
1N(q)Λ(X)
))
= ln N(q)−∫
f(x) lnf(x)Λ(x)
dx︸ ︷︷ ︸
H(f ||Λ) relative entropy
,
Optimizing using Holder’s inequality or Jensen’s inequality yieldsclassic fixed-rate and variable-rate results.
Quantization 27
Can also apply to combined-constraint case to get conjecturedsolution for general η
There appear to be connections with rigorous and heuristic proofs,may help insight and simplify proofs.
Gersho’s conjectures and approximations have many yet unprovedbut commonly believed implications,
e.g., ak = bk all k, the best fixed-rate high rate codes have maximumentropy, asymptotically optimal quantizer point density functions exist.
Henceforth focus on variable-rate case, η = 0.
Quantization 28
Mismatch distortion
High-rate results ⇒ for small λ optimal quantizer satisfies
ρ(f,λ, 0, q)λ
+k
2lnλ ≈ θk + h(f)
Worst case: Performance depends on f only through h(f) ⇒ worstcase source given constraints is max h source, e.g., given m and K,worst case is Gaussian
Mismatch: Design for pdf g, but apply it to f for small λ
ρ(f,λ, 0, q)λ
+k
2lnλ ≈ h(f) + θk︸ ︷︷ ︸
optimal for f
+ H(f ||g)︸ ︷︷ ︸mismatch
.
Quantization 29
Put together: If g Gaussian with same mean, covariance as f then
ρ(f,λ, 0, q)λ
− ρ(g,λ, 0, q)λ
≈ 0
⇒ the performance of the code designed for g asymptotically equalsthat when the code is applied to f , a robust code in informationtheoretic sense!(≈ “generalization guarantees” in machine learning)
Robust and worst-case code ⇒ H(f‖g) measures mismatchdistance from f to Gaussian g with same second-order moments,I.e., for a pdf f , how much performance is lost from optimal for f ifuse optimal code for g on f?
Problem with single worst case: too conservative
Gaussian can be a very bad fit, very suboptimal, e.g., speech.
Quantization 30
Divide & conquer: composite code
Alternative idea, fit a different Gauss source to distinct groups ofinputs instead of the entire collection of inputs. Robust codes for localbehavior.
Partition input space !k into S = {Sm; m = 1, . . . ,M}, assign aGaussian component gm = N (µm,Km) to each Sm, & construct anoptimal quantizer qm = (Em,Dm, !m) for each gm:
ρ(f,λ, 0, q)λ
+k
2lnλ ≈ θk + h(f)︸ ︷︷ ︸
optimal for f
+∑
m
pmH(fm||gm)︸ ︷︷ ︸
mismatch distortion
where fm(x) = f(x)/pm, x ∈ Sm, pm =∫
Smdxf(x).
Quantization 31
What is best (smallest) can make∑
m pmH(fm||gm) over allpartitions S = {Sm; m ∈ I} and collections G = {gm, pm; m ∈ I}?⇔Quantize !k into space of Gauss models using quantizer mismatchdistortion
dQM(x,m) = − ln pm +12
ln |Km| +12(x− µm)tK−1
m (x− µm)
Similar to MDI distortion, Itakura-Saito distortion, log likelihooddistortion for Gaussians, Stein’s spectral distortion, . . .
Codebook {(µm,Km, pm)} ↔ Gauss mixture model.
Can use Lloyd algorithm to optimize: Gauss mixture vectorquantization (GMVQ)
Quantization 32
• GMVQ converts a training sequence into a GM model
• Derivation was based on high-rate analysis of quantizing the originalsource using a collection of codes designed for local Gaussian models.Minimizing the overall MSE ⇔ GMVQ
• Alternative to Baum-Welch/EM algorithm for GM design.
Advantages over EM:
• Lower complexity (about 1/2).
• Faster convergence.
• Incorporates compression ⇒ can use low complexity search, e.g.,TSVQ, useful for large dimensions, large datasets, networks.
Quantization 33
Examples
Aerial images: Manmade vs. natural White: man-made,gray:natural
original 8 bpp, gray-scale images, hand-labeled classified images.GMVQ with probability of error 12.23%
Quantization 34
Content-Adresssible Databases/Image Retrieval
8 × 8 blocks, 9 images train for each of 50 classes, 500 imagedatabase (cross-validated)Precision (fraction of retrieved images that are relevant) ≈recall (fraction of relevant images that are retrieved) ≈ .94 .Query Examples
Query
Quantization 35
Texture Classification
Examples from Brodatz databaseCompared well with Gauss mixture random field methods.
Quantization 36
North Sea Gas Pipeline Data
normal pipeline image field joint longitudinal weldwhich is which???
Quantization 37
• GMVQ Best codebook wins
• MAP-GMVQ Plug in GMVQ produced GMM to MAP
• MAP-EM: Plug into EM produced GMM to MAP
• 1-NN
• MART (boosted classification tree)
• Regularized QDA (Gaussian class models)
• MAP-ECVQ (GM class models)
Quantization 38
Method Recall Precision AccuracyS V W S V W
MART 0.9608 0.9000 0.8718 0.9545 0.9000 0.8947 0.9387Reg. QDA 0.9869 1.0000 0.9487 0.9869 0.9091 1.0000 0.98111-NN 0.9281 0.7000 0.8462 0.9221 0.8750 1.0000 0.8915MAP-ECVQ 0.9737 0.9000 0.9437 0.9739 0.9000 0.9487 0.9623MAP-EM 0.9739 0.9000 0.9487 0.9739 0.9000 0.9487 0.9623MAP-GMVQ 0.9935 0.8500 0.9487 0.9682 1.0000 0.9737 0.9717GMVQ 0.9673 0.8000 0.9487 0.9737 0.7619 0.9487 0.9481
Class SubclassesS Normal, Osmosis Blisters, Black Lines, Small Black Corrosion Dots, Grinder Marks, MFL Marks,
Corrosion Blisters, Single DotsV Longitudinal WeldsW Weld Cavity, Field Joint
Recall = Pr (declare as class i | class i )Precision =Pr ( class i | declare as class i )
Implementation is simple, convergence is fast,classification performance is good.
Quantization 39
Closing Introductory Thoughts
• Quantization ideas useful for A/D conversion, quantizer error analysis,data compression, statistical classification, modeling, and estimationwith a rate constraint.
• Lloyd-designed GMVQ alternative to EM for Gauss mixture design instatistical classification and clustering.
• Gersho’s approach provides intuitive, but nonrigorous, proofs of highrate results based on conjectured geometry of optimal cell shapes.
New tools may help develop rigorous proof reflecting simple heuristics& means of proving/disproving implications of Gersho’s conjecture.
• Basic tools include information theory, signal processing,optimization.
Quantization 40
top related