on the virtues of the cumulative distribution · no. of comparisons needed for sorting • there...
TRANSCRIPT
![Page 1: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/1.jpg)
On the virtues of the cumulative distribution
function
1
![Page 2: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/2.jpg)
matlab code
2
![Page 3: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/3.jpg)
CDFs and sorting
• To Create a CDF you need to sort
• Can knowing the CDF help you sort?
3
![Page 4: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/4.jpg)
No. of comparisons needed for sorting
• There are n! permutations
• Identifying the correct permutation requires log(n!) comparisons.
• log(n!) ~ n log n
• Merge-sort achieves time O(n log n)
4
![Page 5: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/5.jpg)
Efficient sorting for uniform distribution
• n Elements drawn IID from uniform distribution over [0,1]
• A = array[1:n] of lists
• given x insert it into list at A[floor(x n)]
• Lists would rarely have > 1 element
• time: O(n)
5
![Page 6: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/6.jpg)
Efficient sorting using the CDF
x
A1
n
6
![Page 7: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/7.jpg)
Efficient sorting for an IID source
• Source is IID but distribution is unknown
• Sort first m instances to estimate CDF
• Use CDF to sort rest of data.
• How large should m be?
7
![Page 8: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/8.jpg)
Stability of empirical CDF
250 instances 1000 instances
8
![Page 9: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/9.jpg)
The difference between two empirical CDFs
% input: samples d1,d2n = length(d1) % assuming length(d1)==length(d2)vals = [d1; d2];labels = [repmat(1,n,1); repmat(-1,n,1)];
[s,i] = sort(vals); % sort labels by s_labels = labels(i); % increasing vals
d = cumsum(s_labels);plot(0:length(d),[0;d]);
9
![Page 10: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/10.jpg)
Typical cumulative difference
10
![Page 11: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/11.jpg)
Typical cumulative difference
11
![Page 12: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/12.jpg)
Maximal divergence of a returning random walk
12
![Page 13: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/13.jpg)
The Glivenko-Cantelli Theorem
= empirical CDF using n random instancesFn(x)= CDFF (x)
! = error tolerance
Pxn
1
!
maxx
|Fn(x) ! F (x)| " !
"
# ae!bn!
2
Elegant proof.Devroye, Gyorfi, Lugosi / A probabilistic Theory of Pattern Recognition, page 192Best Constants: a=b=2, Complex proof. MASSART, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. of Probability 18, 1269-1293.
xn
1 = !x1, x2, . . . , xn" = sample
13
![Page 14: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/14.jpg)
Empirical test of the Glivenko-Cantelli Theorem
250 instances 1000 instances
14
![Page 15: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/15.jpg)
The Kolmogorov-Smirnoff Test
the maximal difference between the two empirical CDFs
15
![Page 16: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/16.jpg)
KS for hue distributions
16
![Page 17: On the virtues of the cumulative distribution · No. of comparisons needed for sorting • There are n! permutations • Identifying the correct permutation requires log(n!) comparisons](https://reader031.vdocuments.us/reader031/viewer/2022022511/5ae17e137f8b9ac0428ecff3/html5/thumbnails/17.jpg)
Conclusions
• Computing CDFs ~ Sorting.
• CDFs are very stable for all distributions
• Proof: Glivenko Cantelli - related to properties of random walks.
• KS test is a powerful test for n>1000
• When there is systematic variation - KS might be too sensitive.
• For colors of fruit - we need to estimate parameters of distribution - Use distribution model.
17