something for almost nothing: advances in sublinear time algorithms
DESCRIPTION
Something for almost nothing: Advances in sublinear time algorithms. Ronitt Rubinfeld MIT and Tel Aviv U. Algorithms for REALLY big data. No time. What can we hope to do without viewing most of the data?. Small world phenomenon. The social network is a graph: “ node’’ is a person - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/1.jpg)
Something for almost nothing: Advances in sublinear time algorithms
Ronitt RubinfeldMIT and Tel Aviv U.
![Page 2: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/2.jpg)
Algorithms for REALLY big data
![Page 3: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/3.jpg)
No time
What can we hope to do without viewing most of the data?
![Page 4: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/4.jpg)
Small world phenomenon
• The social network is a graph:– “node’’ is a person– “edge’’ between people that
know each other• “6 degrees of separation’’
• Are all pairs of people connected by path of distance at most 6?
![Page 5: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/5.jpg)
Vast data
• Impossible to access all of it
• Accessible data is too enormous to be viewed by a single individual
• Once accessed, data can change
![Page 6: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/6.jpg)
The Gold Standard
• Linear time algorithms:– for inputs encoded by n
bits/words, allow cn time steps (constant c)
• Inadequate…
![Page 7: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/7.jpg)
What can we hope to do without viewing most of the data?
• Can’t answer “for all” or “exactly” type statements:– are all individuals connected by at most 6 degrees of
separation? – exactly how many individuals on earth are left-handed?
• Compromise?– is there a large group of individuals connected by at most 6
degrees of separation?– approximately how many individuals on earth are left-handed?
![Page 8: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/8.jpg)
What types of approximation?
![Page 9: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/9.jpg)
Property Testing• Does the input object have crucial properties?
• Example Properties:
– Clusterability, – Small diameter graph, – Close to a codeword, – Linear or low degree polynomial function,– Increasing order– Lots and lots more…
![Page 10: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/10.jpg)
“In the ballpark” vs. “out of the ballpark” tests
• Property testing: Distinguish inputs that have specific property from those that are far from having that property
• Benefits:– Can often answer such questions much faster– May be the natural question to ask
• When some “noise” always present• When data constantly changing • Gives fast sanity check to rule out very “bad” inputs (i.e., restaurant bills)• Model selection problem in machine learning
![Page 11: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/11.jpg)
Examples
• Can test if a function is a homomorphism in CONSTANT TIME [Blum Luby R.]
• Can test if the social network has 6 degrees of separation in CONSTANT TIME [Parnas Ron]
![Page 12: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/12.jpg)
• Find characterization of property that is• Efficiently (locally) testable• Robust -
• objects that have the property satisfy characterization, • and objects far from having the property are unlikely to
PASS
Constructing a property tester:
Usually the bigger
challenge
![Page 13: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/13.jpg)
• A “bad” testing characterization: ∀x,y f(x)+f(y) = f(x+y)• Another bad characterization: For most x f(x)+f(1) = f(x+1) • Good characterization: For most x,y f(x)+f(y) = f(x+y)
Example: Homomorphism property of functions
![Page 14: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/14.jpg)
• A “bad” testing characterization: For every node, all other nodes within distance 6.
• Another bad one: For most nodes, all other nodes within distance 6.
• Good characterization: For most nodes, there are many other nodes within distance 6.
Example: 6 degrees of separation
![Page 15: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/15.jpg)
Many more properties studied!
• Graphs, functions, point sets, strings, …
• Amazing characterizations of problems testable in graph and function testing models!
![Page 16: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/16.jpg)
Properties of functions• Linearity and low total degree polynomials
[Blum Luby R.] [Bellare Coppersmith Hastad Kiwi Sudan] [R. Sudan] [Arora Safra] [Arora Lund Motwani Sudan Szegedy] [Arora Sudan] ...
• Functions definable by functional equations – trigonometric, elliptic functions [R.]
• All locally characterized affine invariant function classes! [Bhattacharyya Fischer Hatami Hatami Lovett]
• Groups, Fields [Ergun Kannan Kumar R. Viswanathan]
• Monotonicity [EKKRV] [Goldreich Goldwasser Lehman Ron Samorodnitsky] [Dodis Goldreich Lehman Ron Raskhodnikova Samorodnitsky][Fischer Lehman Newman Raskhodnikova R. Samorodnitsky] [Chakrabarty Seshadri]…
• Convexity, submodularity [Parnas Ron R.] [Fattal Ron] [Seshadri Vondrak]…
• Low complexity functions, Juntas [Parnas Ron Samorodnitsky] [Fischer Kindler Ron Safra Samorodnitsky]
[Diakonikolas Lee Matulef Onak R. Servedio Wan]…
![Page 17: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/17.jpg)
• Linear and low degree polynomials, trigonometric functions [Ergun Kumar Rubinfeld] [Kiwi Magniez Santha] [Magniez]…
Some properties testable even with approximation errors!
![Page 18: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/18.jpg)
Properties of sparse and general graphs
• Easily testable dense and hyperfinite graph properties are completely characterized! [Alon Fischer Newman Shapira][Newman Sohler]
• General Sparse graphs: bipartiteness, connectivity, diameter, colorability, rapid mixing, triangle free,… [Goldreich Ron] [Parnas Ron] [Batu Fortnow R. Smith White] [Kaufman Krivelevich Ron] [Alon Kaufman Krivelevich Ron]…
• Tools: Szemeredi regularity lemma, random walks, local search, simulate greedy, borrow from parallel algorithms
![Page 19: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/19.jpg)
Some other combinatorial properties
• Set properties – equality, distinctness,... • String properties – edit distance, compressibility,…• Metric properties – metrics, clustering, convex hulls,
embeddability... • Membership in low complexity languages –
regular languages, constant width branching programs, context-free languages, regular tree languages…
• Codes – BCH and dual-BCH codes, Generalized Reed-Muller,…
![Page 20: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/20.jpg)
• Yes! e.g., [Buhrman Fortnow Newman Rohrig] [Friedl Ivanyos Santha] [Friedl Santha Magniez Sen]
Can quantum property testers do better?
![Page 21: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/21.jpg)
• May only query labels of points from a sample
• Useful for model selection problem
Active property testing[Balcan Blais Blum Yang]
![Page 22: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/22.jpg)
“Traditional” approximation
• Output number close to value of the optimal solution (not enough time to construct a solution)
• Some examples:– Minimum spanning tree, – vertex cover, – max cut, – positive linear program, – edit distance, …
![Page 23: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/23.jpg)
• Given graph G(V,E), a vertex cover (VC) C is a subset of V such that it “touches” every edge.
• What is minimum size of a vertex cover?• NP-complete• Poly time multiplicative 2-approximation based on
relationship of VC and maximal matching
Example: Vertex Cover
![Page 24: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/24.jpg)
Vertex Cover and Maximal Matching
• Maximal Matching: – M E is a matching if no node in in more than one edge. – M is a maximal matching if adding any edge violates the
matching property
• Note: nodes matched by M form a pretty good Vertex Cover!– Vertex cover must include at least one node in each edge of
M– Union of nodes of M form a vertex cover– So |M| ≤ VC ≤ 2 |M|
![Page 25: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/25.jpg)
“Classical” approximation examples
• Can get CONSTANT TIME approximation for vertex cover on sparse graphs! – Output y which is at most 2∙ OPT + ϵn
How? • Oracle reduction framework [Parnas Ron]
– Construct “oracle” that tells you if node u in 2-approx vertex cover
– Use oracle + standard sampling to estimate size of cover
But how do you implement the oracle?
![Page 26: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/26.jpg)
Implementing the oracle – two approaches:
• Sequentially simulate computations of a local distributed algorithm [Parnas Ron]
• Figure out what greedy maximal matching algorithm would do on u [Nguyen Onak]
![Page 27: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/27.jpg)
Greedy algorithm for maximal matching
• Algorithm:
– For every edge (u,v)• If neither of u or v matched
–Add (u,v) to M– Output M
• Why is M maximal?– If e not in M then either u or v already matched by earlier
edge
![Page 28: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/28.jpg)
Implementing the Oracle via Greedy
• To decide if edge e in matching:– Must know if adjacent edges that come before e
in the ordering are in the matching– Do not need to know anything about edges
coming after
• Arbitrary edge order can have long dependency chains!
Odd or even steps from beginning?
1 2 4 8 25 36 47 88 89 110 111 112 113
![Page 29: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/29.jpg)
Breaking long dependency chains[Nguyen Onak]
• Assign random ordering to edges– Greedy works under any ordering– Important fact: random order has short
dependency chains
![Page 30: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/30.jpg)
Better Complexity for VC
• Always recurse on least ranked edge first• Heuristic suggested by [Nguyen Onak]
• Yields time poly in degree [Yoshida Yamamoto Ito]
• Additional ideas yield query complexity nearly linear in average degree for general graphs [Onak Ron Rosen R.]
![Page 31: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/31.jpg)
Further work
• More complicated arguments for maximum matching, set cover, positive LP… [Parnas Ron + Kuhn Moscibroda Wattenhofer] [Nguyen Onak] [Yoshida Yamamoto Ito]
• Even better results for hyperfinite graphs [Hassidim Kelner Nguyen Onak][Newman Sohler]
– e.g., planar
Can dependence be made poly in average
degree?
![Page 32: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/32.jpg)
Large inputsLarge outputs
![Page 33: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/33.jpg)
When we don’t need to see all the output…
do we need to see all the input?
![Page 34: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/34.jpg)
Locally (list-)decodable codes [Sudan-Trevisan-Vadhan, Katz-Trevisan,…]
Large message
Encoding of large message
What is the ith bit?
Input
Outputof DecodingAlgorithm
Can design codes so that few
queries needed to compute answer!!!
![Page 35: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/35.jpg)
Local decompression algorithms[Chandar Shah Wornell][Sadakane Grossi] [Gonzalez Navarro] [Ferragina Venturini] [Kreft Navarro] [Billie Landau Raman Sadakane Satti Weimann] [Dutta Levi Ron R.]…
Large data
Compression of large data
What is the ith bit?
Input
Outputof DecompressionAlgorithm
design compression scheme so that
compress well and few queries needed to compute answer
![Page 36: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/36.jpg)
Local Computation Algorithms (LCA)[R. Tamir Vardi Xie]
Input x
Output y
LCA
j xj
i1,i2,.. yi1, yi2
, …
![Page 37: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/37.jpg)
• Optimization problems [R. Tamir Vardi Xie] [Alon R. Vardi Xie]• Maximal independent set• Broadcast radio scheduling• Hypergraph coloring• K-CNF SAT
• Techniques as in “oracle reduction framework”, but oracle requirements are more stringent
What more can be done in this model?
![Page 38: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/38.jpg)
• Phase 1: Compute a partial solution which decides most nodes• simulate local algorithms• can also simulate greedy
• Phase 2: Only small components left … brute force works!• analysis similar to Beck/Alon• can also analyze via Galton-Watson
Two-phase plan:
![Page 39: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/39.jpg)
More:
• Local algorithms for• Ranking webpages• Graph partitioning[Andersen Borgs Chayes Hopcroft Mirrokni Teng] [Andersen Chung Lang] [Spielman Teng] [Borgs Brautbar Chayes Lucier]
• Property preserving data reconstruction [Ailon Chazelle Comandur Liu] [Comandur Saks] [Jha Raskhodnikova][Campagna Guo R.] [Awasthi Jha Molinaro Raskhodnikova] …
![Page 40: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/40.jpg)
LCAs and the cloud
Input x
Output y
LCALCA
LCALCALCA
![Page 41: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/41.jpg)
No samplesWhat if data only accessible via random
samples?
![Page 42: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/42.jpg)
Distributions
![Page 43: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/43.jpg)
Play the lottery?
![Page 44: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/44.jpg)
Is the lottery unfair?
• From Hitlotto.com: Lottery experts agree, past number histories can be the key to predicting future winners.
![Page 45: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/45.jpg)
True Story!
• Polish lottery Multilotek – Choose “uniformly” at random distinct 20 numbers
out of 1 to 80. – Initial machine biased
• e.g., probability of 50-59 too small
• Past results: http://serwis.lotto.pl:8080/archiwum/wyniki_wszystkie.php?id_gra=2
![Page 46: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/46.jpg)
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)
![Page 47: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/47.jpg)
New Jersey Pick 3,4 Lottery
• New Jersey Pick k ( =3,4) Lottery.– Pick k digits in order. – 10k possible values. – Assume lottery draws iid
• Data:– Pick 3 - 8522 results from 5/22/75 to 10/15/00
• 2-test gives 42% confidence– Pick 4 - 6544 results from 9/1/77 to 10/15/00.
• fewer results than possible values
• 2-test gives no confidence
![Page 48: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/48.jpg)
Distributions on BIG domains
• Given samples of a distribution, need to know, e.g.,• entropy• number of distinct elements• “shape” (monotone, bimodal,…)• closeness to uniform, Gaussian, Zipfian…
• No assumptions on shape of distribution• i.e., smoothness, monotonicity, Normal distribution,…
• Considered in statistics, information theory, machine learning, databases, algorithms, physics, biology,…
![Page 49: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/49.jpg)
Key Question
• How many samples do you need in terms of domain size?• Do you need to estimate the probabilities of each
domain item?• Can sample complexity be sublinear in size of the
domain?
Rules out standard statistical techniques, learning distribution
![Page 50: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/50.jpg)
Our Aim:
Algorithms with sublinear sample complexity
![Page 51: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/51.jpg)
Similarities of distributions
Are p and q close or far?– p is given via samples– q is either
• known to the tester (e.g. uniform)• given via samples
![Page 52: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/52.jpg)
Is p uniform?
• Theorem: ([Goldreich Ron] [Batu Fortnow R. Smith White] [Paninski]) Sample complexity of distinguishing from is
• Nearly same complexity to test if p is any known distribution [Batu Fischer Fortnow Kumar R. White]: “Testing identity”
p
Test
samples
Pass/Fail? ||𝑝−𝑈||1=Σ|𝑝𝑖−1𝑛|
![Page 53: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/53.jpg)
Upper bound for L2 distance [Goldreich Ron]
• L2 distance:
• ||p-U||22 = (S pi -1/n)2
= Spi2 - 2Spi
/n + S1/n2 ========
= Spi2 - 1/n
• Estimate collision probability to estimate L2 distance from uniform
![Page 54: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/54.jpg)
Testing uniformity [GR, Batu et. al.]
• Upper bound: Estimate collision probability and use known relation between between L1 and L2 norms– Issues:
• Collision probability of uniform is 1/n • Use O(sqrt(n)) samples via recycling
– Comment: [P] uses different estimator
• Easy lower bound: (n½)– Can get (n½/2) [P]
![Page 55: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/55.jpg)
Back to the lottery…
plenty of samples!
![Page 56: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/56.jpg)
Is p uniform?
• Theorem: ([Goldreich Ron][Batu Fortnow R. Smith White] [Paninski]) Sample complexity of distinguishing p=U from |p-U|1> is (n1/2)
• Nearly same complexity to test if p is any known distribution [Batu Fischer Fortnow Kumar R. White]: “Testing identity”
p
Test
samples
Pass/Fail?
![Page 57: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/57.jpg)
Testing identity via testing uniformity on subdomains:
• (Relabel domain so that q monotone)• Partition domain into O(log n) groups, so
that each group almost “flat” -- • differ by <(1+e) multiplicative factor• q close to uniform over each group
• Test:– Test that p close to uniform over each group– Test that p assigns approximately correct total
weights to each group
q (known)
![Page 58: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/58.jpg)
Transactions of 20-30 yr olds Transactions of 30-40 yr olds
Testing closeness of two distributions:
trend change?
![Page 59: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/59.jpg)
Testing closeness
Theorem: ([BFRSW] [P. Valiant]) Sample complexity of distinguishing
p=q from ||p-q||1 >e
is (n2/3)
p
Test
Pass/Fail?
q
~
![Page 60: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/60.jpg)
Why so different?
• Collision statistics are all that matter• Collisions on “heavy” elements can hide collision
statistics of rest of the domain• Construct pairs of distributions where heavy
elements are identical, but “light” elements are either identical or very different
![Page 61: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/61.jpg)
Output ||p-q||1 ±e
need (n/log n) samples [G. Valiant P. Valiant]
Additively estimate distance?
![Page 62: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/62.jpg)
Collisions tell all
• Algorithms:• Algorithms use collisions to determine “wrong” behavior
• E.g., too many collisions implies far from uniform [GR,BFSRW]
• Use Linear Programming to determine if there is a distribution with the right collision probabilities and the right property [G. Valiant P. Valiant]
• Lower bounds:• For symmetric properties, collision statistics are only relevant
information [BFRSW] (see also [Orlitsky Santhanam Zhang] [Orlitsky Santhanam Viswanthan Zhang])
• Need new analysis tools since not independent• Central limit theorem for generalized multinomial distributions [G.
Valiant P. Valiant]
![Page 63: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/63.jpg)
Information theoretic quantities
• Entropy• Support size
![Page 64: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/64.jpg)
Information in neural spike trails
• Each application of stimuli gives sample of signal (spike trail)
• Entropy of (discretized) signal indicates which neurons respond to stimuli
Neural signals
time
[Strong, Koberle, de Ruyter van Steveninck, Bialek ’98]
![Page 65: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/65.jpg)
Compressibility of data
![Page 66: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/66.jpg)
Can we get multiplicative approximations?
• In general, no….– 0 entropy distributions are hard to distinguish
• What if entropy is bigger?– Can g-multiplicatively approximate the entropy with Õ(n1/2)
samples (when entropy >2g/e) [Batu Dasgupta R. Kumar]– requires (n1/2) [Valiant]– better bounds when support size is small [Brautbar
Samorodnitsky]– Similar bounds for estimating support size [Raskhodikova Ron
R. Smith] [Raskhodnikova Ron Shpilka Smith]
![Page 67: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/67.jpg)
Testing Independence:Shopping patterns:
Independent of zip code?
![Page 68: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/68.jpg)
Independence of pairs• p is joint distribution on pairs
<a,b> from [n] x [m] (wlog n≥m)
• Marginal distributions p1 ,p2
• p independent if p = p1 x p2 , that is p(a,b)=(p1)a(p2)b for all a,b
12
34
56
78
910 11
12 13 1415 16
12
34
56
78
910
1112
1314
1516
![Page 69: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/69.jpg)
Theorem: [Batu Fischer Fortnow Kumar R. White]
There exists an algorithm for testing independence with sample complexity O(n2/3m1/3poly(log n)) s.t.
– If p=p1 x p2, it outputs PASS
– If ||p-q||1> e for any independent q, it outputs FAIL
• Ω(n2/3m1/3) samples required [Levi Ron R.]
![Page 70: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/70.jpg)
More properties:
• Limited Independence: [Alon Andoni Kaufman Matulef R. Xie] [Haviv Langberg]
• K-flat distributions [Levi Indyk R.]
• K-modal distributions [Daskalakis Diakonikolas Servedio]
• Poisson Binomial Distributions [Daskalakis Diakonikolas Servedio]
• Monotonicity over general posets [Batu Kumar R.] [Bhattacharyya Fischer R. P. Valiant]
• Properties of multiple distributions [Levi Ron R.]
![Page 71: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/71.jpg)
Many other properties to consider!
• Higher dimensional flat distributions• Mixtures of k Gaussians• “Junta”-distributions• …
![Page 72: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/72.jpg)
Getting past the lower bounds• Special distributions
– e.g, uniform on a subset, monotone• Other query models
– Queries to probabilities of elements• Other distance measures [Guha McGregor
Venkatasubramanian]• Competitive classification/closeness testing --
compare to best symmetric test [Acharya Das Jafarpour Orlitsky Pan Suresh] [Acharya Das Jafarpour Orlitsky Pan]
![Page 73: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/73.jpg)
More open directions
• Other properties?• Non-iid samples?
![Page 74: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/74.jpg)
• For many problems, we need a lot less time and samples than one might think!
• Many cool ideas and techniques have been developed
• Lots more to do!
Conclusion:
![Page 75: Something for almost nothing: Advances in sublinear time algorithms](https://reader037.vdocuments.us/reader037/viewer/2022110103/5681435f550346895dafdb75/html5/thumbnails/75.jpg)
Thank you!