learning mixtures of structured distributions over discrete domains xiaorui sun columbia university...

Learning Mixtures of Structured Distributions over Discrete Domains

Xiaorui SunColumbia University

Joint work with Siu-On Chan(UC Berkeley), Ilias Diakonikolas(U Edinburgh), Rocco Servedio(Columbia University)

Density Estimation

• PAC-type learning model• Set of possible target distributions

over • Learner – Know the set but does not know the

target distribution – Independently draws a few samples from – Outputs (succinct description of a)

distribution which is -close to • Total variation distance is standard measure in

statistics

Learn a structured distribution

• If = {all distributions over }, samples are required

• Much better sample complexities possible for structured distributions– Poisson binomial distributions [DDS12a]• samples

–Monotone/k-modal [Bir87, DDS12b]• samples/ samples

This work: Learn mixture of structured distributions

• Learn mixture of distributions?– A set of distributions over – Target distribution is a mixture of

distributions from– i.e. , such that

• Our result: learn mixtures for several structured distributions– Sample complexity close to optimal– Efficient running time

Our results: learning mixture of log-concave

• Log-concave distribution over [n]– – for

1 n

Our results: log-concave

• Algorithm to learn a mixture of log-concave distributions – Sample complexity: – Running time: bit operations

• Lower bound: samples

Our results: mixture of unimodal

• Unimodal distribution over [n]– s.t.

1 n

Our results: mixture of unimodal

• A mixture of 2 unimodal distributions may have modes

• Algorithm to learn a mixture of unimodal distributions– Sample complexity: samples– Running time: bit operations

• Lower bound: samples

Our results: mixture of MHR

• Monotone hazard rate distribution – Hazard rate of : – if –MHR distribution: is a non-decreasing

function over

1 n

Our results: mixture of MHR

• Algorithm to learn a mixture of MHR distributions – Sample complexity: – Running time: bit operations

• Lower Bound: samples

Compare with parameter estimation

• Parameter estimation [KMV10, MV 10] – Learn a mixture of Gaussians– Independently draw a few samples from – Estimate the parameters of each

Gaussian component accurately

• Number of samples inherently exponentially depends on , even for a mixture of 1-dimensional normal distributions [MV10]

Compare with parameter estimation

• Parameter estimation needs at least exp() samples to learn a mixture of binomial distributions– Similar to the lower bound in [MV 10]

• Density estimation allows to estimate non parametric distributions– E.g. log-concave, unimodal, MHR

• Density estimation for mixture of binomial distributions over using samples– Binomial distribution is log-concave

Outline

• Learning algorithm based on decomposition

• Structural results for log-concave, unimodal, MHR distributions

Flat decomposition

• Key definition: distribution is -flat if there exists a partition of into intervals such that – is an -flat decomposition for

• is obtained by "flattening" within each interval – for

Flat decomposition

1 n

Learn -flat distributions

• Main general Thm: Let = {all the -flat distributions}. There is an algorithm which draws samples from , and outputs a hypothesis such that .

• Linear running time with respect to the number of samples

Easier problem: known decomposition

• Given– Samples from an -flat distribution – -flat decomposition for

• Idea: estimate probability mass of every interval in

• samples are enough

Real problem: unknown decomposition

• Only given samples from a -flat distribution

• Exists some -flat decomposition for , but unknown

• A useful fact [DDS+ 13]: If is a -flat decomposition of , and is a “refinement” of , is a -flat decomposition of – If know a refinement of , it is good

Unknown flat decomposition (cont)

• Idea: partition [n] into intervals each with small probability mass,

– Achieve by sampling from

1 n

𝒦ℒ


• Exist (unknown)– Refinement of both and– intervals

1 n

𝒦ℒ


• Exist – Refinement of both and– intervals– -flat decomposition for

1 n

𝒥


• Compare and

1 n

𝒥1 n

𝒥𝒦


• If the total probability mass of every intervals of is at most , then

• Partition [n] into intervals each with probability mass at most – samples are enough

Learn -flat distributions

• Main general Thm: Let {all the -flat distributions}. There is an algorithm which draws samples from , and outputs a hypothesis such that

Learn mixture of distributions

• Lem:A mixture of -flat distributions has an -flat decomposition– Tight for interesting distribution classes

• Thm(Learn mixture): Let be a mixture of -flat distributions. There is an algorithm which draws samples, and outputs a hypothesis s.t.

First application: learning mixture of log-concave distributions

• Recall definition:– – for

• Lem: Every log-concave distribution is -flat

• Learn a mixture of log-concave distributions with samples

Second application: learning mixture of unimodal distribution

• Lem: Every unimodal distribution is -flat [Bir87, DDS+13]

• Learn a mixture of unimodal distribution with samples

Third application: learning mixture of MHR distribution

• Monotone hazard rate distribution– Hazard rate of : – if – is a non-decreasing function over

• Lem: Every MHR distribution is -flat• Learn a mixture of MHR distributions

with samples

Conclusion and further directions

• Flat decomposition is a useful way to study mixtures of structured distributions

• Extend to higher dimension?• Efficient algorithm with optimal

sample complexity

Distribution Sample complexity Lower boundLog-concaveUnimodalMHR

Thank you !

learning mixtures of structured distributions over discrete domains xiaorui sun columbia university...

Documents

mixture of distributions

concave distributions

logconcave slide

mhr distributions slide

flat decomposition slide

unknown decomposition

structured distribution

density estimation slide