Download - Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at

Statistical Inference from Dependent Observations

Constantinos DaskalakisEECS and CSAIL, MIT

Supervised Learning SettingGiven:

• Training set 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 of examples

• 𝑥𝑖: feature (aka covariate) vector of example 𝑖

• 𝑦𝑖: label (aka response) of example 𝑖

• Assumption: 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown

• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses• e.g. ℋ = 𝜃,⋅ 𝜃 2 ≤ 1},

• e.g.2 ℋ = {𝑠𝑜𝑚𝑒 𝑐𝑙𝑎𝑠𝑠 𝑜𝑓 𝐷𝑒𝑒𝑝 𝑁𝑒𝑢𝑟𝑎𝑙 𝑁𝑒𝑡𝑤𝑜𝑟𝑘𝑠}

• generally, ℎ𝜃 might be randomized

• Loss function: ℓ:𝒴 × 𝒴 → ℝ

Goal: Select 𝜃 to minimize expected loss: min𝜃

𝔼 𝑥,𝑦 ∼𝐷 ℓ(ℎ𝜃(𝑥), 𝑦)

Goal 2: In realizable setting (i.e. when, under 𝐷, 𝑦 ∼ ℎ𝜃∗(𝑥)), estimate 𝜃∗

, “dog”

…

, “cat”

E.g. Linear Regression

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑦𝑖 ∈ ℝ, 𝑖 = 1,… , 𝑛Assumption: for all 𝑖, 𝑦𝑖 sampled as follows:

Goal: Infer 𝜃

E.g.2 Logistic Regression

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑦𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑦𝑖 sampled as follows:

Goal: Infer 𝜃

Maximum Likelihood Estimator (MLE)

In the standard linear and logistic regression settings, under mild assumptions about the design matrix 𝑋, whose rows are the covariate vectors, MLE is strongly consistent.

MLE estimator መ𝜃 satisfies: መ𝜃 − 𝜃2= 𝑂

𝑑

𝑛

• Recall: • Assumption: 𝑋1, 𝑌1 , … , 𝑋𝑛, 𝑌𝑛 ∼𝑖𝑖𝑑 𝐷, where 𝐷 unknown• Goal: choose ℎ ∈ ℋ to minimize expected loss under some loss function ℓ, i.e.

minℎ∈ℋ

𝔼 𝑥,𝑦 ∼𝐷 ℓ(ℎ(𝑥), 𝑦)

• Let ℎ ∈ ℋ be the Empirical Risk Minimizer, i.e.

ℎ ∈ argminℎ∈ℋ

1

𝑛

𝑖

ℓ(ℎ(𝑥𝑖), 𝑦𝑖)

• Then:

Loss𝐷 ℎ ≤ infℎ∈ℋ

Loss𝐷 ℎ + 𝑂 𝑉𝐶 ℋ /𝑛 , for Boolean ℋ, 0-1 loss ℓ


Loss𝐷 ℎ + 𝑂 𝜆 ℛ𝑛(ℋ) , for general ℋ, 𝜆-Lipschitz ℓ

Beyond Realizable Setting: Learnability

…

Loss𝐷 ℎ

Supervised Learning Setting on Steroids

But what do these results really mean?

Broader Perspective❑ ML methods commonly operate under stringent assumptions

o train set: independent and identically distributed (i.i.d.) samples

o test set: i.i.d. samples from same distribution as training

❑ Goal: relax standard assumptions to accommodate two important challenges

o (i) censored/truncated samples and (ii) dependent samples

o Censoring/truncation ⇐ systematic missing of data

⇒ train set ≠ test set

o Data Dependencies ⇐ peer effects, spatial, temporal dependencies

⇒ no apparent source for independence

Today’s Topic: Statistical Inference from Dependent Observations

independent dependent

Observations 𝑥𝑖 , 𝑦𝑖 𝑖 are commonly collected on some spatial domain, some temporal domain, or on a social network

As such, they are not independent, but intricately dependent

Why dependent?

e.g. spin glasses• neighboring particles influence each other

- +

++

+ +

-+

+ +

++

- +

--

e.g. social networksdecisions/opinions of nodes are influenced by the decisions of their neighbors (peer effects)

Statistical Physics and Machine Learning

• Spin Systems [Ising’25]

• MRFs, Bayesian Networks, Boltzmann Machines • Probability Theory

• MCMC

• Machine Learning

• Computer Vision

• Game Theory

• Computational Biology

• Causality

• …

- +

++

+ +

-+

+ +

++

- +

--

Peer Effects on Social Networks

Several studies of peer effects, in applications such as:• criminal networks [Glaeser et al’96]• welfare participation [Bertrand et al’00]• school achievement [Sacerdote’01]• participation in Retirement Plans [Duflo-Saez’03]• obesity [Trogdon et al’08, Christakis-Fowler’13]

AddHealth Dataset: • National Longitudinal Study of Adolescent Health• National study of students in grades 7-12• friendship networks, personal and school life, age,

gender, race, socio-economic background, health,…

MicroEconomics: • Behavior/Opinion Dynamics e.g.

[Schelling’78], [Ellison’93], [Young‘93, ’01], [Montanari-Saberi’10],…

Econometrics: • Disentangling individual from network

effects [Manski’93], [Bramoulle-Djebbari-Fortin’09], [Li-Levina-Zhu’16],…

Menu

• Motivation

• Part I: Regression w/ dependent observations

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Standard Linear Regression


Goal: Infer 𝜃

Standard Logistic Regression


Goal: Infer 𝜃

Maximum Likelihood Estimator

In the standard linear and logistic regression settings, under mild assumptions about the design matrix 𝑋, whose rows are the covariate vectors, MLE is strongly consistent.

MLE estimator መ𝜃 satisfies: መ𝜃 − 𝜃2= 𝑂

𝑑

𝑛

Standard Linear Regression


Goal: Infer 𝜃

Linear Regression w/ Dependent Samples

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑦𝑖 ∈ ℝ, 𝑖 = 1,… , 𝑛Assumption: for all 𝑖, 𝑦𝑖 sampled as follows conditioning on 𝑦−𝑖:

Parameters: • coefficient vector 𝜃, inverse temperature 𝛽 (unknown)• Interaction matrix 𝐴 (known)

Goal: Infer 𝜃, 𝛽

Standard Logistic Regression


Goal: Infer 𝜃

Logistic Regression w/ Dependent Samples(today’s focus)

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑦𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑦𝑖 sampled as follows conditioning on 𝑦−𝑖:



Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑦𝑖 ∈ {±1}Assumption: 𝑦1, … , 𝑦𝑛 are sampled jointly according to measure



Ising model w/ inverse temperature 𝛽 and external fields 𝜃𝑇𝑥𝑖

Logistic Regression w/ Dependent Samples(today’s focus)

Challenge: one sample, likelihood contains 𝑍

For 𝛽 = 0 equivalent to Logistic regression

i.e. lack of LLN, hard to compute MLE

Main Result for Logistic Regression from Dependent Samples

N.B. We show similar results for linear regression with peer effects.

Menu

• Motivation

• Part I: Regression w/ dependent observations• Proof Ideas


• Conclusions

MPLE instead of MLE

• Likelihood involves 𝑍 = 𝑍𝜃,𝛽 (the partition function), so is non-trivial to compute.

• [Besag’75,…,Chatterjee’07] studies the maximum pseudolikelihood estimator (MPLE)

• LogPL does not contain 𝑍 and is concave. Is MPLE consistent?

• [Chatterjee’07]: yes, when 𝛽 > 0, 𝜃 = 0

• [BM’18,GM’18]: yes, when 𝛽 > 0, 𝜃 ≠ 0 ∈ ℝ, 𝑥𝑖 = 1, for all 𝑖

• General case?

Problem: Given: • Ԧ𝑥 = 𝑥1, … , 𝑥𝑛 and • Ԧ𝑦 = 𝑦1, … , 𝑦𝑛 ∈ ±1 𝑛 sampled as

Pr Ԧ𝑦 = 𝜎 =exp(σ𝑖 𝜃

T𝑥𝑖 𝜎𝑖+𝛽𝜎T𝐴𝜎)

𝑍

Infer 𝜃, 𝛽

𝑃𝐿 𝛽, 𝜃 ≔ ς𝑖 Pr𝛽,𝜃,𝐴

[𝑦𝑖|𝑦−𝑖]1/𝑛

= ς𝑖1

1+exp −2(𝜃𝑇𝑥𝑖+𝛽𝐴𝑖𝑇𝑦

1/𝑛

Analysis




𝑍

Infer 𝜃, 𝛽

𝑃𝐿 𝛽, 𝜃 ≔ ෑ

𝑖

Pr𝛽,𝜃,𝐴

[𝑦𝑖|𝑦−𝑖]

1/𝑛

Analysis




𝑍

Infer 𝜃, 𝛽

• Concentration: gradient of log-pseudolikelihood at truth should be small; • show 𝛻log𝑃𝐿 𝜃0, 𝛽0 2

2 is 𝑂 𝑑/𝑛 with probability at least 99%

• Anti-concentration: Constant lower bound on min𝜃,𝛽

𝜆𝑚𝑖𝑛(−𝛻2 log 𝑃𝐿 (𝜃, 𝛽)) w.pr. ≥ 99%

𝑃𝐿 𝛽, 𝜃 ≔ ෑ

𝑖

Pr𝛽,𝜃,𝐴

[𝑦𝑖|𝑦−𝑖]

1/𝑛

Menu

• Motivation



• Conclusions

Supervised LearningGiven:


• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses


Assumption: 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown




, “dog”

…

, “cat”

Given:




Assumption: 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown




, “dog”

…

, “cat”

Supervised Learning w/ Dependent Observations

Supervised Learning w/ Dependent ObservationsGiven:




Assumption’: a joint distribution 𝐷 samples training examples andunknown test sample jointly, i.e.

• 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ∪ (𝑥, 𝑦) ∼ 𝐷

Goal: select 𝜃 to minimize: min𝜃

𝔼 ℓ(ℎ𝜃(𝑥), 𝑦)

, “dog”

…

, “cat”

Main result

Learnability is possible when joint distribution 𝐷 over training set and test set satisfies Dobrushin’s condition.

Dobrushin’s condition

• Given a joint probability vector ҧ𝑍 = 𝑍1, … , 𝑍𝑚 define the influence of variable 𝒊 to variable 𝒋: • “worst effect of 𝑍𝑖 to the conditional distribution of 𝑍𝑗 given 𝑍−𝑖−𝑗”

• Inf𝑖→𝑗 = sup𝑧𝑖,𝑧𝑖

′,𝑧−𝑖−𝑗

𝑑𝑇𝑉 Pr 𝑍𝑗 ∣ 𝑧𝑖 , 𝑧−𝑖−𝑗 , Pr 𝑍𝑗 ∣ 𝑧𝑖′, 𝑧−𝑖−𝑗

• Dobrushin’s condition:• “all nodes have limited total influence exerted on them”

• For all 𝑖, σ𝑗≠𝑖 Inf𝑗→𝑖 < 1.

Implies: Concentration of measureFast mixing of Gibbs sampler

Learnability under Dobrushin’s condition

• Suppose: • 𝑋1, 𝑌1 , … , 𝑋𝑛, 𝑌𝑛 , (𝑋, 𝑌) ∼ 𝐷, where 𝐷 is joint distribution• 𝐷: satisfies Dobrushin’s condition• 𝐷: has the same marginals

• Define Loss𝐷 ℎ = 𝔼 ℓ(ℎ(𝑥), 𝑦)

• There exists a learning algorithm which, given 𝑋1, 𝑌1 , … , 𝑋𝑛, 𝑌𝑛 , outputs ℎ ∈ ℋ such that


Loss𝐷 ℎ + ෨𝑂 𝑉𝐶 ℋ /𝑛 for Boolean ℋ, 0-1 loss ℓ


Loss𝐷 ℎ + ෨𝑂 𝜆 𝒢𝑛(ℋ) , for general ℋ, 𝜆-Lipschitz ℓ

(need stronger than Dobrushin condition for general ℋ)

Essentially the same bounds as i.i.d

Menu

• Motivation



• Conclusions

Goal: relax standard assumptions to accommodate two important challenges

o (i) censored/truncated samples and (ii) dependent samples

o Censoring/truncation ⇐ systematic missing of data

⇒ train set ≠ test set

o Data Dependencies ⇐ peer effects, spatial, temporal dependencies

⇒ no apparent source for independence

Goals of Broader Line of Work

o Regression on a network (linear and logistic, 𝑑/𝑛 rates):

Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas:Regression from Dependent Observations.In the 51st Annual ACM Symposium on the Theory of Computing (STOC’19).

o Statistical Learning Theory Framework (learnability and generalization bounds):

Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti:Generalization and learning under Dobrushin's condition.In the 32nd Annual Conference on Learning Theory (COLT’19).

Statistical Estimation from Dependent Observations

Thank you!

Download - Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at

Top Related