![Page 1: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/1.jpg)
Statistical Inference from Dependent Observations
Constantinos DaskalakisEECS and CSAIL, MIT
![Page 2: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/2.jpg)
Supervised Learning SettingGiven:
• Training set 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 of examples
• 𝑥𝑖: feature (aka covariate) vector of example 𝑖
• 𝑦𝑖: label (aka response) of example 𝑖
• Assumption: 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown
• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses• e.g. ℋ = 𝜃,⋅ 𝜃 2 ≤ 1},
• e.g.2 ℋ = {𝑠𝑜𝑚𝑒 𝑐𝑙𝑎𝑠𝑠 𝑜𝑓 𝐷𝑒𝑒𝑝 𝑁𝑒𝑢𝑟𝑎𝑙 𝑁𝑒𝑡𝑤𝑜𝑟𝑘𝑠}
• generally, ℎ𝜃 might be randomized
• Loss function: ℓ:𝒴 × 𝒴 → ℝ
Goal: Select 𝜃 to minimize expected loss: min𝜃
𝔼 𝑥,𝑦 ∼𝐷 ℓ(ℎ𝜃(𝑥), 𝑦)
Goal 2: In realizable setting (i.e. when, under 𝐷, 𝑦 ∼ ℎ𝜃∗(𝑥)), estimate 𝜃∗
, “dog”
…
, “cat”
![Page 3: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/3.jpg)
E.g. Linear Regression
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑦𝑖 ∈ ℝ, 𝑖 = 1,… , 𝑛Assumption: for all 𝑖, 𝑦𝑖 sampled as follows:
Goal: Infer 𝜃
![Page 4: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/4.jpg)
E.g.2 Logistic Regression
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑦𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑦𝑖 sampled as follows:
Goal: Infer 𝜃
![Page 5: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/5.jpg)
Maximum Likelihood Estimator (MLE)
In the standard linear and logistic regression settings, under mild assumptions about the design matrix 𝑋, whose rows are the covariate vectors, MLE is strongly consistent.
MLE estimator መ𝜃 satisfies: መ𝜃 − 𝜃2= 𝑂
𝑑
𝑛
![Page 6: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/6.jpg)
• Recall: • Assumption: 𝑋1, 𝑌1 , … , 𝑋𝑛, 𝑌𝑛 ∼𝑖𝑖𝑑 𝐷, where 𝐷 unknown• Goal: choose ℎ ∈ ℋ to minimize expected loss under some loss function ℓ, i.e.
minℎ∈ℋ
𝔼 𝑥,𝑦 ∼𝐷 ℓ(ℎ(𝑥), 𝑦)
• Let ℎ ∈ ℋ be the Empirical Risk Minimizer, i.e.
ℎ ∈ argminℎ∈ℋ
1
𝑛
𝑖
ℓ(ℎ(𝑥𝑖), 𝑦𝑖)
• Then:
Loss𝐷 ℎ ≤ infℎ∈ℋ
Loss𝐷 ℎ + 𝑂 𝑉𝐶 ℋ /𝑛 , for Boolean ℋ, 0-1 loss ℓ
Loss𝐷 ℎ ≤ infℎ∈ℋ
Loss𝐷 ℎ + 𝑂 𝜆 ℛ𝑛(ℋ) , for general ℋ, 𝜆-Lipschitz ℓ
Beyond Realizable Setting: Learnability
…
Loss𝐷 ℎ
![Page 7: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/7.jpg)
Supervised Learning Setting on Steroids
But what do these results really mean?
![Page 8: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/8.jpg)
Broader Perspective❑ ML methods commonly operate under stringent assumptions
o train set: independent and identically distributed (i.i.d.) samples
o test set: i.i.d. samples from same distribution as training
❑ Goal: relax standard assumptions to accommodate two important challenges
o (i) censored/truncated samples and (ii) dependent samples
o Censoring/truncation ⇐ systematic missing of data
⇒ train set ≠ test set
o Data Dependencies ⇐ peer effects, spatial, temporal dependencies
⇒ no apparent source for independence
![Page 9: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/9.jpg)
Today’s Topic: Statistical Inference from Dependent Observations
independent dependent
![Page 10: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/10.jpg)
Observations 𝑥𝑖 , 𝑦𝑖 𝑖 are commonly collected on some spatial domain, some temporal domain, or on a social network
As such, they are not independent, but intricately dependent
Why dependent?
e.g. spin glasses• neighboring particles influence each other
- +
++
+ +
-+
+ +
++
- +
--
e.g. social networksdecisions/opinions of nodes are influenced by the decisions of their neighbors (peer effects)
![Page 11: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/11.jpg)
Statistical Physics and Machine Learning
• Spin Systems [Ising’25]
• MRFs, Bayesian Networks, Boltzmann Machines • Probability Theory
• MCMC
• Machine Learning
• Computer Vision
• Game Theory
• Computational Biology
• Causality
• …
- +
++
+ +
-+
+ +
++
- +
--
![Page 12: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/12.jpg)
Peer Effects on Social Networks
Several studies of peer effects, in applications such as:• criminal networks [Glaeser et al’96]• welfare participation [Bertrand et al’00]• school achievement [Sacerdote’01]• participation in Retirement Plans [Duflo-Saez’03]• obesity [Trogdon et al’08, Christakis-Fowler’13]
AddHealth Dataset: • National Longitudinal Study of Adolescent Health• National study of students in grades 7-12• friendship networks, personal and school life, age,
gender, race, socio-economic background, health,…
MicroEconomics: • Behavior/Opinion Dynamics e.g.
[Schelling’78], [Ellison’93], [Young‘93, ’01], [Montanari-Saberi’10],…
Econometrics: • Disentangling individual from network
effects [Manski’93], [Bramoulle-Djebbari-Fortin’09], [Li-Levina-Zhu’16],…
![Page 13: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/13.jpg)
Menu
• Motivation
• Part I: Regression w/ dependent observations
• Part II: Statistical Learning Theory w/ dependent observations
• Conclusions
![Page 14: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/14.jpg)
Menu
• Motivation
• Part I: Regression w/ dependent observations
• Part II: Statistical Learning Theory w/ dependent observations
• Conclusions
![Page 15: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/15.jpg)
Standard Linear Regression
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑦𝑖 ∈ ℝ, 𝑖 = 1,… , 𝑛Assumption: for all 𝑖, 𝑦𝑖 sampled as follows:
Goal: Infer 𝜃
![Page 16: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/16.jpg)
Standard Logistic Regression
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑦𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑦𝑖 sampled as follows:
Goal: Infer 𝜃
![Page 17: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/17.jpg)
Maximum Likelihood Estimator
In the standard linear and logistic regression settings, under mild assumptions about the design matrix 𝑋, whose rows are the covariate vectors, MLE is strongly consistent.
MLE estimator መ𝜃 satisfies: መ𝜃 − 𝜃2= 𝑂
𝑑
𝑛
![Page 18: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/18.jpg)
Standard Linear Regression
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑦𝑖 ∈ ℝ, 𝑖 = 1,… , 𝑛Assumption: for all 𝑖, 𝑦𝑖 sampled as follows:
Goal: Infer 𝜃
![Page 19: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/19.jpg)
Linear Regression w/ Dependent Samples
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑦𝑖 ∈ ℝ, 𝑖 = 1,… , 𝑛Assumption: for all 𝑖, 𝑦𝑖 sampled as follows conditioning on 𝑦−𝑖:
Parameters: • coefficient vector 𝜃, inverse temperature 𝛽 (unknown)• Interaction matrix 𝐴 (known)
Goal: Infer 𝜃, 𝛽
![Page 20: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/20.jpg)
Standard Logistic Regression
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑦𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑦𝑖 sampled as follows:
Goal: Infer 𝜃
![Page 21: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/21.jpg)
Logistic Regression w/ Dependent Samples(today’s focus)
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑦𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑦𝑖 sampled as follows conditioning on 𝑦−𝑖:
Parameters: • coefficient vector 𝜃, inverse temperature 𝛽 (unknown)• Interaction matrix 𝐴 (known)
Goal: Infer 𝜃, 𝛽
![Page 22: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/22.jpg)
Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑦𝑖 ∈ {±1}Assumption: 𝑦1, … , 𝑦𝑛 are sampled jointly according to measure
Parameters: • coefficient vector 𝜃, inverse temperature 𝛽 (unknown)• Interaction matrix 𝐴 (known)
Goal: Infer 𝜃, 𝛽
Ising model w/ inverse temperature 𝛽 and external fields 𝜃𝑇𝑥𝑖
Logistic Regression w/ Dependent Samples(today’s focus)
Challenge: one sample, likelihood contains 𝑍
For 𝛽 = 0 equivalent to Logistic regression
i.e. lack of LLN, hard to compute MLE
![Page 23: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/23.jpg)
Main Result for Logistic Regression from Dependent Samples
N.B. We show similar results for linear regression with peer effects.
![Page 24: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/24.jpg)
Menu
• Motivation
• Part I: Regression w/ dependent observations• Proof Ideas
• Part II: Statistical Learning Theory w/ dependent observations
• Conclusions
![Page 25: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/25.jpg)
Menu
• Motivation
• Part I: Regression w/ dependent observations• Proof Ideas
• Part II: Statistical Learning Theory w/ dependent observations
• Conclusions
![Page 26: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/26.jpg)
MPLE instead of MLE
• Likelihood involves 𝑍 = 𝑍𝜃,𝛽 (the partition function), so is non-trivial to compute.
• [Besag’75,…,Chatterjee’07] studies the maximum pseudolikelihood estimator (MPLE)
• LogPL does not contain 𝑍 and is concave. Is MPLE consistent?
• [Chatterjee’07]: yes, when 𝛽 > 0, 𝜃 = 0
• [BM’18,GM’18]: yes, when 𝛽 > 0, 𝜃 ≠ 0 ∈ ℝ, 𝑥𝑖 = 1, for all 𝑖
• General case?
Problem: Given: • Ԧ𝑥 = 𝑥1, … , 𝑥𝑛 and • Ԧ𝑦 = 𝑦1, … , 𝑦𝑛 ∈ ±1 𝑛 sampled as
Pr Ԧ𝑦 = 𝜎 =exp(σ𝑖 𝜃
T𝑥𝑖 𝜎𝑖+𝛽𝜎T𝐴𝜎)
𝑍
Infer 𝜃, 𝛽
𝑃𝐿 𝛽, 𝜃 ≔ ς𝑖 Pr𝛽,𝜃,𝐴
[𝑦𝑖|𝑦−𝑖]1/𝑛
= ς𝑖1
1+exp −2(𝜃𝑇𝑥𝑖+𝛽𝐴𝑖𝑇𝑦
1/𝑛
![Page 27: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/27.jpg)
Analysis
Problem: Given: • Ԧ𝑥 = 𝑥1, … , 𝑥𝑛 and • Ԧ𝑦 = 𝑦1, … , 𝑦𝑛 ∈ ±1 𝑛 sampled as
Pr Ԧ𝑦 = 𝜎 =exp(σ𝑖 𝜃
T𝑥𝑖 𝜎𝑖+𝛽𝜎T𝐴𝜎)
𝑍
Infer 𝜃, 𝛽
𝑃𝐿 𝛽, 𝜃 ≔ ෑ
𝑖
Pr𝛽,𝜃,𝐴
[𝑦𝑖|𝑦−𝑖]
1/𝑛
![Page 28: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/28.jpg)
Analysis
Problem: Given: • Ԧ𝑥 = 𝑥1, … , 𝑥𝑛 and • Ԧ𝑦 = 𝑦1, … , 𝑦𝑛 ∈ ±1 𝑛 sampled as
Pr Ԧ𝑦 = 𝜎 =exp(σ𝑖 𝜃
T𝑥𝑖 𝜎𝑖+𝛽𝜎T𝐴𝜎)
𝑍
Infer 𝜃, 𝛽
• Concentration: gradient of log-pseudolikelihood at truth should be small; • show 𝛻log𝑃𝐿 𝜃0, 𝛽0 2
2 is 𝑂 𝑑/𝑛 with probability at least 99%
• Anti-concentration: Constant lower bound on min𝜃,𝛽
𝜆𝑚𝑖𝑛(−𝛻2 log 𝑃𝐿 (𝜃, 𝛽)) w.pr. ≥ 99%
𝑃𝐿 𝛽, 𝜃 ≔ ෑ
𝑖
Pr𝛽,𝜃,𝐴
[𝑦𝑖|𝑦−𝑖]
1/𝑛
![Page 29: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/29.jpg)
Menu
• Motivation
• Part I: Regression w/ dependent observations• Proof Ideas
• Part II: Statistical Learning Theory w/ dependent observations
• Conclusions
![Page 30: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/30.jpg)
Menu
• Motivation
• Part I: Regression w/ dependent observations• Proof Ideas
• Part II: Statistical Learning Theory w/ dependent observations
• Conclusions
![Page 31: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/31.jpg)
Supervised LearningGiven:
• Training set 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 of examples
• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses
• Loss function: ℓ:𝒴 × 𝒴 → ℝ
Assumption: 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown
Goal: Select 𝜃 to minimize expected loss: min𝜃
𝔼 𝑥,𝑦 ∼𝐷 ℓ(ℎ𝜃(𝑥), 𝑦)
Goal 2: In realizable setting (i.e. when, under 𝐷, 𝑦 ∼ ℎ𝜃∗(𝑥)), estimate 𝜃∗
, “dog”
…
, “cat”
![Page 32: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/32.jpg)
Given:
• Training set 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 of examples
• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses
• Loss function: ℓ:𝒴 × 𝒴 → ℝ
Assumption: 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown
Goal: Select 𝜃 to minimize expected loss: min𝜃
𝔼 𝑥,𝑦 ∼𝐷 ℓ(ℎ𝜃(𝑥), 𝑦)
Goal 2: In realizable setting (i.e. when, under 𝐷, 𝑦 ∼ ℎ𝜃∗(𝑥)), estimate 𝜃∗
, “dog”
…
, “cat”
Supervised Learning w/ Dependent Observations
![Page 33: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/33.jpg)
Supervised Learning w/ Dependent ObservationsGiven:
• Training set 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 of examples
• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses
• Loss function: ℓ:𝒴 × 𝒴 → ℝ
Assumption’: a joint distribution 𝐷 samples training examples andunknown test sample jointly, i.e.
• 𝑥𝑖 , 𝑦𝑖 𝑖=1𝑛 ∪ (𝑥, 𝑦) ∼ 𝐷
Goal: select 𝜃 to minimize: min𝜃
𝔼 ℓ(ℎ𝜃(𝑥), 𝑦)
, “dog”
…
, “cat”
![Page 34: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/34.jpg)
Main result
Learnability is possible when joint distribution 𝐷 over training set and test set satisfies Dobrushin’s condition.
![Page 35: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/35.jpg)
Dobrushin’s condition
• Given a joint probability vector ҧ𝑍 = 𝑍1, … , 𝑍𝑚 define the influence of variable 𝒊 to variable 𝒋: • “worst effect of 𝑍𝑖 to the conditional distribution of 𝑍𝑗 given 𝑍−𝑖−𝑗”
• Inf𝑖→𝑗 = sup𝑧𝑖,𝑧𝑖
′,𝑧−𝑖−𝑗
𝑑𝑇𝑉 Pr 𝑍𝑗 ∣ 𝑧𝑖 , 𝑧−𝑖−𝑗 , Pr 𝑍𝑗 ∣ 𝑧𝑖′, 𝑧−𝑖−𝑗
• Dobrushin’s condition:• “all nodes have limited total influence exerted on them”
• For all 𝑖, σ𝑗≠𝑖 Inf𝑗→𝑖 < 1.
Implies: Concentration of measureFast mixing of Gibbs sampler
![Page 36: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/36.jpg)
Learnability under Dobrushin’s condition
• Suppose: • 𝑋1, 𝑌1 , … , 𝑋𝑛, 𝑌𝑛 , (𝑋, 𝑌) ∼ 𝐷, where 𝐷 is joint distribution• 𝐷: satisfies Dobrushin’s condition• 𝐷: has the same marginals
• Define Loss𝐷 ℎ = 𝔼 ℓ(ℎ(𝑥), 𝑦)
• There exists a learning algorithm which, given 𝑋1, 𝑌1 , … , 𝑋𝑛, 𝑌𝑛 , outputs ℎ ∈ ℋ such that
Loss𝐷 ℎ ≤ infℎ∈ℋ
Loss𝐷 ℎ + ෨𝑂 𝑉𝐶 ℋ /𝑛 for Boolean ℋ, 0-1 loss ℓ
Loss𝐷 ℎ ≤ infℎ∈ℋ
Loss𝐷 ℎ + ෨𝑂 𝜆 𝒢𝑛(ℋ) , for general ℋ, 𝜆-Lipschitz ℓ
(need stronger than Dobrushin condition for general ℋ)
Essentially the same bounds as i.i.d
![Page 37: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/37.jpg)
Menu
• Motivation
• Part I: Regression w/ dependent observations• Proof Ideas
• Part II: Statistical Learning Theory w/ dependent observations
• Conclusions
![Page 38: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/38.jpg)
Menu
• Motivation
• Part I: Regression w/ dependent observations• Proof Ideas
• Part II: Statistical Learning Theory w/ dependent observations
• Conclusions
![Page 39: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/39.jpg)
Goal: relax standard assumptions to accommodate two important challenges
o (i) censored/truncated samples and (ii) dependent samples
o Censoring/truncation ⇐ systematic missing of data
⇒ train set ≠ test set
o Data Dependencies ⇐ peer effects, spatial, temporal dependencies
⇒ no apparent source for independence
Goals of Broader Line of Work
![Page 40: Statistical Inference from Dependent Observations · Pr UԦ=𝜎=exp(σ 𝜃T 𝜎 +𝛽𝜎T𝐴𝜎) 𝑍 Infer 𝜃,𝛽 • Concentration: gradient of log-pseudolikelihood at](https://reader036.vdocuments.us/reader036/viewer/2022070121/5fcc67867164973f2206cfdb/html5/thumbnails/40.jpg)
o Regression on a network (linear and logistic, 𝑑/𝑛 rates):
Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas:Regression from Dependent Observations.In the 51st Annual ACM Symposium on the Theory of Computing (STOC’19).
o Statistical Learning Theory Framework (learnability and generalization bounds):
Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti:Generalization and learning under Dobrushin's condition.In the 32nd Annual Conference on Learning Theory (COLT’19).
Statistical Estimation from Dependent Observations
Thank you!