statistical inference from dependent observationsย ยท pr uิฆ=๐œŽ=exp(ฯƒ ๐œƒt ๐œŽ...

Post on 21-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistical Inference from Dependent Observations

Constantinos DaskalakisEECS and CSAIL, MIT

Supervised Learning SettingGiven:

โ€ข Training set ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› of examples

โ€ข ๐‘ฅ๐‘–: feature (aka covariate) vector of example ๐‘–

โ€ข ๐‘ฆ๐‘–: label (aka response) of example ๐‘–

โ€ข Assumption: ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› ~๐‘–๐‘–๐‘‘ ๐ท where ๐ท is unknown

โ€ข Hypothesis class โ„‹ = โ„Ž๐œƒ ๐œƒ โˆˆ ฮ˜} of (randomized) responsesโ€ข e.g. โ„‹ = ๐œƒ,โ‹… ๐œƒ 2 โ‰ค 1},

โ€ข e.g.2 โ„‹ = {๐‘ ๐‘œ๐‘š๐‘’ ๐‘๐‘™๐‘Ž๐‘ ๐‘  ๐‘œ๐‘“ ๐ท๐‘’๐‘’๐‘ ๐‘๐‘’๐‘ข๐‘Ÿ๐‘Ž๐‘™ ๐‘๐‘’๐‘ก๐‘ค๐‘œ๐‘Ÿ๐‘˜๐‘ }

โ€ข generally, โ„Ž๐œƒ might be randomized

โ€ข Loss function: โ„“:๐’ด ร— ๐’ด โ†’ โ„

Goal: Select ๐œƒ to minimize expected loss: min๐œƒ

๐”ผ ๐‘ฅ,๐‘ฆ โˆผ๐ท โ„“(โ„Ž๐œƒ(๐‘ฅ), ๐‘ฆ)

Goal 2: In realizable setting (i.e. when, under ๐ท, ๐‘ฆ โˆผ โ„Ž๐œƒโˆ—(๐‘ฅ)), estimate ๐œƒโˆ—

, โ€œdogโ€

โ€ฆ

, โ€œcatโ€

E.g. Linear Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and responses ๐‘ฆ๐‘– โˆˆ โ„, ๐‘– = 1,โ€ฆ , ๐‘›Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

E.g.2 Logistic Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Maximum Likelihood Estimator (MLE)

In the standard linear and logistic regression settings, under mild assumptions about the design matrix ๐‘‹, whose rows are the covariate vectors, MLE is strongly consistent.

MLE estimator แˆ˜๐œƒ satisfies: แˆ˜๐œƒ โˆ’ ๐œƒ2= ๐‘‚

๐‘‘

๐‘›

โ€ข Recall: โ€ข Assumption: ๐‘‹1, ๐‘Œ1 , โ€ฆ , ๐‘‹๐‘›, ๐‘Œ๐‘› โˆผ๐‘–๐‘–๐‘‘ ๐ท, where ๐ท unknownโ€ข Goal: choose โ„Ž โˆˆ โ„‹ to minimize expected loss under some loss function โ„“, i.e.

minโ„Žโˆˆโ„‹

๐”ผ ๐‘ฅ,๐‘ฆ โˆผ๐ท โ„“(โ„Ž(๐‘ฅ), ๐‘ฆ)

โ€ข Let โ„Ž โˆˆ โ„‹ be the Empirical Risk Minimizer, i.e.

โ„Ž โˆˆ argminโ„Žโˆˆโ„‹

1

๐‘›

๐‘–

โ„“(โ„Ž(๐‘ฅ๐‘–), ๐‘ฆ๐‘–)

โ€ข Then:

Loss๐ท โ„Ž โ‰ค infโ„Žโˆˆโ„‹

Loss๐ท โ„Ž + ๐‘‚ ๐‘‰๐ถ โ„‹ /๐‘› , for Boolean โ„‹, 0-1 loss โ„“

Loss๐ท โ„Ž โ‰ค infโ„Žโˆˆโ„‹

Loss๐ท โ„Ž + ๐‘‚ ๐œ† โ„›๐‘›(โ„‹) , for general โ„‹, ๐œ†-Lipschitz โ„“

Beyond Realizable Setting: Learnability

โ€ฆ

Loss๐ท โ„Ž

Supervised Learning Setting on Steroids

But what do these results really mean?

Broader Perspectiveโ‘ ML methods commonly operate under stringent assumptions

o train set: independent and identically distributed (i.i.d.) samples

o test set: i.i.d. samples from same distribution as training

โ‘ Goal: relax standard assumptions to accommodate two important challenges

o (i) censored/truncated samples and (ii) dependent samples

o Censoring/truncation โ‡ systematic missing of data

โ‡’ train set โ‰  test set

o Data Dependencies โ‡ peer effects, spatial, temporal dependencies

โ‡’ no apparent source for independence

Todayโ€™s Topic: Statistical Inference from Dependent Observations

independent dependent

Observations ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘– are commonly collected on some spatial domain, some temporal domain, or on a social network

As such, they are not independent, but intricately dependent

Why dependent?

e.g. spin glassesโ€ข neighboring particles influence each other

- +

++

+ +

-+

+ +

++

- +

--

e.g. social networksdecisions/opinions of nodes are influenced by the decisions of their neighbors (peer effects)

Statistical Physics and Machine Learning

โ€ข Spin Systems [Isingโ€™25]

โ€ข MRFs, Bayesian Networks, Boltzmann Machines โ€ข Probability Theory

โ€ข MCMC

โ€ข Machine Learning

โ€ข Computer Vision

โ€ข Game Theory

โ€ข Computational Biology

โ€ข Causality

โ€ข โ€ฆ

- +

++

+ +

-+

+ +

++

- +

--

Peer Effects on Social Networks

Several studies of peer effects, in applications such as:โ€ข criminal networks [Glaeser et alโ€™96]โ€ข welfare participation [Bertrand et alโ€™00]โ€ข school achievement [Sacerdoteโ€™01]โ€ข participation in Retirement Plans [Duflo-Saezโ€™03]โ€ข obesity [Trogdon et alโ€™08, Christakis-Fowlerโ€™13]

AddHealth Dataset: โ€ข National Longitudinal Study of Adolescent Healthโ€ข National study of students in grades 7-12โ€ข friendship networks, personal and school life, age,

gender, race, socio-economic background, health,โ€ฆ

MicroEconomics: โ€ข Behavior/Opinion Dynamics e.g.

[Schellingโ€™78], [Ellisonโ€™93], [Youngโ€˜93, โ€™01], [Montanari-Saberiโ€™10],โ€ฆ

Econometrics: โ€ข Disentangling individual from network

effects [Manskiโ€™93], [Bramoulle-Djebbari-Fortinโ€™09], [Li-Levina-Zhuโ€™16],โ€ฆ

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observations

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observations

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Standard Linear Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and responses ๐‘ฆ๐‘– โˆˆ โ„, ๐‘– = 1,โ€ฆ , ๐‘›Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Standard Logistic Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Maximum Likelihood Estimator

In the standard linear and logistic regression settings, under mild assumptions about the design matrix ๐‘‹, whose rows are the covariate vectors, MLE is strongly consistent.

MLE estimator แˆ˜๐œƒ satisfies: แˆ˜๐œƒ โˆ’ ๐œƒ2= ๐‘‚

๐‘‘

๐‘›

Standard Linear Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and responses ๐‘ฆ๐‘– โˆˆ โ„, ๐‘– = 1,โ€ฆ , ๐‘›Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Linear Regression w/ Dependent Samples

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and responses ๐‘ฆ๐‘– โˆˆ โ„, ๐‘– = 1,โ€ฆ , ๐‘›Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows conditioning on ๐‘ฆโˆ’๐‘–:

Parameters: โ€ข coefficient vector ๐œƒ, inverse temperature ๐›ฝ (unknown)โ€ข Interaction matrix ๐ด (known)

Goal: Infer ๐œƒ, ๐›ฝ

Standard Logistic Regression

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows:

Goal: Infer ๐œƒ

Logistic Regression w/ Dependent Samples(todayโ€™s focus)

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: for all ๐‘–, ๐‘ฆ๐‘– sampled as follows conditioning on ๐‘ฆโˆ’๐‘–:

Parameters: โ€ข coefficient vector ๐œƒ, inverse temperature ๐›ฝ (unknown)โ€ข Interaction matrix ๐ด (known)

Goal: Infer ๐œƒ, ๐›ฝ

Input: ๐‘› feature vectors ๐‘ฅ๐‘– โˆˆ โ„๐‘‘ and binary responses ๐‘ฆ๐‘– โˆˆ {ยฑ1}Assumption: ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘› are sampled jointly according to measure

Parameters: โ€ข coefficient vector ๐œƒ, inverse temperature ๐›ฝ (unknown)โ€ข Interaction matrix ๐ด (known)

Goal: Infer ๐œƒ, ๐›ฝ

Ising model w/ inverse temperature ๐›ฝ and external fields ๐œƒ๐‘‡๐‘ฅ๐‘–

Logistic Regression w/ Dependent Samples(todayโ€™s focus)

Challenge: one sample, likelihood contains ๐‘

For ๐›ฝ = 0 equivalent to Logistic regression

i.e. lack of LLN, hard to compute MLE

Main Result for Logistic Regression from Dependent Samples

N.B. We show similar results for linear regression with peer effects.

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

MPLE instead of MLE

โ€ข Likelihood involves ๐‘ = ๐‘๐œƒ,๐›ฝ (the partition function), so is non-trivial to compute.

โ€ข [Besagโ€™75,โ€ฆ,Chatterjeeโ€™07] studies the maximum pseudolikelihood estimator (MPLE)

โ€ข LogPL does not contain ๐‘ and is concave. Is MPLE consistent?

โ€ข [Chatterjeeโ€™07]: yes, when ๐›ฝ > 0, ๐œƒ = 0

โ€ข [BMโ€™18,GMโ€™18]: yes, when ๐›ฝ > 0, ๐œƒ โ‰  0 โˆˆ โ„, ๐‘ฅ๐‘– = 1, for all ๐‘–

โ€ข General case?

Problem: Given: โ€ข ิฆ๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› and โ€ข ิฆ๐‘ฆ = ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘› โˆˆ ยฑ1 ๐‘› sampled as

Pr ิฆ๐‘ฆ = ๐œŽ =exp(ฯƒ๐‘– ๐œƒ

T๐‘ฅ๐‘– ๐œŽ๐‘–+๐›ฝ๐œŽT๐ด๐œŽ)

๐‘

Infer ๐œƒ, ๐›ฝ

๐‘ƒ๐ฟ ๐›ฝ, ๐œƒ โ‰” ฯ‚๐‘– Pr๐›ฝ,๐œƒ,๐ด

[๐‘ฆ๐‘–|๐‘ฆโˆ’๐‘–]1/๐‘›

= ฯ‚๐‘–1

1+exp โˆ’2(๐œƒ๐‘‡๐‘ฅ๐‘–+๐›ฝ๐ด๐‘–๐‘‡๐‘ฆ

1/๐‘›

Analysis

Problem: Given: โ€ข ิฆ๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› and โ€ข ิฆ๐‘ฆ = ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘› โˆˆ ยฑ1 ๐‘› sampled as

Pr ิฆ๐‘ฆ = ๐œŽ =exp(ฯƒ๐‘– ๐œƒ

T๐‘ฅ๐‘– ๐œŽ๐‘–+๐›ฝ๐œŽT๐ด๐œŽ)

๐‘

Infer ๐œƒ, ๐›ฝ

๐‘ƒ๐ฟ ๐›ฝ, ๐œƒ โ‰” เท‘

๐‘–

Pr๐›ฝ,๐œƒ,๐ด

[๐‘ฆ๐‘–|๐‘ฆโˆ’๐‘–]

1/๐‘›

Analysis

Problem: Given: โ€ข ิฆ๐‘ฅ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› and โ€ข ิฆ๐‘ฆ = ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘› โˆˆ ยฑ1 ๐‘› sampled as

Pr ิฆ๐‘ฆ = ๐œŽ =exp(ฯƒ๐‘– ๐œƒ

T๐‘ฅ๐‘– ๐œŽ๐‘–+๐›ฝ๐œŽT๐ด๐œŽ)

๐‘

Infer ๐œƒ, ๐›ฝ

โ€ข Concentration: gradient of log-pseudolikelihood at truth should be small; โ€ข show ๐›ปlog๐‘ƒ๐ฟ ๐œƒ0, ๐›ฝ0 2

2 is ๐‘‚ ๐‘‘/๐‘› with probability at least 99%

โ€ข Anti-concentration: Constant lower bound on min๐œƒ,๐›ฝ

๐œ†๐‘š๐‘–๐‘›(โˆ’๐›ป2 log ๐‘ƒ๐ฟ (๐œƒ, ๐›ฝ)) w.pr. โ‰ฅ 99%

๐‘ƒ๐ฟ ๐›ฝ, ๐œƒ โ‰” เท‘

๐‘–

Pr๐›ฝ,๐œƒ,๐ด

[๐‘ฆ๐‘–|๐‘ฆโˆ’๐‘–]

1/๐‘›

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Supervised LearningGiven:

โ€ข Training set ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› of examples

โ€ข Hypothesis class โ„‹ = โ„Ž๐œƒ ๐œƒ โˆˆ ฮ˜} of (randomized) responses

โ€ข Loss function: โ„“:๐’ด ร— ๐’ด โ†’ โ„

Assumption: ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› ~๐‘–๐‘–๐‘‘ ๐ท where ๐ท is unknown

Goal: Select ๐œƒ to minimize expected loss: min๐œƒ

๐”ผ ๐‘ฅ,๐‘ฆ โˆผ๐ท โ„“(โ„Ž๐œƒ(๐‘ฅ), ๐‘ฆ)

Goal 2: In realizable setting (i.e. when, under ๐ท, ๐‘ฆ โˆผ โ„Ž๐œƒโˆ—(๐‘ฅ)), estimate ๐œƒโˆ—

, โ€œdogโ€

โ€ฆ

, โ€œcatโ€

Given:

โ€ข Training set ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› of examples

โ€ข Hypothesis class โ„‹ = โ„Ž๐œƒ ๐œƒ โˆˆ ฮ˜} of (randomized) responses

โ€ข Loss function: โ„“:๐’ด ร— ๐’ด โ†’ โ„

Assumption: ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› ~๐‘–๐‘–๐‘‘ ๐ท where ๐ท is unknown

Goal: Select ๐œƒ to minimize expected loss: min๐œƒ

๐”ผ ๐‘ฅ,๐‘ฆ โˆผ๐ท โ„“(โ„Ž๐œƒ(๐‘ฅ), ๐‘ฆ)

Goal 2: In realizable setting (i.e. when, under ๐ท, ๐‘ฆ โˆผ โ„Ž๐œƒโˆ—(๐‘ฅ)), estimate ๐œƒโˆ—

, โ€œdogโ€

โ€ฆ

, โ€œcatโ€

Supervised Learning w/ Dependent Observations

Supervised Learning w/ Dependent ObservationsGiven:

โ€ข Training set ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› of examples

โ€ข Hypothesis class โ„‹ = โ„Ž๐œƒ ๐œƒ โˆˆ ฮ˜} of (randomized) responses

โ€ข Loss function: โ„“:๐’ด ร— ๐’ด โ†’ โ„

Assumptionโ€™: a joint distribution ๐ท samples training examples andunknown test sample jointly, i.e.

โ€ข ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐‘–=1๐‘› โˆช (๐‘ฅ, ๐‘ฆ) โˆผ ๐ท

Goal: select ๐œƒ to minimize: min๐œƒ

๐”ผ โ„“(โ„Ž๐œƒ(๐‘ฅ), ๐‘ฆ)

, โ€œdogโ€

โ€ฆ

, โ€œcatโ€

Main result

Learnability is possible when joint distribution ๐ท over training set and test set satisfies Dobrushinโ€™s condition.

Dobrushinโ€™s condition

โ€ข Given a joint probability vector าง๐‘ = ๐‘1, โ€ฆ , ๐‘๐‘š define the influence of variable ๐’Š to variable ๐’‹: โ€ข โ€œworst effect of ๐‘๐‘– to the conditional distribution of ๐‘๐‘— given ๐‘โˆ’๐‘–โˆ’๐‘—โ€

โ€ข Inf๐‘–โ†’๐‘— = sup๐‘ง๐‘–,๐‘ง๐‘–

โ€ฒ,๐‘งโˆ’๐‘–โˆ’๐‘—

๐‘‘๐‘‡๐‘‰ Pr ๐‘๐‘— โˆฃ ๐‘ง๐‘– , ๐‘งโˆ’๐‘–โˆ’๐‘— , Pr ๐‘๐‘— โˆฃ ๐‘ง๐‘–โ€ฒ, ๐‘งโˆ’๐‘–โˆ’๐‘—

โ€ข Dobrushinโ€™s condition:โ€ข โ€œall nodes have limited total influence exerted on themโ€

โ€ข For all ๐‘–, ฯƒ๐‘—โ‰ ๐‘– Inf๐‘—โ†’๐‘– < 1.

Implies: Concentration of measureFast mixing of Gibbs sampler

Learnability under Dobrushinโ€™s condition

โ€ข Suppose: โ€ข ๐‘‹1, ๐‘Œ1 , โ€ฆ , ๐‘‹๐‘›, ๐‘Œ๐‘› , (๐‘‹, ๐‘Œ) โˆผ ๐ท, where ๐ท is joint distributionโ€ข ๐ท: satisfies Dobrushinโ€™s conditionโ€ข ๐ท: has the same marginals

โ€ข Define Loss๐ท โ„Ž = ๐”ผ โ„“(โ„Ž(๐‘ฅ), ๐‘ฆ)

โ€ข There exists a learning algorithm which, given ๐‘‹1, ๐‘Œ1 , โ€ฆ , ๐‘‹๐‘›, ๐‘Œ๐‘› , outputs โ„Ž โˆˆ โ„‹ such that

Loss๐ท โ„Ž โ‰ค infโ„Žโˆˆโ„‹

Loss๐ท โ„Ž + เทจ๐‘‚ ๐‘‰๐ถ โ„‹ /๐‘› for Boolean โ„‹, 0-1 loss โ„“

Loss๐ท โ„Ž โ‰ค infโ„Žโˆˆโ„‹

Loss๐ท โ„Ž + เทจ๐‘‚ ๐œ† ๐’ข๐‘›(โ„‹) , for general โ„‹, ๐œ†-Lipschitz โ„“

(need stronger than Dobrushin condition for general โ„‹)

Essentially the same bounds as i.i.d

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Menu

โ€ข Motivation

โ€ข Part I: Regression w/ dependent observationsโ€ข Proof Ideas

โ€ข Part II: Statistical Learning Theory w/ dependent observations

โ€ข Conclusions

Goal: relax standard assumptions to accommodate two important challenges

o (i) censored/truncated samples and (ii) dependent samples

o Censoring/truncation โ‡ systematic missing of data

โ‡’ train set โ‰  test set

o Data Dependencies โ‡ peer effects, spatial, temporal dependencies

โ‡’ no apparent source for independence

Goals of Broader Line of Work

o Regression on a network (linear and logistic, ๐‘‘/๐‘› rates):

Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas:Regression from Dependent Observations.In the 51st Annual ACM Symposium on the Theory of Computing (STOCโ€™19).

o Statistical Learning Theory Framework (learnability and generalization bounds):

Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti:Generalization and learning under Dobrushin's condition.In the 32nd Annual Conference on Learning Theory (COLTโ€™19).

Statistical Estimation from Dependent Observations

Thank you!

top related