ece6980 an algorithmic and information theoretic toolbox for … · 2016-12-21 · information...

Post on 09-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ECE6980

An Algorithmic and Information Theoretic

Toolbox for Massive Data

Instructor:Jayadev AcharyaEmail:acharya@cornell.eduLectures:TuTh 1.25-2.40,203PhillipsOfficeHours:MoTh 3-4,304RhodesWebsite:http://people.csail.mit.edu/jayadev/ece6980

Logistics

Grading• Scribealecture:10%

• Encouragedtofillinthedetails,provideexamples

• Assignments30-60%• 2-3assignments• Typeset?

• Projectreportandpresentation:40-60%• Readanewrelatedpaper• Presentasummaryinyourownwords• Canchoosefromalist

• Interruptions: 5%

Lectures• Lecturesprimarilyontheboard• Derivethings(mostlyfromscratch)

Courseoverview• Lotofinterestindatascience

• Numberofcoursesonoffer• Manyaspectscanbecovered

• Thiscourse:• Coreprimitives• Efficientalgorithms• Fundamental limits• Mostlytheoretical,encourage implementation

Prerequisites• Undergraduateprobability/random processes• Basiccombinatorics

• Whatisthevarianceofarandomvariable?• Whatisabinomialdistribution?• Whenaretworandomvariables independent?

Whatyoushouldlearn?• Fastalgorithmsforstatisticalproblems

• Learningdiscretedistributions• Finitesamplehypothesistesting

• Howtoproveinformationtheoretic lowerbounds

Probabilisticthinking

Classical

Smalldomain𝐷

𝑛𝑙𝑎𝑟𝑔𝑒, 𝐷 𝑠𝑚𝑎𝑙𝑙

Modern

Largedomain𝐷

𝑛𝑠𝑚𝑎𝑙𝑙 , 𝐷 𝑙𝑎𝑟𝑔𝑒

Oldquestions,newissues

Domain:

𝑛 = 1000 tosses

AsymptoticanalysisComputationnotcrucial

Newchallenges

Onehumangenome

Domain:

ResourcesSamples

• Howmuchdataneeded?• Inferencewhendataisscarce

Computation• Howdoesrun-timescalewithdataanddomainsize?• Evenquadratic mightbeprohibitive

Otherresources• Storage:Notenoughspacetostorealldata• Communication:Distributeddataacrossservers

GoalsForstatisticalinference• Design efficientalgorithms• Understand fundamental limits

INFORMATIONTHEORY

MACHINELEARNINGSTATISTICS

ALGORITHMS

Distributionlearning

Asimplesetting• Support set𝒳• Distribution𝑝:𝒳 → ℝ45 ,suchthat∑ 𝑝 𝑥8∈𝒳 = 1• Samples𝑥: = 𝑥;𝑥< … 𝑥: drawnfrom𝑝• Outputadistribution𝑞(𝑥:) afterobserving𝑥:

Tossacoin:HTTTHTTH

Throwadie:31344536

Whatisagoodestimator• Wouldlike𝑞 tobecloseto𝑝• 𝐿(𝑝, 𝑞):Lossforestimating𝑝 with𝑞

• Totalvariationdistance,KLdivergence,…

• Findanestimatorwithsmall𝐿• Foragivenlossfunction,howmanysamplesneeded?

• Empiricalestimators:• 𝑞 𝐻 = C

D

• Analyzetheperformance ofempiricalestimators

Learning• Givensamples fromaGaussiandistribution𝑁(𝜇, 𝜎<)• LearnwithaGaussiandistribution

• Relativelysimple

Learning

Ratioofbreadthtoheightof1000crabsbyW.WeldonNotnormallydistributed,morethanonespecies?KarlPearson:MixturesofGaussians (muchharder!!)

Distributiontesting

PolishMultilotek:• Picks20numbersbetween1,…,80

Isitfair?

Testinguniformity

Thanks to Krzysztof Onak (pointer) and Eric Price (graph)(FigurebyOnak,Price,Rubinfeld)

Testinguniformity(contd)

Asimplesetting• 𝒳 = 𝑘• 𝑢: uniformdistributionover𝒳• 𝑥:: 𝑛 samples fromadistribution𝑝

Question: Is𝑝 = 𝑢 OR 𝑝 − 𝑢 ; ≥ 𝜀?

Howmanysamplesdoweneed?Takeaguess…

𝑋:: asportsarticle𝑌:: areligiousarticle

𝑍:oneword

Q:Is𝑍morelikelytoappear insportsorreligion?

Necessarily assign𝑍 towhereitappearsmoreoften?

Asimpleclassificationproblem

Propertyestimation

Predictingnewelements

...

Howmanynewspecies?

Corbet collectedbutterflies inMalayaforoneyear

Applications of estimating the unseen

Corbett collected butterflies in Malaya for 1 year

Frequency 1 2 3 4 5 6 7 ..Species 118 74 44 24 29 22 20 ..

# of seen species = 118 + 74 + 44 + 24 + . . .

# of new species in the next year?

# of words in a book..

17 /46

Howmanynewspecies ifhegoesforonemoreyear?

Entropyestimation

Measuring Randomness in Data

Estimating randomness of the observed data:

Neural signal processing Feature selection for machine learning

Image Registration

Approach: Estimate the “entropy” of the generating distribution

Shannon entropy H(p)

def=

Px �px log px

1

Howmuchrandomnessinneuralspikes?

Howtoestimateentropy fromobservations?

Entropyestimation• 𝒳 = 𝑘• 𝑥:: 𝑛 samples fromadistribution𝑝

Question: Estimate𝐻(𝑝)

Resourceconstraints

DatatoobigtobestoredinasinglemachineLotofrecent interest

samplestream

limitedmemory

decision

distributeddata

limitedcommunication

top related