ece6980 an algorithmic and information theoretic toolbox for … · 2016-12-21 · information...
TRANSCRIPT
![Page 1: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/1.jpg)
ECE6980
An Algorithmic and Information Theoretic
Toolbox for Massive Data
![Page 2: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/2.jpg)
Instructor:Jayadev AcharyaEmail:[email protected]:TuTh 1.25-2.40,203PhillipsOfficeHours:MoTh 3-4,304RhodesWebsite:http://people.csail.mit.edu/jayadev/ece6980
Logistics
![Page 3: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/3.jpg)
Grading• Scribealecture:10%
• Encouragedtofillinthedetails,provideexamples
• Assignments30-60%• 2-3assignments• Typeset?
• Projectreportandpresentation:40-60%• Readanewrelatedpaper• Presentasummaryinyourownwords• Canchoosefromalist
• Interruptions: 5%
![Page 4: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/4.jpg)
Lectures• Lecturesprimarilyontheboard• Derivethings(mostlyfromscratch)
![Page 5: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/5.jpg)
Courseoverview• Lotofinterestindatascience
• Numberofcoursesonoffer• Manyaspectscanbecovered
• Thiscourse:• Coreprimitives• Efficientalgorithms• Fundamental limits• Mostlytheoretical,encourage implementation
![Page 6: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/6.jpg)
Prerequisites• Undergraduateprobability/random processes• Basiccombinatorics
• Whatisthevarianceofarandomvariable?• Whatisabinomialdistribution?• Whenaretworandomvariables independent?
![Page 7: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/7.jpg)
Whatyoushouldlearn?• Fastalgorithmsforstatisticalproblems
• Learningdiscretedistributions• Finitesamplehypothesistesting
• Howtoproveinformationtheoretic lowerbounds
Probabilisticthinking
![Page 8: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/8.jpg)
Classical
Smalldomain𝐷
𝑛𝑙𝑎𝑟𝑔𝑒, 𝐷 𝑠𝑚𝑎𝑙𝑙
Modern
Largedomain𝐷
𝑛𝑠𝑚𝑎𝑙𝑙 , 𝐷 𝑙𝑎𝑟𝑔𝑒
Oldquestions,newissues
Domain:
𝑛 = 1000 tosses
AsymptoticanalysisComputationnotcrucial
Newchallenges
Onehumangenome
Domain:
![Page 9: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/9.jpg)
ResourcesSamples
• Howmuchdataneeded?• Inferencewhendataisscarce
Computation• Howdoesrun-timescalewithdataanddomainsize?• Evenquadratic mightbeprohibitive
Otherresources• Storage:Notenoughspacetostorealldata• Communication:Distributeddataacrossservers
![Page 10: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/10.jpg)
GoalsForstatisticalinference• Design efficientalgorithms• Understand fundamental limits
INFORMATIONTHEORY
MACHINELEARNINGSTATISTICS
ALGORITHMS
![Page 11: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/11.jpg)
Distributionlearning
![Page 12: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/12.jpg)
Asimplesetting• Support set𝒳• Distribution𝑝:𝒳 → ℝ45 ,suchthat∑ 𝑝 𝑥8∈𝒳 = 1• Samples𝑥: = 𝑥;𝑥< … 𝑥: drawnfrom𝑝• Outputadistribution𝑞(𝑥:) afterobserving𝑥:
Tossacoin:HTTTHTTH
Throwadie:31344536
![Page 13: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/13.jpg)
Whatisagoodestimator• Wouldlike𝑞 tobecloseto𝑝• 𝐿(𝑝, 𝑞):Lossforestimating𝑝 with𝑞
• Totalvariationdistance,KLdivergence,…
• Findanestimatorwithsmall𝐿• Foragivenlossfunction,howmanysamplesneeded?
• Empiricalestimators:• 𝑞 𝐻 = C
D
• Analyzetheperformance ofempiricalestimators
![Page 14: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/14.jpg)
Learning• Givensamples fromaGaussiandistribution𝑁(𝜇, 𝜎<)• LearnwithaGaussiandistribution
• Relativelysimple
![Page 15: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/15.jpg)
Learning
Ratioofbreadthtoheightof1000crabsbyW.WeldonNotnormallydistributed,morethanonespecies?KarlPearson:MixturesofGaussians (muchharder!!)
![Page 16: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/16.jpg)
Distributiontesting
![Page 17: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/17.jpg)
PolishMultilotek:• Picks20numbersbetween1,…,80
Isitfair?
Testinguniformity
![Page 18: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/18.jpg)
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)(FigurebyOnak,Price,Rubinfeld)
Testinguniformity(contd)
![Page 19: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/19.jpg)
Asimplesetting• 𝒳 = 𝑘• 𝑢: uniformdistributionover𝒳• 𝑥:: 𝑛 samples fromadistribution𝑝
Question: Is𝑝 = 𝑢 OR 𝑝 − 𝑢 ; ≥ 𝜀?
Howmanysamplesdoweneed?Takeaguess…
![Page 20: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/20.jpg)
𝑋:: asportsarticle𝑌:: areligiousarticle
𝑍:oneword
Q:Is𝑍morelikelytoappear insportsorreligion?
Necessarily assign𝑍 towhereitappearsmoreoften?
Asimpleclassificationproblem
![Page 21: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/21.jpg)
Propertyestimation
![Page 22: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/22.jpg)
Predictingnewelements
...
Howmanynewspecies?
Corbet collectedbutterflies inMalayaforoneyear
Applications of estimating the unseen
Corbett collected butterflies in Malaya for 1 year
Frequency 1 2 3 4 5 6 7 ..Species 118 74 44 24 29 22 20 ..
# of seen species = 118 + 74 + 44 + 24 + . . .
# of new species in the next year?
# of words in a book..
17 /46
Howmanynewspecies ifhegoesforonemoreyear?
![Page 23: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/23.jpg)
Entropyestimation
Measuring Randomness in Data
Estimating randomness of the observed data:
Neural signal processing Feature selection for machine learning
Image Registration
Approach: Estimate the “entropy” of the generating distribution
Shannon entropy H(p)
def=
Px �px log px
1
Howmuchrandomnessinneuralspikes?
Howtoestimateentropy fromobservations?
![Page 24: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/24.jpg)
Entropyestimation• 𝒳 = 𝑘• 𝑥:: 𝑛 samples fromadistribution𝑝
Question: Estimate𝐻(𝑝)
![Page 25: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate](https://reader033.vdocuments.us/reader033/viewer/2022042418/5f349ce57e25c5749530c736/html5/thumbnails/25.jpg)
Resourceconstraints
DatatoobigtobestoredinasinglemachineLotofrecent interest
samplestream
limitedmemory
decision
distributeddata
limitedcommunication