girish nathan misha bilenko microsoft azure machine learning how to work with large datasets to...
TRANSCRIPT
![Page 1: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/1.jpg)
Girish NathanMisha Bilenko
Microsoft Azure Machine Learning
How to Work with Large Datasets to Build Predictive Models
![Page 2: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/2.jpg)
Agenda
1. How to Work with Large Datasets• Sample Dataset: NYC Taxi • HDInsight (Hadoop on Azure) • iPython notebook and HDInsight
2. Building Predictive Models• Azure ML Studio• Learning with Counts
3. Putting it all together: Learning with Counts and HDInsight
![Page 3: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/3.jpg)
Sample Data: NYC Taxi
• One year log of NYC taxi rides• 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/• Trip (driver id, times, locations) and fare (fare, tip, tolls)
• Rest of tutorial: data wrangling and tip prediction• Tools: AzCopy, HDInsight, iPython, Azure ML Studio
![Page 4: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/4.jpg)
• 100% Apache Hadoop as an Azure service• Can deploy on Windows or Linux• Provides Map-Reduce capability over big data in Azure
blobs• Head node: job and cluster monitoring• Hive: SQL-like queries as an alternative to writing codeSELECT Col1, COUNT(*) AS Count_Col1 FROM Your_TableGROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10;
HD Insight : Hadoop on Azure
![Page 5: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/5.jpg)
• Web-based Python REPL environment• Combines authoring, execution, visualization• Can author and execute HDInsight Hive queries• Sample query (python code snippet)
def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams)
data = json.load(response)self.hiveJobID = data[‘id’]
def query(self, queryString):self.submit_hive_query()
Example query string: SELECT * FROM sample_table LIMIT 10;
Ipython Notebook
![Page 6: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/6.jpg)
• Fully managed cloud service
• Browser based authoring of dataflow
• Best in class machine learning algorithms
• Support for R/Python/SQL
• Collaborative data science
• Quickly deploy models as web services/REST API’s
• Publish to a gallery for collaboration with community
What is Azure ML Studio
![Page 7: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/7.jpg)
(Distributed Robust Algorithm for CoUnt-based LeArning)
Misha Bilenko
Microsoft Azure Machine LearningMicrosoft Research
Learning with Counts a.k.a Dracula
![Page 8: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/8.jpg)
adid = 1010054353adText = K2 ski sale!adURL= www.k2.com/sale
Userid = 0xb49129827048dd9bIP = 131.107.65.14
Query = powder skisQCategories = {skiing, outdoor gear}
8
¿𝑢𝑠𝑒𝑟𝑠 109 ¿𝑞𝑢𝑒𝑟𝑖𝑒𝑠 109+¿¿𝑎𝑑𝑠 107 ¿ (𝑎𝑑×𝑞𝑢𝑒𝑟𝑦 ) 1010+¿ ¿
• Information retrieval• Advertising, recommending, search: item, page/query, user
• Transaction classification• Payment fraud: transaction, product, user• Email spam: message, sender, recipient• Intrusion detection: session, system, user• IoT: device, location
Large Scale learning in multi entity domains
![Page 9: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/9.jpg)
adid: 1010054353adText: Fall ski sale!adURL: www.k2.com/sale
userid 0xb49129827048dd9bIP 131.107.65.14
query powder skisqCategories {skiing, outdoor gear}
9
• Problem: representing high-cardinality attributes as features• Scalable: to billions of attribute values• Efficient: predictions/sec• Flexible: for a variety of downstream learners• Adaptive: to distribution change
• Standard approaches: binary features, hashing, projections• What everyone uses in industry: learning with counts• This talk: formalization and generalization
Large Scale learning in multi entity domains
![Page 10: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/10.jpg)
• Features are transforms of conditional statistics (per-label counts)
= [N+ N- log(N+)-log(N-) IsBackoff]• log(N+)-log(N-) = log log-odds/Naïve Bayes estimate
• N+, N- indicators of confidence of the naïve estimate
• IsFromRest: indicator of back-off vs. “real count”
) )
131.107.65.14
) )
k2.com
)
powder skis
)
powder skis , k2. com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Learning with Counts
![Page 11: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/11.jpg)
• Features are transforms of conditional counts = [N+ N- log(N+)-log(N-) IsBackoff]
Scalable “head” in memory + tail in backoff; or: count-min sketch Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts
) )
131.107.65.14
) )
k2.com
)
powder skis
)
powder skis , k2. com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Learning with Counts
![Page 12: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/12.jpg)
Aggregate for different • Standard MapReduce• Bin function: any projection• Backoff options: “tail bin”, hashing,
hierarchical (shrinkage)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982… … …
REST 4419312
52754683
timeTnow
Counting
IP[2]
173.194.*.* 46964 993424
87.250.*.* 6341 91356
131.253.*.* 75126 430826
… … …
12
Learning with Counts : aggregation
![Page 13: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/13.jpg)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
timeTnow
Train predictor
….
IsBackoff
ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures
Original numeric features𝑁−𝑁+¿¿
Counting
Train non-linear model on count-based features
• Counts, transforms, lookup properties
• Additional features can be injected
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982… … …
REST 4419312
52754683
13
Learning with Counts : combiner training
![Page 14: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/14.jpg)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
URL × Country
url1, US 54546 978964
url2, CA 232343 8431467
url3, FR 12973 430982… … …
REST 4419312
52754683
timeTnow
….
IsBackoff
ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures
𝑁−𝑁+¿¿
Counting
• Counts are updated continuously
• Combiner re-training infrequent
Ttrain
Original numeric features
Prediction with counts
![Page 15: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/15.jpg)
• State-of-the-art accuracy• Good fit for map-reduce• Modular (vs. monolithic)• Learner can be tuned/monitored/replaced in isolation
• Monitorable, debuggable (this is HUGE in practice!)• Temporal changes easy to monitor• Easy emergency recovery (remove bot attacks, etc.)• Decomposable predictions• Error debugging (which feature can we blame…)
15
What is great about learning with Counts ?
![Page 16: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/16.jpg)
Learning with Counts : in Azure ML
![Page 17: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/17.jpg)
• HDInsight: large data storage and map-reduce processing
• Azure ML: cloud ML and analytics accessible anywhere
• Learning with Counts: intuitive, flexible large-scale ML solution
Putting it all together
![Page 18: Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models](https://reader038.vdocuments.us/reader038/viewer/2022110206/56649d045503460f949d76c1/html5/thumbnails/18.jpg)
Thanks for your time
Useful Links:
http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial
http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML
Need Azure ML for teaching in classroom ? - Contact the speakers
Other Questions ? - Contact the speakers
Speakers :-Misha Bilenko : [email protected] Nathan – [email protected]