introduction to microsoft azure machine learning€¦ · ppt file · web view ·...

18
Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models

Upload: vannhi

Post on 23-Apr-2018

349 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

Girish NathanMisha Bilenko

Microsoft Azure Machine Learning

How to Work with Large Datasets to Build Predictive Models

Page 2: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

Agenda

1. How to Work with Large Datasets• Sample Dataset: NYC Taxi • HDInsight (Hadoop on Azure) • iPython notebook and HDInsight

2. Building Predictive Models• Azure ML Studio• Learning with Counts

3. Putting it all together: Learning with Counts and HDInsight

Page 3: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

Sample Data: NYC Taxi• One year log of NYC taxi rides• 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/• Trip (driver id, times, locations) and fare (fare, tip, tolls)

• Rest of tutorial: data wrangling and tip prediction• Tools: AzCopy, HDInsight, iPython, Azure ML Studio

Page 4: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

• 100% Apache Hadoop as an Azure service• Can deploy on Windows or Linux• Provides Map-Reduce capability over big data in Azure

blobs• Head node: job and cluster monitoring• Hive: SQL-like queries as an alternative to writing codeSELECT Col1, COUNT(*) AS Count_Col1 FROM Your_TableGROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10;

HD Insight : Hadoop on Azure

Page 5: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

• Web-based Python REPL environment• Combines authoring, execution, visualization• Can author and execute HDInsight Hive queries• Sample query (python code snippet)

def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams)data = json.load(response)self.hiveJobID = data[‘id’] def query(self, queryString):self.submit_hive_query()Example query string: SELECT * FROM sample_table LIMIT 10;

Ipython Notebook

Page 6: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

• Fully managed cloud service• Browser based authoring of

dataflow• Best in class machine

learning algorithms • Support for R/Python/SQL• Collaborative data science • Quickly deploy models as

web services/REST API’s• Publish to a gallery for

collaboration with community

What is Azure ML Studio

Page 7: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

(Distributed Robust Algorithm for CoUnt-based LeArning)

Misha Bilenko

Microsoft Azure Machine LearningMicrosoft Research

Learning with Counts a.k.a Dracula

Page 8: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

adid = 1010054353adText = K2 ski sale!adURL= www.k2.com/sale

Userid = 0xb49129827048dd9bIP = 131.107.65.14

Query = powder skisQCategories = {skiing, outdoor gear}

8

¿𝑢𝑠𝑒𝑟𝑠 109 ¿𝑞𝑢𝑒𝑟𝑖𝑒𝑠 109+¿¿𝑎𝑑𝑠 107 ¿ (𝑎𝑑×𝑞𝑢𝑒𝑟𝑦 ) 1010+¿ ¿

• Information retrieval• Advertising, recommending, search: item, page/query, user

• Transaction classification• Payment fraud: transaction, product, user• Email spam: message, sender, recipient• Intrusion detection: session, system, user• IoT: device, location

Large Scale learning in multi entity domains

Page 9: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

adid: 1010054353adText: Fall ski sale!adURL: www.k2.com/sale

userid 0xb49129827048dd9bIP 131.107.65.14

query powder skisqCategories {skiing, outdoor gear}

9

• Problem: representing high-cardinality attributes as features• Scalable: to billions of attribute values• Efficient: predictions/sec• Flexible: for a variety of downstream learners• Adaptive: to distribution change

• Standard approaches: binary features, hashing, projections• What everyone uses in industry: learning with counts• This talk: formalization and generalization

Large Scale learning in multi entity domains

Page 10: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

• Features are transforms of conditional statistics (per-label counts)

= [N+ N- log(N+)-log(N-) IsBackoff]• log(N+)-log(N-) = log log-odds/Naïve Bayes estimate

• N+, N- indicators of confidence of the naïve estimate

• IsFromRest: indicator of back-off vs. “real count”

) )

131.107.65.14

) )

k2. com

)

powder  skis

)

powder  skis ,  k2. com

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.107.65.14 12 430… … …

REST 745623 13964931

Learning with Counts

Page 11: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

• Features are transforms of conditional counts = [N+ N- log(N+)-log(N-) IsBackoff]

Scalable “head” in memory + tail in backoff; or: count-min sketch Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts

) )

131.107.65.14

) )

k2. com

)

powder  skis

)

powder  skis ,  k2. com

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.107.65.14 12 430… … …

REST 745623 13964931

Learning with Counts

Page 12: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

Aggregate for different • Standard MapReduce• Bin function: any projection• Backoff options: “tail bin”, hashing,

hierarchical (shrinkage)

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430… … …

REST 745623 13964931

query

facebook 281912 7957321

dozen roses 32791 640964… … …

REST 6321789 43477252

Query × AdId

facebook, ad1 54546 978964

facebook, ad2 232343 8431467

dozen roses, ad3 12973 430982… … …

REST 4419312

52754683

timeTnow

Counting

IP[2]

173.194.*.* 46964 993424

87.250.*.* 6341 91356

131.253.*.* 75126 430826… … …

12

Learning with Counts : aggregation

Page 13: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430… … …

REST 745623 13964931

query

facebook 281912 7957321

dozen roses 32791 640964… … …

REST 6321789 43477252

timeTnow

Train predictor

….

IsBackoff

ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures

Original numeric features𝑁−𝑁+¿¿

Counting

Train non-linear model on count-based features

• Counts, transforms, lookup properties

• Additional features can be injected

Query × AdId

facebook, ad1 54546 978964

facebook, ad2 232343 8431467

dozen roses, ad3 12973 430982… … …

REST 4419312

52754683

13

Learning with Counts : combiner training

Page 14: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

IP

173.194.33.9 46964 993424

87.250.251.11 31 843

131.253.13.32 12 430… … …

REST 745623 13964931

query

facebook 281912 7957321

dozen roses 32791 640964… … …

REST 6321789 43477252

URL × Country

url1, US 54546 978964

url2, CA 232343 8431467

url3, FR 12973 430982… … …

REST 4419312

52754683

timeTnow

….

IsBackoff

ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures

𝑁−𝑁+¿¿

Counting

• Counts are updated continuously

• Combiner re-training infrequent

Ttrain

Original numeric features

Prediction with counts

Page 15: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

• State-of-the-art accuracy• Good fit for map-reduce• Modular (vs. monolithic)• Learner can be tuned/monitored/replaced in isolation

• Monitorable, debuggable (this is HUGE in practice!)• Temporal changes easy to monitor• Easy emergency recovery (remove bot attacks, etc.)• Decomposable predictions• Error debugging (which feature can we blame…)

15

What is great about learning with Counts ?

Page 16: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

Learning with Counts : in Azure ML

Page 17: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

• HDInsight: large data storage and map-reduce processing

• Azure ML: cloud ML and analytics accessible anywhere

• Learning with Counts: intuitive, flexible large-scale ML solution

Putting it all together

Page 18: Introduction to Microsoft Azure Machine Learning€¦ · PPT file · Web view · 2018-04-09Microsoft Azure Machine Learning. ... drag-and-drop modules ... For every example, we

Thanks for your time

Useful Links:http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial

http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML

Need Azure ML for teaching in classroom ? - Contact the speakers

Other Questions ? - Contact the speakers

Speakers :-Misha Bilenko : [email protected] Nathan – [email protected]