data science with windows azure - a brief introduction
TRANSCRIPT
![Page 1: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/1.jpg)
A D N A N M A S O O D, P H DS YS T E M S A R CH I T E C T / S O F T WA R E E N G I N E E R
A D N A N . M A S O O D @ O W A S P. O R G( H T T P : / /B L O G . A D N A N M A S O O D . C O M)
G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) , T W I T T E R ( @ A D N A N M A S O O D ) .
PR E S E N T E D AT M I C R O S O F T D ATA S C I E N C E G R O U P – TA M PA A N A LY T I CS PR O F E SS I O N A L S
H T T P : / / W W W. M E E T U P. C O M / A N A L Y T I C S - P R O F E S S I O N A L S - O F -T A M P A / E V E N T S /2 2 8 7 9 6 3 4 3 /
Data Science with Windows Azure
![Page 2: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/2.jpg)
About the Speaker
Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes in machine learning and Bayesian belief networks. Before joining PDS Health care, and GDC (a leading prepaid financial technology institution), he enjoyed life as a principal engineer of a start-up and worked for a leading UK based nonprofit organization as a solutions architect.
A strong believer in the development community, Adnan is an active member of the Open Web Application Security Project (OWASP), an organization dedicated to software security. In the .NET community, he is a cofounder and president of the Pasadena .NET Developers group, which he has been successfully leading for 8 years. He led a number of successful enterprise solutions and consulted for several Fortune 500 company projects.
Adnan devotes himself to his own continual, practical education. He holds certifications in big data, machine learning, and systems architecture from Massachusetts Institute of Technology; an Application Security certification from Stanford University; an SOA Smarts certification from Carnegie Mellon University; and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified Solutions Developer, and Sun Certified Java Developer.
![Page 3: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/3.jpg)
Key Take Aways from this Talk
Understand what Microsoft Offers for Data Science in Windows Azure. (or how to write mapReduce jobs in C#)
Diagrams are Courtesy of Microsoft Corporation
![Page 4: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/4.jpg)
![Page 5: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/5.jpg)
Diagrams are Courtesy of Microsoft Corporation
![Page 6: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/6.jpg)
Diagrams are Courtesy of Microsoft Corporation
![Page 7: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/7.jpg)
Diagrams are Courtesy of Microsoft Corporation
![Page 8: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/8.jpg)
![Page 9: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/9.jpg)
![Page 10: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/10.jpg)
10 6/16/2015
![Page 11: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/11.jpg)
11 6/16/2015
![Page 12: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/12.jpg)
What is Hadoop?At Google MapReduce operation are run on a
special file system called Google File System (GFS) that is highly optimized for this purpose.
GFS is not open source.Doug Cutting and others at Yahoo! reverse
engineered the GFS and called it Hadoop Distributed File System (HDFS).
The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.
This is open source and distributed by Apache.
12
![Page 13: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/13.jpg)
MapReduce13
MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid.
6/16/2015
![Page 14: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/14.jpg)
Classes of problems “mapreducable”
Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort”
Google uses it for wordcount, adwords, pagerank, indexing data.
Simple algorithms such as grep, text-indexing, reverse indexing
Bayesian classification: data mining domainFacebook uses it for various operations: demographicsFinancial services use it for analyticsAstronomy: Gaussian analysis for locating extra-
terrestrial objects.Expected to play a critical role in semantic web and in
web 3.0
14
![Page 15: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/15.jpg)
Apache SparkApache Spark is an open source cluster
computing framework originally developed in the AMPlab at UC Berkley.
Spark in-memory provides performance up to 100 times faster for certain applications.
Spark is well suited for machine learning algorithms.
Spark requires a cluster manager and a distributed storage system.
Spark supports Hadoop YARN.6/16/2015
15
![Page 16: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/16.jpg)
How Hadoop Operates16
6/16/2015
![Page 17: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/17.jpg)
Example: counting the number of occurrences for each word in a collection of documents
The input file is a repository of documents, and each document is an element. The Map function for this example uses keys that are of type String (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1,w2, . . . ,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key-value pairs:
(w1, 1), (w2, 1), . . . , (wn, 1)
6/16/2015
17
![Page 18: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/18.jpg)
Key Players in Hadoop World
HortonWorksClouderaMAPR
![Page 19: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/19.jpg)
Hortonworks is a Business computer software company based in Palo Alto,California
Hortonworks supports & develops Apache Hadoop framework, that allows distributed processing of large data sets across clusters of computers
They are the sponsors of Apache Software Foundation Founded in June 2011 by Yahoo and Benchmark capital as an
independent company. It went public on December 2014 Below are the list of company collaborated with Hortonworks Microsoft on October 2011 to develop Azure & Window server Infomatica on November 2011 to develop HParser Teradata on February 2012 to develop Aster data system SAP AG on September 2012 announced it would resell Hortonworks
distribution
6/16/2015
Hortonworks
![Page 20: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/20.jpg)
About Cloudera Cloudera is “The commercial Hadoop company” Founded by leading experts on Hadoop from
Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop
users Staff includes several committers to Hadoop projects
6/16/2015
20
![Page 21: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/21.jpg)
HaaS exampleAmazon Web Services(AWS) -Amazon Elastic MapReduce (EMR) providing Hadoop based platform for data analysis with S3 as the storage system and EC2 as the compute systemMicrosoft HDInsight, Cloudera CDH3, IBM Infoshpere BigInsights, EMC GreenPlum HD and Windows Azure HDInsight Service are the primary HaaS services by global IT giants
![Page 22: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/22.jpg)
What is MapReduce Used For?
In research: Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Climate simulation (Washington) Bioinformatics (Maryland) Particle physics (Nebraska) <Your application here>
![Page 23: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/23.jpg)
Example: Word Count
def mapper(line): foreach word in line.split(): output(word, 1)
def reducer(key, values): output(key, sum(values))
![Page 24: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/24.jpg)
Key Cloud Solution Providers for Hadoop as A Service
• Windows azure• Aws• Google
![Page 25: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/25.jpg)
Windows Azure
Enterprise-level on-demand capacity builderFabric of cycles and storage available on-
request for a costYou have to use Azure API to work with the
infrastructure offered by MicrosoftSignificant features: web role, worker role ,
blob storage, table and drive-storage
25
![Page 26: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/26.jpg)
Amazon EC2
EC2 provided an API for instantiating computing instances with any of the operating systems supported.
Excellent distribution, load balancing, cloud monitoring tools
26
![Page 27: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/27.jpg)
Google App Engine
Google offers the same reliability, availability and scalability at par with Google’s own applications
27
![Page 28: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/28.jpg)
MapReduce EngineMapReduce requires a distributed file system and
an engine that can distribute, coordinate, monitor and gather the results.
Hadoop provides that engine through (the file system we discussed earlier) and the JobTracker + TaskTracker system.
JobTracker is simply a scheduler. TaskTracker is assigned a Map or Reduce (or
other operations); Map or Reduce run on node and so is the TaskTracker; each task is run on its own JVM on a node.
28
![Page 29: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/29.jpg)
Building a Custom MapReduce Job in .NET
A .NET map-reduce program comprises a number of parts Job definition Mapper, Reducer, and Combiner classes Input data Job executor
![Page 30: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/30.jpg)
![Page 31: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/31.jpg)
![Page 32: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/32.jpg)
![Page 33: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/33.jpg)
![Page 34: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/34.jpg)
![Page 35: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/35.jpg)
![Page 36: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/36.jpg)
![Page 37: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/37.jpg)
![Page 38: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/38.jpg)
References & Further Reading
![Page 39: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/39.jpg)
References & Further Reading
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-mapreduce/
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-zeppelin-notebook-jupyter-spark-sql/
https://azure.microsoft.com/en-us/services/machine-learning/
![Page 40: Data science with Windows Azure - A Brief Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062401/58f06af81a28ab77078b45d9/html5/thumbnails/40.jpg)
Questions