microsoft's implementation of big data
TRANSCRIPT
Gunder Bitén
IT Architect
Knowit Stockholm
30 years as IT Consultant
5 years with Azure
+46 72 553 94 81
https://www.linkedin.com/in/gunderbiten
Agenda
• Big Data Basics
• Microsoft Cloud Offer
• HDInsight Cluster creation
• HBase RDP
• HBase processing from C#
• Query Jobs
• Machine Learning – Predictive Analysis
• Internet of Things – Event Hubs & Storm Cluster
Big Data
Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective, innovative
forms of information processing for enhanced insight and
decision making.
Gartner
Microsoft Cloud Offer
• Office 365 including SharePoint Online & Dynamics CRM Online
• Virtual Machines
• Websites
• Service Bus
• Virtual Networks
• AD
• HDInsight
• Machine Learning
• Event Hub
• More coming next week
DEMO Azure Portal
Apache Hadoop (named after the creator Doug Cutting’s child’s pet elephant)
• Apache Open Source Project
• Java
• De facto standard
• MapReduce (widely used until now)• Programming model and implementation for processing and generating large data sets
• Map() procedure that performs filtering and sorting
• Reduce() procedure that performs a summary operation
• Used by• Facebook & Google
• Microsoft Azure & Amazon EC2
• Almost all of Fortune 100 enterprises
• The foundation for numerous other Apache projects
Open Source Community
We Consume Code
We Contribute Code
Core Code Same Across
Distributions
Apache Hadoop
Microsoft Partner –
HDP for Windows
Heavy Contributors to
Open Source Hadoop
Trusted by Community
Hortonworks
HDInsight Service,
HDInsight Server Built on
Hortonworks Platform
Additional Functionality
HDInsight
Cluster Types
• Hadoop – Work directly with unstructured data in OS files
• HBase - NoSQL database that allows online transactional processing of big data
• Includes larger Zookeeper instances that cost money
• Offers the same response time no matter if you have 100MB or 100 PB of data in a table
• Storm - System for processing streams of data (Preview as of mid October)
• Works well with Event Hub in the Service Bus
DEMOCluster Creation, table action both from cluster command line and Visual Studio
Predictive Analytics – Psychohistory Come True
• Science Fiction novelist Isaac Asimov’s first book in the Foundation Trilogy was published in 1951
• The whole story is based upon the creation of the science Psychohistory by Hari Seldon
• Psychohistory depends on the idea that, while one cannot foresee the actions of a particular individual, the laws of statistics as applied to large groups of people could predict the general flow of future events
• Today Psychohistory is reality but with another name – Predictive Analytics
Internet of Things – Event Hub & Storm Cluster
•Streams – an unbounded sequence of tuples.
•Spouts –sources of streams in a computation (e.g. a Twitter API)
•Bolts – process input streams (Tuples) and produce output streams
•Nimbus node (master node, similar to the Hadoop JobTracker)
•ZooKeeper nodes – coordinates the Storm cluster
•Supervisor nodes – starts and stops workers according to signals from Nimbus
EEEvent
Hub
The world of NoSQL
• MapReduce on the way out !?!
• Spark SQL
• Impala
• Hive – getting faster and better (MapReduce internally)
• SQL look a likes in other Azure products
Why Azure and not HDP for Windows
• Compute capacity expensive – spin up/remove on demand
• Complex infrastructure – use human resources for analytics, not infrastructure
• Get access to everything else in Azure
• Start now – not in six months
• Let Microsoft worry about the 99.9% SLA
From Hype to Commodity
• On Gartner’s Top 10 Strategic Technological Trends three years ago• Internet of Things
• Next Generation Analytics
• Big Data
• Cloud Computing
• Two years ago I started to hope Microsoft would implement 1,2 & 3 in 4
• In Azure today
• Event Hub & Storm
• Machine Learning
• HDInsight
• Hypes or Commodity?
References
• http://azure.microsoft.com - Azure
• http://azure.microsoft.com/en-us/services/hdinsight/ - HDInsight
• http://hadoop.apache.org/ - Hadoop
• http://www.microsoft.com/en-us/server-cloud/products/analytics-platform-system/ - Analytics Platform
• http://aka.ms/IntroHDInsight/PDF - Introducing Microsoft Azure HDInsight E-Book