microsoft's hadoop story
DESCRIPTION
Presentation at the Seattle Hadoop Meetup 1/23 about Microsoft's Hadoop Story.TRANSCRIPT
Hadoop and Microsoft.
Michael Rys | Principal Program Manager @SQLServerMike
Session Objectives
• What is BigData?• How it fits into the Windows and Windows Azure environments• How do I program against it in the Microsoft Environment
What is Big Data?• Traditionally: • Physics Experiments, Sensor data, Satellite data, …
• Now:• Operational Logs• Customer behavior• Social interactions online• …
• From Terabytes in the 1990 over Petabytes today to Zetabytes in the future
Big Data.
Big Data.
VOLUME (Size)
VARIETY (Structure)
VELOCITY (Speed)
Advanced Analytics
Live Data Feed
Social Analytics
How do I optimize my services based on patterns of weather, traffic, etc.?
What’s the social sentiment of my product?
How do I better predict future outcomes?
New Questions.
Hadoop is for Big Data.
What is Hadoop (v1)?
• Processing Platform for Big Data Processing• Using the “Map-Reduce” Processing Paradigm
• Characteristics:• Highly-scalable (scaled out)• Commodity HW-based• Open Source
=> Very low cost for acquisition and storage costs
Hadoop Data Flow
HadoopData Analytics
Hadoop Capabilities
Machine Learning
Graph Processing
Distributed Compute
Extract Load Transform
Predictive
Analysis
Distributed Storage(HDFS)
Query(Hive)
HDInsight Ecosystem
Distributed Processing(Map Reduce)
Scripting(Pig)
NoSQ
L Data
base
(HB
ase
)
Metadata(HCatalog)
Data
Inte
gra
tion
( OD
BC
/ SQ
OO
P/
REST)
Busin
ess In
tellig
ence
(E
xcel, Po
werV
iew
…)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processing(RHadoop)
Pipelin
e /
workflo
w(O
ozie
)
Log file
aggre
gatio
n(Flu
me)
PDW
World’s Data (Azure Data Marketplace) AD, System CenterWindows Azure Storage
Data Knowledge Action
HDInsight
Front endFront end
Stream Layer
Partition Layer
HDFS on Azure: Tale of two File Systems
NameNode
Data Node Data Node
Front end
HDFS API
DFS (1 Data Node per Worker Role)and Compute Cluster
Azure Storage Vault (ASV)
…
Containers on Azure Blob Storage
.Net Map/Reduce Support• Install NuGet• “NuGet” Microsoft .Net MapReduce API for Hadoop• Provide an implementation of a HadoopJob• Execute the job via either
• MRLib\MRRunner.exe -dll ConsoleAppHadoopJob.exeOr
– HadoopJobExecutor.ExecuteJob<HadoopJobClass>();
• Collect your result on HDFS
Javascript Map/Reduce Support• Provide a map and reduce function variable in JS file• Use Javascript console with• runJS(‘/user/myself/MRjob.js’, ‘/path/to/inputfile’, ‘/path/to/output/dir’)
• Collect your result on HDFS
Invoking HiveQL Queries• Run queries in Hadoop Command Shell after invoking hive• Through the web console• Programmatically through ODBC• Coming soon: LINQ to Hive!
Social Apps
Sensor & RFID
Mobile Apps
WebApps
Unstructured data Structured data
Polybase – Enhancing PDW query engine
Traditional schema-based DW applications
EnhancedPDW query engine
Data ScientistsBI Users
DB Admins
Regular T-SQL
Results
PDW V2Hadoop
Microsoft Hadoop Vision
Microsoft Business Intelligence (BI) • Hive ODBC Connectivity • BI Tools for Big Data
Better on Windows and Azure • Active Directory• System Center • .Net Programmability
Microsoft Data Connectivity• SQL Server / SQL Parallel Data Warehouse• Azure Storage / Azure Data Market
Collaborate with and Contribute to OSS• Collaborate with HortonWorks• Provide improvements and Windows support back to OSS
Getting started• On prem: http://www.microsoft.com/bigdata/
• Single node cluster (onebox) install• C:\hadoop• Starts local services• Can start/stop them with start-onebox.cmd/stop-onebox.cmd• Comes with:
• Hadoop command line (shell)• Hadoop Status for name node and map-reduce cluster• HDInsight Dashboard
• On Windows Azure: http://HadoopOnAzure.com/• 3 node cluster running as a service in Azure• Can be used for 5 days• Provides samples and HDInsight Dashboard
• TAP Program
Related Content and Links
http://www.microsoft.com/bigdatahttp://www.hadooponazure.comNuget: http://nuget.codeplex.com/LinqPad: http://www.linqpad.net/Linq to Hive (see http://hadoopsdk.codeplex.com)
Find Me Later At…Twitter: @SQLServerMike
ACM SIGOPS Paper: Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency (Calder et al)http://blogs.msdn.com/b/windowsazure/archive/2012/11/02/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx Developing Big Data Analytics Applications with JavaScript and .NET for Windows Azure and Windows: http://channel9.msdn.com/Events/Build/2012/3-038