big data 101
TRANSCRIPT
NEW ENGLAND SQL SERVER
Big Data 101
SPONSORED BY
Paresh Motiwala
BIG DATA 101SERIOUSLY, THIS IS JUST 101BY PARESH MOTIWALA PREPARED FOR NESQL
PARESH MOTIWALA, PMP ®
• [email protected]• http://www.linkedin.com/in/pareshmotiwala
• @pareshmotiwala• www.circlesofgrowth.com
BIG DATA 101• Who should attend
• DBAs• CIO• Marketing peeps• Developers• Big Data Enthusiasts
• Who should not attend
BIG DATA 101• Agenda for the day:
• Sources• Privacy concerns• Storing- Hadoop• Processing – MapReduce• Presentation• Summary
BIG DATA 101
SO WHY SHOULD I CARE ABOUT THIS?
Data is the new Electricity (Satya Nadella, Spring 2016) https://www.microsoft.com/en-us/sql-server/data-drivenCompanies Generate data, Distribute, Meter, and Use it
Where is data stored?Current: SQL Server, Oracle, Teradata, DB2, Netezza, Open Source Databases; Casandra, MySQL, MongoDBUnstructured: Hadoop, Spark, Data Lakes
What type of data is stored?Traditional: Rows and Columns Big Data Explosion: Images, streaming data, internet-connected devices (IoT), Machine data
BIG DATA IS DRIVING TRANSFORMATIVE CHANGESTraditional Big Data
Relational datawith highly modeled schema
All datawith schema agility
Specialized HW Commodity HW
Datacharacteristics
Costs
Culture Operational reportingFocus on rear-view analysis
Experimentation leading to intelligent actionWith machine learning, graph, a/b testing
BIG DATA 101• Sources
• Cell Phones• Social Media• Credit Cards• GPSs• Bread Crumbs
BIG DATA 101• 5 Vs of Big Data
• Volume• Variety• Velocity• Veracity• Value
BIG DATA 101• Desired Properties:
• Robustness- Fault Tolerance• Low Latency• Scalability• Generalization• Extensibility• Ad hoc Queries• Minimal Maintenance• Debuggability
BIG DATA 101• Flow
Collection Pre-processing
Intervention Visualization
Hygiene
Analysis
OVER 90% OF TODAY’S DATA WAS CREATED IN PAST 2 YEARS
BIG DATA 101• 5 Rs of Data Quality
• Relevancy• Recency• Range• Robustness• Reliability
• Ephemeral Vs. Durability• Refresh of Data
BIG DATA 101• Privacy of Data
• If I collect the data, is it mine?• Ownership Vs Rights • Share Answers not Data• OpAl (http://www.trust.mit.edu/projects/)• Enigma• Let them know
• Why you are collecting• What you are collecting
• FIPP- Fair Information Privacy Principles• Individual Control• Transparency• Respect for Context• Security• Access and Accuracy• Focused Collection
• FERPA- Family Education Rights and Privacy Act
BIG DATA 101WHAT IS A DATA LAKE? ---COURTESY : JAMES SERRA
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.
• A place to store unlimited amounts of data in any format inexpensively, especially for archive purposes
• Allows collection of data that you may or may not use later: “just in case”• A way to describe any large data pool in which the schema and data requirements
are not defined until the data is queried: “just in time” or “schema on read”• Complements EDW and can be seen as a data source for the EDW – capturing all
data but only passing relevant data to the EDW• Frees up expensive EDW resources (storage and processing), especially for
data refinement• Allows for data exploration to be performed without waiting for the EDW team to
model and load the data (quick user access)• Some processing in better done with Hadoop tools than ETL tools like SSIS• Easily scalable
BIG DATA 101THE “DATA LAKE” USES A BOTTOMS-UP APPROACH
Ingest all data regardless of requirements
Store all data in native format without schema definition
Do analysisUsing analytic engines like Hadoop
Interactive queriesBatch queries
Machine LearningData warehouse
Real-time analytics
Devices
Courtesy : James Serra
BIG DATA 101
BIG DATA 101
BIG DATA 101
BIG DATA 101• MapReduce
• Map –Sends Queries• Reduce – Collects Results• Job Tracker• Task Tracker• YARN
Azure Services
Near Realtime Data Analytics Pipeline using Azure Steam Analytics
Big Data Analytics Pipeline using Azure Data Lake
Interactive Analytics and Predictive Pipeline using Azure Data Factory
Base Architecture : Big Data Advanced Analytics Pipeline
Data Sources Ingest Prepare(normalize, clean, etc.)
Analyze(stat analysis, ML, etc.)
Publish(for programmatic
consumption, BI/visualization)
Consume (Alerts, Operational Stats,
Insights)
Machine Learning
Tele
met
ry
Azure SQL (Predictions)
HDI Custom ETL Aggregate /Partition
Azure Storage Blob
dashboard of predictions / alerts
Live / real-time data stats, Anomalies and
aggregates
Hist
oric
Las
er D
ata
(1 ti
me
drop
)
Local DB Sensor
Readings
Local DB Logs Customer
MIS
Event Hub PowerBI
dashboard
Stream Analytics (real-time analytics)
Sensor Readings Device
Health
Data Stream
Faul
t an
d M
aint
enan
ce D
ata
(1 ti
me
drop
)
Azure Data Lake Analytics (Big Data Processing)
Azure Data Lake Storage Azure SQL
Data in Motion
Data at Rest
dashboard of operational stats
Real
time
Read
ings
and
O
pera
tiona
l Da
ta
Lega
cy
(Rep
lace
d by
Az
ure
SQL)
Operational Logs
21
Sche
dule
d ho
urly
tr
ansf
er u
sing
Azur
e Da
ta F
acto
ry
Machine Learning (Anomaly Detection)
OnPrem Data
VISION FOR BIG DATA AND DATA WAREHOUSING
Azure Data Factory+
Federated Query
On-
Prem
ises
Data Warehouse
“Big Data”
Clou
d
ComprehensiveConnectedChoice
VMsSQL DW
APSSQL Server
VMsHDInsight
Data Lake
Microsoft Azure
Microsoft Azure
Microsoft SQL Server
HDP APSYour dataYour workloadYour businessYour way
BIG DATA 101PRESENTATION• R• Python• Power BI• Power BI Desktop
BIG DATA 101•Someday Big Data will just become data
BIG DATA 101• Summary:
• Sources• Privacy concerns• Storing- Hadoop• Processing – MapReduce• Presentation
BIG DATA 101 - CONCLUSIONSQL Server is the best Relational DatabaseThe world is much bigger than any one relational databaseWhat is your company’s data strategy?What is your company’s cloud strategy?Learn adjacent technologies that will make you valuable.
Power BI?Hadoop?NoSQL?
BIG DATA 101• BIBLIOGRAPHY –
• http://www.datasciencecentral.com/• https://
www.youtube.com/playlist?list=PLt-0mOCwxJ6B_OxTlpevxJNAa7GfCLd3l
• https://www.dezyre.com/article/hadoop-components-and-architecture-big-data-and-hadoop-training/114
• MIT Big Data Analytics Course• Data Lake presentation by James Serra• Future of Data…..(or something like that) by George Walters
BIBLIOGRAPHY- BIG DATA 101Ignite (IT Pros) - https://myignite.microsoft.com/videosChannel9 (Developers) - https://channel9.msdn.com/Microsoft Virtual Academy (Both) – http://mva.microsoft.comTechnet Virtual Labs (Hands-on!) - https://technet.microsoft.com/en-us/virtuallabs/defaultFree Azure for 1 month - https://azure.microsoft.com/en-us/free/Free HDInsight (Hadoop as a service) for a week - https://azure.microsoft.com/en-us/services/hdinsight/information-request/MSDN? Link that to Azure for monthly Azure money.Github - https://github.com/