best practices: implementing dataops with a data science platform
TRANSCRIPT
Learn more at datascience.com | Empower Your Data Scientists
November 7, 2017
Best Practices:
Implementing DataOps with a Data Science Platform
Learn more at datascience.com | Empower Your Data Scientists
• Evolving data science landscape
• Data growth and impacts
• Defining DataOps
• DataOps Vs. DevOps
• Best practices in applying DataOps
• Q&A
Agenda
2
Crystal Valentine
VP Technology Strategy
MapR
William Merchan
CSO
DataScience.com
Learn more at datascience.com | Empower Your Data Scientists 3
EVOLVING LANDSCAPE
Learn more at datascience.com | Empower Your Data Scientists
DOING DATA SCIENCE HAS GROWN IN COMPLEXITY
4
Windows OSX Cloud On Prem
Laptops Remote
Environments
Security AWS Google Azure
Notebooks
Jupyter
R Studio
Zeppelin
Languages
Python
Scala
R
SAS
Tools
Libraries
Sharing & Collaboration
?
Results Models
Chat Email
.ppt
Code
Shared
Drives
Deployments
Monitoring Support
Logging
Style A
Logging
Style B
Tools
PMML
Flask
Lineage and Repeatability
?
Data Lake DatabaseData
Inventory
Spark PigHive
Data
ToolsETL
Cron
Users
Learn more at datascience.com | Empower Your Data Scientists
DATA SCIENCE TRENDS: GROWING TEAMS & OPEN SOURCE AS THE NEW
STANDARD
5
2017: 2,350,000 data science and analytics job listings*
*Source: Kaggle 2017 data science trend report, Burning Glass Quant Crunch Report, Microsoft Revolutions Blog 2017
Learn more at datascience.com | Empower Your Data Scientists
DATA SCIENCE PLATFORMS ARE EMERGING CATEGORY BRINGING TOGETHER ESSENTIAL
ELEMENTS FOR DATA SCIENCE SCALING
6
CLOUD PROVIDERS
ETL & DATA
ENGINEERINGVERTICAL
APPLICATIONS
BI & VISUALIZATION
TOOLS
SECURITY
INFRASTRUCTURE
LIBRARIESTOOLS
DATA PLATFORMS
DATA SCIENCE PLATFORMS
Learn more at datascience.com | Empower Your Data Scientists 7
DATA GROWTH
Learn more at datascience.com | Empower Your Data Scientists
DATA IS THE LEVERAGE POINT FOR COMPETITIVE ADVANTAGE
Learn more at datascience.com | Empower Your Data Scientists
DATA VOLUMES GROWING FASTER THAN MOORE’S LAW
Source: McKinsey Global Institute
20101987
1.2
Zettabytes
of Data
3
Exabytes
of Data
Data Diversity
2020
44
Zettabytes of Data
EmailsCall Detail
Records
Click
stream
CSV DocumentsData
PDFBilling
Data
Meta
Data
JSON Network
Data
Mobile
Data
XMLProduct
Catalog
Medical
RecordsText Files VideoText
Messages
Merchant
Listings
Sensor
Data
Server
Logs
Set Top
Box
Social
Media
Audio
Learn more at datascience.com | Empower Your Data Scientists
THE VALUE OF DATA
Size
$
Valu
e
Cost
Legacy Value Model
Net
ValueSize
$
Valu
e
Next-Gen Value Model
Cost
Net
Value
OPT OPT
Learn more at datascience.com | Empower Your Data Scientists
WE HAVE PASSED AN INFLECTION POINT
Legacy technology investmentNext-Gen technology investment
Source: IDC, Gartner; Analysis & Estimates: MapR
Next-gen consists of cloud, big data, software and hardware related expenses
$ (millions)
INVESTMENT IN NEXT-GEN VS. LEGACY TECHNOLOGIES FOR DATA
Total $ growth of IT market
90% of data is on
next-gen
technology by 2020
Learn more at datascience.com | Empower Your Data Scientists 12
DATAOPS
Learn more at datascience.com | Empower Your Data Scientists
DATAOPS: AN AGILE METHODOLOGY FOR DATA-DRIVEN ORGANIZATIONS
13
Axioms:
1. Data is central to disruptive enterprise applicationsa. Lightweight, stateless functions do not represent the majority of workloads
2. Data science and machine learning are an important paradigma. Scientists become active users -- no longer just application developers
b. Iterative workflow with different data usage patterns
3. Data volumes continue to grow
4. Moving data is a performance bottleneck
DataOps Goals:
• Continuous model deployment
• Promote repeatability
• Promote productivity -- focus on core competencies
• Promote agility
• Promote self-service
Learn more at datascience.com | Empower Your Data Scientists
COMPARING DEVOPS AND DATAOPS: WHAT’S DIFFERENT OR THE SAME?
14
Developers &
Architects
Data Engineers
Data
Scientists
Security &
Governance
Operations
DataOps
DevOps DataOps
Learn more at datascience.com | Empower Your Data Scientists
CONTINUOUS MODEL DEPLOYMENT
Data
Engineering
Model
Development
Model
Management
Model
Deployment
Model
Monitoring &
Rescoring
Key Building Blocks for Agility:
1) Unified data platform
2) Data governance
3) Self-service data and compute access
4) Multitenancy and resource management
Learn more at datascience.com | Empower Your Data Scientists 16
BEST PRACTICES
Learn more at datascience.com | Empower Your Data Scientists
INDUSTRY LEADING DATA SCIENCE ORGANIZATIONS ADOPTING DATAOPS
Versioning Platform approach Team makeup and
organization
Self service
Learn more at datascience.com | Empower Your Data Scientists 18
DataOps Platform Checklist
Unified platform for all data --
historical and real-time production
Multitenancy and resource utilization
Single security and access model for
governance and self-service access
Enterprise-grade for mission-critical
applications and open source tools
Run compute on the data platform --
leverage data locality
Learn more at datascience.com | Empower Your Data Scientists 19
Thank you!
Learn more at datascience.com | Empower Your Data Scientists 20
NEW DATAOPS APPROACH FOR DATA SCIENCE TEAMS
DataOps