engineering patterns for implementing data science models on big data platforms
TRANSCRIPT
![Page 1: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/1.jpg)
Data Science Models on Big Data Platforms
Engineering Patterns for Implementing
Hisham ArafatDigital Transformation Lead Consultant Solutions Architect, Technology Strategist & Researcher
Riyadh, KSA – 31 January 2017
![Page 2: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/2.jpg)
http://www.visualcapitalist.com/what-happens-internet-minute-2016/
Big Data…Practical Definition!
• Big Data is the challenge not the solution
• Big Data technologies address that challenge
• Practically:• Massive Streams
• Unstructured
• Complex Processing
![Page 3: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/3.jpg)
Let’s Have a Use Case…Social Marketing
![Page 4: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/4.jpg)
Social Marketing…Looks Simple!
Ingest Social Feeds
Build Corpus Metrics
Design Text Mining Model
Deploy All to a Big
Data Platform
Application for
Marketing Users
What people are saying about our new brand “LemaTea”?
![Page 5: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/5.jpg)
Ingest Social Feeds
Build Corpus Metrics
Design Text Mining Model
Deploy All to a Big
Data Platform
Application for
Marketing Users
![Page 6: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/6.jpg)
It’s NOT as Easy as it’s Looks Like!
![Page 7: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/7.jpg)
Not Only Building Appropriate Model, but More Into
Designing a Solution…
Engineering Factors
![Page 8: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/8.jpg)
• Interfacing with sources: REST APIs, source HTML,… (text is assumed)
• Parsing to extract: queries, Regular Expressions,…• Crawling frequency: every 1 minute, 1 hour, on event,…• Document structure: post, post + comments, #, Reach,
Retweets,…• Metadata: time, date, source, tags, authoritativeness,… • Transformations: canonicalization, weights, tokenization,…
- Size: average size of 2 KB / doc- Initial load: 1.5B doc- Frequency: every 5 minutes- Throughput: 2 KB * 60,000 doc = 120 MB / load - Grows per day ~ 34 GB
Engineering Factors
![Page 9: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/9.jpg)
• Input format: text, encoded text,…• Document representation: bag of words, ontology… • Corpus structures: indexes, reverse indexes,…• Corpus metrics: doc frequency, inverse doc
frequency,…• Preprocessing: annotation, tagging,…• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day- Processing window: 60K per 3 mins- Processing rate: 20K doc per min- Final doc size = 2KB * 5 ~ 10KB- Scan rate: 20k * 10KB min ~
200MB/min - Many overheads need to be added
Engineering Factors
![Page 10: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/10.jpg)
• Dimensionality reduction: stemming, lemmatization, noisy words…
• Type of applications: search/retrieval, sentiment analysis… • Modeling methods: classifiers, topic modeling, relevance…• Model efficiency: confusion metrics, precision, recall…• Overheads: intermediate processing, pre-aggregation,…• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day- Search for “LemaTea sweet taste”- No of tf to calculate ~ 1.5B * 3 ~
4.5B- No of idf to calculate ~ 1.5B- Total calculations for 1 search ~ 6
B- Consider daily growth
Engineering Factors
![Page 11: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/11.jpg)
• Files structure: tables, text files, files-day,… • Files formats: HDFS, parquet, avro…• Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,…• Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML… • Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow,
Kafka/Streaming…• Ingestion pattern: real-time, micro batches,…
- Overall Storage- Processing capacity per node- No of nodes- Tables Hive, Hbase, Greenplum- Individual files Spark, Flink- Files-day Hadoop HDFS
Engineering Factors
![Page 12: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/12.jpg)
• Workload: no of requests, request size,… • Application performance: response time, concurrent
requests…• Applications interfacing: RESET APIs, native, messaging,…• Application implementation: integration, model scoring,…• Security model: application level, platform level,…
- For 3 search terms ~ 6B calculations
- For 5 search terms ~ 9B calculations
- For 10 concurrent requests ~ 75B- Resource queuing / prioritization- Search options like date range- Access control model
Engineering Factors
![Page 13: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/13.jpg)
Ongoing Process…Growing Requirements
What if?• New sources are included • Wider parsing Criteria • Advanced modeling: POS, Word Co-
occurrence, Co-referencing, Named Entity, Relationship Extraction,…
• Better response time is needed• More frequent ingestion
Dynamic
Platform
Ingestion
Corpus Processin
g
Model Processin
g
Requests Processin
g
• Larger number of docs• Increased processing requirements• Platform expansion • Overall architecture reconsidered
![Page 14: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/14.jpg)
Some Building Blocks
![Page 15: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/15.jpg)
What is a Data Science Model?• Type & format of inputs date• Data ingestion• Transformations and feature engineering• Modeling methods and algorithms• Model evaluation and scoring• Applications implantations considerations• In-Memory vs. In-Database
![Page 16: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/16.jpg)
Key Challenges for Data Science Models
Volume
Stationary
Batches
Structured
Insights
Growth
Streams
Real-time
Unstructured
Responsive
Scale out Performance
Data Flow Engines
Event Processing
Complex Formats
Perspective / Deep Models
![Page 17: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/17.jpg)
Traditional Data Management Systems• Shared I/O• Shared Processing• Limited Scalability• Service Bottlenecks• High Cost Factor
Shar
ed B
uffer
s
Data Files
Database Cluster
I/O
I/O
I/O
Network
Data
base
Ser
vice
![Page 18: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/18.jpg)
Abstraction of Big Data Platforms Data
Nodes
Master NodesI/O
Network
Inte
rcon
nect
• Parallel Processing• Shared Nothing• Linear Scalability• Distributed Services• Lower Cost Factor
I/O
I/O
I/O
…
Metadata
1
2
3
n
Direct access to user
data
MetadataStand
by
User data / Replicas
User data / Replicas
User data / Replicas
User data / Replicas
![Page 19: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/19.jpg)
In a Nutshell
Source: http://dataconomy.com/2014/06/understanding-big-data-ecosystem/
• Very huge.• Overlaps.• Overloading.• You need to
start with a use case to be able to get your solutions well engineered.
![Page 20: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/20.jpg)
Engineered Systems• Packaged: Hortonworks – Pivotal – Cloudera• Appliances: EMC DCA – Dell DSSD – Dell VxRack• Cloud offerings: Azure – AWS – IBM – Google Cloud
![Page 21: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/21.jpg)
Engineering Patterns in Implementation
![Page 22: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/22.jpg)
Lambda Architecture…Social Marketing• Generic, scalable
and fault-tolerant data processing architecture.
• Keeps a master immutable dataset while serving low latency requests.
• Aims at providing linear scalability.
Source: http://lambda-architecture.net/
![Page 23: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/23.jpg)
Social Marketing…Revisted
Ingest Social Feeds
Build Corpus Metrics
Design Text Mining Model
Deploy All to a Big
Data Platform
Application for
Marketing Users
What people are saying about our new brand “LemaTea”?
![Page 24: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/24.jpg)
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
![Page 25: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/25.jpg)
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
![Page 26: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/26.jpg)
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Sequence Files
![Page 27: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/27.jpg)
Apache Spark / MLlib• In memory distributed
Processing• Scala, Python, Java and
R• Resilient Distributed
Dataset (RDD)• Mllib – Machine
Learning Algorithms• SQL and Data Frames /
Pipelines• Streaming• Big Graph analytics
Spark Cluster Mesos HDFS/YARN
![Page 28: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/28.jpg)
Apache Spark• Supports
different types of Cluster Managers
• HDFS / YARN, Mesos, Amazon S3, Stand Alone, Hbase, Casandra…
• Interactive vs Application Mode
• Memory OptimizationSource: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-
architecture.html
![Page 29: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/29.jpg)
Apache Spark
![Page 30: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/30.jpg)
Apache Spark MLlib
![Page 31: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/31.jpg)
Apache Spark…The Big Picture
Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
![Page 32: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/32.jpg)
Greenplum / MADLib• Massively Parallel
Processing• Shared Nothing• Table distribution
• By Key• By Round Robin
• Massively Parallel Data Loading
• Integration with Hadoop
• Native MapReduce
![Page 33: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/33.jpg)
Apache MADLib
![Page 34: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/34.jpg)
Image Processing…Unusual WayMassively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
![Page 35: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/35.jpg)
Image Processing…Unusual WayMassively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
![Page 36: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/36.jpg)
Image Processing…Unusual WayMassively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
![Page 37: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/37.jpg)
Take Aways• A Data Science is not just the algorithms but it includes and end-
to-end solution.• The implementation should consider engineering factors and
quantify them so appropriate components can be selected.• The Big Data technology land scape is really huge and growing –
start with a solid use case to identify potential components.• Abstraction of specific technology will enable you to put your
hands on the pros and cons.• Creativity in solutions design and technology selection case by
case.• Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark
SQL Kafka, Hadoop / Yarn, Greenplum, MADLib.
![Page 38: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/38.jpg)
Q & A
![Page 39: Engineering patterns for implementing data science models on big data platforms](https://reader031.vdocuments.us/reader031/viewer/2022030309/58f2d74b1a28abdd238b4567/html5/thumbnails/39.jpg)
Email: [email protected]: hichawyLinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230
Thank You