![Page 1: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/1.jpg)
Synthetic Data Generation for Realistic Analytics Examples and
Testing Ronald J. Nowling
Red Hat, Inc. [email protected]
http://rnowling.github.io/
![Page 2: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/2.jpg)
Who Am I?
• Software Engineer at Red Hat • Data Science Team, Emerging
Technologies – Evaluate open-source Big Data space – Ensure software works for Red Hat
customers – Promote data science internally through
consulting projects • Apache BigTop PMC
2
![Page 3: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/3.jpg)
Synthetic Data
• No licensing, privacy, or intellectual property concerns
• Scalable: Laptops to Clusters! • More reliable than external data sets • Enable more realistic example
applications • Enable more comprehensive testing than
wordcount and TeraSort
3
![Page 4: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/4.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
4
![Page 5: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/5.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
5
![Page 6: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/6.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
6
![Page 7: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/7.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
7
![Page 8: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/8.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
8
![Page 9: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/9.jpg)
Data Transformation and Summarization Pipeline
Transform Raw Text
Raw Daily Page Views
Parse
Clean & Validate
Raw Daily Page Views
Raw Daily Page Views
Transform Raw Text
Transform Raw Text Parse
Parse
Clean & Validate
Clean & Validate
Accounts
Summarize
Summarize
Summarize
Aggregate
DailyActivity
CumulativeActivity
9
![Page 10: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/10.jpg)
Timings
• Data set – 1000’s of files – 100’s of GBs compressed (gzip)
• Conversion from .tsv.gz -> Parquet ~45 min
• Compute aggregations on Parquet data and write out ~2 min
10
![Page 11: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/11.jpg)
Synthetic Data
• Sensitive Data – Real data on cluster for scalability testing and
validation – Synthetic data for local development and testing
• Smaller data sets for checking calculations – Total aggregation results requires re-running old
pipeline – Extra burden on operations team – Delay for development team
11
![Page 12: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/12.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
12
![Page 13: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/13.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
13
![Page 14: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/14.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
14
![Page 15: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/15.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
15
![Page 16: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/16.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
16
![Page 17: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/17.jpg)
ValidationScript
DataGenerator
Expected Cumulative
Activity
Accounts
Raw Daily Page Views
Expected Daily Activity
Transformation and Summarization
Pipeline
Cumulative ActivityDaily Activity
Validation with Synthetic Data
17
![Page 18: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/18.jpg)
Issues Tackled
• Error in account validation introduced while refactoring code
• Usage of the correct join types • Validation of date-time operations • Correct Output Formats
18
![Page 19: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/19.jpg)
Gzipped Files
• Gzip doesn’t support random access – entire file needs to be decompressed sequentially
• Large files – multiple gigabytes uncompressed
• Too many files read in parallel –> long GC or OOM errors
19
![Page 20: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/20.jpg)
(Quirky) TSV Files
• Tab-separated, no quoting • Escaped tabs and newlines within records – E.g., \\n or \\t
• Improperly escaped tabs and newlines – E.g., \\\t vs \\\\t
20
![Page 21: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/21.jpg)
Solutions
• Convert to Parquet as quickly as possible • Use fewer cores per node – More RAM / task (partition)
• 2-phase grouping algorithm – Group within partition – Group partition ends using shuffle – Union
• Optimized string operations – Use iterators instead of concatenation and
replace – Custom CSV parser implementation
21
![Page 22: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/22.jpg)
Apache BigTop BigPetStore Blueprints
• Problem domain: Transactions for a fictional chain of pet stores
• BigPetStore Data Generator simulates customer purchasing behavior to generate realistic transaction data
• Blueprints for big data ecosystem – Hadoop: MapReduce / Pig / Hive / Mahout – Spark – Flink (in progress)
22
![Page 23: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/23.jpg)
BigPetStore
23
![Page 24: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/24.jpg)
BigPetStore
24
HCFS
![Page 25: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/25.jpg)
BigPetStore
25
Core (RDDs) HCFS
![Page 26: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/26.jpg)
BigPetStore
26
Spark SQL
Core (RDDs) HCFS
![Page 27: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/27.jpg)
BigPetStore
27
Spark SQL MLLib
Core (RDDs) HCFS
![Page 28: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/28.jpg)
Team Cluster
• ~10 nodes • 40 cores, 400GB RAM per node
28
![Page 29: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/29.jpg)
Potential Issues
• Infrastructure • Storage • Software Installation • Software Upgrades • Spark Configuration Tuning • User Management
29
![Page 30: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/30.jpg)
Real Stories
• Creating a new user – User Gluster permissions incorrect
• Cluster upgrade – Spark upgrade didn’t take because of issue with
Ansible role configuration – Wiped out our spark.conf – master / mesos
settings wrong
• Gluster moint points disappeared on reboot – Not set in fstab
30
![Page 31: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/31.jpg)
k8petstore
Public IP Proxy
Users
BPS DataGenerator
Redis Master
RedisSlave
Web Application
RedisSlave
RedisSlave
BPS DataGenerator
BPS DataGenerator
31
![Page 32: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/32.jpg)
k8petstore
Public IP Proxy
Users
BPS DataGenerator
Redis Master
RedisSlave
Web Application
RedisSlave
RedisSlave
BPS DataGenerator
BPS DataGenerator
32
![Page 33: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/33.jpg)
k8petstore
Public IP Proxy
Users
BPS DataGenerator
Redis Master
RedisSlave
Web Application
RedisSlave
RedisSlave
BPS DataGenerator
BPS DataGenerator
33
![Page 34: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/34.jpg)
k8petstore
Public IP Proxy
Users
BPS DataGenerator
Redis Master
RedisSlave
Web Application
RedisSlave
RedisSlave
BPS DataGenerator
BPS DataGenerator
34
![Page 35: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/35.jpg)
k8petstore
35
![Page 36: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/36.jpg)
Use Cases
• Configuration • Scalability • Fault Tolerance
36
![Page 37: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/37.jpg)
k8petstore
• OpenContrail networking solution demo1 • Kubernetes JuJu Charm documentation
example2 • Kubernetes v1.0 launch talk at OSCON3 [1] -
https://pedrormarques.wordpress.com/2015/04/24/kubernetes-and-opencontrail/
[2] - http://kubernetes.io/v1.0/docs/getting-started-guides/juju.html [3] - http://www.oscon.com/open-source-2015/public/schedule/detail/45281
37
![Page 38: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/38.jpg)
APACHE BIGTOP DATA GENERATORS
38
![Page 39: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/39.jpg)
BigPetStore
39
![Page 40: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/40.jpg)
BigTop Weatherman
40
![Page 41: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/41.jpg)
BigTop Bazaar
41
![Page 42: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/42.jpg)
Vision
• Encourage synthetic data generation for testing and realistic examples
• Serve as a resource for the larger Apache and open source communities
• Emphasis on – Flexibility – Scalability – Realism
• We look forward to collaborating and getting folks involved!
42
![Page 43: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/43.jpg)
Resources
http://bigtop.apache.org/
http://github.com/apache/bigtop
http://rnowling.github.io/
43
![Page 44: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/44.jpg)
Conclusion
• Synthetic data generators and blueprints are useful!
• Case studies: – Data Processing Pipelines – Cluster Deployment – Kubernetes
• BigPetStore and BigTop Data Generators efforts in Apache BigTop
• Open invitation to get involved and collaborate
44
![Page 45: Synthetic Data Generation for Realistic Analytics Examples ...€¦ · Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters!](https://reader034.vdocuments.us/reader034/viewer/2022051815/6040be61e1d8b644047832b7/html5/thumbnails/45.jpg)
QUESTIONS
45