building custom big data integrations
TRANSCRIPT
![Page 2: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/2.jpg)
AgendaIngest, Data Drift and StreamSets
Short Demo
Building a custom integration
Real-world integration: Salesforce Wave Analytics
![Page 3: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/3.jpg)
Traditional and Big Data Founders
Company Background
Top tier Investors
Momentum to Date
Strategic Partners
● Launched 2014; exited stealth 9/15● ~30 employees● Double-digit enterprise customers● 10,000 downloads
![Page 4: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/4.jpg)
Past ETL ETL
Emerging Ingest Analyze
Data Sources Data Stores Data Consumers
Market Trends
![Page 5: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/5.jpg)
Problem: Data Drift
The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data
Structure Drift
Semantic Drift
Infrastructure Drift
![Page 6: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/6.jpg)
Delayed and False Insights
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Poor Data QualityData DriftCustom code
Fixed-schema
![Page 7: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/7.jpg)
Trusted InsightsData KPIs
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Data DriftIntent-Driven
Drift-Handling
![Page 8: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/8.jpg)
Demo
Let’s build a simple pipeline to answer a real question:
What’s the biggest city lot in San Francisco?
![Page 9: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/9.jpg)
Customizing StreamSets
Currently 25 standard StreamSets destinations, covering a wide variety of target systems, from flat files to S3 to Kafka
But… there’s always some system not on the list
Solution: DIY!
![Page 10: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/10.jpg)
Create Your Own Destination
Five Step Process:○ Create template from Maven archetype○ Add logging○ Create a record buffer○ Add configuration parameters○ Send data to external system
bit.ly/sdc-dest
Your System Here!
![Page 11: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/11.jpg)
Create Template from Archetype
mvn archetype:generate-DarchetypeGroupId=com.streamsets -DarchetypeArtifactId=streamsets-datacollector-stage-lib-tutorial -DarchetypeVersion=1.3.0.0 -DinteractiveMode=true
![Page 12: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/12.jpg)
Add Logging
Not 100% necessary, but VERY helpfulStreamSets uses SLF4J
$ tail -f streamsets-datacollector-1.3.0.0/log/sdc.log
![Page 13: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/13.jpg)
Create a Record Buffer
Leverage existing code where possible!StreamSets includes generators for CSV, JSON, Avro, Protocol Buffers etc
![Page 14: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/14.jpg)
Configuration
Separate configuration and codeDON’T PUT CREDENTIALS IN CODE!!!DON’T PUT CREDENTIALS IN CODE!!!Make your users’ and your lives easier!
![Page 15: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/15.jpg)
Send Data to the External System
Don’t forget security policy!
streamsets-datacollector/etc/sdc-security.policy
grant codebase "file://${sdc.dist.dir}/user-libs/sampletest/-" { permission java.net.SocketPermission "requestb.in", "connect, resolve";};
![Page 16: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/16.jpg)
A Real Custom DestinationSalesforce Wave Analytics
● Adapt to batch processing model○ Configure wait time before ‘closing’ a batch
● External Data API○ Create new dataset○ Write to dataset○ Close dataset on timeout○ Trigger dataflow execution
![Page 17: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/17.jpg)
Conclusion
StreamSets Data Collector makes simple tasks easy, complex tasks possible
Use ‘off the shelf’ stages for simple tasks
Leverage script processors (Jython, JavaScript, Groovy) for more complex work
Build custom stages for ultimate performance, flexibility
![Page 18: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/18.jpg)
Thank You!
![Page 19: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/19.jpg)
Structure Drift
Data structures and formats evolve and
change unexpectedly
Implication:Data Loss
Data Squandering
Delimited Data
107.3.137.195 fe80::21b:21ff:fe83:90fa
Attribute Format Changes
{ “first“: “jon” “last“: “smith” “email“: “[email protected]” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756”}
{ “first“: “jane” “last“: “smith” “email“: “[email protected]” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212”}
Data Structure Evolution
Structure Drift
![Page 20: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/20.jpg)
Semantic Drift
Data semantics change with evolving applications
Implication:Data Corrosion
Data Loss
Semantic Drift
24122-52172 00-24122-52172
Account Number Expansion
M134: user {jsmith} read access granted {ac:24122-52172}
M134: user {jsmith} read access granted {ca.ac:24122-52172}
Namespace Qualification
………,3588310669797950,$91.41,jcb,K1088-W#9,……,6759006011936944,$155.04,switch,A6504-Y#9,……,6771111111151415,$37.78,laser,Q9936-T#9,……,3585905063294299,$164.48,jcb,S4643-H#9,……,5363527828638736,$117.52,mastercard,X3286-P#9,……,4903080150282806,$168.03,switch,I9133-W#3,………
Outlier / Anomaly Detection
![Page 21: Building Custom Big Data Integrations](https://reader035.vdocuments.us/reader035/viewer/2022062400/587641801a28ab68098b85e5/html5/thumbnails/21.jpg)
InfrastructureDrift
Physical and Logical Infrastructure changes
rapidly
Implication:Poor Agility
Operational Downtime
Data Center 1 Data Center 2 Data Center n
3rd Party Service Provider
App a App k
App qCloud
Infrastructure
Infrastructure Drift