Peeling the OnionHow Data Abstractions Help Build BigData Apps
Andreas Neumann @caskoid
November 2016
Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
cask.co4
The Case for Abstractions
Abstraction is a mental process we use when trying to discern what is essential or relevant to a problem.
Tom G. Palmer
cask.co5
Common Abstractions in Computing- Programming Languages
- Assembler > C > C++ > Java > Scala > ? - Memory management, Concurrency, Closures, …
- Web App Servers - CGI-bin > Servlets > JAX-RS - Connection Pools, Security, ...
- Relational Databases - Primitive types -> Semi-structured -> ORM - Transactions, rollback, isolation
cask.co6
Abstractions in Hadoop- MapReduce
- Input/OutputFormat provides some kind of abstraction - Intermediate data (mapper output) must Writable
- HBase
- Row/column keys and values are byte[] - Client must implement encoding of higher level types
- Transactions: Isolation, Consistency
- Existing data abstractions for Hadoop - Apache Hive, Apache Phoenix, …
cask.co7
Layers of Abstractions
engine
capa injecbility tion
dat hara s ing
int atiegr ons
enc ulaaps tion
acc attess p erns
con tensis cy
iso tila on
sto forrage mat
esch ma
cask.co8
Storage Engine Abstraction- Storage Engine
- Physical Storage Medium - Lowest level of the abstraction stack
- Benefits - Application code not “polluted” with low-level storage APIs - Portability across storage engines - Portability across different version of the storage engine - Testability in environments with different storage engine - Reusability of code
cask.co9
Storage Format Abstraction- Representation of data in the storage engine
- Serialization of data types to native storage format - Mapping complex types to storage format (ORM) - Schema representation - Provided partially by some storage engines SQL)
- Benefits - Application is not concerned with serialization/deserialization - Schema evolution - Enforces correct schema and representation
cask.co10
Consistency Abstractions- Strong vs. Eventual Consistency - Transactional (ACID) consistency
- Protect data from concurrent modification - Isolation / visibility guarantees - Optimistic Concurrency Control: Handling conflicts
- Benefits - Application code not concerned with consistency - “Framework level correctness”
cask.co11
Data Sharing Abstractions- Sharing/Reusing data across programming paradigms
- Write with Spark Streaming, query with SQL - Share data between batch (MapReduce) and realtime streaming - Data as a s Service (DaaS)
- Benefits - No data silos - Less redundancy in data access
cask.co12
Data Access Pattern Abstractions- Encapsulation of common data access patterns
- Examples: - Indexed Table - TimeSeries - Cube
- Benefits - Cleaner application code - Enforcement of best practices - Avoid data corruption - Separation of concerns/responsibilities
cask.co13
Capability Injection- Framework level Enterprise capabilities
- Metrics - Meta Data - Lineage, Access Audit Trail, Usage stats - Access Control
- Benefits - Operational Capabilities solved at the framework level - Compliance, Governance
cask.co14
The Cost of Abstraction
First you learn the value of abstraction, then you learn the cost of abstraction,
then you're ready to engineer. Kent Beck
cask.co15
Clean Cut Abstractions
con tensis cysto forrage mat
engine
enc ulaaps tiondat hara s ing
capa injecbility tion
cask.co18
What Makes a Good Abstraction- Minimal Overhead
- Injection happens once - Not in critical path / inner loop
- Not more code - Separation from app code - Reusability
- Storage Optimization - May not expose all the knobs and dials of the storage engine - Allow to bypass the abstraction when necessary
cask.co19
• Application Development and Management
• Provides Data and Programming Abstractions
• Provides Integrations
• Data-As-A-Service
• Empower developers
• Simple Access to Powerful Tech
• WYSIWYG Data Pipelines • Streaming• Batch
• Ingestion, Transformation, Blending (complex joins) and Lookup.
• Machine Learning, Aggregation and Reporting
• Connectors for varied sources and sinks
• Easy way to catalog application and pipeline level metadata
• Search across technical, business and operational metadata
• Track Lineage and Provenance,
• Data Quality Measure
• Integration with other MDM systems
cask.co20
Data Abstractions In Practice- Use Case:
- Ingest from Twitter into a Dataset - Run MapReduce over the Dataset to compute frequent #hashtags - Service to retrieve the top #hashtags - See the lineage for this Dataset
Thank [email protected]
@CaskData
github.com/caskdata/cdapgithub.com/caskdata/hydrator-plugins
Questions?23