architectures styles and deployment on the hadoop

22
Architectural Patterns and Best Practices : #BigData #Hadoop Srividhya Balasubramaniam @ Data and Information Management Consultant [email protected]

Upload: anu-ravindranath

Post on 24-Jan-2017

25 views

Category:

Career


0 download

TRANSCRIPT

Page 1: Architectures styles and deployment on the hadoop

Architectural Patterns and Best Practices : #BigData #HadoopSrividhya Balasubramaniam @ Data and Information Management [email protected]

Page 2: Architectures styles and deployment on the hadoop

Ice Breaker120 Sec Shhhhh!

Page 3: Architectures styles and deployment on the hadoop

Agenda• Why are enterprises re-thinking on their data

strategy• Modernizing Enterprise Data Warehouses• Architectural Patterns and Design Consideration• Best Practices

Analytics Architecture

Application Architecture

Platform Architecture

Page 4: Architectures styles and deployment on the hadoop

“Because we have been doing stuff this way for ages!…… ”is not the normRe-Think!

Page 5: Architectures styles and deployment on the hadoop

Drivers of Change What Has not changedDATA QUALITY AND GOVERNANCE

INFORMATION SECURITY

METADATA MANAGEMENT

DATA SOURCES

DATA STORE

DATA ACCESS

ORCHESTRATION AND SCHEDULING

Page 6: Architectures styles and deployment on the hadoop

Challenges? Velocity , Variety and Volume

Page 7: Architectures styles and deployment on the hadoop

What is the Right Tool? How should I use the tool

Reference Architecture?

What Language and tool should I learn

Why?Why? Why? Why?

What's like data modelling in Hadoop Buy or build?

Page 8: Architectures styles and deployment on the hadoop

Core Design Principles What Business Problem is being Solved? Define Tool Selection Criteria Decouple processing store and systems Hybrid Architecture Leverage Batch and

Stream Scalable, Reliable, Fit for Purpose, Secure Available, Very low Admin Cost Supportable and Operations Monitoring Best Design is cheap

Page 9: Architectures styles and deployment on the hadoop

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BI

Storage of Messaging and StreamingCriteria

1. How Distributed Services are managed2. Guaranteed Ordering3. Data Delivery4. Data Retention Period5. Availability6. Scalability7. Throughput8. Parallel Clients9. Object Size10.Stream Map Reduce11.Cost

Eg: Apache Kafka• Guranteed Ordering,

Parallel Client and Stream MR

• Configurable Data Retention, Availability, Object Size

• Low cost but more admin

Page 10: Architectures styles and deployment on the hadoop

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BI

Databases What DB Export to choose

1. File Size2. Network Bandwidth3. Partitioning4. Bulk Loading5. CDC and Delta Data Transfers6. Native connectors and specific

connectors for Distribution

Adaptors and Golden Gate etc.

Page 11: Architectures styles and deployment on the hadoop

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BI

Data Storage – Distributed Files Criteria

1. Average Latency2. Typical Data Stored3. Typical Item Size4. Request Rate5. Storage Cost PerGB / timeframe6. Durability7. Availability8. Native support for toolsets9. Active community and open source

Enterprise Distributions Selection

Clouders, Hortonworks, MapR

Page 12: Architectures styles and deployment on the hadoop

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BIData Storage Selection Criteria

Data Structure : Fixed , Key Value, JSONAccess Patterns : Hierarchical, Structured, Search, Publish etcData Temperature : Hot, Warm ColdTCO : Low

Elastic Cache

Page 13: Architectures styles and deployment on the hadoop

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BIData Storage Selection Criteria

Cache NoSQL SQL Search 1. Average Latency (ms, sec, min, hours)2. Typical Volume Stored (GB, TB, PB)3. Typical Item Size (B, KB, TB, PB)4. Query Request Rate (High to Very Low)5. Storage and Maintenance Cost (High – Low)6. Durability (Low – Very High)7. Availability (High – Very High)

Data Structure : Fixed , Key Value, JSONAccess Patterns : Hierarchical, Structured, Search, Publish etcData Temperature : Hot, Warm ColdTCO : Low

Page 14: Architectures styles and deployment on the hadoop

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BI

BATCH INTERACTIVE STREAMING MESSAGING

Machine Learning

Spark MLEMR etc

Criteria1. Programming Language

Support2. Availability3. Speed4. Scale5. Latency Query6. Data Volume7. Storage Support8. SQL?

Temperature of Data

Page 15: Architectures styles and deployment on the hadoop

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BI

Buy Vs Build ETL Decision?

Page 16: Architectures styles and deployment on the hadoop

Typical Data Pipeline

Data Source Ingest• RDBMS• SEARCH• FILES/API• MESSAGING• IOT/STREAM

Store Raw• DATABASE• SEARCH

DOCUMENTS• DIST FILE

STORAGE• QUEUE• STREAM

STORE

Process for Analysis•BATCH•INTERACTIVE•STREAMING•MESSAGING•MACHINE LEARNING

Store•Key Value•Graph•Document•Queue•MPP

Insights•Analytical Models

•Visualization•Self Service BI

Create Analytical Application

Make Insights Available Via API

Analysis and Visualization

Zepplin, HUE etc

Publish to Queue

Page 17: Architectures styles and deployment on the hadoop

Data Modelling in Hadoop & Architectural Patterns

Page 18: Architectures styles and deployment on the hadoop

Not only ER and Dimension Models (NoERDM)

Data Storage Format

TextSequenceAvroParquetRC/ORC

Know strength and weakness of each format in terms of Supporting DistributionsProcessing requirements – Write, partial read, full readSchema EvolutionExtract RequirementsStorage Requirements – How big are your filesHow important is file splitabilityDoes block compression matterDoes the file format support indexing?How easy it is to parseDoes it support column Stats?Failure behavior for various file formats.

Page 19: Architectures styles and deployment on the hadoop

Not only ER and Dimension Models (NoERDM)Compression

CodecsZLIBLZOLZFSnappyGzipBzip

ConsiderationsHow much the size reducesHow fast it can compress decompressHow can I split my compressed files? File splitbility to make use of parallelism

Compression typesUncompressedRecord compressed. Block Compressed. `

We trade I/O Loads for CPU Loads

Page 20: Architectures styles and deployment on the hadoop

Other Practices1. Structure and Organize your repository

a. Standard directory structureb. Access quota controlsc. Stage area conventions

2. Location of HDFS filesa. Directory structure should simplify the assignment of permissions to be grated.b. Eg /user, /etl , /tmp, /data, /app, /metadata,

3. Partitioning, Bucketing and denormalization.

Page 21: Architectures styles and deployment on the hadoop

Data Lake / Reservoir / Refinery

Exploratory Data Analysis

Application Level AnalyticsBatch and Stream Analytics – Lambda Architecture

Enterprise Data Pipeline

Architectural Patterns

Page 22: Architectures styles and deployment on the hadoop

Thank You!Questions?