big data , big problem?
Post on 21-Jan-2018
230 Views
Preview:
TRANSCRIPT
Scalable Big Data Architecture
Big Data Big Problem?
PRESENTATION BY :
MOHAMMAD HASAN FARAZMANDOCTOBER 2016
M.H.FARAZMAND@GMAIL.COM
We Will Review… Identifying Big Data Symptoms
Size Matters
Typical Business Use Case
Understanding the Big Data Project’s Ecosystem
Hadoop Distribution
Data Acquisition
Processing Language
Machine Learning
NoSQL Stores
Foundation of long-term Big Data Architecture
Architecture Overview
Long Ingestion Application
Learning Application
Processing Engine
Search Engine
This presentation has been prepared based on the first chapter of
Scalable Big Data Architecture by
Bahaaldine Azarmi
Identifying Big Data Symptoms
Data management is more complex than it has been before!
Big Data is every where , on every one’s mind
When Should I think about employing Big Data ?
Am I ready?
What should I start with?!
Different needs :
The volume of data you handle
Variety of data structure
Scalability issue
Reduce the cost of data processing
Size Matters
Two main areas : Size + Volume
Handle new data structures with flexible & schemaless technology
Big data is also about extracting added value information
Near real time processing with distributed architecture
Execute complex queries with NoSQL store
Value
Typical Business Use Case
Analyzing application’s log, web access log, server log, DB log, Social Networks
Customer Behavior Analytics : Used on e-commerce websites
Sentiment Analysis : Images and reputation of companies which perceived across social networks.
CRM On Boarding : Combine online data sources with offline data sources for better and more accurate customer segmentation ( profile-customized offers)
Prediction : Learning from Data , main big data trend (for 2 past years) –
For example in telecommunication industry :
1) Issue or event prediction based on router log
2) Product catalog selection
3) Pricing depending on user’s global behavior
Understanding Big Data Project’s Ecosystem
Choosing …
Hadoop distribution
Distributed file system
SQL-Like processing language
Machine learning language
Scheduler
Message-oriented middleware
NoSQL data store
Data visualization
Hadoop Distribution
Two Choices :
Download the project you need separately
Use one of most popular Hadoop distribution
Cloudera CDH
1. Impala : realtime, parallelized, SQL based engine that searches for
data in HDFS and Base.
2. Cloudera Management : Cloudera’s console to manage and
deploy Hadoop components.
3. Hue : Console for user interaction with data and scripts
Hortonworks HDP
Hadoop Distributed File System
HDFS
Key features:
Distribution
High Availability
Fault Tolerance
Tuning
Security
Load Balancing
High Throughput Access
Automatic replication across the cluster data nodes
Data Acquisition Large log file, Streamed data, ETL processing outcome, Online
unstructured data, Offline structured data, etc.
ApacheFlume Reliable, Highly available, Simple, Flexible, Intuitive programming
model based on streaming data flows.
Composed of “Sources”,”Channels”,”Sinks”
Apache Sqoop
Transfer bulk data between structured data store and HDFS.
Import data from external relational database to HDFS, Hbase , Hive.
Export data from Hadoop cluster to a relational database or data
warehouse.
Processing Language
MapReduce was the main processing framework in the first generation of the Hadoop cluster.
Grouping sibling data together (Map) and then aggregating the data in depending on a specified aggregation operation (Reduce).
Now that YARN (Yet Another Resource Negotiator) has been implemented.
Batch Processing with Hive
Hive, which brings users the simplicity and power of querying data
from HDFS in a SQL-like way.
Hive is not a near or real-time processing language. It is long-term
processing job with a low priority
Main drawback of using another language rather than using native
MapReduce, is “Performance”.
Stream Processing with Spark Streaming
Extension of Spark.
Leveraging Spark’s distributed data processing framework and treats
streaming computation.
Spark Streaming lets you write a processing job as you would do for
batch processing in Java, Scale, or Python.
Foundation of a strong fault-tolerant and high-performance system.
Message-Oriented Middleware
with Apache Kafka
Persistent messaging and high-throughput system.
Kafka as a pivot point in our architecture mainly to receive data
and push it into Spark Streaming.
Machine Learning
Spark MLlib enables machine learning for Spark.
Composed of various algorithms that go from basic statistics, logistic
regression, k-means clustering, and Gaussian mixtures to singular
value decomposition and multinomial naive Bayes.
Train your data and build prediction models with a few lines of code
NoSQL Stores
Fundamental pieces of the data architecture.
Scalability and Resiliency, and thus High Availability.
Ingest a very large amount of data.
Couchbase
Document-oriented NoSQL database that is easily scalable,
provides a flexible model, and is consistently high performance.
ElasticSearch
Scalable distributed indexing engine and search features.
Based on Apache Lucene and enables real-time data analytics
and full-text search in your architecture.
ELK platform
ElasticSearch is part of the ELK platform.
ElasticSearch + Logstash + Kibana
Provide the best end-to-end platform for collecting, storing, and
visualizing data.
Logstash lets you collect data from many kinds of sources
ElasticSearch indexes the data in a distributed, scalable, and
resilient system.
Kibana is a customizable user interface in which you can build a
simple to complex dashboard to explore and visualize data indexed
by ElasticSearch.
Foundation of a Long-Term
Big Data Architecture
Log Ingestion Application
Consume application logs such as web access logs.
Learning Application
Receives a stream of data and builds prediction to optimize our
recommendation engine.
Processing Engine
Heart of the architecture
Summary
The search engine leverages the data processed by the processing
engine and exposes a dedicated RESTful API that will be used for
analytic purposes.
Search Engine
We have seen all the components that make up our architecture
Good Luck
top related