big data/hadoop option analysis

Zafar Ali

BIG DATA Option Analysis

02/05/2023

IDB Solutions LTD 1

202/05/2023

BACKGROUND“The idea of data creating business value is not new, however, the effective use of data is becoming the basis of competition”

Enterprises always helps clients derive insights from information in order to make better, smarter, real time, fact-based decisions: it is this demand for depth of knowledge that has fueled the growth of big data tools and platforms.

What is BIG DATA? Due to advent of smart devices, social media and new technologies – the

amount of data produced by these devices and technologies is astronomical. BIG data comprises of conventional/structured data (EDW, RDBMS) as well as

other sources/unstructured data like sensor, social media (twitter, facebook, linkedin), logs etc to reveal patterns, trends, KPI, Dashboard etc.

302/05/2023

BIG DATA FOUR V’S• Big data comprises of conventional and

unconventional source and typically based on 4Vs

• Volume: the amount of data being created is vast compared to traditional data sources like RDBS/EDW

• Variety: data comes from different sources and is being created by machines, sensor, logs, humans etc

• Velocity: data is being generated extremely fast — typically processed real time but also ingest in form of batch

• Veracity: big data is sourced from many different places, as a result you need to test the veracity/quality of the data

402/05/2023

BIG DATA VENDORBig Data Technologies are different from traditional data sources and it require different toolsets and technologies to mange and process structures/semi-structured and unstructured data -Below are few players in BIG Data’s world.

502/05/2023

TYPICAL BIG DATA PROCESSINGTo harness the power of big data, enterprises would require an infrastructure that can manage and process huge volumes of structured and unstructured data in real time and in batch processing – keeping data protection privacy and security at the hearth – Typical Big Data Processing will look like below

602/05/2023

NEXT GENERATION ARCHITECTUREEnterprises next generation releases will have both traditional EDW/RDBMS and Big data solutions hands in hands as one cannot fulfill demands and needs. Traditional EDW- Store business critical data- Integrate existing data

sources- Integration with existing

reporting/MI solutionsBig Data• Leverage new data sources

e.g. P6 projects docs, social media discussion about projects

• Parallel processing to process unstructured data e.g. Asset’s sensor data, geolocation etc

702/05/2023

NEXT GENERATION ARCHITECTURE INTEGRATIONHadoop, is an open source applications based on MapReduce algorithm, where the data is processed in parallel on different CPU nodes. Hadoop offers excellent integration with existing AH application (AIM, PIM) ETL (talend) and Reporting tools (TIBCO Spotfire, TICBO jaspersoft)

Existing Infrastructure 1- Reporting: existing MI/Reporting, EDW tools are easy to integrate with Big Data 2- ETL/ELT – Apache, HDP 2.0, Cloudera offers Integration with Talend and existing PL/SQL, UNIX CRON jobs etc3- Applications – P6, ERP, SAP API can be easily integrated with Hadoop’s infrastructure

Reference:http://hortonworks.com/wp-content/uploads/2013/10/Build-A-Modern-Data-Architecture.pdf

http://hortonworks.com/wp-content/uploads/2013/10/Build-A-Modern-Data-Architecture.pdf






802/05/2023

NEXT GENERATION ARCHITECTURE - HADOOPHadoop runs applications using the MapReduce algorithm open source software, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data.Hadoop framework includes following four modules:Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.Hadoop YARN: This is a framework for job scheduling and cluster resource management.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data – a low cost, flexible data source reservoir; Hive on the other hand used for SQL access for structured and semi strurctured data Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

Key Hadoop distributions are Cloudera CDH, Greenplum, MapR, Hortonworks HDP1.0+ etc

902/05/2023

NEXT GENERATION ARCHITECTURE – HADOOP EVOLUTION

• Hadoop originally created using Google MapReduce, BigTable and Google File System (GFS)

• Over the time Hadoop ecosystem has evolved to enhanced functionalities like Hive (Query), Pig (Scripting), Workflow and Schedule (OOZIE), Non Relational DB(Hbase), Log Processing (Flume, sqoop), Management and Monitoring (Amber, Zookeeper)

• Hcatalog to enhance HDFS, HIVE, and Pig

1002/05/2023

NEXT GENERATION ARCHITECTURE – HDP/CLOUDERA/OTHER VENDORSHDP 2.0+:

Hortonworks Data Platform (HDP 2.0) integrates Apache Hadoop into modern data architecture - This will enable enterprises to capture, store and process vast quantities of data in a cost efficient and scalable manner – HDP 2.0 offer excellent gateways and APIs to integrate with existing applications, EDW.

Cloudera/CDH:

Cloudera is another open source big data platform distribution based on Apache Hadoop. CDH offers all key components out of the. CDH also offer hue which provides developers a web based utility execute jobs and check progress.

Other Big data vendor at following link: http://www.bigdatavendors.com/top.php

Basic HDP 2.0 Architecture

Cloudera Basic Architecture

http://www.bigdatavendors.com/top.php



1102/05/2023

NEXT GENERATION ARCHITECTURE – KAFKAKafka offers streaming platform as having three key capabilities:• It lets you publish and subscribe to streams of records. In this

respect it is similar to a message queue or enterprise messaging system.

• It lets you store streams of records in a fault-tolerant way.• It lets you process streams of records as they occur.

What use in Construction/P6?Various types of Hardware could use Kafka for processing real time data.• Live stream of asset geo location• Application tracking• Applications error log real-time processing• Building real-time streaming applications that transform or react to

the streams of dataMore information on Kafka is available at followinghttps://kafka.apache.org/intro.htmlhttp://hortonworks.com/apache/kafka/#section_1

https://kafka.apache.org/intro.html

https://kafka.apache.org/intro.html

http://hortonworks.com/apache/kafka/#section_1

http://hortonworks.com/apache/kafka/#section_1

1202/05/2023

NEXT GENERATION ARCHITECTURE – R/PYTHON/SASR/SaS/Python are programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.R is typically used at the Raw source data, EDW or query store – refer to

Any product currently feeding data into an app for data science and statistical analysis (linear and non-linear modelling, classical statistical tests, time scale series etc) can be easily integrated with HDP or Cloudera . HDP 2.0+ and Cloudera both offer their own version of R to provide statistical analysis - although same feature is available in Hadoop core system in the form of MapReduce (MPP). Other options could be explored under this hood are Pig, Spark, Python etc/

1302/05/2023

NEXT GENERATION ARCHITECTURE – FLUMEApache Flume is the standard way to transport log files from source through to target •Initial use-case was webserver log files, but can transport any file from A-B •Does not do “data transformation”, but can send to multiple targets / target types

•Mechanisms and checks to ensure successful transport of entries - Has a concept of “agents”, “sinks” and “channels”

•Agents collect and forward log data •Sinks store it in final destination •Channels store log data en-route

More information on flume is available at following

https://flume.apache.orghttp://hortonworks.com/apache/flume/https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_installing_manually_book/content/understanding_flume.htmlhttp://www.cloudera.com/products/apache-hadoop/apache-flume.html

Kafka and flume in action

https://flume.apache.org/

https://flume.apache.org/

http://hortonworks.com/apache/flume/

http://hortonworks.com/apache/flume/

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_installing_manually_book/content/understanding_flume.html



http://www.cloudera.com/products/apache-hadoop/apache-flume.html

http://www.cloudera.com/products/apache-hadoop/apache-flume.html

1402/05/2023

NEXT GENERATION ARCHITECTURE - SOURCEData Sources for Big Data can be categorized into three main forms:

• Structured data : Relational data.• Semi Structured data : XML data.• Unstructured data : Word, PDF, Text, Media Logs.

Unstructured Data:Such form of data normally lands into HDFS(Hive)

• Sensor data collection from HW• Geo location data from HW• Server Logs• Documents related to projects e.g. TP500, Gates file, RIIO Code classification, EES etc• Social Media discussion about project e.g. LPT (London Power Tunnels) has high presence on twitter, BBC,

facebook, youtube etc• Physical location of asset e.g. Switchgear, cables etc• Survey data about projects

Structured/Semi Structured Data: Such data normally loaded into traditional EDW either through existing ETL or using BIG data e.g. CSV, API, P6, ERP, SAP etc

1502/05/2023

NEXT GENERATION ARCHITECTURE - ETLTalend/ODI/Informatica provides excellent framework for running Hadoop ETL jobs with major Hadoop distributions and existing infrastructure• ETL/ELT pushes data/transformation down to Hadoop, Cloudera, Hortonworks • Hive, Sqoop, flame provides native drives to push data into Hadoop/HDFS or HBASE• Data Loading is typically in ”raw form”

• Files, event • Semi structured like JASON, XML• High Volume, high velocity is the reason of using Big data instead of RDBMS

• Data Quality / error handling• Metadata driven • Loading types of data in Big data could be:

• Real Time processing• Batch Processing

1602/05/2023

NEXT GENERATION ARCHITECTURE - SPARKSpark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source

Spark and Hadoop are both framework for the Big data but they have contrast difference between them - refer to below links to understand what is each frame provides.

Reference http://spark.apache.orghttp://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html

http://spark.apache.org/sql/

http://spark.apache.org/mllib/

http://spark.apache.org/graphx/

http://spark.apache.org/streaming/

http://mesos.apache.org/

http://spark.apache.org/docs/latest/spark-standalone.html

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

http://cassandra.apache.org/

http://hbase.apache.org/

https://github.com/amplab/spark-ec2#readme

http://mesos.apache.org/

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html


http://hbase.apache.org/

http://hive.apache.org/

http://spark.apache.org/

http://spark.apache.org/

http://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html



1702/05/2023

NEXT GENERATION ARCHITECTURE – NO SQL

NoSQL is referring to non-relational or at least non-SQL database solutions such as HBase (also a part of the Hadoop ecosystem), Cassandra, MongoDB, Riak, CouchDB

There are, after all, in excess of 100 NoSQL databases, as the DB-Engines database popularity ranking showsThere are three most popular NoSQL vendors for Hadoop Named casandara, mongoDB, HBASE.

“NoSQL” are gaining popularity - AH could incorporate BI/Analytical/Reporting using NoSQL which means end-user/client wont have to write SQL to get the desired dataset. An in-depth CTO require before making a final decision on “NoSQL” – though it offers some stark advantages over RDBMS/Analytics Big Data. My personal suggestion would be coexistence of both “NoSQL” and “RDBMS” in Big Data landscape.

http://db-engines.com/en/ranking


https://www.mongodb.com/

https://www.tutorialspoint.com/hadoop/hadoop_big_data_overview.htm




02/05/2023 18

Big Data Distributor Option Analysis- Summary AssessmentOption Cost

(indicative estimate)

Deployment Strategic Fit Windows Compatibility

Ease of use Licenses Overall

Cloudera

Hortonworks

.

.

Cloudera can be deployed on windows OS

Cloudera does n’t support needs of EDW in longer run and see HADOOP as enterprise data hub – this contradicts with AH requirement to integrate existing infrastructure.

Cloudera offers cloud, on-premise and sand-box version option for VM

No clear cost available online

Cloudera has a commercial license - Cloudera also allows the use of its open- source projects free of cost, but the package doesnot include the management suite Cloudera Manager or any other proprietary software

HDP is available as a native component on the windows server.

Hortonworks see EDW as integral part of Hadoop ecosystem and has strong tie with Terdata

No clear cost available online

Hortonworks is open source but chances of installation error through command prompt are very high compare to Cloudera

HDP only offers cloud based services.

Cloudera has a proprietary management software Cloudera Manager, SQL query handling interface Impala, as well as Cloudera Search for easy and real-time access of products.

Hortonworks has no proprietary software, uses Ambari for management and Stinger for handling queries, and Apache Solr for searches of data

big data/hadoop option analysis

Data & Analytics