big data/hadoop option analysis
TRANSCRIPT
![Page 1: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/1.jpg)
Zafar Ali
BIG DATA Option Analysis
02/05/2023
IDB Solutions LTD 1
![Page 2: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/2.jpg)
202/05/2023
BACKGROUND“The idea of data creating business value is not new, however, the effective use of data is becoming the basis of competition”
Enterprises always helps clients derive insights from information in order to make better, smarter, real time, fact-based decisions: it is this demand for depth of knowledge that has fueled the growth of big data tools and platforms.
What is BIG DATA? Due to advent of smart devices, social media and new technologies – the
amount of data produced by these devices and technologies is astronomical. BIG data comprises of conventional/structured data (EDW, RDBMS) as well as
other sources/unstructured data like sensor, social media (twitter, facebook, linkedin), logs etc to reveal patterns, trends, KPI, Dashboard etc.
![Page 3: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/3.jpg)
302/05/2023
BIG DATA FOUR V’S• Big data comprises of conventional and
unconventional source and typically based on 4Vs
• Volume: the amount of data being created is vast compared to traditional data sources like RDBS/EDW
• Variety: data comes from different sources and is being created by machines, sensor, logs, humans etc
• Velocity: data is being generated extremely fast — typically processed real time but also ingest in form of batch
• Veracity: big data is sourced from many different places, as a result you need to test the veracity/quality of the data
![Page 4: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/4.jpg)
402/05/2023
BIG DATA VENDORBig Data Technologies are different from traditional data sources and it require different toolsets and technologies to mange and process structures/semi-structured and unstructured data -Below are few players in BIG Data’s world.
![Page 5: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/5.jpg)
502/05/2023
TYPICAL BIG DATA PROCESSINGTo harness the power of big data, enterprises would require an infrastructure that can manage and process huge volumes of structured and unstructured data in real time and in batch processing – keeping data protection privacy and security at the hearth – Typical Big Data Processing will look like below
![Page 6: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/6.jpg)
602/05/2023
NEXT GENERATION ARCHITECTUREEnterprises next generation releases will have both traditional EDW/RDBMS and Big data solutions hands in hands as one cannot fulfill demands and needs. Traditional EDW- Store business critical data- Integrate existing data
sources- Integration with existing
reporting/MI solutionsBig Data• Leverage new data sources
e.g. P6 projects docs, social media discussion about projects
• Parallel processing to process unstructured data e.g. Asset’s sensor data, geolocation etc
![Page 7: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/7.jpg)
702/05/2023
NEXT GENERATION ARCHITECTURE INTEGRATIONHadoop, is an open source applications based on MapReduce algorithm, where the data is processed in parallel on different CPU nodes. Hadoop offers excellent integration with existing AH application (AIM, PIM) ETL (talend) and Reporting tools (TIBCO Spotfire, TICBO jaspersoft)
Existing Infrastructure 1- Reporting: existing MI/Reporting, EDW tools are easy to integrate with Big Data 2- ETL/ELT – Apache, HDP 2.0, Cloudera offers Integration with Talend and existing PL/SQL, UNIX CRON jobs etc3- Applications – P6, ERP, SAP API can be easily integrated with Hadoop’s infrastructure
Reference:http://hortonworks.com/wp-content/uploads/2013/10/Build-A-Modern-Data-Architecture.pdf
![Page 8: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/8.jpg)
802/05/2023
NEXT GENERATION ARCHITECTURE - HADOOPHadoop runs applications using the MapReduce algorithm open source software, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data.Hadoop framework includes following four modules:Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.Hadoop YARN: This is a framework for job scheduling and cluster resource management.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data – a low cost, flexible data source reservoir; Hive on the other hand used for SQL access for structured and semi strurctured data Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.
Key Hadoop distributions are Cloudera CDH, Greenplum, MapR, Hortonworks HDP1.0+ etc
![Page 9: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/9.jpg)
902/05/2023
NEXT GENERATION ARCHITECTURE – HADOOP EVOLUTION
• Hadoop originally created using Google MapReduce, BigTable and Google File System (GFS)
• Over the time Hadoop ecosystem has evolved to enhanced functionalities like Hive (Query), Pig (Scripting), Workflow and Schedule (OOZIE), Non Relational DB(Hbase), Log Processing (Flume, sqoop), Management and Monitoring (Amber, Zookeeper)
• Hcatalog to enhance HDFS, HIVE, and Pig
![Page 10: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/10.jpg)
1002/05/2023
NEXT GENERATION ARCHITECTURE – HDP/CLOUDERA/OTHER VENDORSHDP 2.0+:
Hortonworks Data Platform (HDP 2.0) integrates Apache Hadoop into modern data architecture - This will enable enterprises to capture, store and process vast quantities of data in a cost efficient and scalable manner – HDP 2.0 offer excellent gateways and APIs to integrate with existing applications, EDW.
Cloudera/CDH:
Cloudera is another open source big data platform distribution based on Apache Hadoop. CDH offers all key components out of the. CDH also offer hue which provides developers a web based utility execute jobs and check progress.
Other Big data vendor at following link: http://www.bigdatavendors.com/top.php
Basic HDP 2.0 Architecture
Cloudera Basic Architecture
![Page 11: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/11.jpg)
1102/05/2023
NEXT GENERATION ARCHITECTURE – KAFKAKafka offers streaming platform as having three key capabilities:• It lets you publish and subscribe to streams of records. In this
respect it is similar to a message queue or enterprise messaging system.
• It lets you store streams of records in a fault-tolerant way.• It lets you process streams of records as they occur.
What use in Construction/P6?Various types of Hardware could use Kafka for processing real time data.• Live stream of asset geo location• Application tracking• Applications error log real-time processing• Building real-time streaming applications that transform or react to
the streams of dataMore information on Kafka is available at followinghttps://kafka.apache.org/intro.htmlhttp://hortonworks.com/apache/kafka/#section_1
![Page 12: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/12.jpg)
1202/05/2023
NEXT GENERATION ARCHITECTURE – R/PYTHON/SASR/SaS/Python are programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.R is typically used at the Raw source data, EDW or query store – refer to
Any product currently feeding data into an app for data science and statistical analysis (linear and non-linear modelling, classical statistical tests, time scale series etc) can be easily integrated with HDP or Cloudera . HDP 2.0+ and Cloudera both offer their own version of R to provide statistical analysis - although same feature is available in Hadoop core system in the form of MapReduce (MPP). Other options could be explored under this hood are Pig, Spark, Python etc/
![Page 13: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/13.jpg)
1302/05/2023
NEXT GENERATION ARCHITECTURE – FLUMEApache Flume is the standard way to transport log files from source through to target •Initial use-case was webserver log files, but can transport any file from A-B •Does not do “data transformation”, but can send to multiple targets / target types
•Mechanisms and checks to ensure successful transport of entries - Has a concept of “agents”, “sinks” and “channels”
•Agents collect and forward log data •Sinks store it in final destination •Channels store log data en-route
More information on flume is available at following
https://flume.apache.orghttp://hortonworks.com/apache/flume/https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_installing_manually_book/content/understanding_flume.htmlhttp://www.cloudera.com/products/apache-hadoop/apache-flume.html
Kafka and flume in action
![Page 14: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/14.jpg)
1402/05/2023
NEXT GENERATION ARCHITECTURE - SOURCEData Sources for Big Data can be categorized into three main forms:
• Structured data : Relational data.• Semi Structured data : XML data.• Unstructured data : Word, PDF, Text, Media Logs.
Unstructured Data:Such form of data normally lands into HDFS(Hive)
• Sensor data collection from HW• Geo location data from HW• Server Logs• Documents related to projects e.g. TP500, Gates file, RIIO Code classification, EES etc• Social Media discussion about project e.g. LPT (London Power Tunnels) has high presence on twitter, BBC,
facebook, youtube etc• Physical location of asset e.g. Switchgear, cables etc• Survey data about projects
Structured/Semi Structured Data: Such data normally loaded into traditional EDW either through existing ETL or using BIG data e.g. CSV, API, P6, ERP, SAP etc
![Page 15: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/15.jpg)
1502/05/2023
NEXT GENERATION ARCHITECTURE - ETLTalend/ODI/Informatica provides excellent framework for running Hadoop ETL jobs with major Hadoop distributions and existing infrastructure• ETL/ELT pushes data/transformation down to Hadoop, Cloudera, Hortonworks • Hive, Sqoop, flame provides native drives to push data into Hadoop/HDFS or HBASE• Data Loading is typically in ”raw form”
• Files, event • Semi structured like JASON, XML• High Volume, high velocity is the reason of using Big data instead of RDBMS
• Data Quality / error handling• Metadata driven • Loading types of data in Big data could be:
• Real Time processing• Batch Processing
![Page 16: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/16.jpg)
1602/05/2023
NEXT GENERATION ARCHITECTURE - SPARKSpark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source
Spark and Hadoop are both framework for the Big data but they have contrast difference between them - refer to below links to understand what is each frame provides.
Reference http://spark.apache.orghttp://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html
![Page 17: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/17.jpg)
1702/05/2023
NEXT GENERATION ARCHITECTURE – NO SQL
NoSQL is referring to non-relational or at least non-SQL database solutions such as HBase (also a part of the Hadoop ecosystem), Cassandra, MongoDB, Riak, CouchDB
There are, after all, in excess of 100 NoSQL databases, as the DB-Engines database popularity ranking showsThere are three most popular NoSQL vendors for Hadoop Named casandara, mongoDB, HBASE.
“NoSQL” are gaining popularity - AH could incorporate BI/Analytical/Reporting using NoSQL which means end-user/client wont have to write SQL to get the desired dataset. An in-depth CTO require before making a final decision on “NoSQL” – though it offers some stark advantages over RDBMS/Analytics Big Data. My personal suggestion would be coexistence of both “NoSQL” and “RDBMS” in Big Data landscape.
![Page 18: Big Data/Hadoop Option Analysis](https://reader037.vdocuments.us/reader037/viewer/2022110218/586fde8a1a28ab18428b6c05/html5/thumbnails/18.jpg)
02/05/2023 18
Big Data Distributor Option Analysis- Summary AssessmentOption Cost
(indicative estimate)
Deployment Strategic Fit Windows Compatibility
Ease of use Licenses Overall
Cloudera
Hortonworks
.
.
Cloudera can be deployed on windows OS
Cloudera does n’t support needs of EDW in longer run and see HADOOP as enterprise data hub – this contradicts with AH requirement to integrate existing infrastructure.
Cloudera offers cloud, on-premise and sand-box version option for VM
No clear cost available online
Cloudera has a commercial license - Cloudera also allows the use of its open- source projects free of cost, but the package doesnot include the management suite Cloudera Manager or any other proprietary software
HDP is available as a native component on the windows server.
Hortonworks see EDW as integral part of Hadoop ecosystem and has strong tie with Terdata
No clear cost available online
Hortonworks is open source but chances of installation error through command prompt are very high compare to Cloudera
HDP only offers cloud based services.
Cloudera has a proprietary management software Cloudera Manager, SQL query handling interface Impala, as well as Cloudera Search for easy and real-time access of products.
Hortonworks has no proprietary software, uses Ambari for management and Stinger for handling queries, and Apache Solr for searches of data