big data com hadoop · big data is not bitcoin . sources for big data • data warehouse • rdbms...

32
VIII Sessão - SQL Bahia 03/03/2018 Big Data com Hadoop Impala, Hive e Spark Diógenes Pires

Upload: phungtruc

Post on 09-Nov-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

VIII Sessão - SQL Bahia 03/03/2018

Big Data com Hadoop Impala, Hive e Spark

Diógenes Pires

Page 2: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Connect with PASS Sign up for a free membership today at:

pass.org

#sqlpass

Page 3: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint
Page 4: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint
Page 5: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Internet Live

http://www.internetlivestats.com/

Page 6: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Introduction

Page 7: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Big Data is not Bitcoin

Page 8: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Sources for Big Data

• Data Warehouse

• RDBMS

• Web server log files;

• Social Media Contents;

• Business Reports;

• Texts of consumer emails to the company;

• Macroeconomic indicators;

• Satisfaction surveys;

• IoT

• CRM

• …

Page 9: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Definitions

Business intelligence (BI) is an umbrella term that includes the applications, infrastructure

and tools, and best practices that enable access to and analysis of information to improve

and optimize decisions and performance.

Big data is high-volume, high-velocity and/or high-variety information assets that demand

cost-effective, innovative forms of information processing that enable enhanced insight,

decision making, and process automation.

Business analytics is comprised of solutions used to build analysis models and simulations

to create scenarios, understand realities and predict future states. Business analytics

includes data mining, predictive analytics, applied analytics and statistics, and is delivered as

an application suitable for a business user.

Gartner

Page 10: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Other Concepts

• Cognitive Computing

• Data Discovery

• Data Lake

• Data Science

• Machine Learning

• Self BI

• Fast Data

Page 11: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Landscape

Page 12: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Google File System (GFS or GoogleFS)

Google File System (GFS or GoogleFS) is a proprietary distributed file

system developed by Google to provide efficient, reliable access to data

using large clusters of commodity hardware. A new version of Google File

System code named Colossus was released in 2010.

Wikipedia

• 2003 GFS

• 2004 MapReduce

• 2006 Big Table

Page 13: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Apache Hadoop

The Apache Hadoop software library is a framework that allows for the

distributed processing of large data sets across clusters of computers

using simple programming models

Apache Hadoop.

Page 14: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Apache Hadoop

The project includes these modules:

• Hadoop Common: The common utilities that support the other Hadoop

modules.

• Hadoop Distributed File System (HDFS™): A distributed file system that

provides high-throughput access to application data.

• Hadoop YARN: A framework for job scheduling and cluster resource

management.

• Hadoop MapReduce: A YARN-based system for parallel processing of large

data sets.

Page 15: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Others Hadoop Projects

Page 16: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Hadoop Architecture

Page 17: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Processing

https://entendendoti.blogspot.com.br/2011/05/tipos-de-processamento.html

Types of Processing

• Batch Processing: This is batch processing, information is collected or

received, stored and processed.

• Online Processing: It is the updated processing, the information is processed

at the same time as it is registered.

• Real Time Processing: It is the immediate processing, the information is

processed the moment it is registered, generating a new processing sub

sequent. Ex .: Autopilot, GPS.

Page 18: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Batch Processing

Page 19: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Example

Page 20: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

MapReduce

A programming paradigm that allows for massive scalability across

hundreds or thousands of servers in a Hadoop cluster.

IBM.

Page 21: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

MapReduce

Page 22: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

The Apache Hive ™ data warehouse software facilitates reading, writing,

and managing large datasets residing in distributed storage using SQL.

Structure can be projected onto data already in storage. A command line

tool and JDBC driver are provided to connect users to Hive.

Hive.org.

Apache Hive

Page 23: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Apache Architecture

Page 24: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Online Processing

Page 25: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Example

Page 26: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Cloudera Impala provides fast, interactive SQL queries directly on your

Apache Hadoop data stored in HDFS or HBase. In addition to using the

same unified storage platform, Impala also uses the same metadata, SQL

syntax (Hive SQL), ODBC driver, and user interface (Cloudera Impala

query UI in Hue) as Apache Hive.

Cloudera.

Cloudera Impala

Page 27: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Impala Architecture

Page 28: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Real Time Processing

Page 29: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Example

Page 30: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Spark is a fast and general processing engine compatible with Hadoop

data. It can run in Hadoop clusters through YARN or Spark's standalone

mode, and it can process data in HDFS, HBase, Cassandra, Hive, and

any Hadoop InputFormat. It is designed to perform both batch processing

(similar to MapReduce) and new workloads like streaming, interactive

queries, and machine learning.

spark.apache.org

Apache Spark

Page 31: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint

Apache Spark

Page 32: Big Data com Hadoop · Big Data is not Bitcoin . Sources for Big Data • Data Warehouse • RDBMS • Web server log files; • Social Media Contents; ... Apresentação do PowerPoint