big data

48
UNDERSTANDING BIG DATAS THINK BIG MEHMET BURAK AKGÜN

Upload: burak-akguen

Post on 08-Jul-2015

220 views

Category:

Education


0 download

DESCRIPTION

MEHMET BURAK AKGÜN Presentation for Enterprise Information Integration Lecture Topic:Understanding Big Data

TRANSCRIPT

Page 1: Big Data

UNDERSTANDING BIG DATAS THINK BIG

MEHMET BURAK AKGÜN

Page 2: Big Data

First of All What is Big Data?

• BIG-DATA is collected as non-structural from different sources Such as Social media sharing, network logs, blogs, photos, videos, log files, etc..

• BIG-DATA is Can not be analyzed before Enormous size and / or diversity information over the data

Page 3: Big Data

1) Why Big Data ? How did we come this point? 2) What is the Big Data’s Components? 3) How will the integration with existing systems? 4) Is there any BIG Data Platform / Applications?

Page 4: Big Data

• Almost all data scientists believed that too important advantages of companies using with the ability to analyze information collected from all sources .

• For example, retailers; according to scientists analyze they can through data can improve their operations by 60 percent profit margin. Similar rates are also valid for the public sector.

• According to research, The US healthcare using to data scientists can save 300 million dollars a year.

• According to estimates in 2020, from household appliances to cars and phones and about 50 billion devices will produce data and will be silently communicate with each other.

• For companies which wish to take decisions in the forward-looking , predictions should be critically correct from these «Big Data».

Page 5: Big Data

● Energy companies, using smart grids and meters, consisting of the use of data relating to individual subscribers, store, handle the event.

● Banks has become 7/24 branch as according to the information they store regarding collecting customers , recognizing the user, the Internet branch knows the day nor to enter and accordingly the main page menu that makes the most efficient, customers who reminders, offers customizable interfaces, rich, fast and convenient branch .

● Hospitals saving datas on their databases for effective to provide medical services

● This information will be stored as "Big Data"

Page 6: Big Data

DATA = PRODUCT

Page 7: Big Data

Google without using conventional methods, created their technologies of requirenments,improving itself was a success. Google has billions of web pages on the Google File System keeps uses Big Table in the database, using MapReduce for processing Big Data. See http://www.google.com/about/datacenters Google and Amazon publishes academic articles are related to their work. Some developers who inspired by the articles such as Doug Cutting created similar technology as secret. These are usually the most beautiful examples of the Apache Lucene, Left, Hadoop, HBase such projects. Each of these projects can successfully use the Big Datas . The second generation of the companies such as Facebook, Twitter, Linkedin, are going a step forward by publishing them store it as open-source projects developed for Big data. Cassandra, Hive, Pig, Voldemort, Storm, indextank projects are examples.

Page 8: Big Data

Peki Büyük Veri nasıl depolanacak? Nasıl işlenecek? Nasıl analiz edilecek?

Page 9: Big Data

Volume

of Tweets create daily.

12+ terabytes

Variety

of different types of data.

100’s Veracity

decision makers trust their information.

Only 1 in 3

4 «components» for being Big

trade events per second.

5+ million

Velocity

Page 10: Big Data
Page 11: Big Data

• It is expecting to reach 3 billion users.In August 2015, 12TB / day «log data» is producing. The end of 2014 expected reach is 500+ million users. 160 million users are online.

• 100 million active users. 12+ TB of data tweets / day! ..

Social Networks and Social Work

Page 12: Big Data

Google process 24 Petabytes data every day

4.6 Billion mobile phone exist

2 Billion Internet users annual traffic in 2014 equals 667 Exabytes

Social Networks and Social Work

Page 13: Big Data

Users are not just people! ..

Page 14: Big Data

Each engine 10TB / 30 Mins data produces.

Page 15: Big Data

“Data generated by machines and sensors will exceed that generated by social media by at least a factor of 10.” *

Leon Katsnelson Program Director, Big Data & Cloud Computing IBM

Page 16: Big Data

By the way, We don't have enough space to store all this data!

Page 17: Big Data

We actually appear as a game company.In fact we are data analysis company.. Ken Rudin, Zynga VP of Analytics

• Offers completely free game facilities. • Gaining revenue by selling virtual goods. • The monthly average has 232m active users. • 95% of players never visited shop! • With Using Big Data analysis they disturbed to the game world.

Page 18: Big Data

Four Entry Points of Big Data Unlock Big

Data

Simplify Your Warehouse

Preprocess Raw Data

Analyse Streaming

Data

IBM Big Data Platform Systems

Management Application

Development Visualization & Discovery

Accelerators

Information Integration & Governance

Hadoop System

Stream Computing

Data Warehouse

BI / Reporting Exploration / Visualization

Functional App

Industry App

Predictive Analytics

Content Analytics

Analytic Applications

Page 19: Big Data

Applications for Big Data Analytics

Homeland Security

Finance Smarter Healthcare Multi-channel sales

Telecom

Manufacturing

Traffic Control

Trading Analytics Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

Page 20: Big Data

"Companies which give importance to Social media is gaining"

In Turkey Organized by the Teradata "Big Data: Great Opportunity" themed event brought together senior executives of the company in Istanbul. President of Teradata EMEA Hermann Wimmer, ‘companies that want to increase their profitability by preventing competitors from social media by analyzing the data obtained in a short time, said he helped to make quick decisions.’

Page 21: Big Data

Technological solution is preferring when it provides advantages

Page 22: Big Data
Page 23: Big Data

Big Data and Open Source Nested

• Open Source Community contribution made over the years

- Apache Hadoop ve Jaql, Apache Derby, Apache Geronimo, Apache Jakarta

- Eclipse: was founded by IBM. - Lucene ;IBM Lucene Extension Library (ILEL) - DRDA, XQuery, SQL, XML4J, XERCES, HTTP,

Java, Linux... • Open source IBM Softwares

– WebSphere: Apache – Rational: Eclipse and Apache – InfoSphere: Eclipse and Apache

• IBM’s BigInsights (Hadoop) is %100 open source software.

Page 24: Big Data

Hadoop • Open Source

• Distributed Computing

• Very Simply MapReduce

• Connected

computers

Page 25: Big Data

• Ebay – 532 node 532x8 Core – 4256 Core • Facebook – 1100 node 8800 core 12 PB

Storage • LinkedIn – 1200 + 580 + 120 node • Quantcast – 3000 node 3.5 PB 1PB+ daily

data

Example Hadoop Installations

Page 26: Big Data

Hadoop • Components:

– Data Storage - HDFS - Distributed Disk File System – Data Processing - MapReduce

Page 27: Big Data

HDFS • Google File System • Distributed Disk File System • File Clustering(Usually Files sizes are over then GBs • Fault tolerant • Replication • HDFS is knowledge about the location of physical.

Page 28: Big Data

HDFS • Accessible by with Hadoop Shell , Java API or Web UI • Four application node:

– NameNode – manages the metadata of the file – SystemJob Tracker – MapReduce – Task Tracker – MapReduce – Data Node – Informations hiding with NameNode

Page 29: Big Data

BIG DATA is not just HADOOP

Manage & store huge volume of any data

Hadoop File System MapReduce

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Data Warehousing Structure and control data

Integrate and govern all data sources

Integration, Data Quality, Security, Lifecycle Management, MDM

Understand and navigate federated big data sources Federated Discovery and Navigation

Page 30: Big Data

MapReduce • Big Data Processing -- Google • Distributed Calculation Model • Key - the value binary data processing • Easy programming framework – for use:you should improve map() ve reduce() functions

Page 31: Big Data

Map: First Step • Make ready the next processing elements

while matching with a key – Data Cleaning – Simply Calculating – Split Strings

Page 32: Big Data

Reduce: Last Step • Takes the list of values for Same key with

Iterators – Filtering – Combining – Samping

So reduced.. The result is written to HDFS or HBase

Page 33: Big Data

Architecture:MapReduce HDFS CLUSTER

K1,V1 K1,V1

…..

K1,V1 K1,V1

K1,V1 K1,V1

….

K1,V1 K1,V1

M K2,V2 K2,V2

M K2,V2 K2,V2

M K2,V2 K2,V2 K2,V2 M …

M M M M

K2,V2 K2,V2 K2,V2 K2,V2 K2,V2 K2,V2 K2,V2 K2,V2 K2,V2 …

GROUPING RANKING

K2, Iterable<V2>

K2, Iterable<V2>

K2, Iterable <V2>

R K3,V3 K3,V3 …

Page 34: Big Data

Apache Pig • Yahoo!

• Designed to easy analysis of large data sets

• Easy than Map Reduce function in Java – Pig Lan coding – Can be improved

• Similar with each languages

• 10 lines of Pig code may be equivalent to hundreds of lines Java

Page 35: Big Data

Run To Pig ! • Grunt – Shell • Java interface • Eclipse and IntelliJ IDEA plugins

%tweets = load ‘/today/tweets’ as (user, mention, tweet) %twitters = group tweets by mention

Page 36: Big Data

Apache Hive • Data Warehouse Project for Hadoop • Summarizing Data • Instant queries • SQL-like language –HiveQL

– Allows to define custom mapper and reducer Special Note * : Hive compiler translates the MapReduce operations to SQL queries

Page 37: Big Data

HBase Every time we do not need to relational databases We need Scalability Table size can be very large so We want very fast access

Distributed Key-- Sorted Persistent Map

Page 38: Big Data

HBase Google – Big Table clon Works on HDFS *fault tolerance *scalability *MapReduces input-output Hbase = HDFS + Random read/ write

Page 39: Big Data

HBase Where to Use: Social Media Recommended Systems Search Engines Intelligence and Monitoring Services Financial Systems - fraud

Page 40: Big Data

Apache Mahout • Let us go beyond the simple analysis

• Classification - Email Spam - Call Center • Clustering - finding new news • Recommended Systems

Page 41: Big Data

Apache Mahout

Ready Algorithms:: • Classification:

• Logisac Regression • Bayesian • Random Forests

– Application • Call forwarding • Face recognition

Page 42: Big Data

Apache Mahout

Ready Algorithms: Recommendation:

• Distributed Item based • Matrix Factorizaaon • Non distributed item/user based

– Application: • eCommerce - Amazon • Movie recommendation - netflix • Music recommendation - LastFM • Venue Recommendation - foursquare

Page 43: Big Data

In summary:Some Big Data Applications Log Analytics (IT for IT) Smart Grid / Smarter Utilities RFID Tracking & Analytics Fraud / Risk Management & Modeling 360° View of the Customer Warehouse Extension Email / Call Center Transcript

Analysis Call Detail Record Analysis IBM Watson

Page 44: Big Data

The IBM Big Data Platform

Page 45: Big Data

IBM Big Data Platform IBM Big Data & Netezza Product Group

InfoSphere BigInsights Hadoop-based, low latency, diverse and high-volume data

analysis

Hadoop

IBM Netezza High Capacity Appliance Archived questionable

structural data

IBM Netezza 1000 BI+Ad Hoc Structured Data Analysis

IBM Smart Analytics System

Structural analysis of operational data

IBM Informix Timeseries Time-structured analytics

IBM InfoSphere Warehouse

High volume, structural veri analizi

Stream Computing InfoSphere Streams Fluid analysis for low latency data

MPP Data Warehouse

Information Integration InfoSphere Information

Server High volume data integration and transformation

Page 46: Big Data

Big Data Exploration: Value & Diagram

File Systems

Relational Data

Content Management

Email

CRM

Supply Chain

ERP

RSS Feeds

Cloud

Custom Sources

Data Explorer

Application/ Users

Find, Visualize & Understand all big data to improve business knowledge

• Greater efficiencies in business processes

• New insights from combining and analyzing data types in new ways

• Develop new business models with resulting increased market presence and revenue

Page 47: Big Data

Data Analysis in Different Diversity Making the analysis on the data in mixed feature.

Dinamic Data Analysis High volume flow data, ad-hoc analysis

Explore and experiment Data on Ad-hoc analysis, discovery and inspection data Manage and Plans Data Rules,Data integrity checking and application

Very High Volume Data Analysis The data in the PB scale appropriate price / performance criteria for analysis

What does IBM platform?

Page 48: Big Data