Download - Cloud computing era

Trend Micro

Cloud Computing Era

Three Major Trends to Chang the World

Cloud Computing

Mobile Big Data

What is Cloud Computing?

Provide scalable and elastic IT relative functions to users via as-a-service business models and internet technologies

Essential Characteristics

Service Models

Deployment Models

National Institute of Standards and Technology (NIST) definition :

It’s About the Ecosystem

IaaS

PaaS

SaaS

Cloud Computing

Generate

Big Data

Lead

Business Insights

create

Competition, Innovation, Productivity

Structured, Semi-structured

Enterprise Data Warehouse

What is Big Data?

What is the problem

• Getting the data to the processors becomes the bottleneck

• Quick calculation

– Typical disk data transfer rate:

• 75MB/sec – Time taken to transfer 100GB of data

to the processor:

• approx. 22 minutes!

The Era of Big Data – Are You Ready

Data for business commercial analysis

• 2011: multi-terabyte (TB)

• 2020: 35.2 ZB (1 ZB = 1 billion TB)

Who Needs It?

When to use?

• Affordable Storage/Compute

• Unstructured or Semi-structured

• Resilient Auto Scalability

When to use?

• Ad-hoc Reporting (<1sec)

• Multi-step Transactions

• Lots of Inserts/Updates/Deletes

Enterprise Database Hadoop

Hadoop!

– inspired by

• Apache Hadoop project

– inspired by Google's MapReduce and Google File System papers.

• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware

• Open Source Software + Hardware Commodity

– IT Costs Reduction

© 2011 Cloudera, Inc. All Rights Reserved.

Hadoop Core

HDFS

MapReduce


HDFS

• Hadoop Distributed File System

• Redundancy

• Fault Tolerant

• Scalable

• Self Healing

• Write Once, Read Many Times

• Java API

• Command Line Tool


MapReduce

13

• Two Phases of Functional Programming

• Redundancy

• Fault Tolerant

• Scalable

• Self Healing

• Java API


Hadoop Core

14

HDFS

MapReduce

Java Java

Java

Java

Word Count Example

Key: offset

Value: line

Key: word

Value: count

Key: word

Value: sum of count

0:The cat sat on the mat

22:The aardvark sat on the sofa

The Hadoop Ecosystems

The Ecosystem is the System

• Hadoop has become the kernel of the distributed operating system for Big Data

• No one uses the kernel alone

• A collection of projects at Apache

Relation Map

MapReduce Runtime (Dist. Programming Framework)

Hadoop Distributed File System (HDFS)

Zo

ok

ee

pe

r (C

oo

rdin

atio

n)

Hbase (Column NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Hue (Web Console)

Mahout (Data Mining)

Zookeeper – Coordination Framework



Zo

ok

ee

pe

r (C

oo

rdin

atio

n)





Hue (Web Console)


What is ZooKeeper

• A centralized service for maintaining

– Configuration information

– Providing distributed synchronization

• A set of tools to build distributed applications that can safely handle partial failures

• ZooKeeper was designed to store coordination data

– Status information

– Configuration

– Location information

Flume / Sqoop – Data Integration Framework



Zo

ok

ee

pe

r (C

oo

rdin

atio

n)





Hue (Web Console)


What’s the problem for data collection

• Data collection is currently a priori and ad hoc

• A priori – decide what you want to collect ahead of time

• Ad hoc – each kind of data source goes through its own collection path

(and how can it help?)

• A distributed data collection service

• It efficiently collecting, aggregating, and moving large amounts of data

• Fault tolerant, many failover and recovery mechanism

• One-stop solution for data collection of all formats


Flume Architecture

Log

Flume Node

Log

Flume Node

...

HDFS


Flume Sources and Sinks

• Local Files

• HDFS

• Stdin, Stdout

• Twitter

• IRC

• IMAP

Sqoop

• Easy, parallel database import/export

• What you want do?

– Insert data from RDBMS to HDFS

– Export data from HDFS back into RDBMS


Sqoop

28

RDBMS

Sqoop

HDFS


Sqoop Examples

29

$ sqoop import --connect jdbc:mysql://localhost/world --

username root --table City

...

$ hadoop fs -cat City/part-m-00000

1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He

rat,AFG,Herat,1868004,Mazar-e-

Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200

...

Pig / Hive – Analytical Language



Zo

ok

ee

pe

r (C

oo

rdin

atio

n)





Hue (Web Console)


Why Hive and Pig?

• Although MapReduce is very powerful, it can also be complex to master

• Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code

• Many organizations have programmers who are skilled at writing code in scripting languages

• Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce

– Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by

• What is Hive?

– An SQL-like interface to Hadoop

• Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop

– MapRuduce for execution

– HDFS for storage

• Hive Query Language

– Basic-SQL : Select, From, Join, Group-By

– Equi-Join, Muti-Table Insert, Multi-Group-By

– Batch query

SELECT * FROM purchases WHERE price > 100 GROUP BY storeid


Hive

33

MapReduce

Hive

SQL

Pig

• A high-level scripting language (Pig Latin)

• Process data one step at a time

• Simple to write MapReduce program

• Easy understand

• Easy debug A = load ‘a.txt’ as (id, name, age, ...)

B = load ‘b.txt’ as (id, address, ...)

C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

– Initiated by


Pig

MapReduce

Pig

Script

Hive vs. Pig

Hive Pig

Language HiveQL (SQL-like) Pig Latin, a scripting language

Schema Table definitions

that are stored in a

metastore

A schema is optionally defined

at runtime

Programmait Access JDBC, ODBC PigServer

• Input

• For the given sample input the map emits

• the reduce just sums up the values

Hello World Bye World

Hello Hadoop Goodbye Hadoop

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

WordCount Example

WordCount Example In MapReduce public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;

WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'

OVERWRITE INTO TABLE wordcount;

SELECT count(*) FROM wordcount GROUP BY token;

4

1 © 2011 Cloudera, Inc. All Rights Reserved.

The Story So Far

RDBMS

Hive Pig

Sqoop

MapReduce

HDFS

FS SQL

SQL Script

Posix

Java

Java

Flume

Hbase – Column NoSQL DB



Zo

ok

ee

pe

r (C

oo

rdin

atio

n)





Hue (Web Console)


Structured-data vs Raw-data

I – Inspired by

• Coordinated by Zookeeper

• Low Latency

• Random Reads And Writes

• Distributed Key/Value Store

• Simple API

– PUT

– GET

– DELETE

– SCAN

Hbase – Data Model

• Cells are “versioned”

• Table rows are sorted by row key

• Region – a row range [start-key:end-key]


HBase Examples

hbase> create 'mytable', 'mycf‘

hbase> list

hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘



hbase> scan 'mytable‘

hbase> disable 'mytable‘

hbase> drop 'mytable'

Oozie – Job Workflow & Scheduling



Zo

ok

ee

pe

r (C

oo

rdin

atio

n)





Hue (Web Console)


What is ?

• A Java Web Application

• Oozie is a workflow scheduler for Hadoop

• Crond for Hadoop

• Triggered

– Time

– Data

Job 1

Job 3

Job 2

Job 4 Job 5


Oozie Features

• Component Independent

– MapReduce

– Hive

– Pig

– SqoopStreaming

Mahout – Data Mining



Zo

ok

ee

pe

r (C

oo

rdin

atio

n)





Hue (Web Console)


What is

• Machine-learning tool

• Distributed and scalable machine learning algorithms on the Hadoop platform

• Building intelligent applications easier and faster


Mahout Use Cases

• Yahoo: Spam Detection

• Foursquare: Recommendations

• SpeedDate.com: Recommendations

• Adobe: User Targetting

• Amazon: Personalization Platform

Hue – developed by

• Hadoop User Experience

• Apache Open source project

• HUE is a web UI for Hadoop

• Platform for building custom applications with a nice UI library

Hue

• HUE comes with a suite of applications

– File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files.

– Job Browser: View jobs, tasks, counters, logs, etc.

– Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.

Hue: File Browser UI

Hue: Beewax UI

Use case Example

• Predict what the user likes based on

– His/Her historical behavior

– Aggregate behavior of people similar to him

Conclusion

Today, we introduced:

• Why Hadoop is needed

• The basic concepts of HDFS and MapReduce

• What sort of problems can be solved with Hadoop

• What other projects are included in the Hadoop ecosystem

Recap – Hadoop Ecosystem



Zo

ok

ee

pe

r (C

oo

rdin

atio

n)





Hue (Web Console)


Trend Micro Smart Protection Network (SPN) Case Study

Collaboration in the underground

Network Threats Shows Explosive Growth

New Unique Malware Discovered

1M

unique

Malwares

every

month

Threats on the network like variants of the virus, spams, unknown download sources escapes the detection of the traditional security system and continues to show explosive growth.

New Design Concept for Threat Intelligence

Web Crawler

Trend Micro Endpoint Protection

Trend Micro Mail Protection

Trend Micro Web Protection

Honeypot

CDN / xSP Human Intelligence

150M+ Worldwide Endpoints/Sensors

Challenges We Are Faced

The Concept is Great but …. 6TB of data and 15B lines of logs received daily by

It becomes the Big Data Challenge!

Raw Data Information Threat

Intelligence/Solution

Volume: Infinite Time: No Delay Target: Keep Changing Threats

Issues to Address

SPN Feedback

Log Receiver

L4

Message Bus

Log Receiver

HBase MapReduce


CDN Log

SPAM

HTTP POST

Feedback Information

SPN High Level Architecture

Log Post Processing

Log Post Processing

L4

Email Reputation Service

SPN

infrastru

cture

A

pp

lication

Web Reputation Service

File Reputation Service

Adhoc-Query (Pig)

Circus (Ambari)

Log Post Processing

Trend Micro Big Data process capacity

Daily amount of SPN data to be processed • 8.5 billions Web Reputation queries

• 3 billions Email Reputation queries

• 7 billions File Reputation queries

• Process 6 TB worldwide raw logs

• 150 millions End-point connections

Trend Micro: Web Reputation Services

User Traffic | Honeypot

Akamai

Rating Server for Known Threats

Unknown & Prefilter

Page Download

Threat Analysis

8 billions/day

4.8 billions/day

860 millions/day

40% filtered

82% filtered

25,000 malicious URL /day

99.98% filtered

Trend Micro Products / Technology

CDN Cache

High Throughput Web Service

Hadoop Cluster

Web Crawling

Machine Learning Data Mining

Technology Process Operation

Block malicious URL within 15 minutes once it goes online!

15

Min

utes

Big Data Cases

Google vs. Hadoop Ecosystem

Chubby vs. Zookeeper

MapReduce vs. MapReduce

BigTable vs. HBase

GFS vs. HDFS

Ref: Google Cluster

http://research.google.com/archive/chubby-osdi06.pdf

http://research.google.com/archive/mapreduce.html

http://research.google.com/archive/bigtable-osdi06.pdf

http://research.google.com/archive/gfs-sosp2003.pdf

http://research.google.com/archive/googlecluster-ieee.pdf

Pioneer of Big Data Infrastructure – Google

Hbase use Case@Facebook - Messages HBase Use Cases @ Facebook

Messages

Facebook Insights

Self-service

Hashout

Operational Data Store

More Analytics/Hashout apps

Site Integrity

2010 2011 2012 2013

Social Graph Search Indexing Realtime Hive Updates Cross-system Tracing … and more

Flagship App：Facebook Messages

• Monthly data volume prior to launch

Monthly data volume prior to launch

15B x 1,024 bytes = 14TB

120B x 100 bytes = 11TB

Facebook Messages Now

• Quick Stats

– 11B+ messages/day

• 90B+ data accesses

• Peak：1.5M ops/sec

• ~55% Read, 45% Write

– 20PB+ of total data

• Grows 400TB/month

Facebook Messages NOW

Emails

Chats

SMS

Messages Quick Stats • 11B+ messages/day

• 90B+ data accesses • Peak: 1.5M ops/sec • ~55%Rd, 45% Wr

• 20PB+ of total data • Grows 400TB/month

Facebook Messages：Requirements

• Very High Write Volume

– Previously, chat was not persisted to disk

• Ever-growing data sets（Old data rarely gets accessed）

• Elasticity & automatic failover

• Strong consistency within a single data center

• Large scans/map-reduce support for migrations & schema conversions

• Bulk import data

Physical Multi-tenancy

• Real-time Ads Insights

– Real-time analytics for social plugins on top of Hbase

– Publishers get real-time distribution/engagement metrics:

• # of impressions, likes

• analytics by domain/URL/demographics and time periods

– Uses HBase capabilities:

• Efficient counters (single-RPC increments)

• TTL for purging old data

– Needs massive write throughput & low latencies

• Billions of URLs

• Millions of counter increments/second

• Operational Data Store

Facebook Open Source Stack

• Memcached --> App Server Cache

• ZooKeeper --> Small Data Coordination Service

• HBase --> Database Storage Engine

• HDFS --> Distributed FileSystem

• Hadoop --> Asynchronous Map-Reduce Jobs

Questions?

Thank you!

Download - Cloud computing era

Top Related