hadoop & zing

PRESENTER: HUNGVVW: http://me.zing.vn/hung.vo

E: [email protected]

2011-08

HADOOP & ZINGHADOOP & ZING

AGENDAAGENDA

Using Hadoop in ZingRank

Introduction to Hadoop, Hive

A case study: Log Collecting, Analyzing & Reporting Systemter Estimate

1

3

2

Conclusion

Hadoop & ZingHadoop & Zing

WhatIt’s a framework for large-scale data processingInspired by Google’s architecture: Map Reduce and GFSA top-level Apache project – Hadoop is open source

WhyFault-tolerant hardware is expensiveHadoop is designed to run on cheap commodity hardwareIt automatically handles data replication and node failureIt does the hard work – you can focus on processing data

Data Flow into HadoopData Flow into Hadoop

Web ServersScribe MidTier

Network Storage and Servers

Hadoop Hive Warehouse MySQL

Hive – Data WarehouseHive – Data Warehouse

A system for managing and querying structured data build on top of HadoopMap-Reduce for executionHDFS for storageMetadata in an RDBMS

Key building Principles:SQL as a familiar data warehousing toolExtensibility - Types, Functions, Formats, ScriptsScalability and Performance

Efficient SQL to Map-Reduce Compiler

Hive ArchitectureHive Architecture

HDFSMap ReduceWeb UI + Hive CLI +

JDBC/ODBC

Browse, Query, DDL

Hive QL

Parser

Planner

Optimizer

Execution

SerDe

CSVThriftRegex

UDF/UDAF

substrsum

averageFileFormat

s

TextFileSequenceFile

RCFile

User-definedMap-reduce

Scripts

Hive DDLHive DDL

DDLComplex columnsPartitionsBuckets

Example CREATE TABLE stats_active_daily( username STRING, userid INT, last_login INT, num_login INT, num_longsession INT) PARTITIONED BY(dt STRING, app STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n‘ STORED AS TEXTFILE;

Hive DMLHive DML

Data loading LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTERDAY*' OVERWRITE INTO TABLE stats_login PARTITION(dt='$YESTERDAY', app='${APP}');

Insert data into Hive tables INSERT OVERWRITE TABLE stats_active_daily PARTITION (dt='$YESTERDAY', app='${APP}')SELECT username, userid, MAX(login_time), COUNT(1), SUM(IF(login_type=3,1,0)) FROM stats_login WHERE dt='$YESTERDAY' and app='${APP}' GROUP BY username, userid;

Hive Query LanguageHive Query Language

SQLWhereGroup ByEqui-JoinSub query in "From" clause

Multi-table Group-By/InsertMulti-table Group-By/Insert

FROM user_information

INSERT OVERWRITE TABLE log_user_gender PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', genderid, COUNT(1) GROUP BY genderid

INSERT OVERWRITE TABLE log_user_age PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', YEAR(dob), COUNT(1) GROUP BY YEAR(dob)

INSERT OVERWRITE TABLE log_user_education PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', educationid, COUNT(1) GROUP BY educationid

INSERT OVERWRITE TABLE log_user_job PARTITION (dt='$YESTERDAY‘) SELECT '$YESTERDAY', jobid, COUNT(1) GROUP BY jobid

File FormatsFile Formats

TextFile:Easy for other applications to write/readGzip text files are not splittable

SequenceFile:http://wiki.apache.org/hadoop/SequenceFileOnly hadoop can read itSupport splittable compression

RCFile: Block-based columnar storagehttps://issues.apache.org/jira/browse/HIVE-352Use SequenceFile block formatColumnar storage inside a block25% smaller compressed sizeOn-par or better query performance depending on the query

SerDeSerDe

Serialization/Deserialization Row FormatCSV (LazySimpleSerDe)Thrift (ThriftSerDe)Regex (RegexSerDe)Hive Binary Format (LazyBinarySerDe)

LazySimpleSerDe and LazyBinarySerDeDeserialize the field when neededReuse objects across different rowsText and Binary format

UDF/UDAFUDF/UDAF

Features:Use either Java or Hadoop Objects (int, Integer, IntWritable)OverloadingVariable-length argumentsPartial aggregation for UDAF

Example UDF:public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; }}

What we use Hadoop for?What we use Hadoop for?

Storing Zing Me core log data Storing Zing Me Game/App log data Storing backup data Processing/Analyzing data with HIVE Storing social data (feed, comment, voting, chat

messages, …) with HBase

Data UsageData Usage

Statistics per day:~ 300 GB of new data added per day~ 800 GB of data scanned per day

~ 10,000 Hive jobs per day

Where is the data stored?Where is the data stored?

Hadoop/Hive Warehouse90T data20 nodes, 16 cores/node16 TB per nodeReplication=2

Log Collecting, Analyzing & ReportingLog Collecting, Analyzing & Reporting

NeedSimple & high performance framework for log collectionCentral, high-available & scalable storageEase-of-use tool for data analyzing (schema-based, SQL-

like query, …)Robust framework to develop report

Version 1 (RDBMS-style)Log data go directly into MySQL database (Master)Transform data into another MySQL database (off-load)Statistics queries running and export data into another

MySQL tables

Performance problemSlow log insert, concurrent insertSlow query-time on large dataset


Version 2 (Scribe, Hadoop & Hive)Fast logAcceptable query-time on large datasetData replicationDistributed calculation


ComponentsLog CollectorLog/Data TransformerData AnalyzerWeb Reporter

ProcessLog defineLog integrate (into application)Log/Data analyzeReport develop


Log CollectorScribe:

a server for aggregating streaming log datadesigned to scale to a very large number of nodes and be robust to

network and node failureshierarchy storesThrift service using the non-blocking C++ server

Thrift-client in C/C++, Java, PHP, …


Log format (common)Application-action log

server_ip server_domain client_ip username actionid createdtime appdata execution_time

Request logserver_ip request_domain request_uri request_time execution_time memory client_ip username application

Game action logtime username actionid gameid goldgain coingain expgain itemtype itemid userid_affect appdata


port=1463

max_msg_per_second=2000000

max_queue_size=10000000

new_thread_per_category=yes

num_thrift_server_threads=10

check_interval=3

# DEFAULT - write all other categories to /data/scribe_log

<store>

category=default

type=file

file_path=/data/scribe_log

base_filename=default_log

max_size=8000000000

add_newlines=1

rotate_period=hourly

#rotate_hour=0

rotate_minute=1

</store>


Scribe – buffer store <store>

category=default

type=buffer

target_write_size=20480

max_write_interval=1

buffer_send_rate=1

retry_interval=30

retry_interval_range=10

<primary>

type=network

remote_host=xxx.yyy.zzz.ttt

remote_port=1463

</primary>

<secondary>

type=file

fs_type=std

file_path=/tmp

base_filename=zmlog_backup

max_size=30000000

</secondary>

</store>


Log/Data TransformerHelp to import data from multi-type source into HiveSemi-automated

Log files to Hive:LOAD DATA LOCAL INPATH … OVERWRITE INTO TABLE…

MySQL data to Hive:Data extract using SELECT … INTO OUTFILE …Import using LOAD DATA


Data AnalyzerCalculation using Hive query language (HQL): SQL-likeData partitioning, query optimization:

very important to improve speeddistributed data readingoptimize query for one-pass data reading

Automationhive --service cli -f hql_fileBash shell, crontab

Export data and import into MySQL for web reportExport with Hadoop command-line: hadoop fs -cat Import using LOAD DATA LOCAL INFILE … INTO TABLE …


Web ReporterPHP web applicationModularStandard format and template

jpgraph


ApplicationsSummarization

User/Apps indicators: active, churn-rate, login, return…User demographics: age, gender, education, job, location…User interactions/Apps actions

Data miningSpam DetectionApplication performanceAd-hoc Analysis…


THANK YOU!

hadoop & zing

Technology