data analysis @ daum | devon 2012

Post on 26-Jun-2015

12.984 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Analysis @ DaumPaul Kim (totworld@daumcorp.com)DevOn 2012

Search Quality

Search Quality=

Satisfaction / Cost

How?

Understanding Userswith Logs

BIG DATA!

Data Analysis Processwith Hadoop

?HADOOP FEATURES TOOLS

!

SAS

RWEKA

ETC

2 QUAD-CORES8GB RAM4TB HDD

4 QUAD-CORES16GB RAM4TB HDD

X 60 NODES

X 30 NODES

For example,

라면 맛있게 끓이는 비법

많이 본 글

Mission만족스러운 검색 경험들을 랭킹에 반영

Target DataHalf Year Search Logs (about 40TB)

FeaturesQuery - Collection RelationshipQuery - Document - Session RelationshipSession - Query RelationshipSession - Document Relationship

GROUP-BY JOB

GROUP-BY JOB

GROUP-BY JOB

GROUP-BY JOB

많이 본 글

ModelingLinear Regression with SAS

Batch Process

HADOOP FEATURES MODEL ENGINE

LESS THAN 2 HOURS

바다 이야기

SEARCH SPAM INDEXMission

Spam이 검색 사용자에게 미치는 영향 파악Data

Search Log : Text with DelimiterPost Filtered Documents : Json FormatOperation Deleted Documents : Xml Format

TaskQuery - Session - Doc. 1 - Doc. 2 - Doc. 3 - Doc. 4

Click?Type? (Ham, Spam, OP Del.)

OUTER JOIN

SEARCH SPAM INDEXResult Sample

BLOG CLASSIFICATION

BLOG CLASSIFICATIONMission

Unsupervised Learning을 통한 나쁜 Blog ClusteringData

30 Days Blog DocumentsTask

Blog - Document’s Feature Analysis with Fixed Interval

BLOG CLASSIFICATIONModeling

Kohonen’s SOM(Self Organizing Map) with R

WHAT ELSE?Topic Analysis with PLSA

Query Chain Filtering

Reprocessing with Hadoop

In Conclusion,

ADVANTAGE OF HADOOP

ADVANTAGELow analyze cost!No more sampling!Low operation cost!Programming Language IndependentVarious support tools

DISADVANTAGEConceptual Change is Needed.Project under active development.Version upgrade is not supported.

THANK YOU!

top related