hw09 terapot email archiving with hadoop

16
Next Revolution Toward Open Platform Terapot: Massive Email Archiving with Hadoop & Friends Jaesun Han Founder & CEO of NexR [email protected] - Commercial Hadoop Application

Upload: cloudera-inc

Post on 25-May-2015

2.398 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Hw09   Terapot  Email Archiving With Hadoop

Next RevolutionToward Open Platform

Terapot: Massive Email Archiving with Hadoop & Friends

Jaesun HanFounder & CEO of [email protected]

- Commercial Hadoop Application

Page 2: Hw09   Terapot  Email Archiving With Hadoop

#2About NexR

icube-cc (Compute)

icube-sc(Storage)

Hadoop

Pro

visionin

g &

Managem

ent

Massive Data Storage & Processing Platform

Cloud Computing Platform(Compatible with Amazon AWS)

Academic SupportProgramMassive Email Archiving MapReduce Workflow

Hadoop & Cloud Computing Services

Offering Hadoop & Cloud Computing Platform and Services

Page 3: Hw09   Terapot  Email Archiving With Hadoop

#3What is Email Archiving?

The Objectives of Email Archiving- Regulatory compliance- e-Discovery: Litigation and legal discovery- E-mail backup and disaster recovery- Messaging system & storage optimization- Monitoring of internal and external e-mail content

Page 4: Hw09   Terapot  Email Archiving With Hadoop

#4The Architecture of Email Archiving

Email ArchivingServer

Indexes

Journaling

Search

Crawling

EmailServers

Archival Storageemail data

Indexing

Discovery

Data AcquisitionJournaling

Mailbox Crawling

Data ProcessingIndexingFiltering

Data AccessSearch

Discovery

auditoradministrator

employee

Page 5: Hw09   Terapot  Email Archiving With Hadoop

#5The Challenges of Email Archiving

Explosive growth of digital data- 6 times (988XB) in 2010 than 2006- 95% (939 XB) unstructured data including email- Increasing the cost and complexity of archiving Requiring scalable & low cost archiving

Reinforcement of data retention regulation- Retention, Disposal, e-Discovery, Security- HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery

Needs for intelligent data management- Knowledge management from email data- Filtering, monitoring, data mining, etc Requiring integration with intelligent system

Page 6: Hw09   Terapot  Email Archiving With Hadoop

#6New Requirements of Email Archiving

High Scalability

Low Cost

High Performance

Intelligence

Page 7: Hw09   Terapot  Email Archiving With Hadoop

#7Terapot: When Hadoop Met Email Archiving…

EmailServers

Distributed Crawling

JournalingServer

Journaling

Hadoop HDFS(Archiving)

Hadoop MapReduce(Crawling, Indexing, etc)

Distributed Search & Discovery

Scale-out architecture with Hadoop- Hadoop HDFS for archiving email data- Hadoop MapReduce for crawling & indexing- Apache Lucene for search & discovery

Page 8: Hw09   Terapot  Email Archiving With Hadoop

#8Features of Terapot

Distributed Massive Email Archiving High Scalability by Shared-Nothing Architecture

- Thousands of servers, billions of emails

Low Cost by Inexpensive Hardware- Entry servers under $5,000

High Performance by Parallelism- Fast search under 1-2 seconds for each user account- Fast discovery in parallel with MapReduce

Intelligence by Data Mining- Contact network analysis, content analysis, statistics

Support Both On-premise Version and Cloud(hosted) Version Development with Various Open Source Software

Page 9: Hw09   Terapot  Email Archiving With Hadoop

#9The Architecture of Terapot

Crawling

Batch processing

MR Workflow Manager

Terapot Frontend

Terapot Clients

POP3Server

HTTP/FTP/SFTPServer

MailServer

NAS/NFS

Email Sources

SOAP REST JSON

Local(index)

HDFS(email)

Indexing Merging

Analyzer

MiningReal-Time

Indexing

MailServer

Searching

Search Gateway

ETL

Analysis

Hadoop MapReduce, Lucene, & Hive

4 keycomponents

Page 10: Hw09   Terapot  Email Archiving With Hadoop

#10Batch Processing Component

Crawling(MR)

Indexing(MR)

Merging

An archive file per user(sequence file)

a temporary index file per user

(lucene index file)

a merged index file(for backing up)

Email Sources

HDFS

Archiving policies An archive file per user Several archive files per crawling

configuredperiod

Local file system

index shard(3 copy replication)

shard 1 shard 0

Search

Page 11: Hw09   Terapot  Email Archiving With Hadoop

#11Real-Time Indexing Component

Real-TimeIndexing

JournalingServer

Memory

Real-TimeIndex

Database

BatchProcessingComponent

Crawling

Indexing Archiving

HDFS

archive

index

Flushing

Forwarding

Page 12: Hw09   Terapot  Email Archiving With Hadoop

#12Search & Discovery Component

SearchGateway

Zookeeper

Updatingshard status

Locatingindex shards

Assigningshards

DistributedSearch

HDFS

index shards

Real-TimeIndexing Nodes

Search Nodescopy index shardsto local file system

Page 13: Hw09   Terapot  Email Archiving With Hadoop

#13Data Analysis Component

ETL (MR)Extract-Transform-

Load

email archive files Hive table

Hive

MiningEngine

MR MR MR MR MR

analysis results database

generatingreports

HDFS

Hive queries

AnalyzerWeb

Reporter

reports

Personal contact network analysis Domain statistics

Page 14: Hw09   Terapot  Email Archiving With Hadoop

#14Installation & Quantitative Analysis

2masternodes

10workernodes

(datanode, tasktracker,searcher,

etc)

Description Qty

CPUIntel Xeon Nehalem

E5504 2.0GHz2

(8 cores)

MemoryDDR3 2GB PC3-10600

Registered Dimm9

(18GB)

HDD 1TB 7200 RPM SATA24

(4TB)

HA Assuming

- 1000 employees- 16 emails per day for each person- 215KB (content 142 KB + attachment 73 KB)for average email size

- 1.25 GB per year for 1 employee Storage

- index size: about 80% of email- compression ratio: about 50 %

Disk volume required for 1 year- email archive (HDFS): 1881 GB- indexes (HDFS + Local): 4559 GB- total: about 6.4 TB per year

40 TB may cover 6 years archiving

Quantitative Analysis

Page 15: Hw09   Terapot  Email Archiving With Hadoop

#15Demonstration

Page 16: Hw09   Terapot  Email Archiving With Hadoop

Hadoop & Cloud ComputingCompany

www.nexrcorp.com

For more information- www.nexrcorp.com- www.terapot.com- [email protected] @jaesun_han