[@indeedeng] logrepo: enabling data-driven decisions

Post on 21-Nov-2014

2.728 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Video available at: http://youtu.be/y0WC1cxLsfo At Indeed our applications generate billions of log events each month across our seven data centers worldwide. These events store user and test data that form the foundation for decision making at Indeed. We built a distributed event logging system, called Logrepo, to record, aggregate, and access these logs. In this talk, we'll examine the architecture of Logrepo and how it evolved to scale. Jeff Chien joined Indeed as a software engineer in 2008. He's worked on jobsearch frontend and backend, advertiser, company data, and apply teams and enjoys building scalable applications. Jason Koppe is a Systems Administrator who has been with Indeed since late 2008. He's worked on infrastructure automation, monitoring, application resiliency, incident response and capacity planning.

TRANSCRIPT

LogrepoEnabling Data-Driven Decisions

Jeff ChienSoftware EngineerIndeed Apply Team

Scale

More job searches worldwide than any other employment website.

● Over 100 million unique users ● Over 3 billion searches per month● Over 24 million jobs● Over 50 countries● Over 28 languages

I help people get jobs.

1. Search

2. View job

3. Click “Apply Now”

4. Submit application

Job seeker flow using Indeed Apply

Knowing how users interact with our system

helps us make better products

Have to upload a resume

Have Indeed

Resume

Likelihood of applying to a job

We Have Questions

● What percentage of applications use Indeed resumes?

● How many searches for “java” in “Austin”?

● How often are resumes edited?

● How long does it take to aggregate jobs?

How many applications … to jobs from CareerBuilder … by job seekers who searched for “java” in “Austin” … used an Indeed resume?

Is the percentage different on mobile compared to web?

How much has this changed in 2011 compared to 2014?

Complicated Questions

More Information

Better Decisions

More information

Need to log events

● job searches

● clicks

● applies

What to log

Client information - unique user identifier, user agent, ip address…

User behavior - clicks, alert signups…

Performance - backend request duration, memory usage...

A/B test groups - control and test groups

Better decisions

Use empirical data to make decisions

Not based on assumptions nor the highest paid person’s opinion!

Objective

Collect data on user actions and system performance from many different applications in multiple data centers

How we build systems

Simple

Fast

Resilient

Scalable

Simple

Easy interface

Reuse familiar technologies

Fast

No impact to runtime performance

Data available soon

Resilient

Does not lose data in spite of system or network failures

Can handle large quantities of data

Scalable

Requirements

Powerful enough to express diverse data

Requirements

Powerful enough to express diverse data

Store all data forever

Powerful enough to express diverse data

Store all data forever

Events stored at least once

Requirements

Requirements

Powerful enough to express diverse data

Store all data forever

Events stored at least once

Easy to add new data to logs

Requirements

Powerful enough to express diverse data

Store all data forever

Events stored at least once

Easy to add new data to logs

Easy to access logs in bulk

RequirementsPowerful enough to express diverse data

Store all data forever

Events stored at least once

Easy to add new data to logs

Easy to access logs in bulk

Time range based access

Non-Goals

Random access to individual events

Real time access to events

Complex data types

LogrepoA distributed event logging system

Est. 2006

Logrepo stores log entries

Everything is a string

Key/value pairs

URL-encoded

Organic click log entry

uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http%3A%2F%2Fwww.indeed.com%2Frc%2Fclk&href=http%3A%2F%2Fwww.indeed.com%2Fjobs%3Fq%3D%26l%3DNewburgh%252C%2BNY%26start%3D20&agent=Mozilla%2F5.0+%28Windows+NT+6.1%3B+WOW64%3B+rv%3A26.0%29+Gecko%2F20100101+Firefox%2F26.0&raddr=173.50.255.255&ckcnt=17&cksz=1033&ctk=18dtbc6960nk20vd&ctkRcv=1&&

URL-decoded organic click log entry

uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http://www.indeed.com/rc/clk&href=http://www.indeed.com/jobs?q=&l=Newburgh%2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0&...

URL-decoded organic click log entry

uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http://www.indeed.com/rc/clk&href=http://www.indeed.com/jobs?q=&l=Newburgh%2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0&...

Advantages

Human-readable

Advantages

Human-readable

Arbitrary keys

Advantages

Human-readable

Arbitrary keys

Low overhead to add new key/value pairs

Advantages

Human-readable

Arbitrary keys

Low overhead to add new key/value pairs

Self-describing

Advantages

Human-readable

Arbitrary keys

Low overhead to add new key/value pairs

Self-describing

Easy to parse in any language

Required log entry keys

Every log entry has uid and type

Type is an arbitrary string

uid=18dtbolr20nk23qh&type=orgClk&...

UID format

uid=18ducm8u50nk23qh&type=jobsearch&...

UID is always the first key

Unique

16 characters

Base 32 [0-9a-v]

uid=18ducm8u50nk23qh

Date = 2014-01-10 Time = 09:35:24.357

Server id = 1512App instance id = 2

UID Version = 0Random value = 3921

UID breakdown

UID generation

Unique IDs are unique

Random value avoids UID collisions

Random value is between 0 and 8191

Up to 8000 events per application instance per millisecond

UID format benefits

Contains useful metadata

Compact format reduces memory requirements

Easy to compare or sort events by time

Job seeker events

1. Search for jobs

2. Click on job

3. Apply to job

All events are part of the same flow

Parent-child relationships between events

Events can reference other events with &tk=18ducm8u50nk23qh...

Children know their parents

Parents don’t know their children

Extremely powerful model

Parent-child relationships between events

An organic click points to the search it occurred on

uid=18dtbnn3p0nk20g9&type=jobsearch&v=0&...

uid=18dtbolr20nk23qh&type=orgClk&v=0 &tk=18dtbnn3p0nk20g9&...

More jobsearch child events

Sponsored job clicks

Javascript errors

Job alert signups

And many more...

uid=18en3o3ov16r25rp&type=viewjob&...

user submission

post to employer

load IndeedApply

job view18en3o3ov16r25rp

Job seeker views a job

job view18en3o3ov16r25rp

user submission

post to employer

uid=18en3o3s216ph6d5&type=loadJs&vjtk=18en3o3ov16r25rp&...

load IndeedApply18en3o3s216ph6d5

Indeed Apply loads

uid=18en3qe0u16pi5ct&type=appSubmit&loadJsTk=18en3o3s216ph6d5&...

job view18en3o3ov16r25rp

user submission18en3qe0u16pi5ct

post to employer

load IndeedApply18en3o3s216ph6d5

Prepare job application

POST /apply HTTPS/1.1Host: employer.com

{ "applicant": {

"name": "John Doe","email": "jobseeker@gmail.com","phone": "555-555-5555",

}, "jobTitle": "Software Engineer" ...

uid=18en3qe2r0nji3h6&type=postApp&appSubmitTk=18en3qe0u16pi5ct&...

job view18en3o3ov16r25rp

user submission18en3qe0u16pi5ct

post to employer18en3qe2r0nji3h6

load IndeedApply18en3o3s216ph6d5

Submit job application

Javascript latency ping

At start of page load, browser executes js to ping Indeed

Server receives the ping and logs an event

Parent job search and child js latency ping

uid=18dqpc3lm16pi2an&type=jobsearch&...

uid=18dqpc3s516pi566&type=lat&tk=18dqpc3lm16pi2an

uid=18dqpc3s516pi566&type=lat&tk=18dqpc3lm16pi2an

Latency = 1389247205253 - 1389247205046= 207 ms

Approximates perceived latency to jobseeker

uid timestamp Jan 9, 2014 00:00:05.253

tk timestamp Jan 9, 2014 00:00:05.046

Subtracting UID timestamps yields duration

West coast perceived latency in California vs. Washington

Writing log entries from apps

LogEntry entry =factory.createLogEntry("search");

entry.setProperty("q", query);entry.setProperty("acctId", accountId);entry.setProperty("time", elapsedMillis);// ...

entry.commit();

Creating a log entry

LogEntry entry =factory.createLogEntry("search");

Creates a log entry with UID and type set

UID timestamp tied to createLogEntry() call

Populating a log entry

entry.setProperty("q", query);entry.setProperty("acctId", accountId);entry.setProperty("time", elapsedMillis);// ...

Lists

Separate values with commas

String groups = "foo,bar,baz";

logEntry.setProperty("grps", groups);

// uid=...&grps=foo%2Cbar%2Cbaz&...

Lists of Tuples

Encapsulate each tuple in parenthesis

Comma-separate elements within tuple

// Two jobs with (job id, score)String jobs = "(123,1.0)(400,0.8)";

logEntry.setProperty("jobs", jobs);

// uid=...&jobs=%28123%2C1.0%29%28400%2C0.8%29&...

Committing a log entry

After log entry is fully populated...

entry.commit();

Jason KoppeSystem Administrator

I engineer systemsthat help people get jobs.

Before logrepo

Before logrepo

log4j - Java logging framework

● Code - what● Configuration - define what goes to

where● Appender - where (file, smtp)

http://logging.apache.org/log4j/1.2/

Before logrepo

Reusing log4j for logrepo

Redundancy from the start

Write to local disk (FileAppender)

Write to remote server #1 (? Appender)

Write to remote server #2 (? Appender)

Writing to a remote server

syslogProtocol for transporting messages across

an IP network

Est. 1980s

http://tools.ietf.org/html/rfc5424

Using log4j with syslog

Out-of-the-box, log4j only supported UDP syslog

UDP could result in data loss

Avoiding data loss

TCP guarantees data transfer

Use TCP!

SyslogTcpAppender● created by Indeed● TCP-enabled log4j syslog Appender● buffers messages before transport

Resilient for short network and syslog server downtimes

Creating a reliable Appender

Choosing a syslog daemon

syslog-ngsyslog daemon which supports TCP

Est. 1998

http://www.balabit.com/network-security/syslog-ng

Redundancy with log4j

Write to local disk (FileAppender)

Write to remote server #1 (SyslogTcpAppender)

Write to remote server #2 (SyslogTcpAppender)

Redundancy over TCP

Each syslog-ng server

receives unsorted log entries

immediately flushes entries to files on disk called raw logs

Quick redundancy over TCP

Optimized for redundancy

raw logs are probably out-of-order

each app writes to syslog independently

Optimize for read access patterns

LogRepositoryBuilder (“Builder”)● sort● deduplicate● compress

Builder architecture

Builder architecture

Builder architecture

Builder architecture

Builder creates segment files

uid=15mt000000k1&type=orgClk&v=1&k=4... uid=15mt000010k7&type=orgClk&v=1&k=3... uid=15mt000020k8&type=orgClk&v=1&k=2... uid=15mt000030ss&type=orgClk&v=1&k=9...

Repeated strings compress well

uid=15mt000000k1&type=orgClk&v=1&k=4... uid=15mt000010k7&type=orgClk&v=1&k=3... uid=15mt000020k8&type=orgClk&v=1&k=2... uid=15mt000030ss&type=orgClk&v=1&k=9...

compresses by 85%

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

logentry type

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

4-char UID prefix, base 32

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

4-char UID prefix, base 32

~9.3 hour time period

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix, base 32

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix, base 32

~17 minute time period

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

unique number

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

unique number

Supports more than 1 segment file per type per 5-char UID prefix

Multiple segment files

Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix

Multiple segment files

Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix

Multiple segment files

Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix

Builder creates the archive

Redundancy

Redundancy

Ensure archive consistency

● Delayed Builder on second server● Add new segment files for log entries

missed by first Builder● Causes multiple segment files for a 5-char

UID prefix

Providing access to logrepo

LogRepositoryReader (“Reader”)● simple request protocol● reads from (multiple) segment files● provides sorted stream of entries to TCP

client as quickly as possible

Reader request protocol

1. Start time2. End time3. Logrepo type

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk

start time (ms since 1970-01-01, the start of Unix time)

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk

end time (ms since 1970-01-01)

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk

logrepo type

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk \

| nc 192.168.0.1 9999

send echo across a TCP session

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk \

| nc 192.168.0.1 9999

uid=15mt00l710k3262q&type=orgClk&v=0&...

uid=15mt00l780k137d9&type=orgClk&v=0&...

...

uid=15mt7ggvj142h06k&type=orgClk&v=0&...

UID-sorted results

Reading entries from archive

1. Isolate to the type directory

1295905740000 1295913600000 orgClk

Reading entries from archive

2. Convert request timestamps to UID prefix

uidPrefixFromTime(1295905740000) = 15mt0

uidPrefixFromTime(1295913600000) = 15mt7

1295905740000 1295913600000 orgClk

Reading entries from archive

3. Find segments matching first UID prefix

ls orgClk/15mt/0*orgClk/15mt/0.log3094.seg.gzorgClk/15mt/0.log4181.seg.gz

1295905740000 1295913600000 orgClk

15mt0

Reading entries from archive

4. Read sorted segments simultaneously, merge into a single sorted stream

/orgClk/15mt/0.log3094.seg.gz: uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&.../orgClk/15mt/0.log4181.seg.gz: uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...

1295905740000 1295913600000 orgClk

Reading entries from archive

4. Read sorted segments simultaneously, merge into a single sorted stream

/orgClk/15mt/0.log3094.seg.gz: uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&.../orgClk/15mt/0.log4181.seg.gz: uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...

1295905740000 1295913600000 orgClk

1

42

3

Reading entries from archive

4. Read sorted segments simultaneously, merge into a single sorted stream

uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...

1295905740000 1295913600000 orgClk

1

4

23

Reading entries from archive

5. Only return log entries between timestamps

uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...

1295905740000 1295913600000 orgClk

1

4

23

Reading entries from archive

6. Read segments for each UID prefix, one prefix at a time

1295905740000 1295913600000 orgClk

15mt0 15mt7

15mt115mt215mt315mt415mt515mt6

Reading entries from archive

7. Stop reading files when entry crosses request boundary

1295905740000 1295913600000 orgClk

The first years (2007 & 2008)

● Single datacenter● App servers● 2 logrepo servers● syslog-ng● Builder● Reader

Growth

job seekers

Growth

job seekers

products

Growth

job seekers

products

datacenters

Growth

log entries

Multi-datacenter rationale

Latency

Redundancy

Multi-datacenter rationale

Job seekers

Logrepo in multiple datacenters

● Single datacenter● Consumers● Reader

● Every datacenter● Applications producing logentries● 2 syslog servers● Builders (minimize Internet traffic)

Single datacenter archival

/dc1/orgClk/15mt/0.log4181.seg.gz

event type(orgClick means organic search result click)

25-bit timestamp prefix, base 32~17-minute time period

random number

Multiple datacenter archival

/dc1/orgClk/15mt/0.log4181.seg.gz

event type(orgClick means organic search result click)

25-bit timestamp prefix, base 32~17-minute time period

random number

datacenter

Datacenter dirs avoid collisions

~$ ls */orgClk/15mt/0*

dc1/orgClk/15mt/0.log1481.seg.gz

dc3/orgClk/15mt/0.log1481.seg.gz

Different datacenters

Datacenter dirs avoid collisions

~$ ls */orgClk/15mt/0*

dc1/orgClk/15mt/0.log1481.seg.gz

dc3/orgClk/15mt/0.log1481.seg.gz

Same segment filename

Independent Builders

uid=18ducm8u50nk23qh

Date = 2014-01-10 Time = 09:35:24.357

Server id = 1512App instance id = 2

UID Version = 0Random value = 3921

UID breakdown

uid=18ducm8u50nk23qh

Date = 2014-01-10 Time = 09:35:24.357

Server id = 1512App instance id = 2

UID Version = 0Random value = 3921

UID breakdown

Using server ID for uniqueness

Each datacenter gets 256 server IDs

1. DC #1 uses 0 - 2552. DC #2 uses 256 - 5113. DC #3 uses 512 - 7674. ...

The next years (2009 - 2011)

● Multiple datacenters● 2 logrepo servers● syslog-ng● Builder

● Consumer datacenter● Reader● Consumers

More logentries

More consumers

Diverse requests

Single server disk bottleneck

Scaling logrepo reads

Bottleneck: single active Reader server

Goal: spread logrepo accesses across a cluster of servers

Read logrepo from HDFS

Hadoop Distributed File System (HDFS)

“a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.”

http://hadoop.apache.org/docs/stable1/hdfs_design.html

Using HDFS for logrepo access

Using HDFS for logrepo access

Using HDFS for logrepo access

Resilient logrepo in HDFS

Store each logentry on 3 servers

Push to HDFS quickly

Mirror every segment file into HDFS

Push to HDFS quickly

/dc1/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix, base 32~17-minute time period

500,000+ files per day

HDFS optimized for fewer files

Reduce the number of logrepo files in HDFS keeps us efficient

HDFS optimized for fewer files

Reduce the number of logrepo files in HDFS keeps us efficient

HDFSArchiver

Archive yesterday in HDFS

/dc1/orgClk/15mt/0.log4181.seg.gz

20-bit timestamp prefix~9.3 hour period

2,500 files per day

type

Scaling logrepo in HDFS

500,000+ files per day

2,500 files per day

LogrepoA distributed event logging system

Created @IndeedEng● Application

Open source● log4j

Created @IndeedEng● Application● SyslogTcpAppender

Open source● log4j

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender

Open source● log4j● syslog-ng

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender● Builder

Open source● log4j● syslog-ng

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender● Builder

Open source● log4j● syslog-ng● gzip

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader

Open source● log4j● syslog-ng● gzip

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader

Open source● log4j● syslog-ng● gzip● rsync+ssh

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader

Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher

Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher● HDFSReader

Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop

LogrepoA distributed event logging system

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher● HDFSReader● HDFSArchiver

Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop

LogrepoA distributed event logging system

All time logrepo = 150 TB compressed

jobsearch event setabredistimeacmetimeaddltimeadscadsdelayadsibadscbadsiboostojcboostojibsjcbsjcwiabsjibsjindappliesbsjindappviewsbsjrevbsjwiackcntckszcountsctkagectkagedaysdayofweekdcpingtimedomTotalTimeds-mpo

dsmissdstimefeatempfjfreekwacfreekwarevfreesjcfreesjrevfrmtimegalatdelayiplatiplongjslatdelayjsvdelaykwackwacdelaykwaikwarevkwcntlacinsizelacsgsizelmstimempotimemprtimenavTotTimendxtime

ojcojclongojcshortojcwiaojiojindappliesojindappviewsojwiaoocscpageprcvdlatencyprimfollowcntprvwojiprvwojlatprvwojopentimeprvwojreqradscradsirecidlookupbudgetrectimeredirCountredirTimerelfollowcntrespTimereturnvisitrojc

rojirqcntrqlcntrqqcntrrsjcrrsjirrsjrevrsavailrsjcrsjirsusedrsviableserpsizesjcsjcdelaysjclongsjcntsjcshortsjcwiasjisjindappliessjindappviewssjrevsjwiasllatsllong

sqcsqisugtimesvjsvjnostarsvjstartadsctadsitimetimeofdaytotcnttotfollowcnttotrevtottimetsjctsjcwiatsjitsjindappliestsjindappviewstsjrevtsjwiaunqcntvpwacinsizewacsgsize

acmepageacmereviewmodacmeserviceacmesessionadclickadcrequestadcrevadschanneladsclickadsenseclickadveadvtagghttpaggjiraaggjobaggjob_waldorfaggsherlockaggsourcehealthagstimingapiapijsvapisearcharchiveindexarchiveindex_shingled_testbincarclicksclickclickanalyticscobranddctmismatchdrawdupepairsdupepairs_minidupepairs_olddupepairsalldupepairsall_miniejcheckeremilyops

feedbridgeglobalnavgooglebot_organichomepageimpressionindeedapplyjhstjobalertjobalertorganicjobalertsearchjobalertsponsoredjobexpirationjobexpiration2jobexpiration3jobprocessedjobqueueblockjobsearchjssquerykeywordAdlocsvclucyindexermainmechanicalturkmindyopsmobhomepagemobilmobilemobileorganicmobilesponsoredmobrecjobsmobsearchmobviewjobmyindeedmyindfunnelmyindpagemyindrezcreatemyindsessionoldopsesjasx

organicorgmodelorgmodelsubsetorgmodelsubset90passportaccountpassportpagepassportsigninramsaccessrecjobsrecommendserviceresumedataresumesearchrexcontactsrexfunnelreximpressionrexsearchrezSrchSearchrezalertrezalertfunnelrezfunnelrezjserrrezsrchrequestrezviewsearchablejobsseosessionsjmodelsponsoredsysadappinfosysadapptimingtestndxtestndx1testndx2tmpusrsvccacheusrsvcrequestviewjobwebusersignin

Every day at Indeed

● Create 5 billion log entries

● App spends 0.03 ms to create each log entry

● Add 500 GB to the archive

● Add 1.5 TB to HDFS

● Consumers read from HDFS at 18.5 GB/s

● 100s of consumers request 1000 different logrepo types

Four types of consumers

Ad-hoc command line

Standard Java programs

Hadoop map/reduce

Real-time monitoring

$ echo 1388556000000 1388642400000 jobsearch \| nc logrepo 9999

uid=18d6666o916r15g3&type=jobsearch&q=VP+ITuid=18d6666ob0mp27aa&type=jobsearch&q=Lab+Techuid=18d6666ob0nl15ce&type=jobsearch&q=daycareuid=18d6666og0nk24rb&type=jobsearch&q=Chef+Upscale...

Command line access

Reuses standard unix tools and patterns

$ echo 1388556000000 1388642400000 jobsearch \| nc logrepo 9999| egrep -o '&searchTime=[^&]+' \| egrep -o '[0-9]+' \| sort -r -n \| head

Slowest searches from log entries

Programmatic access is trivial

We have clients for

● java

● python

● php

● pig

A typical logrepo consumer (single machine)

Reads one primary log event type

Reads a dozen child events per primary

Total size of each event set = 10KB

A typical logrepo consumer (single machine)

Millions of events read per run

Thousands of consumers run each day

Tens of terabytes processed each day

Efficient Parsing

Important for single machine consumers

Log entry parsing too slow

Fast

Minimize memory usage

URL String Parsing(now available on github)4x faster than String.split(...), generates 50% less garbage

Parses 1 million log entries of size 0.5K each in 3 seconds

https://github.com/indeedeng

http://go.indeed.com/urlparsing

Hadoop clients

Reliable, scalable, distributed computing

Hadoop clients

Reliable, scalable, distributed computing

Most new consumers use Hadoop

Hadoop clients

Reliable, scalable, distributed computing

Most new consumers use Hadoop

Read log entries directly from HDFS

Hadoop clients

Reliable, scalable, distributed computing

Most new consumers use Hadoop

Read log entries directly from HDFS

Divide and conquer to scale

Monitoring

Want to monitor

● Business metrics

● Operational metrics

“Available soon” isn’t good enough

Datadog

Third party monitoring service

Stream metrics to Datadog HQ

Real-time dashboards

Datadog

miniEPL

'jobsearch.organic_clk': "SELECT COUNT(*), 'clicks' AS unit FROM orgClk",

'jobsearch.totTime': "SELECT int(totTime), 'ms' AS unit FROM jobsearch(totTime IS NOT NULL)",

'mobile.mobsearch.oji': "SELECT tupleCount(orgRes), 'results' AS unit FROM mobsearch",

Getting logs into Datadog

Data redundancy

Replaying events

Click charging

Replaying events

1. Job alert email sign up broke for logged in users

Replaying events

1. Job alert email sign up broke for logged in users

2. Got alert parameters + jobsearch uid from access logs

Replaying events

1. Job alert email sign up broke for logged in users

2. Got alert parameters + jobsearch uid from access logs

3. Got account id from jobsearch log entries

Replaying events

1. Job alert email sign up broke for logged in users

2. Got alert parameters + jobsearch uid from access logs

3. Got account id from jobsearch log entries

4. Recreated job alert sign ups

Click charging

1. Store sponsored click data in database

Click charging

1. Store sponsored click data in database

2. Log sponsored click data to logrepo

Click charging

1. Store sponsored click data in database

2. Log sponsored click data to logrepo

3. Verify logs match database

Click charging

1. Store sponsored click data in database

2. Log sponsored click data to logrepo

3. Verify logs match database

4. Charge for clicks

Click charging

1. Store sponsored click data in database

2. Log sponsored click data to logrepo

3. Verify logs match database

4. Charge for clicks

5. Profit!

What does logrepo enable?

Answering business and operational questions

Data-driven decisions

Average cover letter length inside US vs. outside US?

Mobile searches per hour inJP vs. UK?

Resume creation by country?

Email alert opens by email domain?

Percent of app downloads fromiOS, Android, Windows?

How quickly does a datacenter take on traffic after a failover?

Q & A

https://github.com/indeedeng

http://go.indeed.com/urlparsing

Next @IndeedEng TalkBig Value from Big Data:

Building Decision Trees at Scale

Andrew Hudson, Indeed CTOFebruary 26, 2014

http://engineering.indeed.com/talks

top related