[@indeedeng] logrepo: enabling data-driven decisions

205
go.indeed.com/IndeedEngTalks

Upload: indeedeng

Post on 21-Nov-2014

2.727 views

Category:

Technology


5 download

DESCRIPTION

Video available at: http://youtu.be/y0WC1cxLsfo At Indeed our applications generate billions of log events each month across our seven data centers worldwide. These events store user and test data that form the foundation for decision making at Indeed. We built a distributed event logging system, called Logrepo, to record, aggregate, and access these logs. In this talk, we'll examine the architecture of Logrepo and how it evolved to scale. Jeff Chien joined Indeed as a software engineer in 2008. He's worked on jobsearch frontend and backend, advertiser, company data, and apply teams and enjoys building scalable applications. Jason Koppe is a Systems Administrator who has been with Indeed since late 2008. He's worked on infrastructure automation, monitoring, application resiliency, incident response and capacity planning.

TRANSCRIPT

Page 2: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

LogrepoEnabling Data-Driven Decisions

Page 3: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Jeff ChienSoftware EngineerIndeed Apply Team

Page 4: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Scale

More job searches worldwide than any other employment website.

● Over 100 million unique users ● Over 3 billion searches per month● Over 24 million jobs● Over 50 countries● Over 28 languages

Page 5: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

I help people get jobs.

Page 6: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
Page 7: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
Page 8: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
Page 9: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
Page 10: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
Page 11: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

1. Search

2. View job

3. Click “Apply Now”

4. Submit application

Job seeker flow using Indeed Apply

Page 12: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Knowing how users interact with our system

helps us make better products

Page 13: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Have to upload a resume

Have Indeed

Resume

Likelihood of applying to a job

Page 14: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

We Have Questions

● What percentage of applications use Indeed resumes?

● How many searches for “java” in “Austin”?

● How often are resumes edited?

● How long does it take to aggregate jobs?

Page 15: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

How many applications … to jobs from CareerBuilder … by job seekers who searched for “java” in “Austin” … used an Indeed resume?

Is the percentage different on mobile compared to web?

How much has this changed in 2011 compared to 2014?

Complicated Questions

Page 16: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

More Information

Better Decisions

Page 17: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

More information

Need to log events

● job searches

● clicks

● applies

Page 18: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

What to log

Client information - unique user identifier, user agent, ip address…

User behavior - clicks, alert signups…

Performance - backend request duration, memory usage...

A/B test groups - control and test groups

Page 19: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Better decisions

Use empirical data to make decisions

Not based on assumptions nor the highest paid person’s opinion!

Page 20: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Objective

Collect data on user actions and system performance from many different applications in multiple data centers

Page 21: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

How we build systems

Simple

Fast

Resilient

Scalable

Page 22: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Simple

Easy interface

Reuse familiar technologies

Page 23: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Fast

No impact to runtime performance

Data available soon

Page 24: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Resilient

Does not lose data in spite of system or network failures

Page 25: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Can handle large quantities of data

Scalable

Page 26: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Requirements

Powerful enough to express diverse data

Page 27: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Requirements

Powerful enough to express diverse data

Store all data forever

Page 28: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Powerful enough to express diverse data

Store all data forever

Events stored at least once

Requirements

Page 29: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Requirements

Powerful enough to express diverse data

Store all data forever

Events stored at least once

Easy to add new data to logs

Page 30: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Requirements

Powerful enough to express diverse data

Store all data forever

Events stored at least once

Easy to add new data to logs

Easy to access logs in bulk

Page 31: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

RequirementsPowerful enough to express diverse data

Store all data forever

Events stored at least once

Easy to add new data to logs

Easy to access logs in bulk

Time range based access

Page 32: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Non-Goals

Random access to individual events

Real time access to events

Complex data types

Page 33: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

LogrepoA distributed event logging system

Est. 2006

Page 34: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Logrepo stores log entries

Everything is a string

Key/value pairs

URL-encoded

Page 35: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Organic click log entry

uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http%3A%2F%2Fwww.indeed.com%2Frc%2Fclk&href=http%3A%2F%2Fwww.indeed.com%2Fjobs%3Fq%3D%26l%3DNewburgh%252C%2BNY%26start%3D20&agent=Mozilla%2F5.0+%28Windows+NT+6.1%3B+WOW64%3B+rv%3A26.0%29+Gecko%2F20100101+Firefox%2F26.0&raddr=173.50.255.255&ckcnt=17&cksz=1033&ctk=18dtbc6960nk20vd&ctkRcv=1&&

Page 36: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

URL-decoded organic click log entry

uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http://www.indeed.com/rc/clk&href=http://www.indeed.com/jobs?q=&l=Newburgh%2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0&...

Page 37: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

URL-decoded organic click log entry

uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http://www.indeed.com/rc/clk&href=http://www.indeed.com/jobs?q=&l=Newburgh%2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0&...

Page 38: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Advantages

Human-readable

Page 39: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Advantages

Human-readable

Arbitrary keys

Page 40: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Advantages

Human-readable

Arbitrary keys

Low overhead to add new key/value pairs

Page 41: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Advantages

Human-readable

Arbitrary keys

Low overhead to add new key/value pairs

Self-describing

Page 42: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Advantages

Human-readable

Arbitrary keys

Low overhead to add new key/value pairs

Self-describing

Easy to parse in any language

Page 43: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Required log entry keys

Every log entry has uid and type

Type is an arbitrary string

uid=18dtbolr20nk23qh&type=orgClk&...

Page 44: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

UID format

uid=18ducm8u50nk23qh&type=jobsearch&...

UID is always the first key

Unique

16 characters

Base 32 [0-9a-v]

Page 45: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

uid=18ducm8u50nk23qh

Date = 2014-01-10 Time = 09:35:24.357

Server id = 1512App instance id = 2

UID Version = 0Random value = 3921

UID breakdown

Page 46: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

UID generation

Unique IDs are unique

Random value avoids UID collisions

Random value is between 0 and 8191

Up to 8000 events per application instance per millisecond

Page 47: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

UID format benefits

Contains useful metadata

Compact format reduces memory requirements

Easy to compare or sort events by time

Page 48: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Job seeker events

1. Search for jobs

2. Click on job

3. Apply to job

All events are part of the same flow

Page 49: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Parent-child relationships between events

Events can reference other events with &tk=18ducm8u50nk23qh...

Children know their parents

Parents don’t know their children

Extremely powerful model

Page 50: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Parent-child relationships between events

An organic click points to the search it occurred on

uid=18dtbnn3p0nk20g9&type=jobsearch&v=0&...

uid=18dtbolr20nk23qh&type=orgClk&v=0 &tk=18dtbnn3p0nk20g9&...

Page 51: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

More jobsearch child events

Sponsored job clicks

Javascript errors

Job alert signups

And many more...

Page 52: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

uid=18en3o3ov16r25rp&type=viewjob&...

user submission

post to employer

load IndeedApply

job view18en3o3ov16r25rp

Job seeker views a job

Page 53: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

job view18en3o3ov16r25rp

user submission

post to employer

uid=18en3o3s216ph6d5&type=loadJs&vjtk=18en3o3ov16r25rp&...

load IndeedApply18en3o3s216ph6d5

Indeed Apply loads

Page 54: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

uid=18en3qe0u16pi5ct&type=appSubmit&loadJsTk=18en3o3s216ph6d5&...

job view18en3o3ov16r25rp

user submission18en3qe0u16pi5ct

post to employer

load IndeedApply18en3o3s216ph6d5

Prepare job application

Page 55: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

POST /apply HTTPS/1.1Host: employer.com

{ "applicant": {

"name": "John Doe","email": "[email protected]","phone": "555-555-5555",

}, "jobTitle": "Software Engineer" ...

uid=18en3qe2r0nji3h6&type=postApp&appSubmitTk=18en3qe0u16pi5ct&...

job view18en3o3ov16r25rp

user submission18en3qe0u16pi5ct

post to employer18en3qe2r0nji3h6

load IndeedApply18en3o3s216ph6d5

Submit job application

Page 56: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Javascript latency ping

At start of page load, browser executes js to ping Indeed

Server receives the ping and logs an event

Page 57: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Parent job search and child js latency ping

uid=18dqpc3lm16pi2an&type=jobsearch&...

uid=18dqpc3s516pi566&type=lat&tk=18dqpc3lm16pi2an

Page 58: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

uid=18dqpc3s516pi566&type=lat&tk=18dqpc3lm16pi2an

Latency = 1389247205253 - 1389247205046= 207 ms

Approximates perceived latency to jobseeker

uid timestamp Jan 9, 2014 00:00:05.253

tk timestamp Jan 9, 2014 00:00:05.046

Subtracting UID timestamps yields duration

Page 59: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

West coast perceived latency in California vs. Washington

Page 60: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Writing log entries from apps

LogEntry entry =factory.createLogEntry("search");

entry.setProperty("q", query);entry.setProperty("acctId", accountId);entry.setProperty("time", elapsedMillis);// ...

entry.commit();

Page 61: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Creating a log entry

LogEntry entry =factory.createLogEntry("search");

Creates a log entry with UID and type set

UID timestamp tied to createLogEntry() call

Page 62: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Populating a log entry

entry.setProperty("q", query);entry.setProperty("acctId", accountId);entry.setProperty("time", elapsedMillis);// ...

Page 63: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Lists

Separate values with commas

String groups = "foo,bar,baz";

logEntry.setProperty("grps", groups);

// uid=...&grps=foo%2Cbar%2Cbaz&...

Page 64: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Lists of Tuples

Encapsulate each tuple in parenthesis

Comma-separate elements within tuple

// Two jobs with (job id, score)String jobs = "(123,1.0)(400,0.8)";

logEntry.setProperty("jobs", jobs);

// uid=...&jobs=%28123%2C1.0%29%28400%2C0.8%29&...

Page 65: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Committing a log entry

After log entry is fully populated...

entry.commit();

Page 66: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Jason KoppeSystem Administrator

Page 67: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

I engineer systemsthat help people get jobs.

Page 68: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Before logrepo

Page 69: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Before logrepo

Page 70: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

log4j - Java logging framework

● Code - what● Configuration - define what goes to

where● Appender - where (file, smtp)

http://logging.apache.org/log4j/1.2/

Page 71: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Before logrepo

Page 72: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reusing log4j for logrepo

Page 73: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Redundancy from the start

Write to local disk (FileAppender)

Write to remote server #1 (? Appender)

Write to remote server #2 (? Appender)

Page 74: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Writing to a remote server

syslogProtocol for transporting messages across

an IP network

Est. 1980s

http://tools.ietf.org/html/rfc5424

Page 75: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Using log4j with syslog

Out-of-the-box, log4j only supported UDP syslog

UDP could result in data loss

Page 76: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Avoiding data loss

TCP guarantees data transfer

Use TCP!

Page 77: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

SyslogTcpAppender● created by Indeed● TCP-enabled log4j syslog Appender● buffers messages before transport

Resilient for short network and syslog server downtimes

Creating a reliable Appender

Page 78: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Choosing a syslog daemon

syslog-ngsyslog daemon which supports TCP

Est. 1998

http://www.balabit.com/network-security/syslog-ng

Page 79: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Redundancy with log4j

Write to local disk (FileAppender)

Write to remote server #1 (SyslogTcpAppender)

Write to remote server #2 (SyslogTcpAppender)

Page 80: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Redundancy over TCP

Page 81: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Each syslog-ng server

receives unsorted log entries

immediately flushes entries to files on disk called raw logs

Page 82: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Quick redundancy over TCP

Page 83: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Optimized for redundancy

raw logs are probably out-of-order

each app writes to syslog independently

Page 84: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Optimize for read access patterns

LogRepositoryBuilder (“Builder”)● sort● deduplicate● compress

Page 85: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Builder architecture

Page 86: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Builder architecture

Page 87: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Builder architecture

Page 88: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Builder architecture

Page 89: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Builder creates segment files

uid=15mt000000k1&type=orgClk&v=1&k=4... uid=15mt000010k7&type=orgClk&v=1&k=3... uid=15mt000020k8&type=orgClk&v=1&k=2... uid=15mt000030ss&type=orgClk&v=1&k=9...

Page 90: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Repeated strings compress well

uid=15mt000000k1&type=orgClk&v=1&k=4... uid=15mt000010k7&type=orgClk&v=1&k=3... uid=15mt000020k8&type=orgClk&v=1&k=2... uid=15mt000030ss&type=orgClk&v=1&k=9...

compresses by 85%

Page 91: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

logentry type

Page 92: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

4-char UID prefix, base 32

Page 93: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

4-char UID prefix, base 32

~9.3 hour time period

Page 94: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix, base 32

Page 95: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix, base 32

~17 minute time period

Page 96: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

unique number

Page 97: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Archive directory structure

/orgClk/15mt/0.log4181.seg.gz

unique number

Supports more than 1 segment file per type per 5-char UID prefix

Page 98: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Multiple segment files

Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix

Page 99: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Multiple segment files

Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix

Page 100: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Multiple segment files

Keep Builder memory usage fixed

When Builder memory fills, it flushes to disk

Each flush creates files for 5-char UID prefix

Page 101: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Builder creates the archive

Page 102: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Redundancy

Page 103: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Redundancy

Page 104: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Ensure archive consistency

● Delayed Builder on second server● Add new segment files for log entries

missed by first Builder● Causes multiple segment files for a 5-char

UID prefix

Page 105: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Providing access to logrepo

LogRepositoryReader (“Reader”)● simple request protocol● reads from (multiple) segment files● provides sorted stream of entries to TCP

client as quickly as possible

Page 106: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reader request protocol

1. Start time2. End time3. Logrepo type

Page 107: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk

start time (ms since 1970-01-01, the start of Unix time)

Page 108: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk

end time (ms since 1970-01-01)

Page 109: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk

logrepo type

Page 110: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk \

| nc 192.168.0.1 9999

send echo across a TCP session

Page 111: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reader request using netcat

$ echo 1295905740000 1295913600000 orgClk \

| nc 192.168.0.1 9999

uid=15mt00l710k3262q&type=orgClk&v=0&...

uid=15mt00l780k137d9&type=orgClk&v=0&...

...

uid=15mt7ggvj142h06k&type=orgClk&v=0&...

UID-sorted results

Page 112: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

1. Isolate to the type directory

1295905740000 1295913600000 orgClk

Page 113: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

2. Convert request timestamps to UID prefix

uidPrefixFromTime(1295905740000) = 15mt0

uidPrefixFromTime(1295913600000) = 15mt7

1295905740000 1295913600000 orgClk

Page 114: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

3. Find segments matching first UID prefix

ls orgClk/15mt/0*orgClk/15mt/0.log3094.seg.gzorgClk/15mt/0.log4181.seg.gz

1295905740000 1295913600000 orgClk

15mt0

Page 115: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

4. Read sorted segments simultaneously, merge into a single sorted stream

/orgClk/15mt/0.log3094.seg.gz: uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&.../orgClk/15mt/0.log4181.seg.gz: uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...

1295905740000 1295913600000 orgClk

Page 116: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

4. Read sorted segments simultaneously, merge into a single sorted stream

/orgClk/15mt/0.log3094.seg.gz: uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&.../orgClk/15mt/0.log4181.seg.gz: uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...

1295905740000 1295913600000 orgClk

1

42

3

Page 117: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

4. Read sorted segments simultaneously, merge into a single sorted stream

uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...

1295905740000 1295913600000 orgClk

1

4

23

Page 118: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

5. Only return log entries between timestamps

uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...

1295905740000 1295913600000 orgClk

1

4

23

Page 119: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

6. Read segments for each UID prefix, one prefix at a time

1295905740000 1295913600000 orgClk

15mt0 15mt7

15mt115mt215mt315mt415mt515mt6

Page 120: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reading entries from archive

7. Stop reading files when entry crosses request boundary

1295905740000 1295913600000 orgClk

Page 121: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
Page 122: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
Page 123: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

The first years (2007 & 2008)

● Single datacenter● App servers● 2 logrepo servers● syslog-ng● Builder● Reader

Page 124: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Growth

job seekers

Page 125: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Growth

job seekers

products

Page 126: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Growth

job seekers

products

datacenters

Page 127: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Growth

log entries

Page 128: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Multi-datacenter rationale

Latency

Redundancy

Page 129: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Multi-datacenter rationale

Job seekers

Page 130: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Logrepo in multiple datacenters

● Single datacenter● Consumers● Reader

● Every datacenter● Applications producing logentries● 2 syslog servers● Builders (minimize Internet traffic)

Page 131: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
Page 132: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Single datacenter archival

/dc1/orgClk/15mt/0.log4181.seg.gz

event type(orgClick means organic search result click)

25-bit timestamp prefix, base 32~17-minute time period

random number

Page 133: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Multiple datacenter archival

/dc1/orgClk/15mt/0.log4181.seg.gz

event type(orgClick means organic search result click)

25-bit timestamp prefix, base 32~17-minute time period

random number

datacenter

Page 134: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Datacenter dirs avoid collisions

~$ ls */orgClk/15mt/0*

dc1/orgClk/15mt/0.log1481.seg.gz

dc3/orgClk/15mt/0.log1481.seg.gz

Different datacenters

Page 135: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Datacenter dirs avoid collisions

~$ ls */orgClk/15mt/0*

dc1/orgClk/15mt/0.log1481.seg.gz

dc3/orgClk/15mt/0.log1481.seg.gz

Same segment filename

Independent Builders

Page 136: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

uid=18ducm8u50nk23qh

Date = 2014-01-10 Time = 09:35:24.357

Server id = 1512App instance id = 2

UID Version = 0Random value = 3921

UID breakdown

Page 137: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

uid=18ducm8u50nk23qh

Date = 2014-01-10 Time = 09:35:24.357

Server id = 1512App instance id = 2

UID Version = 0Random value = 3921

UID breakdown

Page 138: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Using server ID for uniqueness

Each datacenter gets 256 server IDs

1. DC #1 uses 0 - 2552. DC #2 uses 256 - 5113. DC #3 uses 512 - 7674. ...

Page 139: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

The next years (2009 - 2011)

● Multiple datacenters● 2 logrepo servers● syslog-ng● Builder

● Consumer datacenter● Reader● Consumers

Page 140: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

More logentries

More consumers

Page 141: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Diverse requests

Page 142: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Single server disk bottleneck

Page 143: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Scaling logrepo reads

Bottleneck: single active Reader server

Goal: spread logrepo accesses across a cluster of servers

Page 144: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Read logrepo from HDFS

Hadoop Distributed File System (HDFS)

“a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.”

http://hadoop.apache.org/docs/stable1/hdfs_design.html

Page 145: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Using HDFS for logrepo access

Page 146: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Using HDFS for logrepo access

Page 147: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Using HDFS for logrepo access

Page 148: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Resilient logrepo in HDFS

Store each logentry on 3 servers

Page 149: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Push to HDFS quickly

Mirror every segment file into HDFS

Page 150: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Push to HDFS quickly

/dc1/orgClk/15mt/0.log4181.seg.gz

5-char UID prefix, base 32~17-minute time period

500,000+ files per day

Page 151: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

HDFS optimized for fewer files

Reduce the number of logrepo files in HDFS keeps us efficient

Page 152: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

HDFS optimized for fewer files

Reduce the number of logrepo files in HDFS keeps us efficient

HDFSArchiver

Page 153: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Archive yesterday in HDFS

/dc1/orgClk/15mt/0.log4181.seg.gz

20-bit timestamp prefix~9.3 hour period

2,500 files per day

type

Page 154: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Scaling logrepo in HDFS

500,000+ files per day

2,500 files per day

Page 155: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

LogrepoA distributed event logging system

Created @IndeedEng● Application

Open source● log4j

Page 156: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender

Open source● log4j

LogrepoA distributed event logging system

Page 157: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender

Open source● log4j● syslog-ng

LogrepoA distributed event logging system

Page 158: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender● Builder

Open source● log4j● syslog-ng

LogrepoA distributed event logging system

Page 159: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender● Builder

Open source● log4j● syslog-ng● gzip

LogrepoA distributed event logging system

Page 160: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader

Open source● log4j● syslog-ng● gzip

LogrepoA distributed event logging system

Page 161: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader

Open source● log4j● syslog-ng● gzip● rsync+ssh

LogrepoA distributed event logging system

Page 162: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader

Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop

LogrepoA distributed event logging system

Page 163: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher

Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop

LogrepoA distributed event logging system

Page 164: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher● HDFSReader

Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop

LogrepoA distributed event logging system

Page 165: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher● HDFSReader● HDFSArchiver

Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop

LogrepoA distributed event logging system

Page 166: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

All time logrepo = 150 TB compressed

Page 167: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

jobsearch event setabredistimeacmetimeaddltimeadscadsdelayadsibadscbadsiboostojcboostojibsjcbsjcwiabsjibsjindappliesbsjindappviewsbsjrevbsjwiackcntckszcountsctkagectkagedaysdayofweekdcpingtimedomTotalTimeds-mpo

dsmissdstimefeatempfjfreekwacfreekwarevfreesjcfreesjrevfrmtimegalatdelayiplatiplongjslatdelayjsvdelaykwackwacdelaykwaikwarevkwcntlacinsizelacsgsizelmstimempotimemprtimenavTotTimendxtime

ojcojclongojcshortojcwiaojiojindappliesojindappviewsojwiaoocscpageprcvdlatencyprimfollowcntprvwojiprvwojlatprvwojopentimeprvwojreqradscradsirecidlookupbudgetrectimeredirCountredirTimerelfollowcntrespTimereturnvisitrojc

rojirqcntrqlcntrqqcntrrsjcrrsjirrsjrevrsavailrsjcrsjirsusedrsviableserpsizesjcsjcdelaysjclongsjcntsjcshortsjcwiasjisjindappliessjindappviewssjrevsjwiasllatsllong

sqcsqisugtimesvjsvjnostarsvjstartadsctadsitimetimeofdaytotcnttotfollowcnttotrevtottimetsjctsjcwiatsjitsjindappliestsjindappviewstsjrevtsjwiaunqcntvpwacinsizewacsgsize

Page 168: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

acmepageacmereviewmodacmeserviceacmesessionadclickadcrequestadcrevadschanneladsclickadsenseclickadveadvtagghttpaggjiraaggjobaggjob_waldorfaggsherlockaggsourcehealthagstimingapiapijsvapisearcharchiveindexarchiveindex_shingled_testbincarclicksclickclickanalyticscobranddctmismatchdrawdupepairsdupepairs_minidupepairs_olddupepairsalldupepairsall_miniejcheckeremilyops

feedbridgeglobalnavgooglebot_organichomepageimpressionindeedapplyjhstjobalertjobalertorganicjobalertsearchjobalertsponsoredjobexpirationjobexpiration2jobexpiration3jobprocessedjobqueueblockjobsearchjssquerykeywordAdlocsvclucyindexermainmechanicalturkmindyopsmobhomepagemobilmobilemobileorganicmobilesponsoredmobrecjobsmobsearchmobviewjobmyindeedmyindfunnelmyindpagemyindrezcreatemyindsessionoldopsesjasx

organicorgmodelorgmodelsubsetorgmodelsubset90passportaccountpassportpagepassportsigninramsaccessrecjobsrecommendserviceresumedataresumesearchrexcontactsrexfunnelreximpressionrexsearchrezSrchSearchrezalertrezalertfunnelrezfunnelrezjserrrezsrchrequestrezviewsearchablejobsseosessionsjmodelsponsoredsysadappinfosysadapptimingtestndxtestndx1testndx2tmpusrsvccacheusrsvcrequestviewjobwebusersignin

Page 169: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Every day at Indeed

● Create 5 billion log entries

● App spends 0.03 ms to create each log entry

● Add 500 GB to the archive

● Add 1.5 TB to HDFS

● Consumers read from HDFS at 18.5 GB/s

● 100s of consumers request 1000 different logrepo types

Page 170: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Four types of consumers

Ad-hoc command line

Standard Java programs

Hadoop map/reduce

Real-time monitoring

Page 171: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

$ echo 1388556000000 1388642400000 jobsearch \| nc logrepo 9999

uid=18d6666o916r15g3&type=jobsearch&q=VP+ITuid=18d6666ob0mp27aa&type=jobsearch&q=Lab+Techuid=18d6666ob0nl15ce&type=jobsearch&q=daycareuid=18d6666og0nk24rb&type=jobsearch&q=Chef+Upscale...

Command line access

Page 172: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Reuses standard unix tools and patterns

$ echo 1388556000000 1388642400000 jobsearch \| nc logrepo 9999| egrep -o '&searchTime=[^&]+' \| egrep -o '[0-9]+' \| sort -r -n \| head

Slowest searches from log entries

Page 173: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Programmatic access is trivial

We have clients for

● java

● python

● php

● pig

Page 174: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

A typical logrepo consumer (single machine)

Reads one primary log event type

Reads a dozen child events per primary

Total size of each event set = 10KB

Page 175: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

A typical logrepo consumer (single machine)

Millions of events read per run

Thousands of consumers run each day

Tens of terabytes processed each day

Page 176: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Efficient Parsing

Important for single machine consumers

Log entry parsing too slow

Fast

Minimize memory usage

Page 177: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

URL String Parsing(now available on github)4x faster than String.split(...), generates 50% less garbage

Parses 1 million log entries of size 0.5K each in 3 seconds

https://github.com/indeedeng

http://go.indeed.com/urlparsing

Page 178: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Hadoop clients

Reliable, scalable, distributed computing

Page 179: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Hadoop clients

Reliable, scalable, distributed computing

Most new consumers use Hadoop

Page 180: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Hadoop clients

Reliable, scalable, distributed computing

Most new consumers use Hadoop

Read log entries directly from HDFS

Page 181: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Hadoop clients

Reliable, scalable, distributed computing

Most new consumers use Hadoop

Read log entries directly from HDFS

Divide and conquer to scale

Page 182: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Monitoring

Want to monitor

● Business metrics

● Operational metrics

“Available soon” isn’t good enough

Page 183: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Datadog

Third party monitoring service

Stream metrics to Datadog HQ

Real-time dashboards

Page 184: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Datadog

Page 185: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

miniEPL

'jobsearch.organic_clk': "SELECT COUNT(*), 'clicks' AS unit FROM orgClk",

'jobsearch.totTime': "SELECT int(totTime), 'ms' AS unit FROM jobsearch(totTime IS NOT NULL)",

'mobile.mobsearch.oji': "SELECT tupleCount(orgRes), 'results' AS unit FROM mobsearch",

Page 186: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Getting logs into Datadog

Page 187: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Data redundancy

Replaying events

Click charging

Page 188: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Replaying events

1. Job alert email sign up broke for logged in users

Page 189: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Replaying events

1. Job alert email sign up broke for logged in users

2. Got alert parameters + jobsearch uid from access logs

Page 190: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Replaying events

1. Job alert email sign up broke for logged in users

2. Got alert parameters + jobsearch uid from access logs

3. Got account id from jobsearch log entries

Page 191: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Replaying events

1. Job alert email sign up broke for logged in users

2. Got alert parameters + jobsearch uid from access logs

3. Got account id from jobsearch log entries

4. Recreated job alert sign ups

Page 192: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Click charging

1. Store sponsored click data in database

Page 193: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Click charging

1. Store sponsored click data in database

2. Log sponsored click data to logrepo

Page 194: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Click charging

1. Store sponsored click data in database

2. Log sponsored click data to logrepo

3. Verify logs match database

Page 195: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Click charging

1. Store sponsored click data in database

2. Log sponsored click data to logrepo

3. Verify logs match database

4. Charge for clicks

Page 196: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Click charging

1. Store sponsored click data in database

2. Log sponsored click data to logrepo

3. Verify logs match database

4. Charge for clicks

5. Profit!

Page 197: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

What does logrepo enable?

Answering business and operational questions

Data-driven decisions

Page 198: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Average cover letter length inside US vs. outside US?

Page 199: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Mobile searches per hour inJP vs. UK?

Page 200: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Resume creation by country?

Page 201: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Email alert opens by email domain?

Page 202: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Percent of app downloads fromiOS, Android, Windows?

Page 203: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

How quickly does a datacenter take on traffic after a failover?

Page 204: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Q & A

https://github.com/indeedeng

http://go.indeed.com/urlparsing

Page 205: [@IndeedEng] Logrepo: Enabling Data-Driven Decisions

Next @IndeedEng TalkBig Value from Big Data:

Building Decision Trees at Scale

Andrew Hudson, Indeed CTOFebruary 26, 2014

http://engineering.indeed.com/talks