[@indeedeng] how to get a job 35 million times a day using rabbitmq

Post on 15-Jan-2015

5.506 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

@IndeedEnd March: Wednesday, March 27th Video available: http://www.youtube.com/watch?v=MeRHetCMiHg The goal of Indeed's aggregation engine is to find and retrieve every job in the world, as quickly and accurately as possible. As we described in our previous tech talk, we strive to build products that are simple, fast, comprehensive, and relevant. The world's most comprehensive job search site is fueled by the more than 35 million job postings we process every day, which we deliver to jobseekers within minutes of discovery. Our original aggregation architecture was implemented using standard patterns. Our growth required levels of scalability, performance, and resilience this architecture simply could not handle. In a case study of scaling for the web, we will discuss how we tackled this problem. We will cover the issues we saw with our original architecture, how we analyzed our options to guide a solution, how we used RabbitMQ as a key component in the new architecture, and benchmarks to evaluate how successful we were. Speaker Ketan Gangatirkar is the development manager responsible for Indeed's continuous deployment infrastructure as well as its aggregation system. Speaker Cameron Davison is a software engineer on the aggregation team at Indeed and a graduate of UT Austin. He re-architected Indeed's aggregation pipeline using RabbitMQ to sustain high write volumes, and continues to improve products in the aggregation system to make it run more efficiently.

TRANSCRIPT

How to Get a Job 35 Million Times a Day Using RabbitMQ

Ketan Gangatirkar and Cameron Davison

One search. All jobs.

Aggregation gets jobs

Aggregation gets jobs soJobseekers get jobs

Aggregation != Spidering

Spiders see pages.

Aggregation sees jobs.

How spiders see job sites

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

Page

How Indeed sees job sitesStart

Job List

Job Job Job

Job List

Job Job Job

Job List

Job Job Job

Navigation Navigation

JobJob

Job

Aggregation != Spidering

Job sites have structure

Job pages have semantics

Navigation is more than following links

Rememberthis

Aggevery

job

{ Url: http://www.applytracking.com/track.aspx/3VYzR Title: Senior Erlang Engineer Company: Machine Zone Location: Palo Alto,CA,US, 94301 Source Type: Employer Job Type: Full-time ... Description: The Senior Erlang Engineer is an integral ... ... Createdate: 2013-02-05 23:18:05 ...}

What's in a job

location

description

Company

Title

Title

salary

locationjob type

description

Company

How we build products

simple

fast

comprehensive

relevant

Simple

Tough problems, simple solutions

Fast

Discover the jobs quickly

Get them to jobseekers in minutes

10% of jobseekers sort by date

Do you want only new jobs?

20% of jobseekers want only new jobs

Daily new job emails

Speed matters

Comprehensive

Get every job

Relevant

Semantic extraction

The job is still available

Ignore non-jobs

This is a hard problem

Flaky sites

Site redesigns

Javascript

Missing or bad information

Big N makes it even harder

Examine 38M jobs every day

Do this in minutes

Search 100MJobseekersAggregation

EmployersJob BoardsStaffing firmsRecruiters

Strawman* architecture

Datacenter B

MySQL

Engine

Datacenter A

Job site

Engine

Job site

Engine

Job site

Engine

Job site

Engine

Job site

Engine

Job site

Primary Datacenter

Limitations

N connections

MySQL

Job siteJob siteJob siteJob siteJob siteJob site

Primary Datacenter

EngineEngineEngineEngineEngineEngine

Datacenter BDatacenter A

N concurrent writers

MySQL

Job siteJob siteJob siteJob siteJob siteJob site

Primary Datacenter

EngineEngineEngineEngineEngineEngine

Datacenter BDatacenter A

High latency

MySQL

Job siteJob siteJob siteJob siteJob siteJob site

Primary Datacenter

EngineEngineEngineEngineEngineEngine

Datacenter BDatacenter A

Limitation: failure points

Datacenter B

MySQL

Engine

Datacenter A

Job site

Engine

Job site

Engine

Job site

Engine

Job site

Engine

Job site

Engine

Job site

Primary Datacenter

X

X

Scaling Patterns

What has worked for us so far?

Service-Oriented Architecture

Engine

Engine

Engine

Job Write Service MySQL

RemoteDatacenter

PrimaryDatacenter

see http://go.indeed.com/boxcar

Standard Service Interaction

Client Service Database

Our Interaction

Client Service Database

Does this do what we need?

● Lots of workers...● Sending lots of results...● Over a long distance...● That need to get processed fast...● Reliably?

Engine Failure

Engine

Engine

Engine

Job Write Service MySQL

RemoteDatacenter

XPrimaryDatacenter

Engine failure fix:Buffer to disk

Engine

Engine

Engine

Job Write Service MySQL

RemoteDatacenter

disk

disk

disk

PrimaryDatacenter

X

Network Failure

Engine

Engine

Engine

Job Write Service MySQL

RemoteDatacenter

XPrimaryDatacenter

Network failure fix:Disks solve that too

Engine

Engine

Engine

Job Write Service MySQL

RemoteDatacenter

disk

disk

disk

XPrimaryDatacenter

Write Service Failure

Job Write Service MySQL

RemoteDatacenter

XEngine

Engine

Engine

PrimaryDatacenter

Write Service Failure fix:Disks solve that too

Job Write Service MySQL

RemoteDatacenter

XEngine

Engine

Engine

PrimaryDatacenter

disk

disk

disk

Write Service Failure fix:Redundancy

Job Write Service

MySQL

RemoteDatacenter

PrimaryDatacenter

XEngine

Engine

Engine

Job Write Service

Job Write Service

Database Failure

Job Write Service MySQL

RemoteDatacenter

XEngine

Engine

Engine

PrimaryDatacenter

Database Failure fix:Buffer to disk

Job Write Service

MySQL

RemoteDatacenter

XEngine

Engine

Engine

disk

PrimaryDatacenter

Our new architectureJob Write Service

MySQL

RemoteDatacenter

PrimaryDatacenter

Engine

Engine

Engine

disk

disk

disk

Job Write Service

Job Write Service

disk

disk

disk

We could build this...Job Write Service

MySQL

RemoteDatacenter

PrimaryDatacenter

Engine

Engine

Engine

disk

disk

disk

Job Write Service

Job Write Service

disk

disk

disk

... maybe someone already hasJob Write Service

MySQL

RemoteDatacenter

PrimaryDatacenter

Engine

Engine

Engine

disk

disk

disk

Job Write Service

Job Write Service

disk

disk

disk

We should use a message queue

Cameron Davison

Aggregation Requirements

● Durable

● Multi-Data Center (latency)

● 38 million jobs a day

● 2KB average job size○ 76 GB a day

● Target peaks of 1000 jobs / second

● Programming language agnostic

Selection

What we found

High Availability

Open Source/Free

Self-hosted

Performant

Out-of-the-box Experience

Advanced Message Queuing Protocol (AMQP)

● Open Standard

● Wire protocol

● Existing Clients in Multiple Languages

Concepts

● Confirmation and Ack

● At least once

● Asynchronous Confirms

● Persistent

● Clustering

Confirmation and Ack

MQ

Producer Consumer

msg

confi

rm

ackmsg

1

2 3

4

At least once

MQ

At most once

Consumer

Message

Ack

MQ ConsumerMessage

Auto Ack

Asynchronous Confirms1

2

3

4

5

6

7

8

9

1011

12

13

14

15

16

Producer

messages

confirm #6

Persistent

MQ

Producer Consumer

Persistent

MQ

Producer Consumer

Persistent

MQ

Producer Consumer

X

Persistent

MQ

Producer Consumer

Persistent

MQ

Producer Consumer

Clustering

SlaveMaster

Producer

1

2

3

4

Testing

Test RabbitMQ

● Send millions of 2KB messages

● 20 producers and 20 consumers

● 1000 messages / second

● Simulate multiple failures

Test Consistency

Producers

RabbitMQ

RabbitMQ

Consumers

Slave

Master

Test Consistency

Producers

RabbitMQ

RabbitMQ

Consumers

Master

Slave

Test Consistency

Producers

RabbitMQ

RabbitMQ

Consumers

Master

Slave

Test Consistency

Producers

RabbitMQ

RabbitMQ

Consumers

X

Master

Test Consistency

Producers

RabbitMQ

RabbitMQ

Consumers

X

Master

Test Consistency

Producers

RabbitMQ

RabbitMQ

Consumers

Master

Slave

RabbitMQ Clustering

Master Slave

RabbitMQ Clustering

Master Slave

RabbitMQ Clustering

Master

X

RabbitMQ Clustering

Master

X

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

Master

X

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

MasterSlave

RabbitMQ Clustering

Master

X

RabbitMQ Clustering

Master

XX

RabbitMQ Clustering

Master

X

RabbitMQ Clustering

Master

X

RabbitMQ Clustering

Master Slave

Non-persistent

15990 Messages / Second30 MB/s

Persistent

2781 Message / Second5.5 MB/s

Clustered and Persistent

1262 Message / Second2.5 MB/s

Applying RabbitMQ

Unreliable High Latency Connections

Engine

Engine

Engine

Job Write Service

Remote DC Primary DC

MySQL

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write ServiceRabbit

MQ

Remote DC Primary DC

MySQL

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write ServiceRabbit

MQ

Remote DC Primary DC

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write ServiceRabbit

MQ

Remote DC Primary DC

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write Service

Remote DC Primary DC

RabbitMQ

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write Service

Remote DC Primary DC

RabbitMQ

Rabbit can talk to Rabbit

Shovel Plugin

Producer RabbitMQ 1 ConsumerRabbitMQ 2

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write ServiceRabbit

MQ

RabbitMQ

RabbitMQ

RabbitMQ

Remote DC Primary DC

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write ServiceRabbit

MQ

RabbitMQ

RabbitMQ

RabbitMQ

Primary DC

RabbitMQ

Remote DC

Parallelize Job Write Service

RabbitMQ

Job Write Service

Job Write Service

Job Write Service

Job A

Job B

Job C

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write ServiceRabbit

MQ

RabbitMQ

RabbitMQ

RabbitMQ

Primary DC

RabbitMQ

Job Write Service

Remote DC

Replaced with RabbitMQ

Engine

Engine

Engine

Job Write ServiceRabbit

MQ

RabbitMQ

RabbitMQ

RabbitMQ

Primary DC

RabbitMQ

Job Write Service

Message Flow

Message Flow

Engine

Engine

Engine

Job Write Service

Primary DC

Job Write Service

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

Message Flow

Engine

Engine

Engine

Job Write Service

Primary DC

Job Write Service

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

Message Flow

Engine

Engine

Engine

Job Write Service

Primary DC

Job Write Service

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

Message Flow

Engine

Engine

Engine

Job Write Service

Primary DC

Job Write Service

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

Jobs/minute

Jobs/minute from one site

220,000 jobs6 hours

611 jobs / minute

Jobs/minute from one site

251,000 jobs20 minutes

12550 jobs / minute

RabbitMQ

Horizontal Scale

Engine

Engine

Engine Job Write ServiceRabbit

MQ

Job Write Service

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

RabbitMQ

Job Write Service

Job Write Service

Horizontal Scale

Horizontal Scale

Today 1000 messages / second

RabbitMQ 3

2486 Message / Second5MB/s

RabbitMQ Configuration

● Confirmations - Fire and Forget

● Persistent Messages - Durable

● Shoveling - Multi-Data Center

● Mirrored Queues in Cluster - High Reliability

Can we do more with RabbitMQ?

Aggregation Viewer

Real-time browser-based view of job stream

● Almost real-time● Exclusive queue● Transient messages

Aggregation Viewer Architecture

Agg JobsRabbit MQ

ClusterAgg ViewerRabbit MQ

Agg Viewer

Shovel* SubscribeJobs HTTP Browser

Resume Contacts Billing

Pay-per-contact: limited budget

Resume Contacts BillingOriginal Path

Pacific

Asia DC US DC

Log repoResume Search

MySQL

see http://go.indeed.com/logrepo

Resume Contacts BillingFast Path

Pacific

Asia DC US DC

RabbitMQ

MySQL

Log repo

RabbitMQ

Resume Search

X

Company Page Edits

User-contributed content about companies

Company Page

Company Page EditsImplementation

Writing data AND reading it back

Company Page EditsSingle Datacenter

Browser

Web Server MySQL

Company Page Serving

Browser

Web Server

LSM Tree

Asia Datacenter

Memcached

see http://go.indeed.com/lsmtree

Pacific

Company Page Edits

Browser

Web Server

RabbitMQ RabbitMQ MySQL

Primary US Datacenter

Asia Datacenter EU Datacenter

Atlantic

[Et cetera]

Memcached

Pacific

Company Page Reads

MySQL

LSM TreeBuilderLSM Tree

Primary US Datacenter

Asia Datacenter

LSM Tree

EU Datacenter

Atlantic

[Et cetera]

Memcached

Pacific

Company Pages System

Browser

Web Server

RabbitMQ RabbitMQ MySQL

LSM TreeBuilderLSM Tree

Primary US Datacenter

Asia Datacenter

LSM Tree

EU Datacenter

Atlantic

[Et cetera]

Other applications

Company Pages

Recap: The jobs must flow

● Durability● High throughput● Low latency● Partition-tolerance● Efficient use of the database● Minimal points of failure

top related