high performance cooperative distributed systems in adtech · intro design implementation...

39
Intro Design Implementation Reliability Lessons Learned Summary High Performance Cooperative Distributed Systems in Adtech Stan Rosenberg VP of Engineering Forensiq New York, NY QCon, New York, June 26, 2019 1/39

Upload: others

Post on 26-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

High Performance Cooperative DistributedSystems in Adtech

Stan Rosenberg

VP of EngineeringForensiq

New York, NY

QCon, New York, June 26, 2019 1/39

Page 2: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Prebid Throughput

QCon, New York, June 26, 2019 2/39

Page 3: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

GC Pauses

QCon, New York, June 26, 2019 3/39

Page 4: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Failure happens all the time

Ken Arnold,

When you design distributed systems, you have tosay, "Failure happens all the time."

Fallacies of Distributed Computing (Peter Deutsch),

The network is reliable.

Latency is zero.

Bandwidth is infinite.

Transport cost is zero.

QCon, New York, June 26, 2019 4/39

Page 5: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Past Work

QCon, New York, June 26, 2019 5/39

Page 6: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Present Work

QCon, New York, June 26, 2019 6/39

Page 7: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Intro

Before, Ph.D., Computer Science; Stevens, Hoboken, 2011Advisor: David A. NaumannDissertation Title: Region Logic: Local Reasoning for JavaPrograms and its Automation

Recently, building distributed platforms for startupsAppnexus (serving ads faster)PlaceIQ (using location to serve ads)

VP of Engineering, Forensiq (fighting ad fraud)

QCon, New York, June 26, 2019 7/39

Page 8: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Forensiq Overview

Comprehensive Fraud and Verification SaaS (MRC certified)

Display Verification (viewability measurements, impressionblocking)

Performance Fraud (stolen attribution, fake action)

Online scoring via Prebid, Postbid and S2S APIs

Offline scoring via request log import and reputation lists

QCon, New York, June 26, 2019 8/39

Page 9: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Fraud Examples

QCon, New York, June 26, 2019 9/39

Page 10: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Fraud Examples

QCon, New York, June 26, 2019 10/39

Page 11: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Fraud Examples

QCon, New York, June 26, 2019 11/39

Page 12: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Call for Cooperation and Collaboration

Let’s improve data quality!

provide authentic source ipserver-side ad-stitching (e.g., AWS Elemental) hides sourceip; triggers datacenter trafficMRC notes, “data center traffic is determined to be aconsistent source of non-human traffic”.

specify location type (OpenRTB 2.5) and source tostrengthen spoofing detection

provide campaign/source (aggregate) metrics to help detectclient-side JS blocking

QCon, New York, June 26, 2019 12/39

Page 13: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Performance Requirements (Prebid API)

high-throughput – must scale above 1 mil. RPS

low-latency – response p99 < 10ms

QCon, New York, June 26, 2019 13/39

Page 14: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Daily Bid Volume

100 ∗ 109/86400 ≈ 1.1 ∗ 106

https://fixad.tech/wp-content/uploads/2019/02/4-appendix-on-market-saturation-of-the-systems.pdf

QCon, New York, June 26, 2019 14/39

Page 15: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Common Concerns

high-throughput low-latencyserver backend 3 3

KV store 3 3

data ingest 3

ETL 3

data pipelines 3

data pipelinesAd Serving: enrichment, budget, attribution, reportingFraud Detection: enrichment, scoring, reporting

QCon, New York, June 26, 2019 15/39

Page 16: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Guiding Principles

use NIO

use compare-and-swap instead of locks (affects OOOE)

use spatial/temporal locality (prefetch,branch predict)

minimize coupling and state–keep it simple

minimize GC pressure

warmup on startup to trigger JIT

measure everything with HdrHistogram

benchmark everything with JMH and wrk2

QCon, New York, June 26, 2019 16/39

Page 17: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Cloud is fast (enough)

modern hypervisor adds negligible overhead (< 5%)

consitent performance–“noisy neighbor” is a mythnetworking – 2Gbps per core; up to 32Gbps per VM

partitions are infrequent; high inter-region throughput

local storage – NVMe SSDs; read: 300K IOPS, 2GB/seccloud storage – high-throughput and high-availability

strongly consistent (GCS)fast parallel uploads via compose (GCS)

QCon, New York, June 26, 2019 17/39

Page 18: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Mechanical Sympathy

Understanding the Hardware Makes Youa Better Developer

https://mechanical-sympathy.blogspot.com/

https://dzone.com/articles/mechanical-sympathy

https://groups.google.com/forum/#!forum/mechanical-sympathy

QCon, New York, June 26, 2019 18/39

Page 19: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Latency

Little’s Law: L = λ×W , whence throughput is ∝ 1latency

QCon, New York, June 26, 2019 19/39

Page 20: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Know Your Data Structures

1000 references to main memory (e.g., linear scan of linked-list)is ≈ 100 micros; ( 1

100)× 106 = 10, 000 reqs/second

1000 references to L2 cache is ≈ 7 micros; (17)× 106 = 142, 857

reqs/second

linear search is slower than binary, right?

i n t cnt = 0 ;f o r ( i n t i = 0 ; i < n ; i++)

cnt += ( ar r [ i ] < key ) ;re turn cnt < n && arr [ cnt ] == key ;

QCon, New York, June 26, 2019 20/39

Page 21: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Disruptor Pattern–Fast Event Processing

Disruptor is like Java’s BlockingQueue but waaaaay faster!RingBuffer

one compare-and-swap operation to drain the queuepair of sequence numbers for fast atomic reads/writesexploits speculative racing to eliminate locksconsumer message batching results in high-throughput

QCon, New York, June 26, 2019 21/39

Page 22: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Disruptor Pattern

RingBuffer is pre-allocated (data in Wrapper.message)compact – sizeof(disruptor(524,288)) ≈ 14.5MB

QCon, New York, June 26, 2019 22/39

Page 23: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Data Ingest & ETL

validate each request and apply (payload) limits

translate JSON to snappy-compressed Avro

use Disruptor to consume encoded Avro byte[]

append to Avro data file for current 5-min batch

upload to GCS (throttle to reduce GC pressure)

QCon, New York, June 26, 2019 23/39

Page 24: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Avro & Snappy

16 cores, skylakejava version "1.8.0_202"@Threads(24), @BenchmarkMode(Mode.Throughput)

Benchmark Score Error Unitsencode 3741337.244 ±81494.37 ops/sencodeCompress 2699393.673 ±40130.622 ops/sdecode 2925509.122 ±37078.569 ops/sdecodeDecompress 2771921.410 ±60483.905 ops/s

Also see zstd: https://facebook.github.io/zstd/

QCon, New York, June 26, 2019 24/39

Page 25: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Data Ingest & ETL

early ETL cuts out many downstream inefficiencies

Avro’s performance is on par with Protobuf (also see below)

throttling uploads and downloads is a must to reduce GC

eliminate humongous objects (G1)

naive batching/parallel upload with compose works well

skip write-ahead log–deal with corrupted Avro blocks

Codegen makes Avro encoder 2x faster: https://github.com/RTBHOUSE/avro-fastserde

QCon, New York, June 26, 2019 25/39

Page 26: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

KV Store–why not Aerospike?Pros

founded in 2009 (AppNexus was first large deployment)written in C (better resource management in theory)uses Paxos for distributed consensus; heartbeats for nodemembershipsupports migrations, rebalancingsupport cross-datacenter replication

ConsNo bulk loadingindex can get large (RIPEMD is 20 bytes but metadatamakes it 64 bytes)log-structured filesystem (copy-on-write); runs compaction inbackgroundglobal 32k bins limit (bins are like column qualifiers)

QCon, New York, June 26, 2019 26/39

Page 27: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Low latency KV–Voldemort

founded in 2009 by LinkedIn (bulk loading main motivator)

written in Java

simple get/put API

uses consistent hashing (similar to Dynamo) to avoidhotspotting

bulk loading and readonly store

index is compact – uses only 8 bytes of md5(key)

index file is mlocked

(sort of) supports rebalancing

QCon, New York, June 26, 2019 27/39

Page 28: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Voldemort BuildAndPush

QCon, New York, June 26, 2019 28/39

Page 29: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Voldemort Readonly Performance

QCon, New York, June 26, 2019 29/39

Page 30: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Custom Voldemort

added BloomFilter (client-side to reduce RTT)

added Avro schema versioning

added Union datastore

TCP connection pooling is flawed

reloads create short-lived spikes (hard to pin index)

2GB limit per chunk (ByteBuffer 32bit signed addressing)rewrite currently in progress to manage resources moreefficiently,

rewrite Voldemort backend in C++use UDP (potentially with Aeron)use GCS instead of HDFS

QCon, New York, June 26, 2019 30/39

Page 31: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Putting Things Together

QCon, New York, June 26, 2019 31/39

Page 32: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Tech. Debt

QCon, New York, June 26, 2019 32/39

Page 33: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Top Two Diseases

Legacy and Tech. Debt are the top two diseases of anycomplex software development

avoid them at all costsGoogle often rewrites legacy before it’s out of control;secondary effect,

way of transferring knowledge and ownership to newer teammembers

Henderson, Fergus. "Software engineering at Google." arXiv preprint arXiv:1702.01715 (2017).

QCon, New York, June 26, 2019 33/39

Page 34: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Rapid Reliable Iteration

can’t iterate quickly without automated verification (i.e.,tests)

invest time into test and benchmarking fixtures early (e.g.,write emulators)

end-to-end (integration) tests, e.g., Selenium, are must-have

instrument with metrics and measure everything

use design by contract methodology with code reviews

Design by contract was coined by Bertrand Meyer in connection with Eiffel.

QCon, New York, June 26, 2019 34/39

Page 35: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

GCP Managed Infrastructure

distributed, highly available, strongly consistent file system(gcs)

global latency-based load balancing

zero-downtime rolling deploy

fast scaling up/down (new instances take < 90 sec. to boot)

Bigquery (bulk loading, avro/parquet, partitioned tables)

Bigtable (hbase on steroids)

syncs (lb logs to bigquery, billing to bigquery, etc.)

https://serverfault.com/questions/881698/random-failed-to-connect-to-backend-errors-on-gce-lb

QCon, New York, June 26, 2019 35/39

Page 36: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Cloud Tech. is mostly mature

https://github.com/googleapis/cloud-bigtable-client/issues/1348

https://github.com/googleapis/google-cloud-java/issues/3531

https://github.com/googleapis/google-cloud-java/issues/3534

https://github.com/GoogleCloudPlatform/bigdata-interop/issues/106

https://github.com/GoogleCloudPlatform/bigdata-interop/issues/153

QCon, New York, June 26, 2019 36/39

Page 37: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Trust but Verifycost-effective infrastructure is doable but watch out. . .GCP bait & switch product tactics

stackdriver glb logging (free until insanely expensive)load-balancer user-defined headers (free until . . . )cloud armor (firewall for glb) (free until . . . )

Managed services are black boxes (with limitedobservability)

DNS delegation misconfiguration was $54k over 6 months (nometrics, logging or anomaly detection)dataproc transient failures (no useful logging to determineroot cause)dataproc job non-determinstically “stuck” while committingoutput

QCon, New York, June 26, 2019 37/39

Page 38: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Summary

Cloud and OSS is an extremely powerful combination

High-throughout in Cloud is fairly easy through right designLow-latency in Cloud is achievable but takes significantlymore effort

opportunity to build a managed low-latency KV storestorage-as-a-service is still emerging–programmable SSDs

Fraud is here to stay–cooperation and collaboration withadtech is vital

QCon, New York, June 26, 2019 38/39

Page 39: High Performance Cooperative Distributed Systems in Adtech · Intro Design Implementation Reliability Lessons Learned Summary PrebidThroughput QCon, New York, June 26, 2019 2/39

IntroDesign

ImplementationReliability

Lessons LearnedSummary

Questions?

EOF

QCon, New York, June 26, 2019 39/39