gobblin @ nerdwallet (nov 2015)

15
Gobblin @ NerdWallet By Akshay Nanavati and Eric Ogren [email protected] [email protected]

Upload: nerdwallethq

Post on 11-Jan-2017

1.285 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Gobblin @ NerdWallet (Nov 2015)

Gobblin @ NerdWallet

By Akshay Nanavati and Eric Ogren [email protected] [email protected]

Page 2: Gobblin @ NerdWallet (Nov 2015)

Agenda

● Introduction to NerdWallet● Gobblin @ NerdWallet Today● Initial Pain Points & Learnings● Contributions (Present and Future)● Future Use Cases & Requests

2

Page 3: Gobblin @ NerdWallet (Nov 2015)

What Is NerdWallet?

● Started in 2009. 275+ employees● Highly profitable. Series A funding Feb 2015. ● We want to bring clarity to life’s financial decisions.

3

Page 4: Gobblin @ NerdWallet (Nov 2015)

Front-End

Services Tier

NerdWallet Tech Stack

Data Analytics

Data Systems & Platforms

4

Page 5: Gobblin @ NerdWallet (Nov 2015)

Data Types @ NerdWallet

● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes)○ Synced to Redshift periodically

● Consumer Identity Data (Postgres: medium reads, medium writes)● Site Generated Tracking Data (Redshift: heavy reads, heavy writes)● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ? ● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes)● External 3rd Party Analytics Data (Redshift: medium reads, batch import)

5

Page 6: Gobblin @ NerdWallet (Nov 2015)

Gobblin @ NW Today

● Running in standalone mode● Ingests user tracking and operational log data● Tracking Data:

○ ~10 Kafka topics - 1 per event & schema type○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3○ Events are already serialized as protobuf in each Kafka topic○ Around 100 events/second

● Log Ingestion (Operational Data):○ Extracts data from AWS logs sitting in S3○ Parses log lines and serializes them to protobuf○ Writes the serialized protobuf files back to S3 and eventually into redshift

6

Page 7: Gobblin @ NerdWallet (Nov 2015)

Tracking Pipeline

7

Page 8: Gobblin @ NerdWallet (Nov 2015)

Learnings: Deploying Gobblin w/Internal Code

● Have a repo of internal Gobblin modules (this is where we compile everything)● Modified the build script to link the gobblin project to our gobblin-modules

project● Use jenkins to compile gobblin on the remote machine● Maintain a separate repository with .pull files that we can sync with our stage

and production environments

8

Page 9: Gobblin @ NerdWallet (Nov 2015)

Current Contributions

● Simple Data Writer○ class gobblin.writer.SimpleDataWriter○ Writes binary record as bytes with no regard to encoding

○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. \n

for string data)

● Kafka Simple Extractor○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource○ Extracts binary data from Kafka as an array of bytes without any serde

9

Page 10: Gobblin @ NerdWallet (Nov 2015)

Future Contributions

● Gobblin Dashboards● S3 Source & Extractor

○ Given an S3 bucket, extract all files matching a regex■ Leverages FileBasedExtractor

■ We would also like to modify this to have similar functionality to

DatePartitionedDailyAvroSource

● S3 Publisher○ Publishes files to S3

○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since

we are running in standalone this is not an issue for us

10

Page 11: Gobblin @ NerdWallet (Nov 2015)

Future: Dashboards

11

Page 12: Gobblin @ NerdWallet (Nov 2015)

Gobblin @ NW tomorrow

● More data types○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3

○ Offer data from our site: MySQL => S3 (batch and incremental)○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding)

○ Salesforce Data

● Integration with Airflow DAGs● Integration with data cleansing & entity matching frameworks

12

Page 13: Gobblin @ NerdWallet (Nov 2015)

Early Adoption Pain Points & Solutions

● Best practices around for ingestion w/ transformation steps● Initial problems integrating NW specific code (especially extractors &

converters) into Gobblin’s build process● Best practices around scheduler integration - Quartz (built-in) vs ETL

schedulers● Backwards incompatible changes caused us to make migrations to upgrade

versions● No changelogs & tagged releases

13

Page 14: Gobblin @ NerdWallet (Nov 2015)

Things we would like to see/add in future

● Abstract out Avro specific code● Best practices for scheduler integration (can contribute for Airflow)● Clustering without requiring Hadoop & YARN● Metadata support (job X produced files Y,Z)● Release notes & tags :)● The build & unit test process is very bloated

○ Hard to differentiate warnings/stack traces vs legitimate build issues

○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines

(port collisions)

14

Page 15: Gobblin @ NerdWallet (Nov 2015)

Thanks! Questions??

15