2015 nov 27_thug_paytm_rt_ingest_brief_final
TRANSCRIPT
What do we do at Paytm Toronto?
• Create Data Platforms and Data Products to be used by Paytm in the following areas:• Create a comprehensive data
platform for archive, processing, and data-based decisions
• Fraud Detection and Prevention• Analytical Scoring, Reporting, and
other Data Science challenges• Building an advertising technology
platform to generate revenue and increase customer engagement
• Blockchain(!) Stay tuned…3
The Data Platform at Paytm:
• Maquette• Continues to be mainstay of RT Fraud Prevention• Provides a Rule DSL as well as a flexible data model
• Fabrica• Modularized feature creation framework (yes, a real framework)• In Spark and SparkSQL for now• Will move to Spark Streaming very soon
• Chilika• Hadoop and related technologies• Processing and Storage muscle for most data platform tasks• Includes data export and BI tooling like Atscale, Tableau, and ES
5
Fabrica
• Modular framework for execution of feature creation, scoring, and export jobs• Parallel job execution and optimized by caching targeted
datasets• Handles complex transformations and can automate new
feature creation• Easily consumes Machine Learning libraries, especially Spark
MLLib• Starts as a Spark batch job and moves to a Spark Streaming
application• Written in Spark Scala• A DSL coming later
6
Maquette:
• Realtime rule engine for fraud detection• All of our marketplace transactions are evaluated in realtime
with concurrent evaluation on hundreds of fraud rules• Custom Scala/Akka application with a Cassandra datastore• Can be used with our datastores, such as Titan,
GraphIntelligence, HBase, etc• Interface for Rule and Threshold tuning• Handles millions of txns per day at an average response time
of 20ms
8
A “traditional” batch ingest process to Hadoop
• A 24 hour cycle batch-driven process • A reservoir of detailed data for the past 6 months for core
tables, 2 years for some higher level data, and a few months for the new data sets• Hadoop (HDP) data processing tools primarily, specifically Hive• Hive (SQL) transformations require so much additional logic
for reprocessing, alerting, etc that they have python programs call them• For event-streams (aka Real-Time), we load into Kafka. We pull
off this event data into a aggregated avro file for archive in HDFS.
11
Lots of room for improvement…• A 24 hour cycle batch-driven process means stale data for
a lot of use cases• The most important and most fragile pipeline is MySQL• The MySQL instances rely on a chain of Master-Replica-
Replica-Replica to get to Hadoop. This chain fails a lot• The MySQL chain has a fixed schema from RDBMS to
Hive. • Assumptions that this schema is fixed are carried forward
throughout our own processing pipeline. • Changes to schema result in a cascading failure• Hive does not have a resilient and programmatic way of
handling schema change• Others have attempted to write custom Java Hive SerDes to
correct data but this puts too much logic in the wrong spot• By using Hive for transformations that are complicated,
we have forced unnecessary temporary tables, created hacky nanny scripts, and made it nearly impossible to compose complicated transformations
14
A word on impatience…• The amount of signals and actions that a mobile
phone user will generate is much higher than a web user by virtue of their mobility
• Reducing the MTTF (Mean Time To Feature) from hours to minutes opens up an entirely new set of interactions with users:• More advertising inventory with hyperlocal (ie walk into a
store) targeting, ie more revenue potential• Better fraud detection and risk assessment• More opportunities to construct a meaningful relationship
with the customer through helpful engagement:• bill splitting• localized shopping reminders – “while you are here...”• Experience planning (you were looking for a restaurant, so we
suggest something you would like, plan your transit/train, and order via FoodPanda)
15
Our realtime approach: DFAI (Direct-From-App-Ingest)
• Requires our primary applications to implement an SDK that we provide • The SDK is a wrap of the Confluent.io SDKs with our
schema registered• Schema management is done automatically with the
confluent.io schema repository using Apache Avro• Avro Schema is flexible with Avro, unlike @#$^@!!! SQL
Schema• Avro Schema is open source and would still be
manageable even if we moved away from using Confluence. Our data is safe for the long term.
18
DFAI = Direct-From-App-Ingest• Market-order• order/invoice: sales_order_invoice table
• create• updateAddress
• order/order : sales_order table• create• update
• order/payment: sales_order_payment table• create• insertUpdateExisting• update
• order/item:sales_order_item table• create• update
• order/address: sales_order_address table• create• updateAddress
• order/return_fulfillment: sales_order_return table
• create• update
• order/refund: sales_order_refund table• create• update
Order/invoice schema example:
{ "namespace" : "com.paytm", "name": "order_invoice_value", "type": "record", "fields": [ { "name": "tax_amount", "type": unitSchemas.nullLong}, { "name": "surchange_amount", "type": unitSchemas.nullLong.name}, { "name": "subtotal", "type":unitSchemas.nullLong.name}, { "name": "subtotal_incl_tax", "type":unitSchemas.nullLong.name}, { "name": "fulfillment_id", "type":unitSchemas.nullLong.name}, { "name": "created_at", "type": unitSchemas.str} ]}
19
Event Sourcing & Stream Processing
• Read this: http://www.confluent.io/blog/making-sense-of-stream-processing/• Basic concept: treat the stream as an immutable set of
events and then aggregate views specific to use cases• We will use Fabrica to stream and aggregate data in
realtime eventually• For batch work, we post-process the create / update
events to yield an aggregate state view. In other words, the last state
20
Thank you!Adam [email protected]