data integration with spark | mar 18, 2015 · pdf file3 founded from paypal in 2004 palantir...
TRANSCRIPT
Data Integration with Spark | Mar 18, 2015
2
ABOUT ME
Brian Schimpf
Palantirian since 2007
Director of Engineering
3
Founded from Paypal in 2004
PALANTIR
Human computer symbiosis, data integration
Government space Counter-terrorism
4
PALANTIR
THE PROBLEM TRADER OVERSIGHT
6
1995
BARINGS BANK €825M loss
2010
SOCIETE GENERALE €5B loss
2012
UBS £1.4B loss £30M fine
2013
GOLDMAN SACHS Unauthorized $8B position $120M loss
THE PROBLEM
7
LACK OF CONTEXT
HIGH NOISE, LOW SIGNAL
DATA SCALE & DIVERSITY
Single-point alerting drown analysts in noise that fail to capture complex patterns of behavior.
Majority of incidents begin with a small breach that may not look very different from normal trading activity.
Data scale is both massive and incredibly diverse (including structured and unstructured data).
THE PROBLEM
OUR SOLUTION TRADER OVERSIGHT
9
IMPROVE THE RISK MODEL
10
IMPROVE THE INTERFACE
11
DATA INTEGRATIO
N
ANALYTICS DECISIONS
PALANTIR IN PRACTICE
12
FOUNDRY
Developer tools for data Variety of incoming data
Manage lots of transformations Spark & open source
13
SNAPSHOT – VER 1
UPDATE – VER 3
Dataset in HDFS/S3
LOGS
JDBC
STREAMS
UPDATE – VER 2
/dataset/1/main.avro
/dataset/2/main.avro
/dataset/3/main.avro
FOUNDRY
14
DATASET A
VIEW D Spark Transform
VIEW E Python Script
VIEW F SparkSQL
DATASET B
DATASET C
SchemaRDD
Pandas Dataframe
FOUNDRY
15
DATASET A
VIEW D
VIEW E
VIEW F DATASET B
DATASET C
CURRENT VERSION 1
CURRENT VERSION 1
CURRENT VERSION 1 CURRENT VERSION 1 ->
CURRENT VERSION 1-> DEP A VERSION = 1 DEP B VERSION = 1
DEP C VERSION = 1
CURRENT VERSION 1 -> DEP D VERSION = 1
DEP E VERSION = 1
FOUNDRY
16
DATASET A
VIEW D
VIEW E
VIEW F DATASET B
DATASET C
CURRENT VERSION 1
CURRENT VERSION 2
CURRENT VERSION 1 CURRENT VERSION 1 ->
CURRENT VERSION 1-> DEP A VERSION = 1 DEP B VERSION = 1
DEP C VERSION = 1
CURRENT VERSION 1 -> DEP D VERSION = 1
DEP E VERSION = 1
FOUNDRY
17
FOUNDRY
18
DATASET A
VIEW D
VIEW E
VIEW F DATASET B
DATASET C
Inspect the data - SchemaRDD
MERIDIAN
20
Take advantage of Spark improvements
Human/Computer Symbiosis
FUTURE
THANK YOU!
WE’RE RECRUITING! palantir.com/jobs Questions? [email protected]