Download - PyATL Meetup, Oct 8, 2015
P r o c e s s i n g B i g D a t a P r e d i c t i v e A n a l y t i c s
PyATL Meetup 10/18/2015
‹#›
W h o A m I
• Roy Russo• VP Engineering,
Predikto
‹#›
W h y A m I H e r e ?
& (Big Data) Predictive Analytics
‹#›
A g e n d a
• What we do• Problems we faced• How we solved them
• Rationale• What’s good. What’s not.
‹#›
W h o i s P r e d i k t o ?
• Atlanta-based• Founded in 2012• Funded• Paying Customers• Mechanical
Engineers• Big Data Architects• Global 1000
… and we don’t suck.
‹#›
W h a t w e d o ?
• Actionable Predictive Maintenance• Predictive Analytics • Real-time health scoring• Unified asset health view• SaaS
H O W D O E S P R E D I C T I V E A N A LY T I C S W O R K ?
‹#›
D a t a S o u r c e s
‹#›
H o w W e D o I t
TRAIN MGT SYSTEMS
INTELLITRAINGE RM&D
EMD
NYAB LEADERMOTIVEPOWE
RWABTEC
EAM
SAPINFOR
ORACLEMAXIMO
OTHER
BEACONSWILD
WEATHER
TCIS
CUSTOM APPSUMLER
TIME-BASED ACTIONABLE
PREDICTIONS
PREDIKTO ENTERPRISE PLATFORM
PREDIKTOINPUTAPI’S DATA
TRANSFORM. ENGINE
MAXMACHINE LEARNING
ENGINE
PREDIKTOOUTPUT
APIS&
DASHBOARDS
‹#›
T h e P i p e l i n e
Standard JSON
AutoDynamic Feature Engineering
AutoDynamic Feature Selection
ETL
SMS
Integration
UI Data Store
Pred
ikto
dat
a Pi
pelin
e
“MAX
”
Oper
atio
nal I
nteg
ratio
n
Data Aggregation/ETL Machine Learning/Analytics Outbound APIs/Integration
‹#›
H i g h - L e v e l R e q u i r e m e n t s• ETL on LARGE datasets
• Fast• Commodity hardware
• Feature scoring/selection on LARGE datasets• Scale horizontally• Runs from Instruction-set
• Visualize LARGE datasets• Time-Series• Fast • Commodity hardware• Scale horizontally
• Support dynamic querying• Differing Schema
Data Processing
Data Querying
‹#›
W h y S p a r k ?
• ETL:• Shared Memory• Not Disk-Bound• Distributed workloads
• Feature scoring/selection on LARGE datasets• Same as above
• Scale horizontally• New node = more capacity• Spin up. Spin down.
• Runs from Instruction-set• DAGs
• Python Devs… PySpark
‹#›
I m p l e m e n t i n g S p a r k
• Use DAGS• Directed Acyclic
Graphs• Config-Driven
Workflows• Tune to Job
• Workers / Core• Memory Tuning• CPU or RAM ?
• Cons:• Steep learning curve
• Dev & Ops• Documentation
• Exception handling• Native Scala
‹#›
S p a r k U I
‹#›
M o r e o n D A G sclass comma_to_decimal(BaseDagTask): @staticmethod def run(sc, config, rdds, log): from shippable.dag_tasks._utils import safe_map
out = [] for idx, rdd in enumerate(rdds): if rdd is not None: cols = [str(x).strip() for x in config['cols'].split(',')] rdd = safe_map(rdd, lambda x: {k: v if str(k) not in cols else str(v).replace('.', '').replace(',', '.') for k, v in x.items()}, log) out.append(rdd) else: out.append(None)
return out, None "datetimes_ones": { "after": "datatypes_ones", "type": "convert_datetimes", }, “comma_convert": { "after": "datetimes_ones", "type": “comma_to_decimals”, }, "parentids_ones": { "after": "devids_ones", "type": "comma_convert", }, "timestamps_ones": { "after": "parentids_ones", "type": "rename", },
‹#›
W h y E l a s t i c s e a r c h ?
• Time-Series Data• Fast-Reads
• Fast writes with Bulk Inserts• Asynch
• Dynamic querying• Differing schema• Everything is indexed• Visualization
• GeoJSON• Scale horizontally
• New node = more capacity• Spark-ES Connector• REST API & Python lib
‹#›
E S - H a d o o p
client = Elasticsearch(hosts=[es_uri])
query = '{"query": {"filtered": {"filter": {"terms": {"date_epoch": ' + some_var +'}}}}}'
es_conf = { "es.nodes": es_host_name, "es.port": es_host_port, "es.resource": es_index + '/' + es_mapping, "es.query": query }
es_RDD = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)
dicts = es_RDD.map(lambda x: x[1]).map(lambda x: reformat_location(x))
‹#›
I m p l e m e n t i n g E l a s t i c s e a r c h
• Tune• Memory-Bound• Large Drives (SSDs)• Front w/ Load Balancer
• Spark• Schema:
• Understand your customer’s data!• Typing is important
• Timestamps, GeoHASH, Floats, etc…
• Cons:• Moderate learning curve
• Query syntax• Security
‹#›
A r c h i t e c t u r e
‹#›
I m p l e m e n t i n g E l a s t i c s e a r c h
P R E D I C T | P R E V E N T | P E R F O R M
http://www.predikto.com/company/careers