what's next for big data? -- apache spark

WHAT’S NEXT FOR BIG DATA?

APACHE SPARK

WTH IS SPARK?

3 TUMRA - Big Data Week, May 2014

Spark is …

“One platform to rule them all” … and blurs boundary between SQL, machine learning, streams & graphs


Spark is …

… gaining momentum


Spark has …

… more contributors than Hadoop


Spark can …

Source: Databricks


Spark Stack

Source: Databricks

Hadoop (HDFS)


Why Spark?

-  Code reuse across batch, streaming and interactive applications

-  Easy API from Scala, Java & Python -  In-memory data sharing

FAAAAAAST!!!

Check out http://spark.apache.org


CASE STUDY: PERSONALISATION & MARKETING AUTOMATION


Our history with Spark

-  Early adopters; poc in Dec ‘12 -  In production since March ‘13

-  Running on Amazon EC2 -  Ad-hoc analysis and reporting -  Machine learning model building -  Integrates to our real-time dashboards


Use Case: Personalisation


Use Case: Personalisation (cont’d)

-  Matching visitors to products -  50% of visitors are ‘new’ and have

no history to work with -  Blend of pre-computation and real-

time recommendations


Use Case: Marketing Automation

-  Collect user engagement data across websites and mobile apps

-  Increase subscription rates -  Identity users at risk of churn -  Automated personalised marketing


Data Volumes & Velocity

-  29M events per day -  Peak rates ~800 events / second -  All events streamed to Kafka -  10B archived events in Amazon S3


How we use Spark

Amazon S3 (HDFS interface) Apache Ka>a

Data CollecAon API (Akka) & Connectors


Spark gives us …

-  Unified platform for machine learning and graph analytics

-  Ability to experiment at huge scale -  SQL interfaces to existing tools -  Code reuse from data scientists to

production workloads


WANT TO KNOW MORE?


http://spark.apache.org


Spark Summit 2014


Spark London Meetup


Commercial Support & Certification


THANK YOU

@tumra tumra.com

slideshare.net/tumra

what's next for big data? -- apache spark

Data & Analytics

tumra big data week

tumra tumra

data scientists

user engagement data

data volumes velocity

memory data sharing

spark summit

apache spark