ibm - spark y como lo usamos en aspgems - …files.meetup.com/7770922/ibm - spark y como lo usamos...

53
diciembre 2010 Apache Spark y como lo usamos en nuestros proyectos

Upload: vanthu

Post on 28-May-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

diciembre 2010

Apache Spark y como lo usamos en nuestros

proyectos

Page 2: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Who am I

CDO ASPgems

Former President of Hispalinux (Spanish LUG)

Author “La Pastilla Roja” first spanish book about Free Software.

Page 3: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Menu

Un poco de historia: hadoop

Que es Apache Spark

Arquitectura Kappa

Algunos casos reales con Spark

Page 4: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Un poco de historia

En principio de los tiempos

Los Papers de Google

Nacimiento del proyecto Hadoop

Page 5: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

La era pre Spark

Google publica algunos papers claves:

HDFS

MapReduce

HBASE

Page 6: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

La era pre Spark

Nace Hadoop

Primera implementación libre

Yahoo es el sponsor

Se desarrolla uno de los entornos de trabajo más interesantes y productivos de los años 2000

Page 7: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Batch-Hadoop (MR1)

Page 8: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Batc-MapReduce

Page 9: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Batch-Cascading

Page 10: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Cosas que no terminan de gustarnos de Hadoop

Algunas cosas de hadoop no terminan de convencernos.

No todo los problemas se solucionan con Map Reduce.

Hadoop ha generado un montón de proyectos y tecnologías para resolverlo.

Page 11: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

https://www.tele-task.de/archive/lecture/overview/5721/

Page 12: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Mientras tanto …

Hadoop ha cumplido 10 años

Ha habido muchos cambios tanto en hardware como en software

Page 13: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

https://www.tele-task.de/archive/lecture/overview/5721/

In memory Computing

Page 14: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Y de pronto nace: Apache Spark

Es un motor de analítica paralelizado.

Escrito desde cero.

Heredando lo mejor de la experiencia “hadoop”

Probablemente la obra de ingeniería más bella de los últimos años.

Page 15: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Que es Apache Spark

La primera reseña importante esta pensado desde el principio para aprovechar al máximo la memoria.

Resiliente y escalable por definición.

Page 16: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Que es Apache Spark

Escrito, pensado e influido por el lenguaje Scala.

Page 17: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

7

Page 18: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Spark, the elevator pitch

4

Page 19: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying
Page 20: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying
Page 21: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying
Page 22: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Spark Streaming

La pieza clave para entender la información como un Stream.

Más sobre este tema adelante con la arquitectura kappa.

Page 23: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Spark Streaming

Page 24: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Spark SQLCuando cualquier información con un esquema se puede consultar con SQL

Page 25: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

A little contextJuly 2, 2014 Jay Kreps coined the term Kappa Architecture in an article for O’reilly Radar

Page 26: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Who is Jay Kreps

Jay has been involved in lots of projects:

Author of the essay:

The Log: What every software engineer should know about real-time data's unifying abstraction (12/16/2013)

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Page 27: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Jay KrepsAuthor of the book: I ♥ Logs

Page 28: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Jay KrepsInvolved with projects as:

Apache Kafka

Apache Samza

Voldemort

Azkaban

Ex-Linkedin

Now co-founder and CEO of Confluent

Page 29: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Lambda ArchitectureLook something like this:

https://www.mapr.com/developercentral/lambda-architecture

Page 30: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Lambda ArchitectureBatch layer that provides the following functionality

managing the master dataset, an immutable, append-only set of raw data.

pre-computing arbitrary query functions, called batch views.

https://www.mapr.com/developercentral/lambda-architecture

Page 31: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Lambda ArchitectureServing layer

This layer indexes the batch views so that they can be queried in ad hoc with low latency.

Speed layer

This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

Page 32: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Lambda Architecturebatch layer datasets can be in a distributed filesystem, while MapReduce can be used to create batch views that can be fed to the serving layer.

The serving layer can be implemented using NoSQL technologies such as HBase,Apache Druid, etc.

Querying can be implemented by technologies such as Apache Drill or Impala

Speed layer can be realized with data streaming technologies such as Apache Storm or Spark Streaming

https://www.mapr.com/developercentral/lambda-architecture

Page 33: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Pros of Lambda Architecture

Retain the input data unchanged.

Think about modeling data transformations, series of data states from the original input.

Lambda architecture take in account the problem of reprocessing data.

this happens all the time, the code will change, and you will need to reprocess all the information. Lots of reasons and you will need to live with this.

Page 34: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Cons of Lambda Architecture

Maintain the code that need to produce the same result from two complex distributed system is painful.

Very different code for MapReduce and Storm/Apache Spark

Not only is about different code, is also about debugging and interaction with other products like (hive, Oozie, Cascading, etc)

At the end is a problem about different and diverging programming paradigms.

Page 35: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

So what is Kappa Architecture

The proposal of Jay Kreps is so simple:

Use kafka (or other system) that will let you retain the full log of the data you need to reprocess.

When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table.

Page 36: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

So what is Kappa Architecture

part II

When the second job has caught up, switch the application to read from the new table.

Stop the old version of the job, and delete the old output table.

Page 37: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

So what is Kappa Architecture

This architecture looks something like this:

Page 38: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

So what is Kappa Architecture

The first benefit is that only you need to reprocessing only when you change the code.

You can check if the new version is working ok and if not reverse to the old output table.

You can mirror a Kafka topic to HDFS so you are not limited to the Kafka retention configuration.

You have only a code to maintain with an unique framework.

Page 39: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

So what is Kappa Architecture

The real advantage is not about efficiency at all (You will need extra temporarily storage when reprocessing for example) is allowing your team to develop, test, debug and operate their systems on top of a single processing framework.

Page 40: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

What is not Kappa Architecture

Is not a silver bullet to solve every problem at Big Data. Is not a list of prescriptions of technologies. You

can implement with your favorite frameworks. Is not a rigid set of rules. But helps to maintain

the complex projects simple.

Page 41: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

How we use Kappa Architecture

We start working with projects with a complex structure like Linkedin looks at early stage. That’s very usual.

Page 42: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

How we use Kappa Architecture

Page 43: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

How we use Kappa Architecture

We try to refactoring the data flows to fix in a Kappa Architecture.

Page 44: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

How we use Kappa Architecture

Page 45: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

How we use Kappa Architecture

We use Kafka as Stream Data Platform Instead of Samza we feel more comfortable with

Spark Streaming. At ASPGems we choose Apache Spark as our

Analytics Engine and not only for Spark Streaming.

Page 46: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

How we use Kappa Architecture

At the end, Kappa Architecture is design pattern for us. We use/clone this pattern in almost our projects. We have projects of every size, volume of data

or speed needing and fix with the Kappa Architecture.

Page 47: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Use Cases

Page 48: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Telefónica - MSS

We use KA to calculate near real time KPIs, SLAs related with the managed security system. We simplify the data flow of the input data. Kafka in the streaming data platform. As MPP we use CassandraDB.

Page 49: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

IOT - OBD II

One of our clients install On Board Devices in the cars of its customers. We implement an API to got all the information

in real time and inject the information in Kafka. The business rules are implemented in a CEP

running into Apache Spark Streaming. As MPP we use Elastic Search.

Page 50: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Insurance Company

We implement Kappa Architecture to process click stream in real time and clustering users We show content and offers that better fix users

Page 51: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Energy Facility

We implement Kappa Architecture to process and predict energy consume. Our customer include energy storage systems

and we got all the information about energy storage (ultra-capacitors and batteries). We process this information to calculate the

effective lifetime of the components and its degradation.

Page 52: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Questions

Page 53: IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

diciembre 2010

Thank you Juantomás García

[email protected] @juantomas