apache big data europe 2015: selected talks

APACHE: BIG DATA EUROPE 2015Budapest, September 28-30, 2015

tech talk @ ferretAndrii Gakhov

SELECTED TALKS

Photos © Apache Big Data

BEING READY FOR APACHE KAFKAby Michael G. Noll, Confluent Inc.

http://www.slideshare.net/miguno/being-ready-for-apache-kafka-apache-big-data-europe-2015

http://www.slideshare.net/miguno/being-ready-for-apache-kafka-apache-big-data-europe-2015

Apache Kafka is a publish-subscribe messaging rethought as a distributed commit log.

Producer

Producer

Consumer

Consumer

Broker Broker Broker



ZooKeeper

Kafka Cluster

oldest newest

Prod

ucer

Custo

mer

Custo

mer

topic

ABOUT KAFKA FROM JAY KREPS• A consumer just maintains an “offset,” which is the log entry number

for the last record it has processed on each of these partitions. So, changing the consumer’s position to go back and reprocess data is as simple as restarting the job with a different offset. Adding a second consumer for the same data is just another reader pointing to a different position in the log.

• Kafka supports replication and fault-tolerance, runs on cheap, commodity hardware, and is glad to store many TBs of data per machine.

• LinkedIn keeps more than a petabyte of Kafka storage online, and a number of applications make good use of this long retention pattern for exactly this purpose.

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

USING KAFKA• DEB and RPM are available via Confluence Platform

(http://www.confluent.io/developer)• Recommended Python client: kafka-python

(https://github.com/mumrah/kafka-python)• Confluent Kafka-REST is available via Confluent

Platform• Monitoring is important: Host metrics (CPU, memory,

disk I/O and usage, network I/O), Kafka metrics (consumer lag, replication stats, message latency, GC), ZooKeeper metrics (requests latency, #outstanding requests)

http://www.confluent.io/developer

https://github.com/mumrah/kafka-python

NEW IN KAFKA 0.9.0• Copycat is a new framework for loading structured data into and

out of Kafka

• Kafka Streams is a library that supports basic operations (join/filter/map/…), windowing, schema and proper time modelling (event time vs. processing time)

• New unified consumer Java API

• ZooKeeper dependency is removed from clients

copycat copycat

$ cat < in.txt | grep “apache” | tr a-z A-Z > out.txt

Copycat Copycat

Kafka Kafka

Kafka Streams Kafka Streams

KAPPA ARCHITECTUREOUR EXPERIENCEby Juantomás García, ASPgems

http://events.linuxfoundation.org/sites/events/files/slides/ASPgems%20-%20Kappa%20Architecture.pdf

http://events.linuxfoundation.org/sites/events/files/slides/ASPgems%20-%20Kappa%20Architecture.pdf

LAMBDA ARCHITECTURE

https://www.mapr.com/developercentral/lambda-architecture

https://www.mapr.com/developercentral/lambda-architecture

LAMBDA ARCHITECTURE• Batch layer that provides the following functionality:

• managing the master dataset, an immutable, append-only set of raw data.

• pre-computing arbitrary query functions, called batch views. • Serving layer (NoSQL such as HBase, Apache Druid, etc.)

• This layer indexes the batch views so that they can be queried in ad hoc with low latency.

• Speed layer (Apache Storm, Spark Streaming, etc.)• This layer accommodates all requests that are subject to

low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.

LAMBDA ARCHITECTURE

• Retain the input data unchanged

• Take in account the problem of reprocessing data (the code change, and you need to reprocess)

• Maintain the code that need to produce the same result from two complex distributed system is painful

• Different and diverging programming paradigms

Pros Cons

KAPPA ARCHITECTURE• July 2, 2014 Jay Kreps from LinkedIn coined the term Kappa

Architecture• The proposal of Jay Kreps is simple:

• Use Kafka (or other system) that will let you retain the full log of the data you need to reprocess.

• When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table.

• When the second job has caught up, switch the application to read from the new table.

• Stop the old version of the job, and delete the old output table.http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

KAPPA ARCHITECTURE

APPoutput table n

output table n+1

job version n

job version n+1input topic

Kafka Cluster Stream Processing Serving DB

LAMDA ARCHITECTURE

APPspeed table

batch table

processing job

processing job

input topic

Kafka Cluster Stream ProcessingServing DB

Batch Processing

• Need to reprocess only when you change the code.• Check if the new version is working OK and if not

reverse to the old output table.• You can mirror a Kafka topic to HDFS so you are

not limited to the Kafka retention configuration.• You have only a code to maintain with an unique

framework.• The real advantage is allowing your team to

develop, test, debug and operate their systems on top of a single processing framework.

KAPPA ARCHITECTURE

USE CASES: IOT - OBD II• One of clients install On Board Devices in the cars of

its customers.• ASPGems implements an API to got all the

information in real time and inject the information in Kafka.

• The business rules are implemented in a CEP (complex event processing) running into Apache Spark Streaming.

• As MPP (massively parallel processing) they use ElasticSearch.

CATCH THEM IN THE ACTFRAUD DETECTION IN REAL-TIME

by Seshika Fernando, WSO2

http://events.linuxfoundation.org/sites/events/files/slides/Fraud%20Detection%20in%20Real-time%20-%20Seshika%20Fernando.pdf

http://events.linuxfoundation.org/sites/events/files/slides/Fraud%20Detection%20in%20Real-time%20-%20Seshika%20Fernando.pdf

FRAUD: A TRILLION DOLLAR PROBLEM• Survey results

• $ 3.5 – 4 Trillion in Global Losses per year (5% of Global GDP)

• Payment Fraud Only

• Merchants are losing around $250B globally

• Cost of Fraud is around 0.68% of Revenue for Retailers (2014)

• Steep rise in Fraud in eCommerce (0.85% of Revenue) and mCommerce (1.36% of Revenue) with a movement of payments to newer channels

DomainKnowledge

BatchAnalytics

Real-TimeAnalytics

PredictiveAnalytics

InteractiveAnalytics

Fraud Detection ToolkitData Analytics Server

FRAUD SCORING• Use combinations of rules• Give weights to each rule• Derive a single number that reflects many fraud indicators• Use a threshold to reject transactions

• Example: Score = 0.001 * itemPrice + 0.1 * itemQuantity + 2.5 * isFreeEmail + 5 * riskyCountry + 8 * suspicousIPRange + 5 * suspicousUsername + 3 * highTransactionVelocity

LEARN FROM DATA

• Utilize Machine Learning Techniques to identify ‘unknown’ point anomalies (e.g. k-means clustering)

MARKOV MODELS FOR FRAUD DETECTION• Markov Models are stochastic models used to model

randomly changing systems

ClassifyEvents

Update Probability

Matrix

Compare Incoming

Sequences

ProbabilityMatrix

events alerts

MARKOV MODEL: CLASSIFICATIONExample: Each transaction is classified under the following three qualities and expressed as a 3 letter token, e.g., HNN• Amount spent: Low, Normal and High• Whether the transaction includes high price item:

Normal and High• Time elapsed since the last transaction: Large,

Normal and Small

MARKOV MODEL: PROBABILITYLNL LNH LNS LHL HHL …

LNL 0.97 0.54 0.2 0.09 0.07LNH 0.8 0.6 0.18 0.65 0.11LNS 0.07 0.83 0.95 0.15 0.12…

• Compare the probabilities of incoming transaction sequences with thresholds and flag fraud as appropriate

• Can use direct probabilities or more complex metrics (Miss Rate Metric, Miss Probability Metric, Entropy Reduction Metric, …)

• Update Markov Probability table with incoming transactions

DIG DEEPER• Access historical

data using• expressive

querying• easy filtering• useful

visualisations

• to isolate incidents and unearth connections

NLP STRUCTURED DATA INVESTIGATION ON NON-TEXTUAL DATA WITH MLLIB

by Casey Stella, Hortonworks

http://events.linuxfoundation.org/sites/events/files/slides/NLP_on_non_textual_data.pdf

http://events.linuxfoundation.org/sites/events/files/slides/NLP_on_non_textual_data.pdf

WORD2VEC• Word2Vec is a vectorization model created by Google that attempts

to learn relationships between words automatically given a large corpus of sentences.• Gives us a way to find similar words by finding near neighbors in

the vector space with cosine similarity.• Uses a neural network to learn vector representations.• Recent work by Pennington, Socher, and Manning shows that the

word2vec model is equivalent to weighting a word co-occurance matrix based on window distance and lowering the dimension by matrix factorization.

• Read more: http://radimrehurek.com/2014/12/making-sense-of-word2vec/

http://radimrehurek.com/2014/12/making-sense-of-word2vec/

CLINICAL DATA AS SENTENCES• Clinical encounters form a sort of sentence over time. For a

given encounter :• Vitals are measured (e.g. height, weight, BMI).• Labs are performed and results are recorded (e.g. blood tests).• Procedures are performed.• Diagnoses are made (e.g. Diabetes).• Drugs are prescribed.

• Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”.

• Idea: We can use word2vec to investigate connections between these clinical concepts.

DEMO FOR KAGGLE COMPETION• Practice Fusion Diabetes Classification (https://

www.kaggle.com/c/pf2012-diabetes)• Given a de-identified data set of patient electronic

health records, build a model to determine who has a diabetes diagnosis, as defined by ICD9 codes

• There are a total of 9,948 patients in the training set and 4,979 patients in the test set.

• Ingested and preprocessed these records into197,340 clinical “sentences”

https://www.kaggle.com/c/pf2012-diabetes

SYNONIMS

• Sentence:• dx::042 rx::benzoyl_peroxide_topical rx::morphine

from pyspark.mllib.feature import Word2Vec

word2vec = Word2Vec() word2vec.setSeed(0) word2vec.setVectorSize(100) model = word2vec.fit(sentences)

def print_synonyms_filt(clinical_concept, model, prefix): synonyms = model.findSynonyms(clinical_concept, 10000) for word, cosine_distance in synonyms: if prefix is None or word.startswith(prefix): print "{}: {}".format(cosine_distance, word)

RESULTS EXAMPLE:ATHEROSCLEROSIS OF THE AORTA

• Hearing Loss¶• From an article from the Journal of Atherosclerosis in 2012:

• Sensorineural hearing loss seemed to be associated with vascular endothelial dysfunction and an increased cardiovascular risk

• Knee Joint Replacements • These procedures are common among those with osteoarthritis and there has

been a solid correlation between osteoarthritis and atherosclerosis in the literature.

print_synonyms_filt(‘dx::440.0’, model, None) 0.930721402168: dx: v12.71 -- Personal history of peptic ulcer disease 0.926115810871: dx: 533.40 -- Chronic or unspecified peptic ulcer of unspecified site with hemorrhage, without mention of obstruction 0.91034334898: dx: 153.6 -- Malignant neoplasm of ascending colon 0.90947073698: dx: 238.75 -- Myelodysplastic syndrome, unspecified 0.907130658627: dx: 389.10 -- Sensorineural hearing loss, unspecified 0.90490090847: dx: 428.30 -- Diastolic heart failure, unspecified 0.902494549751: dx: v43.65 -- Knee joint replacement

THANK YOU

apache big data europe 2015: selected talks

Technology