apache big data europe 2015: selected talks
TRANSCRIPT
APACHE: BIG DATA EUROPE 2015Budapest, September 28-30, 2015
tech talk @ ferretAndrii Gakhov
SELECTED TALKS
Photos © Apache Big Data
BEING READY FOR APACHE KAFKAby Michael G. Noll, Confluent Inc.
http://www.slideshare.net/miguno/being-ready-for-apache-kafka-apache-big-data-europe-2015
Apache Kafka is a publish-subscribe messaging rethought as a distributed commit log.
Producer
Producer
Consumer
Consumer
Broker Broker Broker
Broker Broker Broker
Broker Broker Broker
ZooKeeper
Kafka Cluster
oldest newest
Prod
ucer
Custo
mer
Custo
mer
topic
ABOUT KAFKA FROM JAY KREPS• A consumer just maintains an “offset,” which is the log entry number
for the last record it has processed on each of these partitions. So, changing the consumer’s position to go back and reprocess data is as simple as restarting the job with a different offset. Adding a second consumer for the same data is just another reader pointing to a different position in the log.
• Kafka supports replication and fault-tolerance, runs on cheap, commodity hardware, and is glad to store many TBs of data per machine.
• LinkedIn keeps more than a petabyte of Kafka storage online, and a number of applications make good use of this long retention pattern for exactly this purpose.
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
USING KAFKA• DEB and RPM are available via Confluence Platform
(http://www.confluent.io/developer)• Recommended Python client: kafka-python
(https://github.com/mumrah/kafka-python)• Confluent Kafka-REST is available via Confluent
Platform• Monitoring is important: Host metrics (CPU, memory,
disk I/O and usage, network I/O), Kafka metrics (consumer lag, replication stats, message latency, GC), ZooKeeper metrics (requests latency, #outstanding requests)
NEW IN KAFKA 0.9.0• Copycat is a new framework for loading structured data into and
out of Kafka
• Kafka Streams is a library that supports basic operations (join/filter/map/…), windowing, schema and proper time modelling (event time vs. processing time)
• New unified consumer Java API
• ZooKeeper dependency is removed from clients
copycat copycat
$ cat < in.txt | grep “apache” | tr a-z A-Z > out.txt
Copycat Copycat
Kafka Kafka
Kafka Streams Kafka Streams
KAPPA ARCHITECTUREOUR EXPERIENCEby Juantomás García, ASPgems
http://events.linuxfoundation.org/sites/events/files/slides/ASPgems%20-%20Kappa%20Architecture.pdf
LAMBDA ARCHITECTURE
https://www.mapr.com/developercentral/lambda-architecture
LAMBDA ARCHITECTURE• Batch layer that provides the following functionality:
• managing the master dataset, an immutable, append-only set of raw data.
• pre-computing arbitrary query functions, called batch views. • Serving layer (NoSQL such as HBase, Apache Druid, etc.)
• This layer indexes the batch views so that they can be queried in ad hoc with low latency.
• Speed layer (Apache Storm, Spark Streaming, etc.)• This layer accommodates all requests that are subject to
low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.
LAMBDA ARCHITECTURE
• Retain the input data unchanged
• Take in account the problem of reprocessing data (the code change, and you need to reprocess)
• Maintain the code that need to produce the same result from two complex distributed system is painful
• Different and diverging programming paradigms
Pros Cons
KAPPA ARCHITECTURE• July 2, 2014 Jay Kreps from LinkedIn coined the term Kappa
Architecture• The proposal of Jay Kreps is simple:
• Use Kafka (or other system) that will let you retain the full log of the data you need to reprocess.
• When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table.
• When the second job has caught up, switch the application to read from the new table.
• Stop the old version of the job, and delete the old output table.http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
KAPPA ARCHITECTURE
APPoutput table n
output table n+1
job version n
job version n+1input topic
Kafka Cluster Stream Processing Serving DB
LAMDA ARCHITECTURE
APPspeed table
batch table
processing job
processing job
input topic
Kafka Cluster Stream ProcessingServing DB
Batch Processing
• Need to reprocess only when you change the code.• Check if the new version is working OK and if not
reverse to the old output table.• You can mirror a Kafka topic to HDFS so you are
not limited to the Kafka retention configuration.• You have only a code to maintain with an unique
framework.• The real advantage is allowing your team to
develop, test, debug and operate their systems on top of a single processing framework.
KAPPA ARCHITECTURE
USE CASES: IOT - OBD II• One of clients install On Board Devices in the cars of
its customers.• ASPGems implements an API to got all the
information in real time and inject the information in Kafka.
• The business rules are implemented in a CEP (complex event processing) running into Apache Spark Streaming.
• As MPP (massively parallel processing) they use ElasticSearch.
CATCH THEM IN THE ACTFRAUD DETECTION IN REAL-TIME
by Seshika Fernando, WSO2
http://events.linuxfoundation.org/sites/events/files/slides/Fraud%20Detection%20in%20Real-time%20-%20Seshika%20Fernando.pdf
FRAUD: A TRILLION DOLLAR PROBLEM• Survey results
• $ 3.5 – 4 Trillion in Global Losses per year (5% of Global GDP)
• Payment Fraud Only
• Merchants are losing around $250B globally
• Cost of Fraud is around 0.68% of Revenue for Retailers (2014)
• Steep rise in Fraud in eCommerce (0.85% of Revenue) and mCommerce (1.36% of Revenue) with a movement of payments to newer channels
DomainKnowledge
BatchAnalytics
Real-TimeAnalytics
PredictiveAnalytics
InteractiveAnalytics
Fraud Detection ToolkitData Analytics Server
FRAUD SCORING• Use combinations of rules• Give weights to each rule• Derive a single number that reflects many fraud indicators• Use a threshold to reject transactions
• Example: Score = 0.001 * itemPrice + 0.1 * itemQuantity + 2.5 * isFreeEmail + 5 * riskyCountry + 8 * suspicousIPRange + 5 * suspicousUsername + 3 * highTransactionVelocity
LEARN FROM DATA
• Utilize Machine Learning Techniques to identify ‘unknown’ point anomalies (e.g. k-means clustering)
MARKOV MODELS FOR FRAUD DETECTION• Markov Models are stochastic models used to model
randomly changing systems
ClassifyEvents
Update Probability
Matrix
Compare Incoming
Sequences
ProbabilityMatrix
events alerts
MARKOV MODEL: CLASSIFICATIONExample: Each transaction is classified under the following three qualities and expressed as a 3 letter token, e.g., HNN• Amount spent: Low, Normal and High• Whether the transaction includes high price item:
Normal and High• Time elapsed since the last transaction: Large,
Normal and Small
MARKOV MODEL: PROBABILITYLNL LNH LNS LHL HHL …
LNL 0.97 0.54 0.2 0.09 0.07LNH 0.8 0.6 0.18 0.65 0.11LNS 0.07 0.83 0.95 0.15 0.12…
• Compare the probabilities of incoming transaction sequences with thresholds and flag fraud as appropriate
• Can use direct probabilities or more complex metrics (Miss Rate Metric, Miss Probability Metric, Entropy Reduction Metric, …)
• Update Markov Probability table with incoming transactions
DIG DEEPER• Access historical
data using• expressive
querying• easy filtering• useful
visualisations
• to isolate incidents and unearth connections
NLP STRUCTURED DATA INVESTIGATION ON NON-TEXTUAL DATA WITH MLLIB
by Casey Stella, Hortonworks
http://events.linuxfoundation.org/sites/events/files/slides/NLP_on_non_textual_data.pdf
WORD2VEC• Word2Vec is a vectorization model created by Google that attempts
to learn relationships between words automatically given a large corpus of sentences.• Gives us a way to find similar words by finding near neighbors in
the vector space with cosine similarity.• Uses a neural network to learn vector representations.• Recent work by Pennington, Socher, and Manning shows that the
word2vec model is equivalent to weighting a word co-occurance matrix based on window distance and lowering the dimension by matrix factorization.
• Read more: http://radimrehurek.com/2014/12/making-sense-of-word2vec/
CLINICAL DATA AS SENTENCES• Clinical encounters form a sort of sentence over time. For a
given encounter :• Vitals are measured (e.g. height, weight, BMI).• Labs are performed and results are recorded (e.g. blood tests).• Procedures are performed.• Diagnoses are made (e.g. Diabetes).• Drugs are prescribed.
• Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”.
• Idea: We can use word2vec to investigate connections between these clinical concepts.
DEMO FOR KAGGLE COMPETION• Practice Fusion Diabetes Classification (https://
www.kaggle.com/c/pf2012-diabetes)• Given a de-identified data set of patient electronic
health records, build a model to determine who has a diabetes diagnosis, as defined by ICD9 codes
• There are a total of 9,948 patients in the training set and 4,979 patients in the test set.
• Ingested and preprocessed these records into197,340 clinical “sentences”
SYNONIMS
• Sentence:• dx::042 rx::benzoyl_peroxide_topical rx::morphine
from pyspark.mllib.feature import Word2Vec
word2vec = Word2Vec() word2vec.setSeed(0) word2vec.setVectorSize(100) model = word2vec.fit(sentences)
def print_synonyms_filt(clinical_concept, model, prefix): synonyms = model.findSynonyms(clinical_concept, 10000) for word, cosine_distance in synonyms: if prefix is None or word.startswith(prefix): print "{}: {}".format(cosine_distance, word)
RESULTS EXAMPLE:ATHEROSCLEROSIS OF THE AORTA
• Hearing Loss¶• From an article from the Journal of Atherosclerosis in 2012:
• Sensorineural hearing loss seemed to be associated with vascular endothelial dysfunction and an increased cardiovascular risk
• Knee Joint Replacements • These procedures are common among those with osteoarthritis and there has
been a solid correlation between osteoarthritis and atherosclerosis in the literature.
print_synonyms_filt(‘dx::440.0’, model, None) 0.930721402168: dx: v12.71 -- Personal history of peptic ulcer disease 0.926115810871: dx: 533.40 -- Chronic or unspecified peptic ulcer of unspecified site with hemorrhage, without mention of obstruction 0.91034334898: dx: 153.6 -- Malignant neoplasm of ascending colon 0.90947073698: dx: 238.75 -- Myelodysplastic syndrome, unspecified 0.907130658627: dx: 389.10 -- Sensorineural hearing loss, unspecified 0.90490090847: dx: 428.30 -- Diastolic heart failure, unspecified 0.902494549751: dx: v43.65 -- Knee joint replacement
THANK YOU