big data deep learning with apache apex (native hadoop)
TRANSCRIPT
2
• Apache Apex
• Deep Learning
• Deeplearning4j library
• Using deeplearning4j with Apex
• Architecture
• Demo screenshots
• Challenges
Agenda
3
• Platform and runtime engine that enables development of scalable and fault-tolerant distributed applications
• Hadoop native
○ No separate service to manage stream processing
○ Streaming Engine built into Application Master and Containers
• Process streaming or batch big data
• High throughput and low latency
• Library of commonly needed business logic
• Write any custom business logic in your application
Apache Apex
4
• Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data.
• Deep learning eliminates the need for feature engineering.• Effectively works on unsupervised data.
Deep Learning
5
• Deep learning uses deep neural networks to model the high level abstractions in data.
• Deep neural networks are neural networks with more than one hidden layer.
Deep Learning
6
• Object Classification in Photographs• Image Caption Generation• Automatic Game playing• Handwriting Recognition• Adding sound to silent movies• Colorization of black and white images
Applications of Deep Learning
7
● An Open Source Deep Learning library (released under Apache 2.0 license)● DL4J is Distributed● Written for Java and Scala● Integrated with Hadoop● Skymind is its commercial support arm● The Neural Net platform Dl4j provides various neural networks like Long
Short-Term Memory units, Convolutional Neural Networks for image processing, Deep AutoEncoder, Restricted Boltzmann Machine, Recurrent Nets, Denoising Autoencoders etc.
Deeplearning4j
8
Deeplearning4j
9
● Training Deep Learning models on single processor is extremely slow.
● Dl4j works with multi CPU and multi GPU systems.
● This integration will enhance the implementation of deep learning models in
distributed and stream processing environments.
Using deeplearning4j with Apex
10
• We achieve distributed training of neural networks using Data Parallelism.• In data parallelism, different machines have a complete copy of the model,
each machine simply gets a different portion of data.
Architecture
11
● We use a method called Parameter Averaging to combine and synchronize models trained on different machines.
Architecture
12
Apex Application DAG
13
● The Iris flower data set or Fisher's Iris data set is a multivariate data set.● The data set consists of 150 samples from each of three species of Iris (Iris
setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.
Demo Dataset
Iris versicolor Iris setosa Iris virginica
14
DEMO
15
16
• Had to change default packaging of Apex Application
• We used Maven Shade plugin for packaging the app.
• Certain components of Nd4j are incompatible with KryoSerializer.
• We are using Java Serializer for those components.
Challenges
17
• Apache Apex - http://apex.apache.org/
• Subscribe to forums○ Apex - http://apex.apache.org/community.html○ DataTorrent - https://groups.google.com/forum/#!forum/dt-users
• Download - https://datatorrent.com/download/
• Twitter○ @ApacheApex; Follow - https://twitter.com/apacheapex○ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://meetup.com/topics/apache-apex
• Webinars - https://datatorrent.com/webinars/
• Videos - https://youtube.com/user/DataTorrent
• Slides - http://slideshare.net/DataTorrent/presentations
• Startup Accelerator – Free full featured enterprise product○ https://datatorrent.com/product/startup-accelerator/
• Big Data Application Templates Hub – https://datatorrent.com/apphub
Resources
18
Thank You!