deep learning cases: text and image processing

Deep Learning Cases: Text and Image Processing

Grigory Sapunov

Founders & Developers: Deep Learning UnicornsMoscow 03.04.2016

[email protected]

“Simple” Image & Video Processing

Simple tasks: Classification and Detection

http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

Detection task is harder than classification, but both are almost done.And with better-than-human quality.



Case #1: IJCNN 2011The German Traffic Sign Recognition Benchmark

● Classification, >40 classes● >50,000 real-life images● First Superhuman Visual Pattern Recognition

○ 2x better than humans○ 3x better than the closest artificial competitor○ 6x better than the best non-neural method

http://benchmark.ini.rub.de/index.php?section=gtsrb&subsection=results#

Method Correct (Error)1 Committee of CNNs 99.46 % (0.54%)2 Human Performance 98.84 % (1.16%)3 Multi-Scale CNNs 98.31 % (1.69%)4 Random Forests 96.14 % (3.86%)

http://people.idsia.ch/~juergen/superhumanpatternrecognition.html





Case #2: ILSVRC 2010-2015Large Scale Visual Recognition Challenge (ILSVRC)

● Object detection (200 categories, ~0.5M images)● Classification + localization (1000 categories, 1.2M images)

Case #2: ILSVRC 2010-2015

● Blue: Traditional CV● Purple: Deep Learning● Red: Human

Examples: Object Detection

Example: Face Detection + Emotion Classification

Example: Face Detection + Classification + Regression

Examples: Food Recognition

Examples: Computer Vision on the Road

Examples: Pedestrian Detection

Examples: Activity Recognition

Examples: Road Sign Recognition (on mobile!)

● NVidia Jetson TK1/TX1○ 192/256 CUDA Cores○ 64-bit Quad-Core ARM A15/A57 CPU, 2/4 Gb Mem

● Raspberry Pi 3○ 1.2 GHz 64-bit quad-core ARM Cortex-A53, 1 Gb SDRAM, US$35

● Tablets, Smartphones● Google Project Tango

Deep Learning goes mobile!

...even more mobile

http://www.digitaltrends.com/cool-tech/swiss-drone-ai-follows-trails/

This drone can automatically follow forest trails to track down lost hikers



...even homemade automobile

Meet the 26-Year-Old Hacker Who Built a Self-Driving Car... in His Garagehttps://www.youtube.com/watch?v=KTrgRYa2wbI

https://www.youtube.com/watch?v=KTrgRYa2wbI

https://www.youtube.com/watch?v=KTrgRYa2wbI

More complex Image & Video Processing

https://www.youtube.com/watch?v=ZJMtDRbqH40 NYU Semantic Segmentation with a Convolutional Network (33 categories)

Semantic Segmentation

https://www.youtube.com/watch?v=ZJMtDRbqH40

https://www.youtube.com/watch?v=ZJMtDRbqH40

Caption Generation

http://arxiv.org/abs/1411.4555 “Show and Tell: A Neural Image Caption Generator”

http://arxiv.org/abs/1411.4555


Example: NeuralTalk and Walk

Ingredients:

● https://github.com/karpathy/neuraltalk2 Project for learning Multimodal Recurrent Neural Networks that describe images with sentences

● Webcam/notebook

Result:

● https://vimeo.com/146492001

https://github.com/karpathy/neuraltalk2

https://github.com/karpathy/neuraltalk2

https://vimeo.com/146492001

https://vimeo.com/146492001

More hacking: NeuralTalk and Walk

Product of the near future: DenseCap and ?

http://arxiv.org/abs/1511.07571 DenseCap: Fully Convolutional Localization Networks for Dense Captioning



Image Colorization

http://richzhang.github.io/colorization/



Visual Question Answering

https://avisingh599.github.io/deeplearning/visual-qa/



Reinforcement LearningУправление симулированным автомобилем на основе видеосигнала (2013)http://people.idsia.ch/~juergen/gecco2013torcs.pdf http://people.idsia.ch/~juergen/compressednetworksearch.html

http://people.idsia.ch/~juergen/gecco2013torcs.pdf

http://people.idsia.ch/~juergen/gecco2013torcs.pdf

http://people.idsia.ch/~juergen/compressednetworksearch.html

http://people.idsia.ch/~juergen/compressednetworksearch.html

Reinforcement Learning

Reinforcement LearningHuman-level control through deep reinforcement learning (2014)http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

Playing Atari with Deep Reinforcement Learning (2013)http://arxiv.org/abs/1312.5602

http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html



Reinforcement Learning

Fun: Deep Dream

http://blogs.wsj.com/digits/2016/02/29/googles-computers-paint-like-van-gogh-and-the-art-sells-for-thousands/



More Fun: Neural Style

http://www.dailymail.co.uk/sciencetech/article-3214634/The-algorithm-learn-copy-artist-Neural-network-recreate-snaps-style-Van-Gogh-Picasso.html



More Fun: Neural Style

http://www.boredpanda.com/inceptionism-neural-network-deep-dream-art/



More Fun: Photo-realistic Synthesis

http://arxiv.org/abs/1601.04589 Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis



More Fun: Neural Doodle

http://arxiv.org/abs/1603.01768 Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks

(a) Original painting by Renoir, (b) semantic annotations,(c) desired layout, (d) generated output.



Text Processing / NLP

Deep Learning and NLPVariety of tasks:

● Finding synonyms● Fact extraction: people and company names, geography, prices, dates,

product names, …● Classification: genre and topic detection, positive/negative sentiment

analysis, authorship detection, …● Machine translation● Search (written and spoken)● Question answering● Dialog systems● Language modeling, Part of speech recognition

https://code.google.com/archive/p/word2vec/

Example: Semantic Spaces (word2vec, GloVe)



http://nlp.stanford.edu/projects/glove/

Example: Semantic Spaces (word2vec, GloVe)



Encoding semanticsUsing word2vec instead of word indexes allows you to better deal with the word meanings (e.g. no need to enumerate all synonyms because their vectors are already close to each other).

But the naive way to work with word2vec vectors still gives you a “bag of words” model, where phrases “The man killed the tiger” and “The tiger killed the man” are equal.

Need models which pay attention to the word ordering: paragraph2vec, sentence embeddings (using RNN/LSTM), even World2Vec (LeCunn @CVPR2015).

Multi-modal learning

http://arxiv.org/abs/1411.2539 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models



Example: More multi-modal learning

Case: Sentiment analysis

http://nlp.stanford.edu/sentiment/

Can capture complex cases where bag-of-words models fail.

“This movie was actually neither that funny, nor super witty.”



Case: Machine Translation

Sequence to Sequence Learning with Neural Networks, http://arxiv.org/abs/1409.3215


Case: Automated Speech TranslationTranslating voice calls and video calls in 7 languages and instant messages in over 50.

https://www.skype.com/en/features/skype-translator/



Case: Baidu Automated Speech Recognition (ASR)

More Fun: MtG cards

http://www.escapistmagazine.com/articles/view/scienceandtech/14276-Magic-The-Gathering-Cards-Made-by-Artificial-Intelligence



Case: Question Answering

A Neural Network for Factoid Question Answering over Paragraphs, https://cs.umd.edu/~miyyer/qblearn/

https://cs.umd.edu/~miyyer/qblearn/

Case: Dialogue Systems

A Neural Conversational Model,Oriol Vinyals, Quoc Lehttp://arxiv.org/abs/1506.05869



What for: Conversational Commerce

https://medium.com/chris-messina/2016-will-be-the-year-of-conversational-commerce-1586e85e3991



What for: Conversational Commerce

Summary

Why Deep Learning is helpful? Or even a game-changer● Works on raw data (pixels, sound, text or chars), no need to feature

engineering○ Some features are really hard to develop (requires years of work for

group of experts)○ Some features are patented (i.e. SIFT, SURF for images)

● Allows end-to-end learning (pixels-to-category, sound to sentence, English sentence to Chinese sentence, etc)○ No need to do segmentation, etc. (a lot of manual labor)

⇒ You can iterate faster (and get superior quality at the same time!)

Still some issues exist● No dataset -- no deep learning

There are a lot of data available (and it’s required for deep learning, otherwise simple models could be better)

○ But sometimes you have no dataset…■ Nonetheless some hacks available: Transfer learning, Data

augmentation, Mechanical Turk, …

● Requires a lot of computations.

No cluster or GPU machines -- much more time required

So what to do next?

Universal Libraries and Frameworks

● Torch7 (http://torch.ch/) ● TensorFlow (https://www.tensorflow.org/) ● Theano (http://deeplearning.net/software/theano/)

○ Keras (http://keras.io/) ○ Lasagne (https://github.com/Lasagne/Lasagne)○ blocks (https://github.com/mila-udem/blocks)○ pylearn2 (https://github.com/lisa-lab/pylearn2)

● CNTK (http://www.cntk.ai/) ● Neon (http://neon.nervanasys.com/) ● Deeplearning4j (http://deeplearning4j.org/) ● Google Prediction API (https://cloud.google.com/prediction/) ● …● http://deeplearning.net/software_links/

http://torch.ch/

https://www.tensorflow.org/

http://deeplearning.net/software/theano/

http://keras.io/

https://github.com/Lasagne/Lasagne

https://github.com/mila-udem/blocks

https://github.com/lisa-lab/pylearn2

http://www.cntk.ai/

http://neon.nervanasys.com/

http://deeplearning4j.org/

https://cloud.google.com/prediction/

http://deeplearning.net/software_links/

http://deeplearning.net/software_links/

Libraries & Frameworks for image/video processing

● OpenCV (http://opencv.org/) ● Caffe (http://caffe.berkeleyvision.org/) ● Torch7 (http://torch.ch/) ● clarifai (http://clarif.ai/) ● Google Vision API (https://cloud.google.com/vision/) ● … ● + all universal libraries

http://opencv.org/

http://caffe.berkeleyvision.org/

http://torch.ch/

http://clarif.ai/

https://cloud.google.com/vision/

Libraries & Frameworks for speech

● CNTK (http://www.cntk.ai/) ● KALDI (http://kaldi-asr.org/) ● Google Speech API (https://cloud.google.com/) ● Yandex SpeechKit (https://tech.yandex.ru/speechkit/) ● Baidu Speech API (http://www.baidu.com/) ● wit.ai (https://wit.ai/) ● …

http://www.cntk.ai/

http://kaldi-asr.org/

https://cloud.google.com/

https://tech.yandex.ru/speechkit/

http://www.baidu.com/

https://wit.ai/

Libraries & Frameworks for text processing

● Torch7 (http://torch.ch/) ● Theano/Keras/… ● TensorFlow (https://www.tensorflow.org/) ● MetaMind (https://www.metamind.io/)● Google Translate API (https://cloud.google.com/translate/) ● …● + all universal libraries

http://torch.ch/

https://www.tensorflow.org/

https://www.metamind.io/

https://cloud.google.com/translate/

What to read and where to study?- CS231n: Convolutional Neural Networks for Visual Recognition, Fei-Fei

Li, Andrej Karpathy, Stanford (http://vision.stanford.edu/teaching/cs231n/index.html)

- CS224d: Deep Learning for Natural Language Processing, Richard Socher, Stanford (http://cs224d.stanford.edu/index.html)

- Neural Networks for Machine Learning, Geoffrey Hinton (https://www.coursera.org/course/neuralnets)

- Computer Vision course collection(http://eclass.cc/courselists/111_computer_vision_and_navigation)

- Deep learning course collection(http://eclass.cc/courselists/117_deep_learning)

- Book “Deep Learning”, Ian Goodfellow, Yoshua Bengio and Aaron Courville(http://www.deeplearningbook.org/)

http://vision.stanford.edu/teaching/cs231n/index.html



http://cs224d.stanford.edu/index.html

https://www.coursera.org/course/neuralnets



http://eclass.cc/courselists/111_computer_vision_and_navigation

http://eclass.cc/courselists/117_deep_learning

http://www.deeplearningbook.org/

What to read and where to study?- Google+ Deep Learning community (https://plus.google.

com/communities/112866381580457264725) - VK Deep Learning community (http://vk.com/deeplearning) - Quora (https://www.quora.com/topic/Deep-Learning) - FB Deep Learning Moscow (https://www.facebook.

com/groups/1505369016451458/)- Twitter Deep Learning Hub (https://twitter.com/DeepLearningHub)- NVidia blog (https://devblogs.nvidia.com/parallelforall/tag/deep-learning/)- IEEE Spectrum blog (http://spectrum.ieee.org/blog/cars-that-think) - http://deeplearning.net/ - Arxiv Sanity Preserver http://www.arxiv-sanity.com/ - ...

https://plus.google.com/communities/112866381580457264725



http://vk.com/deeplearning

https://www.quora.com/topic/Deep-Learning

https://www.facebook.com/groups/1505369016451458/



https://twitter.com/DeepLearningHub

https://devblogs.nvidia.com/parallelforall/tag/deep-learning/

http://spectrum.ieee.org/blog/cars-that-think

http://deeplearning.net/

http://deeplearning.net/

http://www.arxiv-sanity.com/

Whom to follow?- Jürgen Schmidhuber (http://people.idsia.ch/~juergen/) - Geoffrey E. Hinton (http://www.cs.toronto.edu/~hinton/)- Google DeepMind (http://deepmind.com/) - Yann LeCun (http://yann.lecun.com, https://www.facebook.com/yann.lecun) - Yoshua Bengio (http://www.iro.umontreal.ca/~bengioy, https://www.quora.

com/profile/Yoshua-Bengio)- Andrej Karpathy (http://karpathy.github.io/) - Andrew Ng (http://www.andrewng.org/)- ...

http://people.idsia.ch/~juergen/

http://www.cs.toronto.edu/~hinton/

http://deepmind.com/

http://yann.lecun.com/

https://www.facebook.com/yann.lecun

http://www.iro.umontreal.ca/~bengioy

https://www.quora.com/profile/Yoshua-Bengio



http://karpathy.github.io/

http://www.andrewng.org/

https://ru.linkedin.com/in/grigorysapunov [email protected]

Thanks!

https://ru.linkedin.com/in/grigorysapunov

https://ru.linkedin.com/in/grigorysapunov

mailto:[email protected]

mailto:[email protected]

deep learning cases: text and image processing

Technology