deep learningbanks/790-lectures.dir/lect1.pdf · 2019-08-25 · deep nns. one classifies images as...

58
Deep Learning Lecture 1 David Banks Duke and samsi Type equation here.

Upload: others

Post on 12-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Deep LearningLecture 1

    David BanksDuke and samsi

    Type equation here.

  • 0. Class Policies and Information

    The class website is www.stat.duke.edu/~banks.

    The class is jointly attended by students from Duke, NCSU and Duke, so our start time may adjust.

    We shall study applied deep learning, including NNs, CNNs, RNNs, and GANs. We shall do some theory, but there is no mathematical prerequisite.

    http://www.stat.duke.edu/%7Ebanks

  • About weekly, there will be a exercise to apply a deep network to a canonical dataset. You may work in teams of three. You will submit your trained network parameters and your error rates on the test data.

    By next week, self-select into teams and make sure you can access the canonical datasets. There is a sign-up sheet on our website and instructions on how to access the data.

    People need not know the same programming language/API. Popular ones are Python, PyTorch, Keras and TensorFlow.

  • Deep Leaning (DL) has some science but is mostly art. So we shall use the class assignments to perform a designed experiment that studies how much benefit accrues from popular DL strategies.

    The factors in the designed experiment will be such things as which dataset is used, how deep a network is fit, what type of activation function is used, and various regularization methods

    The results will be analyzed using an Analysis of Variance (ANOVA). This is a standard tool in statistics.

  • 1.1 Why Do We Care?

    A large part of our future is data engineering, not data science. A data engineer uses data and algorithms to produce something useful. Examples include• Google maps• Google translate• Autonomous vehicles• Recommender systemsAll of these applications rest on DL.

  • Tesla Autopilot Hardware v2+

    It uses NVIDIA Drive PX 2 hardware, a CNN has 8 camera input, and Inception 1 architecture.

  • In the United States, there is about 1.18 fatalities per 108 million human-driven miles.

    There have been four fatalities with level 2 systems (Tesla), one with a level 3 system (Uber), and none with the levels 4 or 5. The number of miles driven by autonomous vehicles is 109.

    But there are still open questions: weather conditions, vulnerability to hacking, and ethical decisions.

  • Reasons Why Driverless Cars Are Better• Safer: better sensors, no distraction, coordination

    among all vehicles• Much better fuel economy. Cars could be made

    out of canvas.• Congestion: Seven-fold more cars.• Independence: Seniors and children.

    Challenges• Difficult to train• Legal/regulatory proof of safety• Mixed fleet transition period.

  • GANS for Image Synthesis

    Generative Adversarial Networks (GANs) are pairs of deep NNs. One classifies images as true or fake, and the other generates images to fool the first. Game theory ensures the two systems will converge to an equilibrium (Ian Goodfellow et al., 2014).

    These are used to create amazing photorealistic images which are used in games and other applications.

  • All of these are fake.

  • GANs can do more things than create photo-realistic images. They can• build 3D models from 2D images• produce movies, by concatenating images• age images of faces to help find missing children• find optimal control inputs for nonlinear dynamical

    systems• text to image generation.As we shall see later, text generation is more difficult.

  • The GANs audience at NIPS 2017. The statistical community needs to take this seriously.

  • AlphaGo

    Non-GAN Deep Learners were the basis for AlphaGo’s learning the policy network and the value network that enabled it to defeat human champions. The ideas easily generalize so as to improve airline routing, personal investment, or infrastructure management. It uses the tensor processing unit (TPU) chips developed by Google.

  • AlphaGo beat the top human player.

  • Deepmind’s AlphaZero

    Google used DL to train a AlphaZero, a chess-playing algorithm. It was trained by self-play for 9 hours.

    Its opponent was Stockfish, which had undergone continual development from 2008 to 2017.

    In 2017, AlphaZero beat Stockfish as white 25 times, as black 3 times, and drew the other 72 games in the set.

  • DL in Computer Vision: Ten Applications

    Image Classification Image Classification with LocalizationObject Detection Object SegmentationImage Style Transfer Image ColorizationImage Reconstruction Image Super-ResolutionImage Synthesis Caricature Identification

    This course will look at several of these applications

  • DL generated caricatures of people.

  • AlexNet (2012): First CNN 8 layers, 61*106 parameters•ZFNet (2013): 8 layers•VGGNet (2014): 11.2% to 7.3%16 layers, 138*106 parametersGoogLeNet (2014): 11.2% to6.7% 22 layers, 5*106 para.ResNet (2015): 6.7% to 3.57% 152 layersCUImage (2016): 3.57% to 2.99% Ensemble of 6 models

    SENet (2017): 2.99% to 2.251% Network adaptively adjusts theweighting of each feature map inthe convolutional block.

    ImageNet Challenge

  • ImageNet largescale visual recognition challenge

  • Other Recent DL Successes

    BERT (Bidirectional Encoder Representations from Transforms) is the leading Natural Language Processor. Instead of tracking sentences left-to-right or right-to-left, Google AI Language does both simultaneously. It outperforms competitors on Question Answering and Natural Language Inference challenges, among others.

    It is open source. It was trained on the Wikipedia.

  • AdaNet is a DL system for ensemble modeling, which combines multiple Machine Learning (ML) predictions. It is based on TensorFlow and trains quickly.

    AutoAugment is a DL system that expands image data training sets by rotating, reflecting, and shearing. It uses reinforcement learning to find good image transformation policies from the data itself. This addresses the problem the DL needs lots of training data. It shows improved performance on the standard challenge datasets.

  • Synthetic Data. One can distort data, forcing the NN to focus on essential image characteristics.

  • OpenAI 5 & Dota 2Dota 2 as a testbed for the messiness and continuous natureof the real world: teamwork, long time horizons, and hiddeninformation.

  • Problems with Deep Learning

    • It takes a lot of data to train a deep NN• Training can be slow and may get caught in a bad. local

    minimum (it will get caught in a local minimum).• There is a large carbon footprint.• There is not much mathematical/statistical theory.• There are known pathologies.• They can be used for evil.

  • The DL Ecosystem

    There are seven components:• Software• Datasets• Architectures• Training• Applications• Theoretical properties• Testing.

  • Software Environment

    Factors to consider:• Learning curve• Speed of development• Size and passion of community• Number of papers implemented in framework• Likelihood of long-term growth and stability• Ecosystem of tooling

  • Deep Learning Frameworks

    Deep LearningFrameworkPower Scores(by Jeff Hale)http://bit.ly/2GBa3tU

    1.

    2.

    3.

    4.

    5.

    6.

    7.

    8.

    9.

    10.

    11.

  • Name Website GitHub URL License Language APIs RatingTensorFlow http://tenso tensorflow/teApache-2 C++, Python Python, C++, Jav 100Keras http://keras fchollet/kera MIT Python Python, R 46.1Caffe http://caffe BVLC/caffe BSD C++ Python, MATLA 38.1MXNet http://mxne apache/incubApache-2 C++ Python, Scala, R 34Theano http://deepTheano/Thea BSD Python Python 19.3CNTK https://docsMicrosoft/CN MIT C++ Python, C++, C# 18.4DeepLearning4J https://dee deeplearningApache-2 Java, Scala Java, Scala, Cloju 17.8PaddlePaddle http://www baidu/paddle Apache-2 C++ C++ 16.3PyTorch http://pytor pytorch/pyto BSD C++, Python Python 14.3

    GitHub metric ratings, arXiv: 1803.04818v2 Zacharias et al.

    Sheet1

    Name Website GitHub URL License Language APIs Rating

    TensorFlowhttp://tensorflow.org tensorflow/tensorflowApache-2.0 C++, Python Python, C++, Java, Go100

    Kerashttp://keras.io/ fchollet/keras MIT Python Python, R 46.1

    Caffehttp://caffe.berkeleyvision.org BVLC/caffe BSD C++ Python, MATLAB38.1

    MXNethttp://mxnet.io apache/incubatormxnetApache-2.0 C++ Python, Scala, R, JavaScript, Julia, MATLAB, Go, C++, Perl34

    Theanohttp://deeplearning.net/software/theanoTheano/Theano BSD Python Python 19.3

    CNTKhttps://docs.microsoft.com/en-us/cognitive-toolkitMicrosoft/CNTK MIT C++ Python, C++, C#, Java18.4

    DeepLearning4Jhttps://deeplearning4j.org deeplearning4j/deeplearning4jApache-2.0 Java, Scala Java, Scala, Clojure, Kotlin17.8

    PaddlePaddle http://www.paddlepaddle.org baidu/paddle Apache-2.0 C++ C++ 16.3

    PyTorch http://pytorch.org pytorch/pytorch BSD C++, Python Python 14.3

    http://deeplearning.net/software/theano

  • Data Sets

    We shall use six well-known datasets in this course:• MNIST Handwritten Digits• MNIST Fashion• Cat and Non-Cat• Cifar-10• Street View House Numbers• IMDB Large Movie Reviews

    The first five are image data sets, but the last is text.

  • The MNIST Data

    This consists of a training set of 60,000 handwritten digits, and 10,000 test samples. The data have been size normalized and centered in a 28 × 28 pixel grid. It is a mixture of data from Census Bureau employees and high school students.

    This is an old dataset, collected in 1995. It is pretty easy to get good accuracy with simple NNs.

  • MNIST Fashion Data

    The MNIST Fashion data set contains 70,000 grayscale images of 10 types of fashion products. There are 7,000 images in each category. There are 60,000 images in the training set, and 10,000 in the test set.

    As with the digits, all images have been centered and grayscaled in a a 28 × 28 pixel format.

  • The MNIST Fashion dats is more challenging. It consists of images of ten types of wearables: tops, pants, pullovers, dresses, coats, sandals, shirts, sneakers, handbags, and ankle boots.

  • Cat and Non-Cat Data

    This data set comes from Andrew Ng’s Coursera class on Neural Networks and Deep Learning. The training data consist of 209 images labeled “cat” or “non-cat”. The test set consists of 50 images.

    These are color images, with an RGB scale. The images have been centered, standardized, and are square.

  • Cifar-10 Dataset

    This is a subset of 80 million color images, standardized to a 32 × 32 pixel grid format. It is an RGB scale.

    There are 60,000 images, with 6,000 per each of ten categories: automobile, bird, truck, frog, deer, ship, horse, cat, airplane, and dog.

    There are 10,000 images in the test set.

  • Street View House Numbers

    There are ten classes, one for each digit. There are 73,257 digits for training, and 26,032 for testing. This is similar to the MNIST digits, but more challenging since the images have background distractors and varying degrees of resolution.

    The data come in two formats. We shall use the one in which images contain a single centered RGB digit in a 32 × 32 pixel grid format.

  • IMDb Movie Review Dataset

    These data are for binary sentiment classification from a trainging set of 25,000 highly polarized movie reviews, with another 25,000 for testing. The text is coded as numbers.

    There are interesting text analysis things that could be done, besides classifying the review as positve/negative.

  • Architectures

    The architecture refers to the connectivity structure among the nodes in a neural network. Different architectures favor different applications, and one of the challenges in DL is to find a good architecture.

    The architecture links the input data through a series of layers consisting of perceptron nodes, which then finally produce the output.

  • The Perceptron, the building block of neural networks.

  • Multilayer NN: 𝜌𝜌(𝑊𝑊3 � 𝜌𝜌(𝑊𝑊2 ⋅ 𝜌𝜌(𝑊𝑊1 ⋅ 𝑥𝑥0)))

  • The x1, …, xn are are covariates (e.g., pixel values for images, or words coded as numbers).

    The perceptron multiplies each covariate by a corresponding weight, w1, …, wn, and adds a constant b to produce z.

    The activation function f(z) is a monotone function of z, here shows as a sigmoidal function, a step function, a tanh function, and a relu function.

  • 𝑓𝑓 𝑧𝑧 =exp(𝑧𝑧)

    1 + exp(𝑧𝑧)Sigmoid function:

    Step function: 𝑓𝑓 𝑧𝑧 = � 0, if 𝑧𝑧 < 0𝑎𝑎, if 𝑧𝑧 ≥ 0 for 𝑎𝑎 > 0

    Tanh function: 𝑓𝑓 𝑧𝑧 = exp 𝑧𝑧 −exp(−𝑧𝑧)exp 𝑧𝑧 +(−𝑧𝑧)

    ReLU function: 𝑓𝑓 𝑧𝑧 = � 0, 𝑧𝑧 < 0𝑎𝑎𝑧𝑧, 𝑓𝑓 𝑥𝑥 = 𝑧𝑧 ≥ 0

  • There are three main architectures we shall consider, each of which can be varied in many ways:• The (deep) feed forward NN, in which outputs cannot be

    inserted as inputs to previous layers.• The recurrent NN (rNN), where the graph has cycles

    (good for time data, such as speech recognition).• The convolutional neural network, in not all nodes in

    one layer are connected to each node in the subsequent layer.

    There others, such as recursive NNs, which re-use weights.

  • Training a NN

    This is a significant challenge. For a large network, one must estimate millions of weights. The classical approach is backpropagation, usually implemented through coordinate gradient descent.

    Also, regularization often improves predictive accuracy. One kind of regularization weights that take only a small number of different values (quantization). Another kind does not use completely connected layers.

  • Training is an optimization problem. One wants to find weights wi that minimize a loss function on the predictions �𝑌𝑌 compared to the training inputs Y. A typical loss function is squared error,

    L(Y, �𝑌𝑌) = ∑𝑖𝑖=1𝑀𝑀 (𝑌𝑌𝑖𝑖 − �𝑌𝑌𝑖𝑖) 2.

    But there are many other choices, such as the sum of absolute deviations or differences in convex penalties (which need not be convex). In particular, the SCAD penalty has nice properties.

  • The previous loss functions were relevant to prediction of a numerical value Y by �𝑌𝑌. If one has categorical data, one can minimize the Kullback-Leibler Divergence. Let pi be the true probability of label i and let qi be the probability predicted by the NN. Then one wants to minimize:

    DKL(P ∥ Q) = ∑𝑖𝑖=1𝐾𝐾 𝑝𝑝𝑖𝑖 ln𝑝𝑝 𝑖𝑖𝑞𝑞𝑖𝑖

    .

    Again, there are alternative loss functions for categorical data.

  • Coordinate gradient descent is an optimization algorithm that successively minimimzes along coordinate directions to find the minimum of a function. At each iteration the algorithm chooses a coordinate (or set of coordinates) according to some rule (e.g., cycling, random, most change) and then minimizes the function in that direction.

    It is non-convex, so there are typically many local minima that all behave about the same.

  • This illustrates, in 2D, how coordinate gradient descent works. First it minimizes in the y direction, then the xdirection, and it alternates.

    One could also pick a direction based upon a linear combination of x and y.

  • Theory: The Universal Approximation Theorem

    Cybenko (1989) proved that a feed-forward NN with a single hidden layer having a finite number of perceptronscan approximate any continuous function on a compact subset of ℝ𝑛𝑛. He used a sigmoidal activation function, but a similar result was found by Hanin (2017) for ReLUactivation functions.

    This is related to earlier work by Kolmogorov and Arnold, who solved Hilbert’s 13th problem.

  • The intuition for the Universal Approximation Theorem is simple. • A continuous function on a compact set can be

    approximated by a piecewise constant function.• To represent a piecewise constant function as a NN, for

    each region where the function is constant, use a NN as an indicator function for that region. Then build a final layer with a single node whose input is a linear combination that sums all the indicators, with weights equal to the constant value of the corresponding region.

  • The width of a NN is the number of nodes in a layer (these need not be the same for all layers, but ignore that for now).

    Single hidden layer feed-forward networks require exponential width. Lu, Pu, Wang, Hu and Wang (2017) showed universal approximability for width bounded (n + 4) deep NNs with ReLUactivitation functions.

    They also showed that the width must be at least n.

    Slide Number 1Slide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Slide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38Slide Number 39Slide Number 40Slide Number 41Slide Number 42Slide Number 43Slide Number 44Slide Number 45Slide Number 46Slide Number 47Slide Number 48Slide Number 49Slide Number 50Slide Number 51Slide Number 52Slide Number 53Slide Number 54Slide Number 55Slide Number 56Slide Number 57Slide Number 58