deep learning by jskim
DESCRIPTION
Deep learning by JSKIMTRANSCRIPT
딥러닝(Deep Learning)역사와 현재, 그리고 보건학으로의 적용
김진섭
유전체역학
September 10, 2014
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 1 / 74
What is Deep Learning?
Contents
1 What is Deep Learning?
2 HistoryPerceptronMultilayer Perceptron1st Breakthrough: Unsupervised Learning2nd Breakthrough: Supervised Learning
3 Apply to Public HealthEpidemiology vs Machine LearningDeep Learning vs Other MLHypothesis Testing vs Hypothesis Generating
4 Conclusion
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 2 / 74
What is Deep Learning?
Machine Learning
컴퓨터가 학습하여 예측할 수 있도록 예측모형(prediction)을개발하는 인공지능의 한 분야.
Computer science + Statistics ??
Amazon, Google, Facebook..
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 3 / 74
What is Deep Learning?
Neural Network
Human brain VS Computer
3431× 3324 =??
개와 고양이 구별, 음성인식, 문자인식
Sequential VS Parallel
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 4 / 74
What is Deep Learning?
Neuron & Artificial Neural Network(ANN)[19]
Figure. (A) Human neuron; (B) artificial neuron or hidden unity; (C) biologicalsynapse; (D) ANN synapses.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 5 / 74
What is Deep Learning?
http://www.nd.com/welcome/whatisnn.htm
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 6 / 74
What is Deep Learning?
Deep Neural Network(DNN) ' Deep Learning
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 7 / 74
What is Deep Learning?
글로벌 IT기업 ‘기계학습’ 집중 http://www.dt.co.kr/contents.
html?article_no=2014062002010960718002
세계는 지금 인공지능 열풍 6조달러 블루오션 한국은 ‘꽝’http://vip.mk.co.kr/news/view/21/20/1178659.html
MS 클라우드, ‘머신러닝’으로 똑똑해진다http://www.bloter.net/archives/196341
떠오르는 5대 주요 기술과 ‘딥러닝’http://www.wikitree.co.kr/main/news_view.php?id=157174
인공지능 시대 구글의 맨해튼 프로젝트 http://weekly.chosun.
com/client/news/viw.asp?nNewsNumb=002311100009&ctcd=C02
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 8 / 74
History
Contents
1 What is Deep Learning?
2 HistoryPerceptronMultilayer Perceptron1st Breakthrough: Unsupervised Learning2nd Breakthrough: Supervised Learning
3 Apply to Public HealthEpidemiology vs Machine LearningDeep Learning vs Other MLHypothesis Testing vs Hypothesis Generating
4 Conclusion
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 9 / 74
History Perceptron
Perceptron
1958년 Rosenblatt[23].
y = ϕ(n∑
i=1
wixi + b) (1)
(b: bias, ϕ: activation function(e.g: logistic or tanh))
Figure. Concept of Perceptron[Honkela]김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 10 / 74
History Perceptron
Low Performance
XOR도 해결하지 못한다[Hinton].
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 11 / 74
History Multilayer Perceptron
Multilayer Perceptron
Hidden layer를 늘리면 해결된다!!
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 12 / 74
History Multilayer Perceptron
Learing Problem
Hidden layer증가 → Weight 갯수 증가..
1985년: Error Backpropagation Algorithm[24]
Gradient Descent Methods뒤에서부터 거꾸로..
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 13 / 74
History Multilayer Perceptron
Gradient Descent Methods
Weight 갯수가 너무 많다..
Linear regression: Least square, maximum likelihood: Exactcalculation.
MLP: No exact method
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 14 / 74
History Multilayer Perceptron
Gradient Descent Algorithm[Han-Hsing]
(a) Large Gradient (b) Small Gradient
(c) Small Learning Rate (d) Large Learning Rate
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 15 / 74
History Multilayer Perceptron
Example[Hinton]
A toy example to illustrate the iterative method • Each day you get lunch at the cafeteria.
– Your diet consists of fish, chips, and ketchup. – You get several portions of each.
• The cashier only tells you the total price of the meal – After several days, you should be able to figure out the price of
each portion. • The iterative approach: Start with random guesses for the prices and
then adjust them to get a better fit to the observed prices of whole meals.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 16 / 74
History Multilayer Perceptron
Solving the equations iteratively
• Each meal price gives a linear constraint on the prices of the portions:
• The prices of the portions are like the weights in of a linear neuron.
• We will start with guesses for the weights and then adjust the guesses slightly to give a better fit to the prices given by the cashier.
w = (wfish ,wchips ,wketchup )
price = x fishw fish + xchipswchips + xketchupwketchup
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 17 / 74
History Multilayer Perceptron
The true weights used by the cashier Price of meal = 850 = target
portions of fish
portions of chips
portions of ketchup
150 50 100
2 5 3
linear neuron
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 18 / 74
History Multilayer Perceptron
• Residual error = 350 • The “delta-rule” for learning is:
• With a learning rate of 1/35, the weight changes are +20, +50, +30
• This gives new weights of 70, 100, 80. – Notice that the weight for
chips got worse!
A model of the cashier with arbitrary initial weights
Δwi = ε xi (t − y)
price of meal = 500
portions of fish
portions of chips
portions of ketchup
50 50 50
2 5 3
ε
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 19 / 74
History Multilayer Perceptron
Deriving the delta rule
• Define the error as the squared residuals summed over all training cases:
• Now differentiate to get error derivatives for weights
• The batch delta rule changes the weights in proportion to their error derivatives summed over all training cases
E = 12
(tnn∈training∑ − yn )2
∂E∂wi
= 12
∂yn
∂wi
dEn
dynn∑
= − xin
n∑ (tn − yn )
Δwi = −ε∂E∂wi
= ε xin
n∑ (tn − yn )
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 20 / 74
History Multilayer Perceptron
Backpropagation Algorithm[Kim]
(e) Forward Propagation (f) Back Propagation
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 21 / 74
History Multilayer Perceptron
Limitations of MLP[Kim]
1 Vanishing gradient problem
2 Typically requires lots of labeled data
3 Overfitting problem: Given limited amounts of labeled data, trainingvia back-propagation does not work well
4 Get stuck in local minima (?)
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 22 / 74
History Multilayer Perceptron
Vanishing Gradient[2]
Figure. Sigmoid functions
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 23 / 74
History Multilayer Perceptron
Local Minima[Kim]
Figure. Global and Local Minima
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 24 / 74
History 1st Breakthrough: Unsupervised Learning
1st Breakthrough: Unsupervised Learning
2006년 Restricted Boltzmann Machine, Deep Belief Network, DeepBoltzmann Machine[25, 13]..
Figure. Description of Unsupervised Learning[Kim]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 25 / 74
History 1st Breakthrough: Unsupervised Learning
Limitations of MLP[Kim]
1 Vanishing gradient problem
Solved by bottom-up layerwise unsupervised pre-training
2 Typically requires lots of labeled data3 Overfitting problem: Given limited amounts of labeled data, training
via back-propagation does not work well
Solved by using lots of unlabeled data
4 Get stuck in local minima (?)
Unsupervised pre-training may help the network initialize with goodparameters
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 26 / 74
History 1st Breakthrough: Unsupervised Learning
Restricted Boltzmann Machine(RBM)
에너지가 낮을수록 확률이 높다
P(v , h) =1
Zexp−E(v ,h)
(Z: Normalized Constant)
Figure. Diagram of a Restricted Boltzmann[Wikipedia]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 27 / 74
History 1st Breakthrough: Unsupervised Learning
Energy Function
E (v , h) = −∑
i
aivi −∑
j
bjhj −∑
i
∑j
hjwi ,jvi = −aTv − bTh − hTWv
(ai : offset of visible variable, bj : offset of hidden variable, wi ,j : weightbetween vi and hj )
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 28 / 74
History 1st Breakthrough: Unsupervised Learning
목표
P(v) =∑
h P(v , h)를 최대화 하는 v와 그때의 weight들을 구하는 것.
E (v , h) = −∑
i
aivi −∑
j
bjhj −∑
i
∑j
hjwi ,jvi = −aTv − bTh − hTWv
즉, h, v가 동시에 켜진 쪽의 weight를 크게하려는 의도
같이 활성화되는 시냅스(synapse)는 연결된다.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 29 / 74
History 1st Breakthrough: Unsupervised Learning
Hebb’s Law (Hebbian Learning Rule)
http://www.skewsme.com/behavior.htm
http://lesswrong.com/lw/71x/a_crash_course_in_the_
neuroscience_of_human/l
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 30 / 74
History 1st Breakthrough: Unsupervised Learning
Traing RBM
P(v) =∑
h P(v , h)를 최대화 하는 v와 그때의 weight들을 구하는 것.
Gradient Ascent
logP(v) = log(∑
h
exp−E(v ,h)
Z)
= log(∑
h
exp−E(v ,h))− logZ
= log(∑
h
exp−E(v ,h))− log(∑v ,h
exp−E(v ,h))
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 31 / 74
History 1st Breakthrough: Unsupervised Learning
∂logP(v)
∂θ= −
1∑h exp−E(v,h)
∑h
exp−E(v,h) ∂E(v, h)
∂θ+
1∑v,h exp−E(v,h)
∑v,h
exp−E(v,h) ∂E(v, h)
∂θ
= −∑
h
p(h|v)∂E(v, h)
∂θ+
∑v,h
p(h, v)∂E(v, h)
∂θ
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 32 / 74
History 1st Breakthrough: Unsupervised Learning
P(v |h) =m∏
i=1
P(vi |h)
P(h|v) =n∏
j=1
P(hj |v)
p(hj = 1|v) = σ
(bj +
m∑i=1
wi ,jvi
)
p(vi = 1|h) = σ
ai +n∑
j=1
wi ,jhj
(σ: activation function)
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 33 / 74
History 1st Breakthrough: Unsupervised Learning
∂logP(v)
∂θ= −
∑h
p(h|v)∂E (v , h)
∂θ+∑v ,h
p(h, v)∂E (v , h)
∂θ
변형된 Gibbs sampler로 Sampling하여 해결
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 34 / 74
History 1st Breakthrough: Unsupervised Learning
Figure. Contrastive Divergence(CD-k)[7]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 35 / 74
History 1st Breakthrough: Unsupervised Learning
Deep Belief Network[11, 12, 1]
1 Multiple RBM
2 Phoneme → Word → Grammer, Sentence
3 Generation도 가능!!!
http://www.cs.toronto.edu/~hinton/adi/index.htm
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 36 / 74
History 2nd Breakthrough: Supervised Learning
2nd Breakthrough: Supervised Learning
1 Vanishing gradient problem
Solved by a new non-linear activation :rectified linear unit (ReLU)
2 Typically requires lots of labeled dataSolved by big data & crowd sourcing
3 Overfitting problem: Given limited amounts of labeled data, trainingvia back-propagation does not work well
Solved by a new regularization method : dropout, dropconnect, etc
4 Get stuck in local minima (?)
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 37 / 74
History 2nd Breakthrough: Supervised Learning
Rectified Linear Unit (ReLU)
Figure. The proposed non-linearity, ReLU, and the standard neuralnetwork non-linearity, logistic[30]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 38 / 74
History 2nd Breakthrough: Supervised Learning
장점
1 0보다만 크면 항상 기울기가 1로 일정해 기울기가 감소하는 경우가없다.
2 학습이 쉽다.
3 Pre-training의 필요성을 없애준다.[20, 8].
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 39 / 74
History 2nd Breakthrough: Supervised Learning
DropOut & DropConnect
Ensemble Model
DropOut: hidden unit의 일부를 쉬게 한다[14].
DropConnect: hidden unit으로의 연결 중 일부를 쉬게 한다[28].
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 40 / 74
History 2nd Breakthrough: Supervised Learning
Figure. Description of DropOut & DropConnect[Wan]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 41 / 74
History 2nd Breakthrough: Supervised Learning
Figure. Using the MNIST dataset, in a) Ability of Dropout andDropConnect to prevent overfitting as the size of the 2 fully connectedlayers increase. b) Varying the drop-rate in a 400-400 network shows nearoptimal performance around the p = 0.5[28]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 42 / 74
History 2nd Breakthrough: Supervised Learning
Local Minima Issue
High dimension and non-convex optimization
1 Local minima들의 값이 비슷비슷할 것
2 Local minima ' Global minima.
3 수많은 차원에서 차원마다 local minima이기는 어렵다.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 43 / 74
History 2nd Breakthrough: Supervised Learning
31
ConvNets: today
Loss
parameter
Local minima are all similar, there are long plateaus, it can take long to break symmetries.
Optimization is not the real problem when:– dataset is large– unit do not saturate too much– normalization layer
Figure. Local minima when high dimension and non-convex optimization[Ranzato]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 44 / 74
History 2nd Breakthrough: Supervised Learning
Others: Convolutional Neural Network
Sparse Connectivity & Shared Weight: 2차원 데이터에적합[documentation]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 45 / 74
History 2nd Breakthrough: Supervised Learning
http://parse.ele.tue.nl/education/cluster0
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 46 / 74
History 2nd Breakthrough: Supervised Learning
http://eblearn.sourceforge.net/old/demos/mnist/index.shtml
http://yann.lecun.com/exdb/lenet/
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 47 / 74
History 2nd Breakthrough: Supervised Learning
Deep Learning Summary!!!
1 1950년대 퍼셉트론(perceptron)에서 시작된 인공신경망 연구는 1980년대오류역전파알고리즘(Error Backpropagation Algorithm)으로 다층퍼셉트론(Multilayer perceptron)을 학습할 수 있게 되면서 발전.
2 Gradient vanishing, labeled data의 부족, overfitting, local minima issue 등이 잘해결되지 못해 2000년대 초까지 인공신경망 연구는 답보상태.
3 2006년부터 볼츠만머신을 이용한 Unsupervised Learning인 Restricted BoltzmannMachine(RBM), Deep Belief Network(DBN), Deep Boltzmann Machine(DBM),Convolutional Deep Belief Network 등이 개발.
4 Unlabeled data를 이용하여 pre-training을 수행할 수 있게 되어 위에 언급된다층퍼셉트론의 한계점이 극복됨.
5 2010년부터는 빅데이터를 적극적으로 이용함으로서 수많은 labeled data를 사용할수 있게 되었고, Rectified linear unit (ReLU), DropOut, DropConnect 등의발견으로 vanishing gradient문제와 overfitting issue를 해결하여 아예 Supervisedlearning이 가능.
6 Local minima issue는 High dimension non-convex optimization에서는 별로 중요한부분이 아니라는 공감대.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 48 / 74
Apply to Public Health
Contents
1 What is Deep Learning?
2 HistoryPerceptronMultilayer Perceptron1st Breakthrough: Unsupervised Learning2nd Breakthrough: Supervised Learning
3 Apply to Public HealthEpidemiology vs Machine LearningDeep Learning vs Other MLHypothesis Testing vs Hypothesis Generating
4 Conclusion
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 49 / 74
Apply to Public Health Epidemiology vs Machine Learning
Objective of statistics
1 지식의 확장, Causal inference
통계학자 Pearson: 다윈의 진화론 증명을 위하여..
2 의사결정
통계학자 R.A Fisher: 가장 성능이 좋은 비료 선택
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 50 / 74
Apply to Public Health Epidemiology vs Machine Learning
Statistics in Epidemiology
Causal inference: 원인이 무엇인가?
해석이 잘되는 모형이 짱이다. 인과관계 추론.
간단한 모형 선호.
독립변수의 단위도 중요(Kilometer VS meter, centering issue)
β, Odds Ratio(OR), Hazard Ratio(HR), p-value, AIC
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 51 / 74
Apply to Public Health Epidemiology vs Machine Learning
Statistics in Machine Learning
Prediction: 앞으로 어떻게 될 것인가?
예측력이 좋은 것이 짱이다.
복잡한 모형도 상관없다. 예측만 효율적으로 잘 한다면.
필요에 따라 독립변수들을 자유자재로 바꾼다. (Scale change)
Y , p, Cross-validation, Accuracy, ROC curve
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 52 / 74
Apply to Public Health Epidemiology vs Machine Learning
Example: Logistic regression
Binomial data를 다루는 강력한 통계분석방법.
특히 epidemiologic study에서는 절대적인 지위.
β → Odds Ratio(OR) : 해석이 쉽다.
But..
Logit function... 계산이 어려워지는 원인.
Heritability issue of binomial trait?? Logit함수가 범인..
Probit model이 대안이 될 수 있다.
계산쉽다.β 해석 어렵다..
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 53 / 74
Apply to Public Health Epidemiology vs Machine Learning
Logit VS Probit
Figure. Logit VS Probit
Logit: Pr(Y = 1 | X ) = [1 + e−X ′β]−1
Probit: Pr(Y = 1 | X ) = Φ(X ′β)
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 54 / 74
Apply to Public Health Epidemiology vs Machine Learning
Example2: Cox proportional hazard model
Censored data분석의 표준.
http:
//www.theriac.org/DeskReference/viewDocument.php?id=188
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 55 / 74
Apply to Public Health Epidemiology vs Machine Learning
http://www.uni-kiel.de/psychologie/rexrepos/posts/
survivalCoxPH.html
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 56 / 74
Apply to Public Health Epidemiology vs Machine Learning
Assumptions
lnλ(t) = lnλ0(t) + β1X1 + · · ·+ βpXp = lnλ0(t) + Xβλ(t) = λ0(t) eβ1X1+···+βpXp = λ0(t) eXβ
S(t) = S0(t)exp(Xβ) = exp(−Λ0(t) eXβ
)Λ(t) = Λ0(t) eXβ
λ(t)
λ0(t)= eXβ
β : Hazard Ratio(HR)
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 57 / 74
Apply to Public Health Epidemiology vs Machine Learning
Hazard Ratio
해석 편하다. Odd Ratio 급.
But, 가정이 많이 들어간다.
식이 복잡해서 계산이 어렵다.
Conditional Logistic Regression..
Prediction에도 Cox를 고집할 필요는 없다.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 58 / 74
Apply to Public Health Epidemiology vs Machine Learning
Alternatives
Yi : Time of event
Not censored
p(yi |µi , σ2) = (2πσ2)−
12 exp{−(yi − µi )
2
2σ2}
Censored
p(yi ≥ ti |µi , σ2) =
∫ ∞ti
(2πσ2)−12 exp{−(yi − µi )
2
2σ2}∂yi = Φ(
µi − ti
σ)
정규분포의 CDF로 간단히 표현 → 계산이 쉽다!!
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 59 / 74
Apply to Public Health Epidemiology vs Machine Learning
Example3: Correlation Structure
Correlation structure 고려해야하나?1 Epidemiology: Important
β의 s.e가 바뀐다. → p-value가 바뀐다.
2 Prediction model: Not importantβ 자체는 크게 안바뀐다.→ Y , p는 잘 안바뀐다.Correlation structure : Unmeasured effect → 측정되지 않은 것은 Newdata에서 prediction할 때 이용할 수 없다.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 60 / 74
Apply to Public Health Epidemiology vs Machine Learning
Figure. A representation of the tradeoff between flexibility and interpretability,using different statistical learning methods. In general, as the flexibility of amethod increases, its interpretability decreases[16]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 61 / 74
Apply to Public Health Epidemiology vs Machine Learning
Ted Chiang
It has been 25 years since a report of origi-nal research was last submitted to oureditors for publication, making this an
appropriate time to revisit the questionthat was so widely debated then: what isthe role of human scientists in an age whenthe frontiers of scientific inquiry havemoved beyond the comprehensibility ofhumans?
No doubt many of our subscribersremember reading papers whose authorswere the first individuals ever to obtain theresults they described. But as metahumansbegan to dominate experimental research,they increasingly made their findings avail-able only via DNT (digital neural transfer),leaving journals to publish second-handaccounts translated into human language.
Without DNT, humans could not fullygrasp earlier developments nor effectivelyutilize the new tools needed to conductresearch, while metahumans continued toimprove DNT and rely on it even more. Jour-nals for human audiences were reduced tovehicles of popularization, and poor ones atthat, as even the most brilliant humansfound themselves puzzled by translations ofthe latest findings.
No one denies the many benefits ofmetahuman science, but one of its costs tohuman researchers was the realization thatthey would probably never make an originalcontribution to science again. Some left thefield altogether, but those who stayed shiftedtheir attentions away from original researchand toward hermeneutics: interpreting thescientific work of metahumans.
Textual hermeneutics became popularfirst, since there were already terabytes ofmetahuman publications whose transla-tions, although cryptic, were presumablynot entirely inaccurate. Deciphering thesetexts bears little resemblance to the task per-formed by traditional palaeographers, butprogress continues: recent experiments havevalidated the Humphries decipherment ofdecade-old publications on histocompati-bility genetics.
The availability of devices based onmetahuman science gave rise to artefacthermeneutics. Scientists began attemptingto ‘reverse engineer’ these artefacts, theirgoal being not to manufacture competingproducts, but simply to understand thephysical principles underlying their opera-tion. The most common technique is thecrystallographic analysis of nanoware appli-
entific inquiry and increases the body ofhuman knowledge just as original researchdid. Moreover, human researchers maydiscern applications overlooked by meta-humans, whose advantages tend to makethem unaware of our concerns.
For example, imagine if research offeredhope of a different intelligence-enhancingtherapy, one that would allow individuals togradually ‘upgrade’ their minds to a levelequivalent to that of a metahuman. Such atherapy would offer a bridge across what hasbecome the greatest cultural divide in ourspecies’ history, yet it might not even occur tometahumans to explore it; that possibilityalone justifies the continuation of humanresearch.
We need not be intimidated by theaccomplishments of metahuman science.We should always remember that the tech-nologies that made metahumans possiblewere originally invented by humans, andthey were no smarter than we. ■
Ted Chiang is an occasional writer of science fiction.His latest story can be found in the anthologyVanishing Acts, published by Tor Books.
futures
NATURE | VOL 405 | 1 JUNE 2000 | www.nature.com 517
ances, which frequently provides us withnew insights into mechanosynthesis.
The newest and by far the mostspeculative mode of inquiry isremote sensing of metahumanresearch facilities. A recenttarget of investigation isthe ExaCollider recentlyinstalled beneath theGobi Desert, whosepuzzling neutrinosignature has beenthe subject of muchcontroversy. (Theportable neutrinodetector is, ofcourse, anothermetahuman arte-fact whose oper-ating principlesremain elusive.)
The question is,are these worthwhileundertakings for sci-entists? Some call thema waste of time, likeningthem to a Native Americanresearch effort into bronzesmelting when steel tools ofEuropean manufacture are readilyavailable. This comparison might bemore apt if humans were in competitionwith metahumans, but in today’s economyof abundance there is no evidence of suchcompetition. In fact, it is important torecognize that — unlike most previous low-technology cultures confronted with a high-technology one — humans are in no dangerof assimilation or extinction.
There is still no way to augment a humanbrain into a metahuman one; the Sugimotogene therapy must be performed before theembryo begins neurogenesis in order for abrain to be compatible with DNT. This lackof an assimilation mechanism means thathuman parents of a metahuman child face adifficult choice: to allow their child DNTinteraction with metahuman culture, andwatch him or her grow incomprehensible tothem; or else restrict access to DNT duringthe child’s formative years, which to ametahuman is deprivation like that sufferedby Kaspar Hauser. It is not surprising that thepercentage of human parents choosing theSugimoto gene therapy for their children hasdropped almost to zero in recent years.
As a result, human culture is likely to sur-vive well into the future, and the scientifictradition is a vital part of that culture.Hermeneutics is a legitimate method of sci-
Catching crumbs from the tableIn the face of metahuman science, humans have become metascientists.
JAC
EY
© 2000 Macmillan Magazines Ltd
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 62 / 74
Apply to Public Health Epidemiology vs Machine Learning
Human VS metahuman[4]
Ted Chiang : SF 소설가
메타 인류(인공지능)의 압도적인 지식처리능력.
Human science: 메타 인류가 밝혀낸 것들을 해석하는 정도의 수준.
메타 인류의 논문을 번역하는 것이 human science..
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 63 / 74
Apply to Public Health Deep Learning vs Other ML
Deep Learning vs Other ML
Multiple Hidden Layer: High flexibility
Massive Parallel Computing
Programming language for GPU/parallel computing
CUDA(Compute Unified Device Architecture), OpenCL[21, 26]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 64 / 74
Apply to Public Health Deep Learning vs Other ML
Examples: Cat recognition
16,000개의 CPU
그림만 보고 고양이 인식 (Unsupervised Learning)
GPU를 이용하여 Computing 시간 줄임.
http:
//www.asiae.co.kr/news/view.htm?idxno=2012062708351993171
http://googleblog.blogspot.kr/2012/06/
using-large-scale-brain-simulations-for.html
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 65 / 74
Apply to Public Health Deep Learning vs Other ML
Paper[18, 5]
Building High-level FeaturesUsing Large Scale Unsupervised Learning
Quoc V. Le [email protected]’Aurelio Ranzato [email protected] Monga [email protected] Devin [email protected] Chen [email protected] S. Corrado [email protected] Dean [email protected] Y. Ng [email protected]
Abstract
We consider the problem of building high-level, class-specific feature detectors fromonly unlabeled data. For example, is it pos-sible to learn a face detector using only unla-beled images? To answer this, we train a 9-layered locally connected sparse autoencoderwith pooling and local contrast normalizationon a large dataset of images (the model has1 billion connections, the dataset has 10 mil-lion 200x200 pixel images downloaded fromthe Internet). We train this network usingmodel parallelism and asynchronous SGD ona cluster with 1,000 machines (16,000 cores)for three days. Contrary to what appears tobe a widely-held intuition, our experimentalresults reveal that it is possible to train a facedetector without having to label images ascontaining a face or not. Control experimentsshow that this feature detector is robust notonly to translation but also to scaling andout-of-plane rotation. We also find that thesame network is sensitive to other high-levelconcepts such as cat faces and human bod-ies. Starting with these learned features, wetrained our network to obtain 15.8% accu-racy in recognizing 22,000 object categoriesfrom ImageNet, a leap of 70% relative im-provement over the previous state-of-the-art.
Appearing in Proceedings of the 29 th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s).
1. Introduction
The focus of this work is to build high-level, class-specific feature detectors from unlabeled images. Forinstance, we would like to understand if it is possible tobuild a face detector from only unlabeled images. Thisapproach is inspired by the neuroscientific conjecturethat there exist highly class-specific neurons in the hu-man brain, generally and informally known as “grand-mother neurons.” The extent of class-specificity ofneurons in the brain is an area of active investigation,but current experimental evidence suggests the possi-bility that some neurons in the temporal cortex arehighly selective for object categories such as faces orhands (Desimone et al., 1984), and perhaps even spe-cific people (Quiroga et al., 2005).
Contemporary computer vision methodology typicallyemphasizes the role of labeled data to obtain theseclass-specific feature detectors. For example, to builda face detector, one needs a large collection of imageslabeled as containing faces, often with a bounding boxaround the face. The need for large labeled sets posesa significant challenge for problems where labeled dataare rare. Although approaches that make use of inex-pensive unlabeled data are often preferred, they havenot been shown to work well for building high-levelfeatures.
This work investigates the feasibility of building high-level features from only unlabeled data. A positiveanswer to this question will give rise to two significantresults. Practically, this provides an inexpensive wayto develop features from unlabeled data. But perhapsmore importantly, it answers an intriguing question asto whether the specificity of the “grandmother neuron”could possibly be learned from unlabeled data. Infor-mally, this would suggest that it is at least in principlepossible that a baby learns to group faces into one class
Deep learning with COTS HPC systems
Adam Coates [email protected] Huval [email protected] Wang [email protected] J. Wu [email protected] Y. Ng [email protected]
Stanford University Computer Science Dept., 353 Serra Mall, Stanford, CA 94305 USA
Bryan Catanzaro [email protected]
NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050
Abstract
Scaling up deep learning algorithms has beenshown to lead to increased performance inbenchmark tasks and to enable discovery ofcomplex high-level features. Recent effortsto train extremely large networks (with over1 billion parameters) have relied on cloud-like computing infrastructure and thousandsof CPU cores. In this paper, we present tech-nical details and results from our own sys-tem based on Commodity Off-The-Shelf HighPerformance Computing (COTS HPC) tech-nology: a cluster of GPU servers with Infini-band interconnects and MPI. Our system isable to train 1 billion parameter networks onjust 3 machines in a couple of days, and weshow that it can scale to networks with over11 billion parameters using just 16 machines.As this infrastructure is much more easilymarshaled by others, the approach enablesmuch wider-spread research with extremelylarge neural networks.
1. Introduction
A significant amount of effort has been put into de-veloping deep learning systems that can scale to verylarge models and large training sets. With each leapin scale new results proliferate: large models in theliterature are now top performers in supervised vi-sual recognition tasks (Krizhevsky et al., 2012; Cire-san et al., 2012; Le et al., 2012), and can even learn
Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).
to detect objects when trained from unlabeled im-ages alone (Coates et al., 2012; Le et al., 2012). Thevery largest of these systems has been constructed byLe et al. (Le et al., 2012) and Dean et al. (Dean et al.,2012), which is able to train neural networks with over1 billion trainable parameters. While such extremelylarge networks are potentially valuable objects of AIresearch, the expense to train them is overwhelming:the distributed computing infrastructure (known as“DistBelief”) used for the experiments in (Le et al.,2012) manages to train a neural network using 16000CPU cores (in 1000 machines) in just a few days, yetthis level of resource is likely beyond those availableto most deep learning researchers. Less clear still ishow to continue scaling significantly beyond this sizeof network. In this paper we present an alternativeapproach to training such networks that leverages in-expensive computing power in the form of GPUs andintroduces the use of high-speed communications in-frastructure to tightly coordinate distributed gradientcomputations. Our system trains neural networks atscales comparable to DistBelief with just 3 machines.We demonstrate the ability to train a network withmore than 11 billion parameters—6.5 times larger thanthe model in (Dean et al., 2012)—in only a few dayswith 2% as many machines.
Buoyed by many empirical successes (Uetz & Behnke,2009; Raina et al., 2009; Ciresan et al., 2012;Krizhevsky, 2010; Coates et al., 2011) much deeplearning research has focused on the goal of buildinglarger models with more parameters. Though sometechniques (such as locally connected networks (Le-Cun et al., 1989; Raina et al., 2009; Krizhevsky, 2010),and improved optimizers (Martens, 2010; Le et al.,2011)) have enabled scaling by algorithmic advan-tage, another main approach has been to achieve scale
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 66 / 74
Apply to Public Health Hypothesis Testing vs Hypothesis Generating
Hypothesis Testing vs Hypothesis Generating
Figure. Hypothesis-testing and Hypothesis-generating paradigms[3]
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 67 / 74
Apply to Public Health Hypothesis Testing vs Hypothesis Generating
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 68 / 74
Apply to Public Health Hypothesis Testing vs Hypothesis Generating
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 69 / 74
Apply to Public Health Hypothesis Testing vs Hypothesis Generating
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 70 / 74
Conclusion
Contents
1 What is Deep Learning?
2 HistoryPerceptronMultilayer Perceptron1st Breakthrough: Unsupervised Learning2nd Breakthrough: Supervised Learning
3 Apply to Public HealthEpidemiology vs Machine LearningDeep Learning vs Other MLHypothesis Testing vs Hypothesis Generating
4 Conclusion
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 71 / 74
Conclusion
Conclusion
Deep Learning이 Mobile Health 의 핵심.
Mobile data: 영상, 음성, 텍스트 등 비정형 데이터.
Parallel Computing System 구축이 필요하다.
Prediction vs Inference
Understanding concept of Machine Learning
Hypothesis Generating
Paradigm shift: Causal inference → Big data & Prediction
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 72 / 74
Conclusion
Reference I
[1] Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends R© in Machine Learning, 2(1):1–127.
[2] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. NeuralNetworks, IEEE Transactions on, 5(2):157–166.
[3] Biesecker, L. G. (2013). Hypothesis-generating research and predictive medicine. Genome research, 23(7):1051–1053.
[4] Chiang, T. (2000). Catching crumbs from the table. Nature, 405(6786):517–517.
[5] Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep learning with cots hpc systems. InProceedings of The 30th International Conference on Machine Learning, pages 1337–1345.
[documentation] documentation, D. . Convolutional neural networks (lenet).http://deeplearning.net/tutorial/lenet.html.
[7] Fischer, A. and Igel, C. (2012). An introduction to restricted boltzmann machines. In Progress in Pattern Recognition,Image Analysis, Computer Vision, and Applications, pages 14–36. Springer.
[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier networks. In Proceedings of the 14th InternationalConference on Artificial Intelligence and Statistics. JMLR W&CP Volume, volume 15, pages 315–323.
[Han-Hsing] Han-Hsing, T. [ml, python] gradient descent algorithm (revision 2).http://hhtucode.blogspot.kr/2013/04/ml-gradient-descent-algorithm.html.
[Hinton] Hinton, G. Coursera: Neural networks for machine learning. https://class.coursera.org/neuralnets-2012-001.
[11] Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation,18(7):1527–1554.
[12] Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5):5947.
[13] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science,313(5786):504–507.
[14] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks bypreventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 73 / 74
Conclusion
Reference II
[Honkela] Honkela, A. Multilayer perceptrons. https://www.hiit.fi/u/ahonkela/dippa/node41.html.
[16] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical learning. Springer.
[Kim] Kim, J. 2014 패턴인식 및 기계학습 여름학교. http://prml.yonsei.ac.kr/.
[18] Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. In Acoustics, Speech and SignalProcessing (ICASSP), 2013 IEEE International Conference on, pages 8595–8598. IEEE.
[19] Maltarollo, V. G., Honorio, K. M., and da Silva, A. B. F. (2013). Applications of artificial neural networks in chemicalproblems.
[20] Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the27th International Conference on Machine Learning (ICML-10), pages 807–814.
[21] Nvidia, C. (2007). Compute unified device architecture programming guide.
[Ranzato] Ranzato, M. Deep learning for vision: Tricks of the trade. www.cs.toronto.edu/~ranzato.
[23] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.Psychological review, 65(6):386.
[24] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1985). Learning internal representations by error propagation.Technical report, DTIC Document.
[25] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory.
[26] Stone, J. E., Gohara, D., and Shi, G. (2010). Opencl: A parallel programming standard for heterogeneous computingsystems. Computing in science & engineering, 12(3):66.
[Wan] Wan, L. Regularization of neural networks using dropconnect. http://cs.nyu.edu/~wanli/dropc/.
[28] Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. (2013). Regularization of neural networks using dropconnect. InProceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066.
[Wikipedia] Wikipedia. Wikepedia. http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine.
[30] Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q. V., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J.,et al. (2013). On rectified linear units for speech processing. In Acoustics, Speech and Signal Processing (ICASSP), 2013IEEE International Conference on, pages 3517–3521. IEEE.
김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 74 / 74