h2o world - welcome to h2o world with arno candel

Post on 16-Apr-2017

905 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

We l c ome ' t o ' H 2O 'Wo r l d

Sri'&'H2O'Team'

Data  Science   is  a  Team  Sport!  

                                                                                           Culture  Matters!

Open  Source  Breeds  Courage!  

Community  Matters!  

Every  generation  needs  to  make  its  own  history!

Code   is  conversation  with  Customer!

Great  Product  Matters!

Accuracy  with  Speed  and  Scale

HDFS%

S3%

SQL%%

NoSQL%

CLASSIFICATION%REGRESSION%

FEATURE%ENGINEERING%

IN4MEMORY%

MAP%REDUCE/FORK%JOIN%

COLUMNAR%COMPRESSION%

DEEP%LEARNING%

PCA,%GLM,%COX%

RANDOM%FOREST%/%GBM%ENSEMBLES%

FA S T %MODE L ING % ENG INE %

Streaming% NANO % FA ST % JAVA % S COR ING % ENG INES %

MATRIX%FACTORIZATION% CLUSTERING%

MUNGING%

What ’s  New  in  H2O-­‐3

H2O-­‐3  vs  H2O-­‐2:  • Total  rewrite  of  the  core  in  Java:  built  for  data  scientists  AND  developers!  • Unique  Flow  GUI  (Notebook  and  more)  • REST  Schemas  for  self-­‐describing  API  for  all  methods/algos  • New  R  client:  cleaner,  faster  • Sparkling  Water:  H2O  is  the  Killer  App  on  Spark  • Fully  featured  Python  client  (incl.  Pipelines,  scikit-­‐learn  look&feel)  • New  expression  parser  &  backend  execution  engine  for  R,  Py,  Flow  • New  Algo:  GLRM  -­‐  Generalized  Low  Rank  Modeling(unifies  PCA,  K-­‐Means,  Matrix  Factorization,  Imputation,  etc.)  

• New  Solvers  for  GLM:  Coordinate  Descent  and  L-­‐BFGScontinued…

What ’s  New  in  H2O-­‐3

Additional  New  Features:  • Grid  Search  for  all  Algorithms  (R/Py/Flow)  • N-­‐fold  Cross-­‐Validation  for  all  Algorithms  • Early  Stopping  (check  for  convergence)  for  GBM/DRF/DL  • Stochastic  GBM  (row/col  sampling)  • Distributions  (Gaussian,  Laplace,  Poisson,  Gamma,  Tweedie)  for  GBM/DL  • Improved  sparse  data  handling  for  DL  • Multi-­‐node  auto-­‐tuning  for  DL  • Multinomial  GLM  • Scalable  Scatter  Plots  for  numeric  and  categorical  data  • Big-­‐Big  Joins  (“distributed  data.table”)  -­‐  in  QA

…and  many  more!

Convergence-­‐Based  Early  Stopping   in  H2O

Before:  trains  too  long,  but  at  least  overwrite_with_best_model=true  prevents  overfitting  (returns  the  model  with  lowest  validation  error)

Now:  specify  additional  convergence  criterion:  E.g.  stopping_rounds=5,  stopping_metric=“MSE”,  stopping_tolerance=1e-­‐3,  to  stop  as  soon  as  the  moving  average  (length  5)  of  the  validation  MSE  does  not  improve  by  at  least  0.1%  for  5  consecutive  scoring  events

validation  error

training  error

overwrite_with_best_model=true

training  time  /  epochs

training  time  /  epochsUse  Flow  to  inspect  the  model

Early  stopping  saves  tons  of  time

Best  Model

Deep  Learning  with  Higgs  data

What  do  these  st ickers  mean?

I have H2O Installed

I have Python installed

I have R installed

I have the H2O World data sets

P i ck  up   s t i cke rs  o r   get   i n s ta l l   he lp   a t   the  in fo rmat ion  booth

top related