data science 101 - presentation · 2019-09-15 · data marts data scientist data mining and...

Post on 29-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DataScience101ArikPelkeyPentaho SeniorDirector– ProductMarketing,HitachiVantaraScottCooleyPentaho DataScientist,HitachiVantara

Agenda

Thissessionwillprovideanintroductiontodatasciencefundamentals.

• WhatisDataScience?

• CommonUseCasesandAlgorithms

• TheDataScienceProcess• BuildingaDataScienceTeam• TheFuture

AI,MachineLearning,andDeepLearning

Imagefromhttps://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.

• AI:Gettingmachinestodowhathumansaregoodat

• DeepLearning:Atypeofmachinelearning

• MachineLearning:Feedinganalgorithmdatatolearnandpredictsomething

DataScience:SolvingProblemswithData

DiagramfromDrewConway:http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.

Understandingoftheunderlyingassumptions

Algorithmsandnumerical

techniquestoderiveinsights

HACKINGSKILLS

MATHANDSTATISTICS

KNOWLEDGE

DATASCIENCE

DangerZone!

TraditionalResearch

MachineLearning

SUBSTANTIVEEXPERIENCE

Computerscience,dataengineeringandwrangling,coding

Domainknowledge,businessacumen,experience,

valuetothebusiness

What’sallthefuss?Thisstuffwascreatedmanymanyyearsago

• Legendre,GaussandGaltonearly1800’s

Hereisasamplefootnote.

• ThomasBayesmid1700’s

• McCullochandPittsearly1940s

• BayesTheorem

• Regression

• NeuralNetworks

ThinkaboutAllOurDataandCompute

https://www.computerworld.com.au/article/392735/ska_telescope_generate_more_data_than_entire_internet_2020/.

SKA- 2020(SquareKilometerArrayTelescope)

WillgenerateasmuchdatainadayastheentirePLANETdoesinayear!

ItisstillGROWING!

Hereisasamplefootnote.

Regression – Lookingforastatisticalrelationshipacrossvariablesthatmaygiveusanestimateofaparticularoutcome.

Classification – Similartoregressionbutlookingforseparationsinthedatagivenpredefinedclasses.(Supervised)

Clustering – Donothavepredefinedclassesbuttryingtofindgroupsorsetsbasedupondataathand.(Unsupervised)

AnomalyDetection–Identificationofoutliersbaseduponexpectedrangesofdata.

✕✕✕✕ ✕

✕✕

△△△△

✕✕✕

△△

◇△△△△△△

△△△

△△△

?

?

△△△

TypesofMachineLearning

LabelledvsUnlabelledLetssaywewanttoClassifyHousesbySize

Unsupervised

SIZEismissing!We needtolookforsimilaritiesinthedataandgroupthemintoclusters.

GivenFeaturesorFeatureSet

LabelFullBath HalfBath Bedrooms HomeAge1 0 2 561 1 3 592 1 3 202 1 3 19

SizeMLMS

SupervisedLearning

Usethelabelstobuildamodel.ModelusedtoclassifynewhousesizebasedONLYontheknownfeatureset.

MoreonMachineLearningMachineLearning isamethodologytocreateamodelbasedonsampledataandusethemodeltomakeapredictionorstrategyusingamorealgorithmicapproach.

Historicalrecordsthatcontainsquarefeet,numberofbathrooms,zipcode….

Recordsthatcontainthepricethehousesoldfor

Iteratethealgorithmoverthecombineddatatotrainthemodel

Usethetrainedmodeltopredictoutcomeonnewrecords

SUPERVISEDLEARNINGMODEL

TheDataScienceProcess:GettingfromRawDatatoOutcomes

JoeBlizstein andHanspeter Pfister createdforHarvardDataSciencecourse.

FormalFrameworkCRISP–DMCrossIndustryStandardProcess

forDataMining

TheDataScienceWorkflow

SpecialistTraditionalDataScienceTeam

DataScientist(DS)– Preparesdata,engineersfeatures,mostvaluableskill:trainingmodels.

DataEngineer(DE)– Dataacquisitionfocus.Builddatapipelines.Notuncommontohave5:1ratioDE:DS

DataAnalyst(DA)– AssistDSwithdataprep

Applicationarchitect(AA)– Designcompletesolution;deployandmaintainmodelsinproduction

MythicalCreatures

Trends

• Automation

• ToolsforCitizenDataScientists• Pre-trainedmodelsinthecloud

Hereisasamplefootnote.

HiringGuidance

Hereisasamplefootnote.

DefiningSuccess

• Easyforthetangible– Searchorderoptimization– RecommendationengineorCTR

• Hardforothers– Leadscoring– Attrition

• Trytomeasuredirectoutcomes

• Rarelyasilverbullet• ThinkROI

Hereisasamplefootnote.

TypicalDataScienceProject

DS

Understandbusinessobjectives

AA

DE

DS

IDandprocure

trainingdata

DA

DS

Preparedataandbuild

newfeatures

DS

Trainmodel

Deploymodels

AA

DS

Updatemodels

AA

PreventiveMaintenance:Caterpillar

MarineAssetIntelligence

Business User (COO) Reporting on

Operations and Efficiency

Dashboards and Reports on Machine

Performance (Onboard and

Onshore)

DataMarts

Data ScientistData Mining and

Predictive Maintenance

LocalEquipmentsensorandServerData

FleetDataviaSatellite

CrossDepartmentOperationsDataScheduling/ERP

DataIntegration

DataIntegration

TheFuture

• Scalingup/enablingmoredatascientists

• Modelmanagement

• Improvedproductivity

• Supportforcontainerizedapplications.

Hereisasamplefootnote.

PentahoMLOrchestration

• Makesdatascienceteamsmoreproductive

• Broadsupportforopensourcelibrariesinvariouslanguages

Summary

• WhatisDataScience

• CommonUseCasesandAlgorithms

• TheDataScienceProcess• BuildingaDataScienceTeam• TheFuture

NextSteps

Wanttolearnmore?

• ScheduleaMeettheExpert

• ReadMarkHall’sMachineLearningwithPentahoBlog

top related