large-scale data analytics and its relationship to … · § data analytics = discovering...

23
Sandia National Laboratories is a multi-mission laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Large-Scale Data Analytics and Its Relationship to Simulation CMSE Frontiers in Data Science and Computing Workshop Michigan State University October 4, 2016 Rob Leland Vice President, Science & Technology Chief Technology Officer Sandia National Laboratories SAND2016-9893

Upload: buithuy

Post on 28-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Sandia National Laboratories is a multi-mission laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Large-ScaleDataAnalyticsandItsRelationshiptoSimulationCMSE Frontiers in Data Science and Computing WorkshopMichigan State UniversityOctober 4, 2016 RobLeland

VicePresident,Science&TechnologyChiefTechnologyOfficerSandiaNationalLaboratories

SAND2016-9893

Page 2: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Outline

2

§ Somenecessarybackground

§ AchargefromtheNationalStrategicComputingInitiative

§ Answerstothreekeyquestions§ Whyisaincreasingcoherencebetweensimulationandanalyticsimportant?§ Whatisreallymeantby“increasingcoherence”betweenthetwo?§ Howmightcoherencebefurtheredinpractice?

§ Aunifyingvision

Page 3: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Termsandcontext

3

§ Simulation§ Computationstounderstandphysicalphenomenaorconductengineering

§ LargeScaleDataAnalytics(LSDA)§ DataAnalytics=Discoveringmeaningfulpatternsindata§ LargeScale=Requiringleading-edgeprocessingandstoragecapabilities

§ LSDAisincreasinginimportance§ Pervasive

§Commerce,finance,healthcare,science,engineering,nationalsecurity,...§ Lastingsocietalsignificance

§ Internetsearch,genomics,climatemodeling,Higgsparticle,...

§ LSDAisgetting“harder”§ Captureddatagrowingexponentiallywithtime§ Individualanalysisbecomingmoresophisticated§ Morepeopleexaminingmoredatamorefrequently§ AggregateworkgrowingmuchfasterthanMoore’sLaw

TheEconomist:

Page 4: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

NationalStrategicComputingInitiative(NSCI)

4

Page 5: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

NSCIStrategicObjectives

5

§ (1)Acceleratingdeliveryofacapableexascale computingsystemthatintegrateshardwareandsoftwarecapabilitytodeliverapproximately100timestheperformanceofcurrent10petaflopsystemsacrossarangeofapplicationsrepresentinggovernmentneeds.

§ (2)Increasingcoherencebetweenthetechnologybaseusedformodelingandsimulationandthatusedfordataanalyticcomputing.

§ (3)Establishing,overthenext15years,aviablepathforwardforfutureHPCsystemsevenafterthelimitsofcurrentsemiconductortechnologyarereached(the"post-Moore'sLawera").

§ (4)IncreasingthecapacityandcapabilityofanenduringnationalHPCecosystembyemployingaholisticapproachthataddressesrelevantfactorssuchasnetworkingtechnology,workflow,downwardscaling,foundationalalgorithmsandsoftware,accessibility,andworkforcedevelopment.

§ (5)Developinganenduringpublic-privatecollaborationtoensurethatthebenefitsoftheresearchanddevelopmentadvancesare,tothegreatestextent,sharedbetweentheUnitedStatesGovernmentandindustrialandacademicsectors.

Page 6: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Q1:Whyisincreasingcoherencebetweensimulationandanalyticsimportant?

6

§ Forsimulation§ HPCsimulationmustrideonsomecommoditycurve§ Largermarketforcesbehindanalytics§ Canexploitcommoditycomponenttechnologyfromanalytics

§ Foranalytics§ LargeScaleDataAnalyticsproblemsbecomingevermoresophisticated§ Requiringmorecoupledmethods§ CanexploitarchitecturallessonsfromHPCsimulation

§ Forboth:Integrationofsimulationandanalyticsinthesameworkflow§ Automationofanalysisofdatafromsimulation§ Creationofsyntheticdataviasimulationtoaugmentanalysis§ Automatedgenerationandtestingofhypothesis§ Explorationofnewscientificandtechnicalscenarios§ ...

Mutualinspiration,technicalsynergy,andeconomiesofscaleinthecreation,deployment,anduseofHPCresources

Page 7: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

7

Achallengebecausesimulationandanalyticsdifferinmanyrespects…

Page 8: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

DatastructuresdescribingsimulationandanalyticsdifferGraphsfromsimulationsmaybeirregular,buthavemorelocalitythanthosederivedfromanalytics

ComputationalSimulationofphysicalphenomena:

Climatemodeling Carcrash

Internetconnectivity Yeastproteininteractions

LargeScaleDataAnalytics:

FiguresfromLelandet.al.courtesyofYelick,LBNL.

Page 9: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

TheU.S.roadmap,whichhasspatiallocalityandisthusmostsimilarofthethreeinstructuretocomputationalpatternsthatwouldariseintypicalphysicalsimulations.

Computationandcommunicationpatternsdiffer

Black =timespentcomputingGreen =timespentcommunicatingWhite =timespentwaitingfordatatobecommunicated

TheErdős-Rényi graph,awell-studiedexampleingraphtheorywork.

A scale-freegraph,anexamplemorereflectiveofreal-worldnetworks.

FigurefromLelandet.al.courtesyofJohnson,PNNL.

Page 10: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Simulation

Analytics

Standardbenchmarksinclude:• LINPACK(smallestdataintensiveness;barelyvisibleongraph)• STREAM• SPECFP• SpecInt

MemoryperformancedemandsdifferAkeydifferentiatorintheperformanceofsimulationandanalytics

FigurefromMurphy&Kogge withadjustmenttodoubleradiusofLinpack datapointtomakeitvisible.

Areaofthecircle=relativedataintensiveness(i.e.totalamountofuniquedataaccessed overafixedintervalofinstructions)

Simulation

Analytics

Page 11: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Applicationcodeproperty Simulation Analytics

Spatiallocality High Low

Temporallocality Moderate Low

Memoryfootprint Moderate High

Computationtype Maybefloating-pointdominated* Integerintensive

Input-outputorientation Outputdominated Inputdominated

*Increasingly,simulationworkhasbecomelessfloating-pointdominated

Applicationcodecharacteristicsdiffer

Contrastingproperties:

Page 12: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Q2:Sowhatismeantby“increasingcoherence”betweensimulationandanalytics?

12

§ NOTonesystemostensiblyoptimizedforbothsimulationandanalytics

§ Greatercommonalityinunderlyingcomponentryanddesignprinciples

§ Greaterinteroperability,allowinginterleavingofbothtypesofcomputations

…Amorecommonhardwareandsoftwareroadmapbetweensimulationandanalytics

Page 13: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

13

Andyet,thereishope…

Page 14: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Simulationandanalyticsareevolvingtobecomemoresimilarintheirarchitecturalneeds

14

§ CurrentchallengesfortheLSDAcommunity§ Datamovement§ Powerconsumption§ Memory/interconnectbandwidth§ Scalingefficiency

§ InstructionmixforSandia’sHPCengineeringcodes§ Memoryoperations 40%§ Integeroperations 40%§ Floatingpoint 10%§ Other 10%

§ Commondesignimpactsofenergycosttrends§ Increasedconcurrency(processingthreads,cores,memorydepth)§ Increasedcomplexityandburdenon

§ systemsoftware,languages,tools,runtimesupport,codes

…similartoHPCsimulation

…similartoLSDA

Page 15: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Energycostofmovingdataisbecomingdominant

Energyco

st,inpicojoules

(pJ),per

64-bitflo

ating-po

into

peratio

n

Costestimatesfortechnologyyear

Energycostforvariouscommonoperations

FromDanMcMorrow,TechnicalChallengesofExascaleComputing,JSR-12-310,JASON,MITRECorporation,April2013.

Page 16: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

ArchitecturalCharacteristic Simulation Analytics

Computation Memoryaddressgenerationdominated Same

Primarymemory Lowpower,highbandwidth,semi-randomaccess Same

Secondarymemory Emergingtechnologiesmayoffsetcost,allowingmuchmorememory …require extremelylargememoryspaces

Storage Integrationofanotherlayerofmemoryhierarchytosupportcheckpoint/restart …tosupportout-of-coredatasetaccess

Interconnecttechnology Highbisectionbandwidth,(forrelativelycoarse-grainedaccess) …(forfine-grainedaccess)

Systemsoftware(node-level)

Lowdependenceonsystemservices,increasinglyadaptive,resourcemanagementforstructured parallelism

…highlyadaptive,resourcemanagementforunstructured parallelism

Systemsoftware(system-level) Increasinglyirregularworkflows Irregularworkflows

Emergingarchitecturalandsystemsoftwaresynergies

Similarneeds:

Page 17: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Q3:Howmightcoherencebefurtheredinpractice?

17

§ Makingitanelementofnationalstrategy§ CheckviatheNSCI

§ Buildingthisintoexascale computingefforts§ AlsoacomponentoftheNSCI

§ Communicatingwithandenlistingthetechnicalcommunitiesconcerned§ Thisforumandsimilarevents

§ Furtherdevelopingthevision§ Today’sdialoguesession!

Page 18: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Aunifyingvisionforsimulationandanalytics

FromTheFourthParadigm:Data-IntensiveScientificDiscoverybyJimGray

Dataanalysiscomplementstheory,experiment,andcomputation

Page 19: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Acknowledgements

19

Page 20: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Additionalreferences

20

§ TheEconomist,“Data,Data,Everywhere,” Feb25th,2010

§ R.C.MurphyandP.M.Kogge,“OntheMemoryAccessPatternsofSupercomputerApplications:BenchmarkSelectionandItsImplications,”IEEETransactionsonComputers56(7,July2007):937–945.

§ R.Murphy,“PowerIssues,”presentationtoJASON2012,June2012.

§ PeterKogge (editor)etal.,ExaScale ComputingStudy:TechnologyChallengesinAchievingExascaleSystems. DARPA,2008.

§ DanMcMorrow,TechnicalChallengesofExascaleComputing,JSR-12-310,JASON,MITRECorporation,April2013.

§ TonyHey,StewartTansley,andKristinTolle(editors), TheFourthParadigm:Data-IntensiveScientificDiscovery,MicrosoftResearch,2009.

§ JimGray,TheFourthParadigm:Data-IntensiveScientificDiscovery

Page 21: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

Suggestedquestionsforbreakoutdialogue

21

§ Whywouldincreasingthecoherencebetweenthetechnologybaseusedforsimulationandthatforanalyticsbringvalueinthecontextofyourwork?

§ Whatresearchanddevelopmentwouldbestsupportdevelopmentofamorecommoncomponentroadmapanddesignprinciplesbridgingsimulationandanalytics?

§ Howwouldthisresearchbebestorganized?

Page 22: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

22

SupplementaryMaterial

Page 23: Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering meaningful patterns in data § Large Scale = Requiring leading-edge processing and storage

GraphmatchingexampleofdataanalyticsAkeyanalyticprimitive-- usedtofindaspecificinstanceofanabstractpatternofinterest

FromCoffman,Greenblatt,andMarcus,Graph-BasedTechnologiesforIntelligenceAnalysis, CommunicationsoftheACM,47,March2004.