svr17: data-intensive computing on windows hpc server with the

38
Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert Architect Microsoft Corporation SVR17

Upload: butest

Post on 10-May-2015

755 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SVR17: Data-Intensive Computing on Windows HPC Server with the

Data-Intensive Computing on Windows HPC Server with the DryadLINQ FrameworkJohn VertArchitectMicrosoft Corporation

SVR17

Page 2: SVR17: Data-Intensive Computing on Windows HPC Server with the

Moving Parts

> Windows HPC Server 2008 – cluster management, job scheduling

> Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets

> LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model

> PLINQ – multi-core parallelism across LINQ queries.

> DryadLINQ – Bring LINQ ease of programming to Dryad

Page 3: SVR17: Data-Intensive Computing on Windows HPC Server with the

Software StackImage

Processing

Windows HPC

Server 2008

HPC Job Scheduler

Dryad

DryadLINQ

MachineLearning

GraphAnalysis

DataMining

.NET Applications

Windows HPC

Server 2008

Windows HPC

Server 2008

Windows HPC

Server 2008

Page 4: SVR17: Data-Intensive Computing on Windows HPC Server with the

Dryad

> Provides a general, flexible distributed execution layer> Dataflow graph as the computation model

> Can be modified by runtime optimizations> Higher language layer supplies graph,

vertex code, serialization code, hints for data locality

> Automatically handles distributed execution> Distributes code, routes data> Schedules processes on machines near data> Masks failures in cluster and network

Page 5: SVR17: Data-Intensive Computing on Windows HPC Server with the

A Dryad JobDirected acyclic graph (DAG)

Processingvertices

Channels(file, fifo, pipe)

Inputs

Outputs

Page 6: SVR17: Data-Intensive Computing on Windows HPC Server with the

2-D Piping

Unix Pipes: 1-Dgrep | sed | sort | awk | perl

Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

6

Page 7: SVR17: Data-Intensive Computing on Windows HPC Server with the

LINQLanguage Integrated Query

> Declarative extensions to C# and VB.NET for iterating over collections> In memory> Via data providers> SQL-Like

> Broadly adoptable by developers> Easy to use> Reduces written code> Predictable results> Scalable experience> Deep tooling support

Page 8: SVR17: Data-Intensive Computing on Windows HPC Server with the

PLINQ Parallel Language Integrated Query

Value Proposition:> Enable LINQ developers to take advantage of

parallel hardware—with basic understanding of data parallelism.

> Declarative data parallelism (focus on the “what” not the “how”)

> Alternative to LINQ-to-Objects> Same set of query operators + some extras> Default is IEnumerable<T> based

> Preview in Parallel Extensions to .NET Framework 3.5 CTP

> Shipping in .NET Framework 4.0 Beta 2

Page 9: SVR17: Data-Intensive Computing on Windows HPC Server with the

DryadLINQLINQ to clusters

> Declarative programming style of LINQ for clusters

> Automatic parallelization> Parallel query plan exploits multi-node

parallelism> PLINQ underneath exploits multi-core parallelism

> Integration with VS and .NET> Type safety, automatic serialization> Query plan optimizations

> Static optimization rules to optimize locality> Dynamic run-time optimizations

Page 10: SVR17: Data-Intensive Computing on Windows HPC Server with the

Query plan

LINQ query

DryadLINQ: From LINQ to Dryad

Dryad

logs

where

select

Automatic query plan generation

Distributed query

execution by Dryad

var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);

Page 11: SVR17: Data-Intensive Computing on Windows HPC Server with the

A Simple LINQ Query

IEnumerable<BabyInfo> babies = ...;

var results = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;

Page 12: SVR17: Data-Intensive Computing on Windows HPC Server with the

A Simple PLINQ Query

IEnumerable<BabyInfo> babies = ...;

var results = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;

Page 13: SVR17: Data-Intensive Computing on Windows HPC Server with the

A Simple DryadLINQ Query

PartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”);

var results = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;

Page 14: SVR17: Data-Intensive Computing on Windows HPC Server with the

PartitionedTable<T>Core data structure for DryadLINQ

> Scale-out, partitioned container for .NET objects

> Derives from IQueryable<T>, IEnumerable<T>> ToPartitionedTable() extension methods

> DryadLINQ operators consume and produce PartitionedTable<T>

> DryadLINQ generates code to serialize/deserialize your .NET objects

> Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem

Page 15: SVR17: Data-Intensive Computing on Windows HPC Server with the

Partitioned FileFile-based container for PartitionedTable<T> metadata

XC\output\520a0fcf\Part200,1855000,HPCMETAHN011,1630000,HPCA1CN132,1707500,HPCA1CN123,1828820,HPCA1CN224,1802140,HPCA1CN075,1741000,HPCA1CN086,1733980,HPCA1CN117,1762620,HPCA1CN068,1861300,HPCA1CN149,1807460,HPCA1CN1710,1807560,HPCA1CN2311,1768120,HPCA1CN2012,1847220,HPCA1CN0313,1729160,HPCA1CN1614,1767500,HPCA1CN0515,1781520,HPCA1CN0416,1728480,HPCA1CN0917,1802580,HPCA1CN1818,1862380,HPCA1CN1019,1762540,HPCA1CN21

\\HPCMETAHN01\XC\output\520a0fcf\Part.00000000

Page 16: SVR17: Data-Intensive Computing on Windows HPC Server with the

PartitionedFileFile-based container for PartitionedTable<T> metadata

XC\output\520a0fcf\Part200,1855000,HPCMETAHN011,1630000,HPCA1CN132,1707500,HPCA1CN123,1828820,HPCA1CN224,1802140,HPCA1CN075,1741000,HPCA1CN086,1733980,HPCA1CN117,1762620,HPCA1CN068,1861300,HPCA1CN149,1807460,HPCA1CN1710,1807560,HPCA1CN2311,1768120,HPCA1CN2012,1847220,HPCA1CN0313,1729160,HPCA1CN1614,1767500,HPCA1CN0515,1781520,HPCA1CN0416,1728480,HPCA1CN0917,1802580,HPCA1CN1818,1862380,HPCA1CN1019,1762540,HPCA1CN21

\\HPCMETAHN01\XC\output\520a0fcf\Part.00000000\\HPCA1CN13\XC\output\520a0fcf\Part.00000001\\HPCA1CN12\XC\output\520a0fcf\Part.00000002\\HPCA1CN22\XC\output\520a0fcf\Part.00000003\\HPCA1CN07\XC\output\520a0fcf\Part.00000004\\HPCA1CN08\XC\output\520a0fcf\Part.00000005\\HPCA1CN11\XC\output\520a0fcf\Part.00000006\\HPCA1CN06\XC\output\520a0fcf\Part.00000007\\HPCA1CN14\XC\output\520a0fcf\Part.00000008\\HPCA1CN17\XC\output\520a0fcf\Part.00000009\\HPCA1CN23\XC\output\520a0fcf\Part.00000010\\HPCA1CN20\XC\output\520a0fcf\Part.00000011\\HPCA1CN03\XC\output\520a0fcf\Part.00000012\\HPCA1CN16\XC\output\520a0fcf\Part.00000013\\HPCA1CN05\XC\output\520a0fcf\Part.00000014\\HPCA1CN04\XC\output\520a0fcf\Part.00000015\\HPCA1CN09\XC\output\520a0fcf\Part.00000016\\HPCA1CN18\XC\output\520a0fcf\Part.00000017\\HPCA1CN10\XC\output\520a0fcf\Part.00000018\\HPCA1CN21\XC\output\520a0fcf\Part.00000019

Page 17: SVR17: Data-Intensive Computing on Windows HPC Server with the

A typical data-intensive query

var logs = PartitionedTable.Get<string>(“weblogs.pt”);var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\jvert") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

Go through logs and keep only lines that are not comments. Parse each line into a new LogEntry object.

Go through logentries and keep only entries that are accesses by jvert.

Group jvert accesses according to what page they correspond to. For each page, count the occurrences.

Sort the pages jvert has accessed according to access frequency.

Page 18: SVR17: Data-Intensive Computing on Windows HPC Server with the

Dryad Parallel DAG execution

var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\jvert") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count());var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

logentries

user

logs

accesses

htmAccesses

output

Page 19: SVR17: Data-Intensive Computing on Windows HPC Server with the

Query plan generation

> Separation of query from its execution context> Add all the loaded assemblies as resources> Eliminate references to local variables by partially

evaluating all the expressions in the query> Distribute objects used by the query> Detect impure queries when possible

> Automatic code generation> Object serialization code for Dryad channels> Managed code for Dryad Vertices

> Static query plan optimizations> Pipelining: composing multiple operators into one vertex> Minimize unnecessary data repartitions> Other standard DB optimizations

Page 20: SVR17: Data-Intensive Computing on Windows HPC Server with the

DryadLINQ query plan

Query 0 Output: file://\\hpcmetahn01\XC\output\b7e651a4-38b7-490c-8399-f63eaba7f29a.ptDryadLinq0.dll was built successfully.Input: [PartitionedTable: file://weblogs.pt]Super__1: Where(line => !(line.StartsWith(_))) Select(line => new logdemo.LogEntry(line)) Where(access => access.user.EndsWith(_)) DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count())) DryadHashPartition(e => e.Key,e => e.Key)Super__12: DryadMerge() DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum())) Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))

Page 21: SVR17: Data-Intensive Computing on Windows HPC Server with the

XML representationGenerated by DryadLINQ and passed to Dryad<Query> <DryadLinqVersion>1.0.1401.0</DryadLinqVersion> <ClusterName>hpcmetahn01</ClusterName> ... <Resources> <Resource>wrappernativeinfo.dll</Resource> <Resource>DryadLinq0.dll</Resource> <Resource>System.Threading.dll</Resource> <Resource>logdemo.exe</Resource> <Resource>LinqToDryad.dll</Resource> </Resources> <QueryPlan> <Vertex> <UniqueId>0</UniqueId> <Type>InputTable</Type> <Name>weblogs.pt</Name> ... </Vertex> <Vertex> <UniqueId>1</UniqueId> <Type>Super</Type> <Name>Super__1</Name> ... <Children> <Child> <UniqueId>0</UniqueId> </Child> </Children> </Vertex> ... </QueryPlan><Query>

List of files to be shipped to the cluster

Vertex definitions

Page 22: SVR17: Data-Intensive Computing on Windows HPC Server with the

DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices

public sealed class DryadLinq__Vertex { public static int Super__1(string args) { < . . . > DryadVertexEnv denv = new DryadVertexEnv(args, dvertexparam); var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0); var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString); var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true); var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true); var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"\jvert"), true); var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false); DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2); DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyy HH:mm:ss.fff")); return 0; } public static int Super__12(string args) { < . . . > }

Page 23: SVR17: Data-Intensive Computing on Windows HPC Server with the

DryadLINQ query operators

> Almost all the useful LINQ operators> Where, Select, SelectMany, OrderBy,

GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate

> Operators introduced by DryadLINQ> HashPartition, RangePartition, Merge,

Fork> Dryad Apply

> Operates on sequences rather than items

Page 24: SVR17: Data-Intensive Computing on Windows HPC Server with the

MapReduce in DryadLINQ

MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs{ var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs}

Page 25: SVR17: Data-Intensive Computing on Windows HPC Server with the

K-means in DryadLINQ

public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c);}

public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(point => NearestCenter(point, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count());}

var vectors = PartitionedTable.Get<Vector>("vectors.pt");IQueryable<Vector> centers = vectors.Take(100);for (int i = 0; i < 10; i++) { centers = Step(vectors, centers);}centers.ToPartitionedTable<Vector>(“centers.pt”);

public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() {…}}

Page 26: SVR17: Data-Intensive Computing on Windows HPC Server with the

Putting it all togetherIt’s LINQ all the way down

> Major League Baseball dataset> Pitch-by-pitch data for every MLB game

since 2007> 47,909 pitch XML files (one for each

pitcher appearance)> 6,127 player XML files (one for each

player)> Hash partition the input data files to

distribute the work> LINQ to XML to shred the data> DryadLINQ to analyze dataset

Page 27: SVR17: Data-Intensive Computing on Windows HPC Server with the

Load the dataset and partitionDefine Pitch and Player classes

void StagePitchData(string[] fileList, string PartitionedFile){ // partition the list of filenames across // 20 nodes of the cluster var pitches = fileList.ToPartitionedTable("filelist") .HashPartition((x) => (x), 20) .SelectMany((f) => XElement.Load(f).Elements("atbat")) .SelectMany((a) => a.Elements("pitch") .Select((p) => new Pitch((string)a.Attribute("pitcher"), (string)a.Attribute("batter"), p))); pitches.ToPartitionedTable(PartitionedFile);}

Void StagePlayerData(string[] fileList, string PartitionedFile){ var players = fileList.Select((p) => new Player(XElement.Load(p))); players.ToPartitionedTable(PartitionedFile); return 0;}

Page 28: SVR17: Data-Intensive Computing on Windows HPC Server with the

Analyze dataset with LINQ

IQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, int count){ return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count);}

Page 29: SVR17: Data-Intensive Computing on Windows HPC Server with the

Supports LINQ Joins

IQueryable<string> FindFastestPitchers(IQueryable<Pitch> pitches, IQueryable<Player> players, int count){ return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count) .Join(players, (o) => o.Pitcher, (i) => i.Id, (o, i) => i.FirstName + " " + i.LastName) .Distinct();}

Page 30: SVR17: Data-Intensive Computing on Windows HPC Server with the

DryadLINQ on HPC Server

> DryadLINQ program runs on client workstation> Develop, debug, run locally> When ToPartitionedTable() is called, the query

expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server

> HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager

> The JM then schedules additional tasks to execute the vertices of the DryadLINQ query

> When the job completes, the client program picks up the output result and continues.

Page 31: SVR17: Data-Intensive Computing on Windows HPC Server with the

Examples of DryadLINQ Applications> Data mining

> Analysis of service logs for network security> Analysis of Windows Watson/SQM data> Cluster monitoring and performance analysis

> Graph analysis> Accelerated Page-Rank computation> Road network shortest-path preprocessing

> Image processing> Image indexing> Decision tree training> Epitome computation

> Simulation> light flow simulations for next-generation display research> Monte-Carlo simulations for mobile data

> eScience> Machine learning platform for health solutions> Astrophysics simulation

Page 32: SVR17: Data-Intensive Computing on Windows HPC Server with the

Ongoing Work

> Advanced query optimizations> Combination of static analysis and annotations> Sampling execution of the query plan> Dynamic query optimization

> Incremental computation> Real-time event processing> Global scheduling

> Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications

> Scale-out partitioned storage> Pluggable storage providers

> DryadLINQ on Azure> Better debugging, performance analysis, visualization,

etc.

Page 33: SVR17: Data-Intensive Computing on Windows HPC Server with the

Additional Resources

> Dryad and DryadLINQ> http://connect.microsoft.com/DryadLINQ> DryadLINQ source, Dryad binaries, documentation,

samples, blog, discussion group, etc.

> PLINQ> Available in Parallel Extensions to .NET Framework 3.5 CTP> Available in .NET Framework 4.0 Beta 2> http://msdn.microsoft.com/en-us/concurrency/default.aspx> http://msdn.microsoft.com/en-us/magazine/cc163329.aspx

> Windows HPC Server 2008> http://www.microsoft.com/hpc

> Download it, try it, we want your feedback!

Page 34: SVR17: Data-Intensive Computing on Windows HPC Server with the

Questions?

Page 35: SVR17: Data-Intensive Computing on Windows HPC Server with the

YOUR FEEDBACK IS IMPORTANT TO US!

Please fill out session evaluation

forms online atMicrosoftPDC.com

Page 36: SVR17: Data-Intensive Computing on Windows HPC Server with the

Learn More On Channel 9

> Expand your PDC experience through Channel 9.

> Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses.

channel9.msdn.com/learnBuilt by Developers for Developers….

Page 37: SVR17: Data-Intensive Computing on Windows HPC Server with the

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 38: SVR17: Data-Intensive Computing on Windows HPC Server with the