the value of parallelism 16 th meeting course name: business intelligence year: 2009

17

Upload: randall-gallagher

Post on 20-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009
Page 2: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

The Value of Parallelism16th Meeting

Course Name: Business IntelligenceYear: 2009

Page 3: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

Bina Nusantara University

3

Source of this Material

(2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s

Guide. Chapter 11

Page 4: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

The Business CaseMaintaining large amounts of transaction data is one thing, but

integrating and subsequently transforming that data into an analytical environment (such as a data warehouse or any multidimensional analytical framework) requires a large amount of both storage space and processing capability. And unfortunately, the kinds of processing needed for BI applications cannot be scaled linearly.

In other words, with most BI processing, doubling the amount of data can dramatically increase the amount of processing required.

Bina Nusantara University 4

Page 5: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

Whenever we talk about parallelism, we need to assess the size of the problem as well as the way the problem can be decomposed into parallelizable units. The metric of the unit size with respect to concurrency is called granularity. Large problems that decompose into a relatively small number of large task would have coarse granularity, whereas a decomposition into a large number of very small task would have fine granularity. In this section we look at different kinds of parallelism ranging from coarse-grained to fine-grained parallelism.

• ScalabilityScalability refers to the situation when the speedup linearly increases as the number of resources is increased.

• Task ParallelismThis is an example of task parallelism (Figure 16-1), which is a coarsely grained parallelism. In this case, a high-level process is decomposed into a collection of discrete tasks, each of which performs some set of operations and results in some output or side effect.

Bina Nusantara University 5

Parallelism and Granularity

Page 6: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

• Pipeline ParallelismPipelining is an example of medium-grained parallelism, because the tasks are not fully separable (i.e., the completion of a single stage does not result in a finished product); however, the amount of work is large enough that it can be cordoned off and assigned as operational tasks (Figure 16-2).

Bina Nusantara University 6

Parallelism and Granularity (cont…)Figure 16-1

Page 7: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

Bina Nusantara University 7

Parallelism and Granularity (cont…)Figure 16-2

• Data ParallelismData parallelism is a different kind of parallelism. Here, instead of indentifying a collection of operational steps to be allocated to a process or task, the parallelism is related to both the flow and the structure of the information. For data parallelism, the goal is to scale the throughput of processing based on the ability to decompose the data set into concurrent processing stream, all performing the same set of operations.

Page 8: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

Bina Nusantara University 8

Parallelism and Granularity (cont…)• Vector Parallelism

Vector parallelism refers to an execution framework where a collection of objects is treated as an array, or vector, and the same set of operations is applied to all elements of the set. A vector parallel platform can be used to implement both pipeline and data parallel applications.

• CombinationsWe can embed pipelined processing within coarsely grained tasks or even decompose a pipe stage into a set of concurrent processes. The value of each of these kinds of parallelism is bounded by the system’s ability to support the overhead for managing those different levels.

Page 9: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

In this section we look at some popular parallel processing architectures. Systems employing these architectures either are configured by system manufacturers (such as symmetric multiprocessor [SMP] or massively parallel processing [MPP] systems) or can be homebrewed by savvy technical personnel (such as by use of a network of workstations).

• Symmetric MultiprocessingAn SMP system is a hardware configuration that combines multiple processors within a single architecture. In an SMP system, multiple processes can be allocated to different CPU’s within the system, which makes an SMP machine a good platform for coarse-grained parallelism.

• Massively Parallel ProcessingAn MPP system consists of a large number of small homogeneous processors interconnected via a high-speed network. The processors in an MPP machine are independent- they do not share memory, and typically each processor may run its own instance of an operating system, although there may be systemic controller application hosted on a leader processor that instruct the individual processors in the MPP configuration on what tasks to perform.

Bina Nusantara University 9

Parallel Processing System

Page 10: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

• Network of WorkstationsA network of workstations is a more loosely coupled version of an MPP system; the workstations are likely to be configured as individual machines that are connected via network. The communication latencies (i.e., delays in exchanging data) are likely to be an order of magnitude greater in this kind of configuration, and it is also possible for the machines in the network to be heterogeneous (i.e., involving different kinds of systems).

• Hybrid ArchitecturesA hybrid architecture is one that combines or adapts one of the previously discussed systems. The us of a hybrid architecture may be very dependent on the specific application, because some systems may be better suited to the concurrency specifics associated with each application.

Bina Nusantara University 10

Parallel Processing System (cont…)

Page 11: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

The key to exploiting parallelism is the ability to analyze dependence constraints within any process. In this section we explore both control dependence and data dependence, as well as issues associated with analyzing the dependence constraints within a system that prevent an organization from exploiting parallelism.

• Control DependenceControl refers to the logic that determines whether a particular task is performed. Process control cannot be initiated at that first task until the condition is set by the other task or the other task completes. In this case, the first task is control dependent on the second task.

• Input/output Data DependenceA data dependence between two processing stages represents a situation where some information that is “touched” by one process is read or written by a subsequent process, and the order of execution of these processes requires that the first process execute before the second process. There are potentially four kinds of data dependencies.

Bina Nusantara University 11

Dependence

Page 12: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

Read After Read (RAR)In which the second process’s read of data occurs after the first process’s read.

Read After Write (RAW)In which the second process’s read of data must occur after the first process’s write of that data.

Write After Read (WAR)In which the second process’s write of data must occur after the first process’s read of that data item.

Write After Write (WAW)In which the second process’s write of data must occur after the first process’s write of that data item.

• Dependence AnalysisDependence analysis is the process of examining an application’s use of data and control structure to indentify true dependencies within the application. The goal is first to identify the dependence chain within the system and then to look for opportunities where independent tasks can be executed in parallel.

Bina Nusantara University 12

Dependence (cont…)

Page 13: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

At this point it makes sense to review the value of parallelism with respect to a number of the BI-related applications. In each of these applications, a significant speedup can be achieved by exploiting parallelism.

• Query ProcessingRelational database management systems will most likely have taken advantage of internal query optimization as well as user defined indexes to speed up this kind of query. But even in that situation there different ways that the query can be parallelized, and this will still result in speed up.

• Data ProfilingThe column analysis component of data profiling is another good example of an application that can benefit from parallelism. Every column in a table is subject to a collection of analyses: frequency of values, value cardinality, distribution of value ranges, etc. But because the analysis of one column is distinct from that applied to all other columns, we can exploit parallelism by treating the set of analyses applied to each column as a separate task and then instantiating separate task to analyze a collection of columns simultaneously.Bina Nusantara University 13

Parallelism and Business Intelligence

Page 14: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

• Extract, Transform, LoadThe extract, transform, load (ETL) component of data warehouse population is actually well suited to all the kinds of parallelism we have discussed. The ETL process itself consist of a sequence of stages that can be configured as a pipeline, propagating the results of each stage to its successor, yielding some medium-grained parallelism.

Bina Nusantara University 14

Parallelism and Business Intelligence (cont…)

Page 15: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

In this section we discuss some of the management issues associated with the use of parallelism in a BI environment.

• Training and Technical Management RequirementsSignificant technical training is required to understand both system management and how best to take advantage of parallel system.

• Minimal Software SupportNot many off-the-shelf application software packages take advantage of parallel system, although a number of RDBMS systems do, and there is an increasing number of high-end ETL tools that exploit parallelism.

• Scalability IssuesScalability is not just a function of the size of the data. Any use of parallelism in an analytical environment should be evaluated in the context of data size, the amount of query processing expected, the number of concurrent users, and the kinds and complexity of the analytical processing.

Bina Nusantara University 15

Management Issue

Page 16: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

• Need for ExpertiseUltimately, control and data dependence analysis are tasks that need to be incorporated into the systems analysis role when migrating to a parallel infrastructure, because the absence of valid dependence analysis will prevent proper exploitation of concurrent resources.

Bina Nusantara University 16

Management Issue (cont…)

Page 17: The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009

End of Slide

Bina Nusantara University 17