the value of parallelism 16 th meeting course name: business intelligence year: 2009

The Value of Parallelism16th Meeting

Course Name: Business IntelligenceYear: 2009

Bina Nusantara University

3

Source of this Material

(2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s

Guide. Chapter 11

The Business CaseMaintaining large amounts of transaction data is one thing, but

integrating and subsequently transforming that data into an analytical environment (such as a data warehouse or any multidimensional analytical framework) requires a large amount of both storage space and processing capability. And unfortunately, the kinds of processing needed for BI applications cannot be scaled linearly.

In other words, with most BI processing, doubling the amount of data can dramatically increase the amount of processing required.

Bina Nusantara University 4

Whenever we talk about parallelism, we need to assess the size of the problem as well as the way the problem can be decomposed into parallelizable units. The metric of the unit size with respect to concurrency is called granularity. Large problems that decompose into a relatively small number of large task would have coarse granularity, whereas a decomposition into a large number of very small task would have fine granularity. In this section we look at different kinds of parallelism ranging from coarse-grained to fine-grained parallelism.

• ScalabilityScalability refers to the situation when the speedup linearly increases as the number of resources is increased.

• Task ParallelismThis is an example of task parallelism (Figure 16-1), which is a coarsely grained parallelism. In this case, a high-level process is decomposed into a collection of discrete tasks, each of which performs some set of operations and results in some output or side effect.


Parallelism and Granularity

• Pipeline ParallelismPipelining is an example of medium-grained parallelism, because the tasks are not fully separable (i.e., the completion of a single stage does not result in a finished product); however, the amount of work is large enough that it can be cordoned off and assigned as operational tasks (Figure 16-2).


Parallelism and Granularity (cont…)Figure 16-1


Parallelism and Granularity (cont…)Figure 16-2

• Data ParallelismData parallelism is a different kind of parallelism. Here, instead of indentifying a collection of operational steps to be allocated to a process or task, the parallelism is related to both the flow and the structure of the information. For data parallelism, the goal is to scale the throughput of processing based on the ability to decompose the data set into concurrent processing stream, all performing the same set of operations.


Parallelism and Granularity (cont…)• Vector Parallelism

Vector parallelism refers to an execution framework where a collection of objects is treated as an array, or vector, and the same set of operations is applied to all elements of the set. A vector parallel platform can be used to implement both pipeline and data parallel applications.

• CombinationsWe can embed pipelined processing within coarsely grained tasks or even decompose a pipe stage into a set of concurrent processes. The value of each of these kinds of parallelism is bounded by the system’s ability to support the overhead for managing those different levels.

In this section we look at some popular parallel processing architectures. Systems employing these architectures either are configured by system manufacturers (such as symmetric multiprocessor [SMP] or massively parallel processing [MPP] systems) or can be homebrewed by savvy technical personnel (such as by use of a network of workstations).

• Symmetric MultiprocessingAn SMP system is a hardware configuration that combines multiple processors within a single architecture. In an SMP system, multiple processes can be allocated to different CPU’s within the system, which makes an SMP machine a good platform for coarse-grained parallelism.

• Massively Parallel ProcessingAn MPP system consists of a large number of small homogeneous processors interconnected via a high-speed network. The processors in an MPP machine are independent- they do not share memory, and typically each processor may run its own instance of an operating system, although there may be systemic controller application hosted on a leader processor that instruct the individual processors in the MPP configuration on what tasks to perform.


Parallel Processing System

• Network of WorkstationsA network of workstations is a more loosely coupled version of an MPP system; the workstations are likely to be configured as individual machines that are connected via network. The communication latencies (i.e., delays in exchanging data) are likely to be an order of magnitude greater in this kind of configuration, and it is also possible for the machines in the network to be heterogeneous (i.e., involving different kinds of systems).

• Hybrid ArchitecturesA hybrid architecture is one that combines or adapts one of the previously discussed systems. The us of a hybrid architecture may be very dependent on the specific application, because some systems may be better suited to the concurrency specifics associated with each application.


Parallel Processing System (cont…)

The key to exploiting parallelism is the ability to analyze dependence constraints within any process. In this section we explore both control dependence and data dependence, as well as issues associated with analyzing the dependence constraints within a system that prevent an organization from exploiting parallelism.

• Control DependenceControl refers to the logic that determines whether a particular task is performed. Process control cannot be initiated at that first task until the condition is set by the other task or the other task completes. In this case, the first task is control dependent on the second task.

• Input/output Data DependenceA data dependence between two processing stages represents a situation where some information that is “touched” by one process is read or written by a subsequent process, and the order of execution of these processes requires that the first process execute before the second process. There are potentially four kinds of data dependencies.


Dependence

Read After Read (RAR)In which the second process’s read of data occurs after the first process’s read.

Read After Write (RAW)In which the second process’s read of data must occur after the first process’s write of that data.

Write After Read (WAR)In which the second process’s write of data must occur after the first process’s read of that data item.

Write After Write (WAW)In which the second process’s write of data must occur after the first process’s write of that data item.

• Dependence AnalysisDependence analysis is the process of examining an application’s use of data and control structure to indentify true dependencies within the application. The goal is first to identify the dependence chain within the system and then to look for opportunities where independent tasks can be executed in parallel.


Dependence (cont…)

At this point it makes sense to review the value of parallelism with respect to a number of the BI-related applications. In each of these applications, a significant speedup can be achieved by exploiting parallelism.

• Query ProcessingRelational database management systems will most likely have taken advantage of internal query optimization as well as user defined indexes to speed up this kind of query. But even in that situation there different ways that the query can be parallelized, and this will still result in speed up.

• Data ProfilingThe column analysis component of data profiling is another good example of an application that can benefit from parallelism. Every column in a table is subject to a collection of analyses: frequency of values, value cardinality, distribution of value ranges, etc. But because the analysis of one column is distinct from that applied to all other columns, we can exploit parallelism by treating the set of analyses applied to each column as a separate task and then instantiating separate task to analyze a collection of columns simultaneously.Bina Nusantara University 13

Parallelism and Business Intelligence

• Extract, Transform, LoadThe extract, transform, load (ETL) component of data warehouse population is actually well suited to all the kinds of parallelism we have discussed. The ETL process itself consist of a sequence of stages that can be configured as a pipeline, propagating the results of each stage to its successor, yielding some medium-grained parallelism.


Parallelism and Business Intelligence (cont…)

In this section we discuss some of the management issues associated with the use of parallelism in a BI environment.

• Training and Technical Management RequirementsSignificant technical training is required to understand both system management and how best to take advantage of parallel system.

• Minimal Software SupportNot many off-the-shelf application software packages take advantage of parallel system, although a number of RDBMS systems do, and there is an increasing number of high-end ETL tools that exploit parallelism.

• Scalability IssuesScalability is not just a function of the size of the data. Any use of parallelism in an analytical environment should be evaluated in the context of data size, the amount of query processing expected, the number of concurrent users, and the kinds and complexity of the analytical processing.


Management Issue

• Need for ExpertiseUltimately, control and data dependence analysis are tasks that need to be incorporated into the systems analysis role when migrating to a parallel infrastructure, because the absence of valid dependence analysis will prevent proper exploitation of concurrent resources.


Management Issue (cont…)

End of Slide


the value of parallelism 16 th meeting course name: business intelligence year: 2009

Documents

finegrained parallelism

data set

different kinds of parallelism

bina nusantara university

different kind of parallelism

data warehouse

large number

kinds of processing