5201196x1

Performance tests

PAGE 1

Optimization of time-critical program parts

In order to improve the performance of the serial code - and as a consequence also the performance of the parallel version as well - we at first selected a test dataset, which allowed executing many program runs in a limited time frame. This test dataset only dealt with 9 effective meioses and thus lead to a very short run time, but it was expected to reveal a run time behaviour similar to our target data set with 17 meioses. Repeatedly, during these test runs, performance information was collected and evaluated, and the program code was then modified to cut down the run time. For this purpose we used the performance analyzer toolset, an important part of Sun Microsystems programming environment. By this means those portions of the code which are very expensive by not utilizing the hardware in a favorable manner can easily be identified. With this toolset it was easy to find out that the original program was spending more than 99% of its run time in two functions (peel and brute_force_analyze) which are consuming less than 1% of the program code. This is of course a very lucky case, because changing only a limited code portion can have a major effect on performance. There are some well-known techniques of program optimization which could be applied. In many cases this will be done automatically by an optimizing compiler. But then there are cases where the code is too complicated or where not enough information is available at compile time. Here, we did the following manual code changes to GENEHUNTER-TWOLOCUS in order to improve the performance.

- Extraction of loop-invariant code

Most of the compute time is spent inside multiple nested for-loops. If some parts of the code in the loop body do not depend on the loop iteration, these can be extracted out of the loop, thus saving a lot of computation time.

- Replacement of case constructs by bit manipulations

Modern processors use pipelines to speed up program executions. Therefore several instructions are operated upon at any given time, each operation being in a different stage. But any kind of program condition, e.g. if statements, or, as here, case constructs, disturb the instruction flow. Here it was profitable to substitute the case constructs by bit operations, helping to keep the instruction pipelines busy and thus increasing the number of instructions per machine clock cycle.

- Loop interchange to improve loop unrolling

Modern compilers frequently use a technique called loop unrolling. The body of a given loop is replicated and the number of loop iterations is reduced correspondingly. Thus, the loop overhead (caused by loop index calculation and jumping) is reduced, and the loop body contains more instructions which can then be executed more efficiently. If the unrolled loop is too short, this technique does not pay off. But if the iteration count is not known at compile time, the compiler has difficulties to take the right decisions. The GENEHUNTER-TWOLOCUS code contained a time consuming loop nest, in which the inner loop count was equal to the number of children, which in families of western societies is usually quite small. A manual interchange of loops improved the effect of the loop unrolling by the compiler considerably.

- Subroutine inlining

Subroutines frequently called in loops cause a high call overhead. In many cases the compiler is able to integrate the called subroutine into the calling subroutine, eliminating this overhead completely. Furthermore the code of this inlined subroutine is now part of the loop body and can be further optimized by the compiler. Code simplifications obtained by the above listed methods lead to a reduction of the code complexity which then facilitated the inlining by the compiler.

These modifications of only a small part of the serial code of GENEHUNTER-TWOLOCUS lead to a considerable reduction of the runtime of the whole program without the cost of additional hardware.

Parallelization of the two-locus extension

The computation time can be further reduced by parallelization of those parts which still remain time critical. It turns out that the two-locus extension is particular well suited for that. This becomes apparent if one rewrites equation (2) in the manuscript as:

Obviously, the sums in the squared brackets, i.e., , can be calculated independently from one another for each , that is, they can be computed in parallel.

In order to parallelize the calculation of , we have used the Message Passing Interface (MPI) library. The MPI library allows parallelization of programs for distributed memory systems as well as for shared memory systems. It is free of cost for many operating systems including Linux, and the C-programming language is well supported.

Principally, if a program should be run in parallel, the MPI runtime system starts the same program on as many processors as demanded. So in addition to the serial program, one has to explicitly distribute the work among the processors. We have employed the following simple workflow:

One processor - frequently called the master or root processor - reads the marker and pedigree data from the corresponding files from disk and distributes the data to all the other processors.

Preliminary steps like analysis of the family structure are done in a redundant fashion by every processor. This procedure is feasible as the resulting data is required by every processor and the time used is not noticeable compared to the time for the complete calculation. After that, the are calculated, where each processor works off a certain number of these terms (i.e., for a certain number of inheritance vectors). The actual number of these terms varies at most by one between different processors. This is achieved by a cyclic distribution scheme: Let N be the number of inheritance vectors and size the number of processors. Then the first processor edits the inheritance vectors 0, size, 2size, . The second processor handles the inheritance vectors 1, size+1, 2size+1, . That followed, every processor performs a partial summation of the outer sum in (3), corresponding to the set of it has calculated. Then these partial sums are summed up to complete the outer sum in (3). No interprocessor communication is necessary during the calculation of the . Also the computational effort is exactly the same for each of these terms. So, as long as every processor has to calculate the same number of the , there will be no load-balancing problem.

Since practically all of the computation time is spent for the calculation of the , it would not make sense in the context of two trait loci to parallelize other parts of the program. Especially we have not parallelized the Fast-Fourier-Transform (FFT) which is utilized in combining the information from multiple markers. Performance analyzing tools reported the time spent with FFT as zero, as the time was too small to be within the range of the displayed figures. However, this can be easily explained by the fact that GENEHUNTER version 1.3, on which GENEHUNTER-TWOLOCUS is built, spends roughly equal amounts of time on the subsequent program parts: 1. calculation of inheritance probabilities at each marker, based only on the data for that marker; 2. calculation of multipoint-inheritance probabilities by use of Hidden-Markov-Model marker-to-marker transitions, implemented with FFT methods; 3. calculation of the scoring function. With GENEHUNTER-TWOLOCUS, the number of evaluations of the scoring function increases by a factor of 22n-f, compared with the single locus version (in our case, from 217 to 234), whereas the computation time for the other parts is only doubled. This shows that the computation time with GENEHUNTER-TWOLOCUS is completely dominated by the calculation of the scoring function.

Even after optimization and parallelization, GENEHUNTER-TWOLOCUS may still run for hours or even days with larger pedigrees. To avoid the annoyance of losing the results in case that a lengthy program run breaks down shortly before regular program termination, we have implemented a restart mechanism into GENEHUNTER-TWOLOCUS. This enables the user to continue the analysis at a point before the system crash.

Comparison of GENEHUNTER-TWOLOCUS, TLINKAGE and SUPERLINK

Besides GENEHUNTER-TWOLOCUS, there are other linkage programs available which can perform two-trait-locus linkage analysis. TLINKAGE is based on the Elston-Stewart algorithm. In consequence, the pedigree size is no crucial point, but the calculation time and memory demands grow exponentially with an increasing number of markers. SUPERLINK is a new program based on Bayesian Networks. Within this framework, the sums over the possible genotypes and pedigree members are optimally arranged according to the data actually analyzed. This feature makes SUPERLINK a most flexible tool, yielding results with reasonable computation time and memory demands for a wide range of combinations of pedigree sizes and numbers of markers. The program MCLINK-2LOCUS also accomplishes two-trait-locus linkage analysis. It employs Markov Chain Monte Carlo methods for approximate likelihood calculation. Here, we are comparing the different programs with regard to the timing results obtained for linkage analysis with two trait loci and several markers. Since we focus on programs which perform exact calculations, we only included the programs GENEHUNTER-TWOLOCUS, TLINKAGE and SUPERLINK (version 1.2) in the following tests.

The timing experiments have been chosen with the following questions in mind: performance of the programs when analyzing the complete data set of the hypercholesterolemia study, influence of the number of markers used in the analysis, influence of the pedigree size, influence of missing genotypes, and influence of the disease model. With respect to the latter, two trait models were taken into account: the multiplicative, binary model with all penetrances equal to 0 or 1, as shown in Al-Kateb et al. (see table 3 therein), and a soft model which is derived from the binary model by substituting 0 by 0.01 and 1 by 0.99.

When considering the test results, two different issues have to be taken into account. SUPERLINK employs a stochastic optimization procedure. This implicates that the time to calculate the LOD score often changes from run to run, even though the experiment has been repeated under exactly the same conditions. If different results have been obtained for the same test, we report the minimum time. Only when observing differences by more than a factor of two, we show the minimum and the maximum time needed. The second issue is that GENEHUNTER-TWOLOCUS and TLINKAGE / SUPERLINK embark two different program strategies. With GENEHUNTER-TWOLOCUS, due to the nature of the Lander-Green algorithm, the score function has to be calculated only once for all inheritance-vector combinations, since it does not depend on the genetic positions of the trait loci. Because the evaluation of the score function consumes almost all of the time needed for LOD- or NPL-score calculation, it makes virtually no difference whether the score is evaluated finally at one or ten thousand different positions. SUPERLINK, on the other hand, needs approximately the same amount of time for every disease locus position within one run. Accordingly, the times shown for GENEHUNTER-TWOLOCUS correspond to the calculation of the LOD score surface over the complete grid spanned by the two chromosomes, i.e., roughly ten thousand points. The times presented for SUPERLINK correspond to the evaluation of the LOD score at only one combination of disease locus positions, and therefore need to be multiplied by the number of combinations to be evaluated.

At first we compare the performance of SUPERLINK, TLINKAGE and GENEHUNTER-TWOLOCUS when analyzing the complete pedigree (shown in fig.3) and varying the number of markers, using the binary as well as the soft disease model, described above. The results are shown in table S1. The time GENEHUNTER-TWOLOCUS needs to accomplish the task is practically independent of the number of markers and the assumed disease model. For a single combination of positions of the first and second trait locus, SUPERLINK and TLINKAGE are very fast, outperforming GENEHUNTER-TWOLOCUS by far. This holds for SUPERLINK even if a considerable number of markers is used. However, we did not manage to complete a run with TLINKAGE using more than two markers. Also if the number of markers exceeds 14, the times for SUPERLINK become comparable to those of GENEHUNTER-TWOLOCUS (see table S1). This holds when only one processor is assumed for calculation with GENEHUNTER-TWOLOCUS. If a large number of processors is used, the break-even point occurs already at a fewer number of markers. But if we want to employ the full number of markers available for the hypercholesterolemia study, the time for SUPERLINK to complete the LOD score calculation, even for a single disease-locus position, is prohibitively large. This particular job is only feasible with GENEHUNTER-TWOLOCUS. The computation times for TLINKAGE and SUPERLINK when using the soft model seem to have a slight tendency to be higher, which might be caused by the fact that the number of genotypes, which can be excluded from the LOD score calculation due to incompatibility with the data, are reduced with the soft model. The times shown for SUPERLINK runs with more than 14 markers are extrapolated from tests which had to be interrupted because of the unfeasibly long expected calculation times. One also observes a more-than-linear rise of the computation time with the number of markers for SUPERLINK.

Since our pedigree contains an inbreeding loop, we accomplished a second experiment by removing this loop, resulting in a smaller pedigree with only the 6 children and their parents (i.e., 10 bits). The execution times for the soft as well as for the binary model are presented in table S2. One observes, in principle, the same characteristics as in the first experiment. The tendency of SUPERLINK to spend more time with the LOD score calculation under the soft model is more pronounced than in the first experiment. As far as the runs could be completed, TLINKAGE also shows this behavior, however to a smaller degree. Within this test we experienced the problem that SUPERLINK terminated with a segmentation violation error under the binary disease model when more than 20 markers had been included. This happened despite the fact that 2GB of memory were available (i.e., 1GB main memory plus 1GB of swap disk space). Once again, whereas the calculation times for SUPERLINK have to be multiplied by the number of combinations of disease locus positions to be analyzed, the times for GENEHUNTER-TWOLOCUS already include the calculations for the complete two-dimensional grid. Furthermore, the calculation times of GENEHUNTER-TWOLOCUS are independent of the number of markers and the disease model. The particularity of GENEHUNTER-TWOLOCUS that the calculation times practically do not depend on the number of markers used can be explained by the fact that the scoring function which completely dominates the overall computation time does not depend on the number of markers.

In addition we also analyzed the behavior of the three programs with the reduced pedigree when the parents genotypes are missing, and varied the number of markers and the disease model as before. Unfortunately we were able to get only a small number of results (see table S3) since we did not manage to complete tests with SUPERLINK due to segmentation violation errors for more than 6 markers. For those runs which could be completed, one observes that the times needed by SUPERLINK to compute a LOD-score calculation are several orders of magnitude higher when genotypes are missing. TLINKAGE also runs longer in case of missing genotypes, whereas the computation time needed by GENEHUNTER-TWOLOCUS is the same as when all genotypes are available.

Memory requirements of SUPERLINK, TLINKAGE and GENEHUNTER-TWOLOCUS have also been examined within the three scenarios used with the performance tests. Before discussing the results two issues have to be mentioned. The memory requirement of SUPERLINK has not been entirely reproducible but varied within the same test case when repeated several times. This behaviour might be caused by the stochastic optimization procedure of SUPERLINK (version 1.2). We report here only the minimum requirements. For GENEHUNTER-TWOLOCUS the memory requirement has been tested for each case with different number of processors. We report the memory usage per processor. So the total memory used is the memory shown in the tables S4-S6 times the number of processors.

Table S4 shows the memory requirements for the complete pedigree with different numbers of markers. Even though SUPERLINK tends to need less memory than GENEHUNTER-TWOLOCUS, it is obvious that, for every test case, the memory demanded by both programs is within the same order of magnitude. Evidently, both programs use more memory if more markers are included in the analysis. It has to be taken in to account that, with the reported amount of memory for one program run, GENEHUNTER-TWOLOCUS handles the calculation of LOD scores, NPL scores and marker information content for a grid of ten thousand points. The memory demands of GENEHUNTER-TWOLOCUS are reduced with an increasing number of processors. This effect is most pronounced for large pedigrees.

The aforementioned points also hold for the reduced pedigree (table S5). However, here, the memory requirement for GENEHUNTER-TWOLOCUS is within a range where it is already heavily dependent on the particular computing environment. So instead of memory reduction when using more processors, the demanded memory rises by constant amount which is due to administrational overhead for parallel execution.

When examining the memory requirements of SUPERLINK for the reduced pedigree without parental genotypes (table S6), it becomes apparent that SUPERLINK needs more memory compared to the case with genotyped parents. The memory demands of GENEHUNTER-TWOLOCUS are neither influenced by missing genotypes nor are they dependent on the disease model used.

Complementing the performance tests described above, we have extended the reduced pedigree by four more children, resulting in an 18-bit pedigree. GENEHUNTER-TWOLOCUS finished this pedigree within 33 hours using 64 processors. Reduction of memory requirements per processor ranges from 711MB for a one processor run down to 427MB per processor for a 24 processor program run. This gives a clue of the pedigree size which is still manageable with our new version of GENEHUNTER-TWOLOCUS.

To summarize our results, it can be said that SUPERLINK is very well suited for large pedigrees, even if a considerable number of markers is involved, and thus can handle some of the cases which are unreachable for the Elston-Stewart-based TLINKAGE and the Lander-Green-based GENEHUNTER-TWOLOCUS. Still, the increase of computation time using SUPERLINK grows more than linearly with the number of markers. Also, the program does not perform as well if some of the individuals are untyped. Therefore, the data of the hypercholesterolemia study cannot be analyzed by SUPERLINK with the complete set of markers, as is possible with GENEHUNTER-TWOLOCUS. TLINKAGE can also handle large pedigrees, but with only very few markers. The new version of GENEHUNTER-TWOLOCUS, presented here, can analyze moderately large pedigrees, with at least up to 18 bits, while it is possible to handle large numbers of markers and disease-locus positions. The fact that GENEHUNTER-TWOLOCUS calculates complete LOD and NPL score surfaces at practically no additional cost is particularly useful in the context of a two-disease-locus analysis.

Table S1 :

Execution times with GENEHUNTER-TWOLOCUS, SUPERLINK and TLINKAGE for the complete 17-bit pedigree

Number of markers12346810121416182022

SUPERLINK / binary disease model0.005s0.75s1.1s10s78s420s720s18000s7.2(105s*3.0(106s*5.04(106s*4.4(108s*6.3(108s*

SUPERLINK / soft disease model0.005s0.75s1.3s8s120s810s6000s61200s8.3(105s*3.1(106s*1.08(107s*7.8(107s*2.1(1010s*

TLINKAGE / binary disease model0.8s14400s********************************

TLINKAGE / soft disease model0.9s25200s********************************

GENEHUNTER-TWOLOCUS6.84(106s with one processor; actually, 272 processors were used: 25128s

The times are reported for a 1GHz Pentium3 PC, except for the GENEHUNTER-TWOLOCUS program which was run on a cluster consisting of UltraSparc III (Cu) 900 MHz processors. We have found that the UltraSparc processors run approximately 1.4 times faster than the Pentium3 processor with our application.

*) These times have not been taken from the complete run. Only a certain percentage has been completed and the total time extrapolated.

**) The program was stopped after running for 300 hours without finishing the first LOD score calculation.

***) The program has terminated with a segmentation violation error.

Table S2:

Execution times with GENEHUNTER-TWOLOCUS, SUPERLINK and TLINKAGE for the reduced (10-bit) pedigree

Number of markers12346810121416182022

SUPERLINK / binary disease model

5201196x1

Documents