1 lab 2 parallel processing using nios ii processors ceg 4131 computer architecture iii miodrag...

1

Lab 2 Parallel processing using NIOS II processors

CEG 4131 Computer Architecture III

Miodrag Bolic

2

Overview

You will learn how to:• Design multiprocessing systems that use shared

memories• Partition sequential program so that it can be

implemented on multi-processors• Synchronize multiprocessing system

• Time: 3 weeks• Point: 115 (There is an optional task)

3

Overview

• Part 1– Design a multiprocessing system by following the steps from the

tutorial. Run and debug the program that comes with the tutorial.

• Part 2– Use the same hardware designed in part 1– Develop a program for parallel matrix multiplication and run it on

the multiprocessing system– Compute the speedup of the program when it runs on a single

processor and on a multiprocessing system

4

Part 1

• Copy the project C:\altera\kits\nios2\examples\vhdl\niosII_stratix_1s10\standardto your home directory

• Go through the steps of the tutorial “Creating Multiprocessor Nios II System tutorial”. You can download the tutorial from tt_nios2_multiprocessor_tutorial.pdf and a program from http://www.altera.com/literature/tt/hello_world_multi.c

• Modification: On page 30 of the tutorial, choose NIOS II/s core for CPU3 instead of NIOS II/e. All three cores have to be NIOS II/s. Change the instruction cache size for all 3 of them to 4kBytes.

• Before generating and compiling on page 36 of the tutorial, do the following:– Add performance counter in the same way as in Lab 1. Connect

performance_counter only to the data master of the CPU1.– Add on-chip Memory block and configure it as shown in the next page.

Connect s1 port to cpu1/data_master and cpu2/data_master. Connect s2 port to cpu3/data_master.

• Continue with the tutorial.

http://www.altera.com/literature/tt/hello_world_multi.c

5

On-chip memory configuration

6

Task 1 – Demonstration and Questions

• Show to the TA that the program is working (20 points)

• Questions:

1. Describe the program in details.

2. Why do we need mutex?

3. If processor 1 gets a mutex for the memory messsage_buffer_ram, can processor 2 write to this memory before processor 1 releases the mutex?

4. Can processor 1 store two messages in the buffer?

7

Part 2

• In this part, the same hardware configuration will be used.

• You will design a program for parallel matrix multiplication.

• Problem:

There is an input/output module which receives and stores data in matrices in matrices M1 and M2. We will simulate this module using shared_memory module that we added in the first part of the Lab. Our program multiplies these two matrices and stores the result C in the same module (memory).

8

Sequential solution

• Program the Altera chip using the same configuration from part 1.

• Modify the matrix_performance.c file so that matrices M1, M2 and C are transferred to the shared_memory. Do this step before activating the performance counter. Change the number of iterations in matrix multiplication from 100 to 1000.

• Change the C/C++ options in your project and syslib project from Debug to Release.

• Run the code and present the performance count results and matrix C that is obtained in the iteration 1000.

• Demonstration: show the result to the TA.

9

Parallel solution

• CPU 1 will be used for synchronization and for I/O operations, while CPU 2 and 3 are used for multiplication. CPU 2 and 3 function in single program multiple data SPMD mode. This means that they start the iterations at the same time and they execute the same code but on different data. After they finish the multiplication, they signal to CPU1. The program will repeat the multiplication of matrices 1000 times.

10

Parallel matrix multiplication

• CPU1 transfers M1 and M2 to the shared_memory. • Algorithm

The sequential program is show bellow. In parallel implementation, CPU 2 will execute i loop from 0 to 4, and CPU 3 will execute i loop from 5 to 9. CPU 2 and 3 will perform their operations at the same time

for (i=0;i<=9;i++){

for (j=0;j<=9;j++){

C[i][j] = 0;

for (k=0;k<=9;k++){

C[i][j]+=M1[i][k]*M2[k][j];

}

}

}

11

Synchronization• Variables status_start and status_done will be shared variables

used for synchronization. All three processor will access these variable using the mutex. They will be stored in message_buffer_ram memory.

• It is extremely important that both CPU2 and CPU3 start matrix multiplication at the same time. This will not happen automatically since they are booted from the same memory. So, CPU1 has to assure that both CPU2 and CPU3 start at the same time. Shared variable status_start will be used for that. CPU1 has to set this variable to 1 and CPU2 and CPU3 have to increment this variable before they start matrix multiplication. When status_start is 3 then CPU2 and CPU3 will start matrix multiplication and CPU1 will initiate measurement of time using the performance_counter.

• At the beginning, CPU 1 will set status_done to 1. After CPU 2 and CPU 3 finish 1000 iterations of 10x10 matrix multiplication, they each increment the status_done. CPU 1 is periodically reading the variable status_done, and when it is 3, the program is over. CPU1 stops the performance_count and print performance_count result and matrix C from 1000th iteration on the terminal.

12

Task 2 - Questions

• What is speedup if we compare sequential and parallel implementation? Comment the speed-up result.

• Why can we design a program for matrix multiplication without using mutexes (except for synchronization)?

13

Demonstration (40)

• Send matrix C of 1000th iteration of the matrix multiplication algorithm to the terminal through JTAG UART. Send also the number of clock cycles from the performance counter.

• Show this result to the TA. Explain to the TA how your parallel matrix multiplication program works and how you achieved synchronization. You will get 10 points less if speedup is less than 1.

14

Optional part- Synchronization

• If our program emulates real system, then CPU1 should synchronize both CPU1 and CPU2 after 1 iteration of 10x10 matrix multiplication and not after 1000 of them. So, in a real program after each 10x10 matrix multiplication, the CPU1 will perform some operations on the computed matrix C and initialize new iteration of 10x10 matrix multiplication if matrices M1 and M2 are ready.

• In this part of the lab, you will use iteration_done variable to notify CPU1 that one iteration of 10x10 matrix multiplication is done. Additional shared variable is needed for the start of next iteration. Let’s call it start_next_iteration.

• The program works as follows. At the beginning CPU1 sets start_next_iteration . After 10x10 multiplication iteration starts, CPU2 and CPU3 resets this variable. After CPU2 and CPU3 are done with the execution of their part of 10x10 matrix multiplication, they increment iteration_done and wait for start_next_iteration to be set. CPU1 checks if iteration_done is equal 3 and if it is, CPU1 sets start_next_iteration. The new iteration of 10x10 matrix multiplication can start then.

15

Optional part – Demonstration and Questions

Question • What is the speedup of this program?

Demonstration (10 optional points)• Send the sum of the elements of matrix C of each

iteration of 10x10 matrix multiplication algorithm to the terminal through JTAG UART. Send also the number of clock cycles from the performance counter.

• Show this result to the TA. Explain to them how you achieved synchronization.

16

What to submitReport contains the following (30 points):• Title page • Description of your system with the picture of SOPC Builder System

Components• Detailed description of your solution of the algorithm for parallel

matrix multiplication and synchronization.• Answers to the questions from task 1-2. • Conclusions• Page 17 of this document signed by the TA.

• Soft copies of the report and source code of the programs for sequential and parallel multiplication with basic comments (*.c files) and quartus II files *.sof and *.ptf (10 points).

• Optional: Description of the synchronization method and speedup for the optional part as apart of the report. Softcopy of the algorithm for matrix multiplication. (5 points)

17

Lab 2 – Signature page

Student name:

Student name:

Demonstrated(TA’s signature)

Performance_counter result - Time

Points

Part 1 / ____/20

Part 2 sequential

Part 2 parallel ____/40

Part 2 optional ____/10

Total / / ____

1 lab 2 parallel processing using nios ii processors ceg 4131 computer architecture iii miodrag...

Documents