ixp training part 3 programming tips 2011.04.12. ncku csie cial lab2 outline memory absolute...

48
IXP Training Part 3 Programming Tips 2011.04.12

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

IXP Training Part 3Programming Tips

2011.04.12

Page 2: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 2

Outline

Memory Absolute Instruction Selection Task Partition

Memory Relative Reducing Overhead

Reduce the number of memory accesses Reduce average access latency

Hiding Overhead

Page 3: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 3

Memory Absolute Tips

Instruction Selection General Coding Skill Use Hardware Instruction

Task Partition Multi-Processing Context-Pipelining

Page 4: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 4

Coding Skill

Loop Unrolling Shift Operation Inline Function

__inline & __forceinline Branch Prediction

Branch Prediction Penalty

Page 5: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 5

Hardware Instruction

POP_COUNT FFS Multiply CRC Hashing CAM

Page 6: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 6

POP_COUNT--Brief

Population Count Report number of bit set in a 32-bit re

gister Example:

pop_count( 0x3121 ) = ? 0011 0001 0010 0001 Result = 5

Page 7: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 7

POP_COUNT--Naïve Implementationunsigned int pop_count_for (unsigned int x){ unsigned int y=0; unsigned int i;

for(i=0; i<32; i++) { if( (x&1)==1 ) y++; x=x>>1; } return y;}

Page 8: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 8

POP_COUNT--Faster Implementationunsigned int pop_count_agg(unsigned int x){

x -= ((x >> 1) & 0x55555555); x = (((x >> 2) & 0x33333333) + (x & 0x33333333)); x = (((x >> 4) + x) & 0x0f0f0f0f); x += (x >> 8); x += (x >> 16); return(x & 0x0000003f);}}

Reference http://aggregate.org/MAGIC/

Page 9: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 9

POP_COUNT--Hardware Instruction

unsigned int pop_count_hardware(unsigned int x)

{return pop_count (x);

}

Page 10: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 10

POP_COUNT--Additional Information

Bitmap-RFC (Liu, TECS 2008)

Page 11: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 11

FFS Find the first bit set in data and return its po

sition Example:

ffs ( 0x3121 ) = 0 0011 0001 0010 0001

ffs ( 0x3120 ) = 5 0011 0001 0010 0000

ffs ( 0x3100 ) = 8 0011 0001 0000 0000

Page 12: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 12

Multiply

Specific Multiply Instruction Multiply_24x8() Multiply_16x16() Multiply_32x32_hi() Multiply_32x32_lo()

Page 13: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 13

CRC

Example of CRC operationcrc_write( 0x42424242);crc_32_be( source_address, bytes_0_3 );crc_32_be( dest_address, bytes_0_3 );…Cache_index = crc_read();

Page 14: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 14

Hash Hash_48() Hash_64() Hash_128()

Example:SIGNAL sig_hash;Hash48(data_out, data_in, count, sig_done, &sig_

hash);__wait_for_all(&sig_hash);

Page 15: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 15

CAM--Brief Each ME has 16 32-bit CAM entries The CAM is private to other MEs With lookup operation, each entries is

searching in parallel With a success lookup, the index of m

atched entries will be returned Else, the index of entries to be replace

d will be returned

Page 16: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 16

Content Addressable Memory--Structure

cam_lookup_t

Page 17: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 17

CAM--Usagecam_lookup_t cam_result;cam_result = cam_lookup( data );if( cam_result.hit == 1 ) {

Access Entry cam_result.entry_num;…

}else {

……cam_write( cam_result.entry_num, data, 15 );

}

Page 18: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 18

Task Partition

Multi-Processing More Computing Power Easy to implement

Context-Pipelining More Useable Resource Hard to balance

Page 19: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 19

Memory Relative Tips--Reducing Overhead

Reduce the number of memory accesses Wide-word Accesses Result Caches

Reduce average access latency Multi-level Memory Hierarchy Data Cache

Page 20: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 20

Wide-Word Accesses--Brief

Batch Access the needed data Reduce the necessary accesses Useful when the data are linked-list

like structure

Page 21: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 21

Wild-Word Access--Example

MEM_ADDR+0

……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……

Page 22: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 22

Wide-Word Accesses--Usage (One Node per Access)

__declspec(sram_read_reg) UINT32 A;SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*4), 1, sig_done, &sig_read);

__wait_for_all( &sig_read );

Access A ......----------------------------------------------Result: 8 Accesses are needed

Page 23: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 23

Wide-Word Accesses--Usage (Two Node per Access)

__declspec(sram_read_reg) UINT32 A[2];SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*8), 2, sig_done, &sig_read);

__wait_for_all( &sig_read );

Access A ......----------------------------------------------Result: 4 Accesses are needed

Page 24: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 24

Wide-Word Accesses--Usage (Four Node per Access)

__declspec(sram_read_reg) UINT32 A[4];SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*16), 4, sig_done, &sig_read);

__wait_for_all( &sig_read );

Access A ......----------------------------------------------Result: 2 Accesses are needed

Page 25: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 25

Wide-Word Accesses--Experiment Platform: IXP2800 Total Accesses: 8 LW (8*4 Byte)

Case Total Cycle

Average Cycle/ LW

1LW * 8 Time

1211 151.38

2LW * 4 Time

725 90.63

4LW * 2 Time

460 57.50

8LW * 1 Time

387 48.38

Page 26: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 26

Wide-Word Accesses--Limitation

Data must be contiguous Suitable for linear search Not support random accesses

Number of Transfer Registers are fixed Each thread has 16 read / write registers The Tx-Regs may be reserved by others

Page 27: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 27

Resulting Cache--Brief

Caching the result of application If same fields appear again, the

cached result is return Memory accesses are reduced

when cache hit. Depends on time locality of the

traffic

Page 28: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 28

Result Cache--IXP2400

No hardware cache is supported in IXP2400 ME

Not easy to implement set-associative cache

Replacement policy will also be an overhead

Page 29: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 29

Result Cache--Design Consideration

Shared or Private Cache ? Size of Cache ? Works with specific Hardware ? Miss penalty handling ?

Page 30: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 30

Result Cache--Experiment

Page 31: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 31

Multi-Level Memory Hierarchy--Brief

Reduce the average access latency Number of accesses remained

unchanged If data can fit in faster memory,

then do it

Page 32: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 32

Multi-Level Memory Hierarchy--Data Placement Size smaller while read-only

Hard Code Size smaller while need updating

Local Memory Size larger

Scratchpad Size largest

SRAM

Page 33: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 33

Multi-Level Memory Hierarchy--Packet Data Type Packet related data

Temporary Data Valid with specific packet Local Memory

Flow related data Related to specific flow Spatial Locality Wide-Word Access

Application related data Valid with specific application Temporal Locality Result Cache

Page 34: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 34

Split-Cache (Z. Liu, IET-COM 2007) Two separate hardware for application data and flow data

Page 35: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 35

Data Cache--Brief

Hardware Cache Mechanism that cached the data for packet processing App-Cache Flow-Cache

However, not supported by IXP2400

Page 36: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 36

Data Cache--CAM + Local Memory

CAM works with Local Memory acts like hardware cache

However, number of CAM entries is less

Each CAM entry may co-worked with several Local Memory Cache entry

Page 37: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 37

Memory Relative Tips--Hiding Overhead

Not really reduce the overhead, but overlapped it Hardware Multi-Threading Asynchronous Memory

Page 38: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 38

Hardware Multi-Threading

Swap out itself and let another thread to execute while access memory

Each thread kept its own set of registers, thus no stack are needed for thread swapping

Round Robin Scheduling No thread preemptive

Page 39: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 39

Asynchronous Memory--Brief

Thread will not be blocked when issue a memory request

Thus, thread can issues multiple memory requests at a time

Page 40: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 40

Asynchronous Memory--Example (1 Issue)

Read X__wait_for_all ( &sig_x )Read Y__wait_for_all ( &sig_y )

// Use X and Y …

Page 41: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 41

Asynchronous Memory--Example (2 Issue)

Read XRead Y__wait_for_all ( &sig_x, &sig_y )

// Use X and Y …

Page 42: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 42

Wild-Word Access +Multiple Issues

MEM_ADDR+0

……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……

Page 43: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 43

Wild-Word Access +Multiple Issues (1LW, 2 Issue)

MEM_ADDR+0

……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……

Page 44: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 44

Wild-Word Access +Multiple Issues (2LW, 2 Issue)

MEM_ADDR+0

……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……

Page 45: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 45

Wild-Word Access +Multiple Issues (4LW, 2 Issue)

MEM_ADDR+0

……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……

Page 46: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 46

Wild-Word Access +Multiple Issues (Experiment)Scheme Total Cycles Average Cycles / LW

1 LW * 1 Issue 1211 151.382 LW * 1 Issue 725 90.63

4 LW * 1 Issue 460 57.50

8 LW * 1 Issue 387 48.38

1 LW * 2 Issue 716 89.50

2 LW * 2 Issue 445 55.63

4 LW * 2 Issue 364 45.50

1 LW * 4 Issue 396 49.50

2 LW * 4 Issue 320 40.00

1 LW * 8 Issue 318 39.75

Page 47: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 47

Reference (1) Jayaram Mudigonda, Harrick M. Vin, Raj Yav

atkar, “Overcoming the memory wall in packet processing: hammers or ladders?”, ANCS 2005

Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang, “High-Performance Packet Classification Algorithm for Multireaded IXP Network Processor”, ACM TECS 2008.

Page 48: IXP Training Part 3 Programming Tips 2011.04.12. NCKU CSIE CIAL Lab2 Outline Memory Absolute Instruction Selection Task Partition Memory Relative Reducing

NCKU CSIE CIAL Lab 48

Reference (2)

Z. Liu, K. Zheng, B. Liu, “Hybrid cache architecture for high-speed packet processing”, IET-COM 2007