ixp training part 3 programming tips 2011.04.12. ncku csie cial lab2 outline memory absolute...
Post on 22-Dec-2015
216 views
TRANSCRIPT
IXP Training Part 3Programming Tips
2011.04.12
NCKU CSIE CIAL Lab 2
Outline
Memory Absolute Instruction Selection Task Partition
Memory Relative Reducing Overhead
Reduce the number of memory accesses Reduce average access latency
Hiding Overhead
NCKU CSIE CIAL Lab 3
Memory Absolute Tips
Instruction Selection General Coding Skill Use Hardware Instruction
Task Partition Multi-Processing Context-Pipelining
NCKU CSIE CIAL Lab 4
Coding Skill
Loop Unrolling Shift Operation Inline Function
__inline & __forceinline Branch Prediction
Branch Prediction Penalty
NCKU CSIE CIAL Lab 5
Hardware Instruction
POP_COUNT FFS Multiply CRC Hashing CAM
NCKU CSIE CIAL Lab 6
POP_COUNT--Brief
Population Count Report number of bit set in a 32-bit re
gister Example:
pop_count( 0x3121 ) = ? 0011 0001 0010 0001 Result = 5
NCKU CSIE CIAL Lab 7
POP_COUNT--Naïve Implementationunsigned int pop_count_for (unsigned int x){ unsigned int y=0; unsigned int i;
for(i=0; i<32; i++) { if( (x&1)==1 ) y++; x=x>>1; } return y;}
NCKU CSIE CIAL Lab 8
POP_COUNT--Faster Implementationunsigned int pop_count_agg(unsigned int x){
x -= ((x >> 1) & 0x55555555); x = (((x >> 2) & 0x33333333) + (x & 0x33333333)); x = (((x >> 4) + x) & 0x0f0f0f0f); x += (x >> 8); x += (x >> 16); return(x & 0x0000003f);}}
Reference http://aggregate.org/MAGIC/
NCKU CSIE CIAL Lab 9
POP_COUNT--Hardware Instruction
unsigned int pop_count_hardware(unsigned int x)
{return pop_count (x);
}
NCKU CSIE CIAL Lab 10
POP_COUNT--Additional Information
Bitmap-RFC (Liu, TECS 2008)
NCKU CSIE CIAL Lab 11
FFS Find the first bit set in data and return its po
sition Example:
ffs ( 0x3121 ) = 0 0011 0001 0010 0001
ffs ( 0x3120 ) = 5 0011 0001 0010 0000
ffs ( 0x3100 ) = 8 0011 0001 0000 0000
NCKU CSIE CIAL Lab 12
Multiply
Specific Multiply Instruction Multiply_24x8() Multiply_16x16() Multiply_32x32_hi() Multiply_32x32_lo()
NCKU CSIE CIAL Lab 13
CRC
Example of CRC operationcrc_write( 0x42424242);crc_32_be( source_address, bytes_0_3 );crc_32_be( dest_address, bytes_0_3 );…Cache_index = crc_read();
NCKU CSIE CIAL Lab 14
Hash Hash_48() Hash_64() Hash_128()
Example:SIGNAL sig_hash;Hash48(data_out, data_in, count, sig_done, &sig_
hash);__wait_for_all(&sig_hash);
NCKU CSIE CIAL Lab 15
CAM--Brief Each ME has 16 32-bit CAM entries The CAM is private to other MEs With lookup operation, each entries is
searching in parallel With a success lookup, the index of m
atched entries will be returned Else, the index of entries to be replace
d will be returned
NCKU CSIE CIAL Lab 16
Content Addressable Memory--Structure
cam_lookup_t
NCKU CSIE CIAL Lab 17
CAM--Usagecam_lookup_t cam_result;cam_result = cam_lookup( data );if( cam_result.hit == 1 ) {
Access Entry cam_result.entry_num;…
}else {
……cam_write( cam_result.entry_num, data, 15 );
}
NCKU CSIE CIAL Lab 18
Task Partition
Multi-Processing More Computing Power Easy to implement
Context-Pipelining More Useable Resource Hard to balance
NCKU CSIE CIAL Lab 19
Memory Relative Tips--Reducing Overhead
Reduce the number of memory accesses Wide-word Accesses Result Caches
Reduce average access latency Multi-level Memory Hierarchy Data Cache
NCKU CSIE CIAL Lab 20
Wide-Word Accesses--Brief
Batch Access the needed data Reduce the necessary accesses Useful when the data are linked-list
like structure
NCKU CSIE CIAL Lab 21
Wild-Word Access--Example
MEM_ADDR+0
……
+4 ……
+8 ……
+12 ……
+16 ……
+20 ……
+24 ……
+28 ……
NCKU CSIE CIAL Lab 22
Wide-Word Accesses--Usage (One Node per Access)
__declspec(sram_read_reg) UINT32 A;SIGNAL sig_read;
sram_read( &A, MEM_ADDR+(i*4), 1, sig_done, &sig_read);
__wait_for_all( &sig_read );
Access A ......----------------------------------------------Result: 8 Accesses are needed
NCKU CSIE CIAL Lab 23
Wide-Word Accesses--Usage (Two Node per Access)
__declspec(sram_read_reg) UINT32 A[2];SIGNAL sig_read;
sram_read( &A, MEM_ADDR+(i*8), 2, sig_done, &sig_read);
__wait_for_all( &sig_read );
Access A ......----------------------------------------------Result: 4 Accesses are needed
NCKU CSIE CIAL Lab 24
Wide-Word Accesses--Usage (Four Node per Access)
__declspec(sram_read_reg) UINT32 A[4];SIGNAL sig_read;
sram_read( &A, MEM_ADDR+(i*16), 4, sig_done, &sig_read);
__wait_for_all( &sig_read );
Access A ......----------------------------------------------Result: 2 Accesses are needed
NCKU CSIE CIAL Lab 25
Wide-Word Accesses--Experiment Platform: IXP2800 Total Accesses: 8 LW (8*4 Byte)
Case Total Cycle
Average Cycle/ LW
1LW * 8 Time
1211 151.38
2LW * 4 Time
725 90.63
4LW * 2 Time
460 57.50
8LW * 1 Time
387 48.38
NCKU CSIE CIAL Lab 26
Wide-Word Accesses--Limitation
Data must be contiguous Suitable for linear search Not support random accesses
Number of Transfer Registers are fixed Each thread has 16 read / write registers The Tx-Regs may be reserved by others
NCKU CSIE CIAL Lab 27
Resulting Cache--Brief
Caching the result of application If same fields appear again, the
cached result is return Memory accesses are reduced
when cache hit. Depends on time locality of the
traffic
NCKU CSIE CIAL Lab 28
Result Cache--IXP2400
No hardware cache is supported in IXP2400 ME
Not easy to implement set-associative cache
Replacement policy will also be an overhead
NCKU CSIE CIAL Lab 29
Result Cache--Design Consideration
Shared or Private Cache ? Size of Cache ? Works with specific Hardware ? Miss penalty handling ?
NCKU CSIE CIAL Lab 30
Result Cache--Experiment
NCKU CSIE CIAL Lab 31
Multi-Level Memory Hierarchy--Brief
Reduce the average access latency Number of accesses remained
unchanged If data can fit in faster memory,
then do it
NCKU CSIE CIAL Lab 32
Multi-Level Memory Hierarchy--Data Placement Size smaller while read-only
Hard Code Size smaller while need updating
Local Memory Size larger
Scratchpad Size largest
SRAM
NCKU CSIE CIAL Lab 33
Multi-Level Memory Hierarchy--Packet Data Type Packet related data
Temporary Data Valid with specific packet Local Memory
Flow related data Related to specific flow Spatial Locality Wide-Word Access
Application related data Valid with specific application Temporal Locality Result Cache
NCKU CSIE CIAL Lab 34
Split-Cache (Z. Liu, IET-COM 2007) Two separate hardware for application data and flow data
NCKU CSIE CIAL Lab 35
Data Cache--Brief
Hardware Cache Mechanism that cached the data for packet processing App-Cache Flow-Cache
However, not supported by IXP2400
NCKU CSIE CIAL Lab 36
Data Cache--CAM + Local Memory
CAM works with Local Memory acts like hardware cache
However, number of CAM entries is less
Each CAM entry may co-worked with several Local Memory Cache entry
NCKU CSIE CIAL Lab 37
Memory Relative Tips--Hiding Overhead
Not really reduce the overhead, but overlapped it Hardware Multi-Threading Asynchronous Memory
NCKU CSIE CIAL Lab 38
Hardware Multi-Threading
Swap out itself and let another thread to execute while access memory
Each thread kept its own set of registers, thus no stack are needed for thread swapping
Round Robin Scheduling No thread preemptive
NCKU CSIE CIAL Lab 39
Asynchronous Memory--Brief
Thread will not be blocked when issue a memory request
Thus, thread can issues multiple memory requests at a time
NCKU CSIE CIAL Lab 40
Asynchronous Memory--Example (1 Issue)
Read X__wait_for_all ( &sig_x )Read Y__wait_for_all ( &sig_y )
// Use X and Y …
NCKU CSIE CIAL Lab 41
Asynchronous Memory--Example (2 Issue)
Read XRead Y__wait_for_all ( &sig_x, &sig_y )
// Use X and Y …
NCKU CSIE CIAL Lab 42
Wild-Word Access +Multiple Issues
MEM_ADDR+0
……
+4 ……
+8 ……
+12 ……
+16 ……
+20 ……
+24 ……
+28 ……
NCKU CSIE CIAL Lab 43
Wild-Word Access +Multiple Issues (1LW, 2 Issue)
MEM_ADDR+0
……
+4 ……
+8 ……
+12 ……
+16 ……
+20 ……
+24 ……
+28 ……
NCKU CSIE CIAL Lab 44
Wild-Word Access +Multiple Issues (2LW, 2 Issue)
MEM_ADDR+0
……
+4 ……
+8 ……
+12 ……
+16 ……
+20 ……
+24 ……
+28 ……
NCKU CSIE CIAL Lab 45
Wild-Word Access +Multiple Issues (4LW, 2 Issue)
MEM_ADDR+0
……
+4 ……
+8 ……
+12 ……
+16 ……
+20 ……
+24 ……
+28 ……
NCKU CSIE CIAL Lab 46
Wild-Word Access +Multiple Issues (Experiment)Scheme Total Cycles Average Cycles / LW
1 LW * 1 Issue 1211 151.382 LW * 1 Issue 725 90.63
4 LW * 1 Issue 460 57.50
8 LW * 1 Issue 387 48.38
1 LW * 2 Issue 716 89.50
2 LW * 2 Issue 445 55.63
4 LW * 2 Issue 364 45.50
1 LW * 4 Issue 396 49.50
2 LW * 4 Issue 320 40.00
1 LW * 8 Issue 318 39.75
NCKU CSIE CIAL Lab 47
Reference (1) Jayaram Mudigonda, Harrick M. Vin, Raj Yav
atkar, “Overcoming the memory wall in packet processing: hammers or ladders?”, ANCS 2005
Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang, “High-Performance Packet Classification Algorithm for Multireaded IXP Network Processor”, ACM TECS 2008.
NCKU CSIE CIAL Lab 48
Reference (2)
Z. Liu, K. Zheng, B. Liu, “Hybrid cache architecture for high-speed packet processing”, IET-COM 2007