introduction to embedded systems
DESCRIPTION
Introduction to Embedded Systems . Rabie A. Ramadan [email protected] http:// www.rabieramadan.org /classes/2014/embedded/ 3. Memory Component Models Cache Memory Mapping \. Memory . Memory Component Models. Larger memory structures can be built from memory blocks. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/1.jpg)
Introduction to Embedded Systems
Rabie A. [email protected]
http://www.rabieramadan.org/classes/2014/embedded/
3
![Page 2: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/2.jpg)
Memory
2
Memory Component Models Cache Memory Mapping \
![Page 3: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/3.jpg)
Memory Component Models
3
![Page 4: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/4.jpg)
Multiport memories
4
Larger memory structures can be built from memory blocks.
Memory Mapping is required
![Page 5: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/5.jpg)
Register Files
5
The size of the register file is fixed when the CPU is predesigned.
Register file size is a key parameter in CPU design that affects code performance and energy consumption as well as the area of the CPU.
If the register file is too small,• The program must spill values to main memory:• The value is written to main memory and later read back from main
memory. • Spills cost both time and energy because main memory accesses are
slower and more energy-intense than register file accesses
![Page 6: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/6.jpg)
Register Files
6
If the register file is too large, then it consumes static energy as well as taking extra chip area that could be used for other purposes.
![Page 7: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/7.jpg)
Caches
7
When designing an embedded system, we need to pay extra attention to the relationship between the cache configuration and the programs that use it.
Too-small caches result in excessive main memory accesses;
Too-large caches consume excess static power. Longer cache lines provide more prefetching bandwidth,
which is useful in some algorithms but not others.
![Page 8: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/8.jpg)
Caches
8
Line size affects prefetching behavior—• Programs that access successive memory
locations can benefit from the prefetching induced by long cache lines.
• Long lines can also, in some cases, provide reuse for very small sets of locations.
Cache Memory Mapping is another issue
![Page 9: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/9.jpg)
Wolfe and Lam Classification to Behavior of Arrays
9
![Page 10: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/10.jpg)
Caches
10
Several groups, have proposed configurable caches whose configuration can be changed at runtime.
Additional multiplexers and other logic allow a pool of memory cells to be used in several different cache configurations.
![Page 11: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/11.jpg)
Scratch Pad Memories
11
Cache is designed to move a relatively small amount of memory close to the processor.
Caches use hardwired algorithms to manage the cache contents• Hardware determines when values
are added or removed from the cache.
Scratch pad memory is located parallel to the cache. • the scratch pad does not include
hardware to manage its contents.
![Page 12: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/12.jpg)
Scratch pad memory Part of the memory address
space controlled by the processor
Scratch pad is managed by software, not hardware.• Provides predictable access
time.• Requires values to be allocated.
Use standard read/write instructions to access scratch pad.
![Page 13: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/13.jpg)
Memory Maps
13
A memory map for a processor defines how addresses get mapped to hardware.
The total size of the address space is constrained by the address width of the processor.• A32-bit processor, for example, can address 232 locations, or 4
gigabytes (GB), assuming each address refers to one byte.
![Page 14: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/14.jpg)
An ARM CortexTM - M3 architecture,
14
Separates addresses used for program memory (labeled A) from those used for data memory (B and D).
Memories accessed via separate buses,
Permitting instructions and data to be fetched simultaneously.
Effectively doubles the memory bandwidth.
Such a separation of program memory from data memory is known as a Harvard architecture.
![Page 15: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/15.jpg)
An ARM CortexTM - M3 architecture
15
Includes a number of on-chip peripherals (C)• Devices that are accessed by
the processor using some of the memory addresses
• Timers, ADCs, UARTs, and other I/O devices
Each of these devices occupies a few of the memory addresses by providing memory-mapped registers
![Page 16: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/16.jpg)
16
![Page 17: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/17.jpg)
Memory Hierarchy The idea
• Hide the slower memory behind the fast memory • Cost and performance play major roles in selecting the memory.
![Page 18: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/18.jpg)
Hit Vs. Miss Hit
• The requested data resides in a given level of memory. Miss
• The requested data is not found in the given level of memory Hit rate
• The percentage of memory accesses found in a given level of memory.
Miss rate• The percentage of memory accesses not found in a given level of
memory.
![Page 19: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/19.jpg)
Hit Vs. Miss (Cont.) Hit time
• The time required to access the requested information in a given level of memory.
Miss penalty• The time required to process a miss,
• Replacing a block in an upper level of memory, • The additional time to deliver the requested data to the processor.
![Page 20: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/20.jpg)
Miss Scenario The processor sends a request to the cache for location
X• if found cache hit • If not try next level
When the location is found load the whole block into the cache • Hoping that the processor will access one of the neighbor
locations next.• One miss may lead to multiple hits Locality
Can we compute the average access time based on this memory Hierarchy?
![Page 21: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/21.jpg)
Average Access Time Assume a memory hierarchy with three levels (L1,
L2, and L3)
What is the memory average access time?
h1 hit at L1 (1-h1) miss at L1t1 L1 access time
h2 hit at L2(1-h2) miss at L2t2 L2 access time
h3 hit at L3=100%(1-h3) miss at L3t3 L3 access time
![Page 22: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/22.jpg)
Cache Mapping Schemes
![Page 23: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/23.jpg)
Cache Mapping Schemes Cache memory is smaller than the main memory
Only few blocks can be loaded at the cache
The cache does not use the same memory addresses
Which block in the cache is equivalent to which block in the memory? • The processor uses Memory Management Unit (MMU) to convert the
requested memory address to a cache address
![Page 24: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/24.jpg)
Direct Mapping Assigns cache mappings using a modular approach
j = i mod n j cache block number i memory block number n number of cache blocks
Memory
Cache
![Page 25: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/25.jpg)
Example Given M memory blocks to be mapped to 10 cache blocks, show
the direct mapping scheme?
How do you know which block is currently in the cache?
![Page 26: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/26.jpg)
Direct Mapping (Cont.) Bits in the main memory address are divided into three fields.
Word identifies specific word in the block
Block identifies a unique block in the cache
Tag identifies which block from the main memory currently in the cache
![Page 27: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/27.jpg)
Example Consider, for example, the case of a main memory consisting of 4K
blocks, a cache memory consisting of 128 blocks, and a block size of 16 words. Show the direct mapping and the main memory address format?
Tag
![Page 28: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/28.jpg)
Example (Cont.)
![Page 29: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/29.jpg)
Direct Mapping Advantage
• Easy
• Does not require any search technique to find a block in cache • Replacement is a straight forward
Disadvantages• Many blocks in MM are mapped to the same cache block• We may have others empty in the cache • Poor cache utilization
![Page 30: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/30.jpg)
Group Activity 1 Consider, the case of a main memory
consisting of 4K blocks, a cache memory consisting of 8 blocks, and a block size of 4 words. Show the direct mapping and the main memory address format?
![Page 31: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/31.jpg)
Group Activity 2 Given the following direct mapping chart, what is the
cache and memory location required by the following addresses:
31 126 3
4 20 2
![Page 32: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/32.jpg)
Fully Associative Mapping Allowing any memory block to be placed anywhere in the
cache
A search technique is required to find the block number in the tag field
![Page 33: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/33.jpg)
Example We have a main memory with 214 words , a cache with
16 blocks , and blocks is 8 words. How many tag & word fields bits?
Word field requires 3 bits
Tag field requires 11 bits 214 /8 = 2048 blocks
![Page 34: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/34.jpg)
Fully Associative Mapping Advantages
• Flexibility • Utilizing the cache
Disadvantage• Required tag search • Associative search Parallel search • Might require extra hardware unit to do the search• Requires a replacement strategy if the cache is full • Expensive
![Page 35: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/35.jpg)
N-way Set Associative Mapping Combines direct and fully associative mapping The cache is divided into a set of blocks All sets are the same size
Main memory blocks are mapped to a specific set based on : s = i mod S
• s specific to which block i mapped • S total number of sets
Any coming block is assigned to any cache block inside the set
![Page 36: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/36.jpg)
N-way Set Associative Mapping
Tag field uniquely identifies the targeted block within the determined set.
Word field identifies the element (word) within the block that is requested by the processor.
Set field identifies the set
![Page 37: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/37.jpg)
Group Activity Compute the three parameters (Word, Set, and
Tag) for a memory system having the following specification: • Size of the main memory is 4K blocks, • Size of the cache is 128 blocks, • The block size is 16 words.
Assume that the system uses 4-way set-associative mapping.
![Page 38: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/38.jpg)
Answer
![Page 39: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/39.jpg)
N-way Set Associative Mapping Advantages:
• Moderate utilization to the cache
Disadvantage • Still needs a tag search inside the set
![Page 40: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/40.jpg)
If the cache is full and there is a need for block replacement ,
Which one to replace?
![Page 41: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/41.jpg)
Cache Replacement Policies Random
• Simple • Requires random generator
First In First Out (FIFO)• Replace the block that has been in the cache the longest • Requires keeping track of the block lifetime
Least Recently Used (LRU) • Replace the one that has been used the least • Requires keeping track of the block history
![Page 42: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/42.jpg)
Cache Replacement Policies (Cont.) Most Recently Used (MRU)
• Replace the one that has been used the most• Requires keeping track of the block history
Optimal • Hypothetical • Must know the future
![Page 43: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/43.jpg)
Example Consider the case of a 4X8 two-dimensional array of numbers, A.
Assume that each number in the array occupies one word and that the array elements are stored column-major order in the main memory from location 1000 to location 1031. The cache consists of eight blocks each consisting of just two words. Assume also that whenever needed, LRU replacement policy is used. We would like to examine the changes in the cache if each of the direct mapping techniques is used as the following sequence of requests for the array elements are made by the processor:
![Page 44: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/44.jpg)
Array elements in the main memory
![Page 45: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/45.jpg)
![Page 46: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/46.jpg)
Conclusion
16 cache miss No single hit 12 replacements Only 4 cache blocks are used
![Page 47: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/47.jpg)
Group Activity Do the same in case of fully and 4-way set
associative mappings ?
![Page 48: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/48.jpg)
Memory Models
48
Stacks• A stack is a region of memory that is dynamically
allocated to the program in a last-in, first-out (LIFO) pattern.
• A stack pointer (typically a register) contains the memory address of the top of the stack.
Stacks are typically used to implement procedure calls.
![Page 49: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/49.jpg)
Memory Models-Stacks
49
In C, the compiler produces code that pushes onto the stack the location of:• instruction to execute upon returning from the procedure, • the current value of some or all of the machine registers, • the arguments to the procedure,• sets the program counter equal to the location of the procedure
code. Stack Frame
• The data for a procedure that is pushed onto the stack . When a procedure returns:
• the compiler pops its stack frame, • retrieving the program location at which to resume execution.
![Page 50: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/50.jpg)
Memory Models-Stacks
50
It can be disastrous if the stack pointer is incremented beyond the memory allocated for the stack - stack overflow
Result in overwriting memory that is being used for other purposes.
Becomes particularly difficult with recursive programs, where a procedure calls itself.
Embedded software designers often avoid using recursion to circumvent this difficulty.
![Page 51: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/51.jpg)
misuse or misunderstanding of the stack
51
When calling foo (), c refers to the return
address – after returning the stack frame c becomes
address of b which cause addressing
problem.
![Page 52: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/52.jpg)
Memory Protection Units
52
A key issue in systems that support multiple simultaneous tasks is preventing one task from disrupting the execution of another.
Many processors provide memory protection in hardware. Tasks are assigned their own address space, and if a task
attempts to access memory outside its own address space, a segmentation fault or other exception results.
This will typically result in termination of the offending application.
![Page 53: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/53.jpg)
Memory Models- Dynamic Memory Allocation
53
General-purpose software applications often have indeterminate requirements for memory, depending on parameters and/or user input.
To support such applications,• computer scientists have developed dynamic memory allocation
schemes, • a program can at any time request that the operating system allocate
additional memory. The memory is allocated from a data structure known as a
heap, which facilitates keeping track of which portions of memory are in use by which application.
![Page 54: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/54.jpg)
Memory Models- Dynamic Memory Allocation
54
Memory allocation occurs via an operating system call (such as malloc in C).
When the program no longer needs access to memory that has been so allocated, it deallocates the memory (by calling free in C).
it is possible for a program to inadvertently accumulate memory that is never freed. This is known as a memory leak,
for embedded applications, which typically must continue to execute for a long time, it can be disastrous.
The program will eventually fail when physical memory is exhausted.
![Page 55: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/55.jpg)
Memory Models- Dynamic Memory Allocation
55
memory fragmentation occurs when a program chaotically allocates and deallocates memory in varying sizes.
A fragmented memory has allocated and free memory chunks interspersed, and often the free memory chunks become too small to use.
In this case, defragmentation is required. Defragmentation and garbage collection are both very problematic
for real-time systems. Straightforward implementations of these tasks require all other
executing tasks to be stopped while the defragmentation or garbage collection is performed.
Implementations using such “stop the world” techniques can have substantial pause times, running sometimes for many milliseconds.
![Page 56: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/56.jpg)
Programs
56
![Page 57: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/57.jpg)
Topics
57
Code Compression Code generation and back-end compilation. Memory-oriented software optimizations.
![Page 58: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/58.jpg)
Code Compression
58
Memory is one of the key driving factors in embedded system design
larger memory indicates an increased chip area, more power dissipation, and higher cost.
memory imposes constraints on the size of the application programs.
Code compression techniques address the problem by reducing the program size.
![Page 59: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/59.jpg)
Traditional Code Compression
59
Compression is done off-line (prior to execution) Compressed program is loaded into the memory. Decompression is done during the program execution (online).
![Page 60: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/60.jpg)
Dictionary-based Approach
60
take the advantage of commonly occurring instruction sequences by using a dictionary
The repeating occurrences are replaced by a codeword that points to the index of the dictionary that contains the pattern.
![Page 61: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/61.jpg)
Improved Dictionary-based Approach
61
improve the dictionary based compression technique by considering mismatches.
Step1: Determine the instruction sequences that are di erent in few bit positions (hamming distance) ff
Step 2: Store that information in the compressed program
Step 3: Update the dictionary (if necessary).
The compression ratio will depend on how many bit changes are considered during compression
![Page 62: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/62.jpg)
Example
62
This example considers only 1-bit change
the third pattern (from top) in the original program is di erent fffrom the first dictionary entry (index 0) on the sixth bit position (from left).
The compression ratio for this example is 95%.
![Page 63: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/63.jpg)
CODE COMPRESSION USING BIT-MASKS
63
Your Reading Homework Link A presentation is required – I will be selecting randomly one of you to explain it next time .
![Page 64: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/64.jpg)
Memory Optimization Techniques
64
![Page 65: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/65.jpg)
PLATFORM-INDEPENDENT CODE TRANSFORMATIONS
65
Code Rewriting Techniques for Access Locality and Regularity• Consisting of loop (and sometimes also data flow)
transformations,
Should this algorithm be implemented directly?
![Page 66: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/66.jpg)
Code Rewriting Techniques for Access Locality and Regularity
66
result in high storage and bandwidth requirements (assuming that N is large),
b[] signals have to be written to an off-chip background memory in the first loop and read back in the second loop.
![Page 67: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/67.jpg)
Code Rewriting Techniques for Access Locality and Regularity
67
Rewriting the code using a loop merging transformation, gives the following:
b[] signals can be stored in registers up to the end of the accumulation, since they are immediately consumed after they have been produced.
In the overall algorithm, this reduces memory bandwidth requirements significantly,
![Page 68: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/68.jpg)
Code Rewriting Techniques to Improve Data reuse
68
it is important to optimize data transfers and storage to utilize the memory hierarchy efficiently
The compiler literature up to now focused on improving data reuse by performing loop transformations.
hierarchical data reuse copies are added to the code, exposing the different levels of reuse
Depends on the knowledge about the memory hierarchy and their sizes .
Still hard to implement as well as to understand
![Page 69: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/69.jpg)
Code Rewriting Techniques to Improve Data reuse
69
Only Part of the arrays are accessed in the internal loops Make them ready in buffers
![Page 70: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/70.jpg)
Memory Estimation
70
One of the techniques is based on live elements (Signals) Requires a dependency graph In computer sciences a dependency graph is directed graph
representing dependencies of several instructions towards each other
![Page 71: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/71.jpg)
Example
71
![Page 72: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/72.jpg)
Lets build the Dependency graph
72
![Page 73: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/73.jpg)
73
![Page 74: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/74.jpg)
Dependences
74
Instruction Dependency • The operation performed by a stage depends on the operation(s)
performed by other stage(s).
E.g. Conditional Branch Instruction I4 can not be executed until the branch
condition in I3 is evaluated and stored. The branch takes 3 units of time
![Page 75: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/75.jpg)
Dependences
75
Data Dependency:A source operand of instruction Ii depends on the results of
executing a proceeding Ij i > j
E.g. Ij can not be fetched unless the results of Ii are saved.
![Page 76: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/76.jpg)
Data Dependency Write after write
Read after write
Write after read
Read after read does not cause stall
![Page 77: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/77.jpg)
Read after write
![Page 78: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/78.jpg)
Example Consider the execution of the following sequence of
instructions on a five-stage pipeline consisting of IF, ID, OF, IE, and IS. Show all types of data dependency
![Page 79: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/79.jpg)
Answer
![Page 80: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/80.jpg)
Memory Modeling
80
Based on the dependency and data flow graph,
• All variables that need to be preserved over more than one control step are stored in registers.
• The minimization of the number of registers assigned to the variables because the register count impacts the area of the resulting design.
![Page 81: Introduction to Embedded Systems](https://reader033.vdocuments.us/reader033/viewer/2022061612/56816387550346895dd47317/html5/thumbnails/81.jpg)
Register Allocation by Graph Coloring
81
The life time of each variable is computed first, a graph is constructed whose nodes represent variables, the existence of an edge indicates that the life times overlap, i.e.,
they cannot share the same register; a register can only be shared by variables with nonoverlapping life
times. Thus, the problem of minimizing the register count for a given set
of variables and their life times is equivalent to the graph coloring problem .
Assign colors to each node of the graph such that the total number of colors is minimum and no two adjacent nodes share the same color