introduction to embedded systems

Introduction to Embedded Systems

Rabie A. [email protected]

http://www.rabieramadan.org/classes/2014/embedded/

3




Memory

2

Memory Component Models Cache Memory Mapping \

Memory Component Models

3

Multiport memories

4

Larger memory structures can be built from memory blocks.

Memory Mapping is required

Register Files

5

The size of the register file is fixed when the CPU is predesigned.

Register file size is a key parameter in CPU design that affects code performance and energy consumption as well as the area of the CPU.

If the register file is too small,• The program must spill values to main memory:• The value is written to main memory and later read back from main

memory. • Spills cost both time and energy because main memory accesses are

slower and more energy-intense than register file accesses

Register Files

6

If the register file is too large, then it consumes static energy as well as taking extra chip area that could be used for other purposes.

Caches

7

When designing an embedded system, we need to pay extra attention to the relationship between the cache configuration and the programs that use it.

Too-small caches result in excessive main memory accesses;

Too-large caches consume excess static power. Longer cache lines provide more prefetching bandwidth,

which is useful in some algorithms but not others.

Caches

8

Line size affects prefetching behavior—• Programs that access successive memory

locations can benefit from the prefetching induced by long cache lines.

• Long lines can also, in some cases, provide reuse for very small sets of locations.

Cache Memory Mapping is another issue

Wolfe and Lam Classification to Behavior of Arrays

9

Caches

10

Several groups, have proposed configurable caches whose configuration can be changed at runtime.

Additional multiplexers and other logic allow a pool of memory cells to be used in several different cache configurations.

Scratch Pad Memories

11

Cache is designed to move a relatively small amount of memory close to the processor.

Caches use hardwired algorithms to manage the cache contents• Hardware determines when values

are added or removed from the cache.

Scratch pad memory is located parallel to the cache. • the scratch pad does not include

hardware to manage its contents.

Scratch pad memory Part of the memory address

space controlled by the processor

Scratch pad is managed by software, not hardware.• Provides predictable access

time.• Requires values to be allocated.

Use standard read/write instructions to access scratch pad.

Memory Maps

13

A memory map for a processor defines how addresses get mapped to hardware.

The total size of the address space is constrained by the address width of the processor.• A32-bit processor, for example, can address 232 locations, or 4

gigabytes (GB), assuming each address refers to one byte.

An ARM CortexTM - M3 architecture,

14

Separates addresses used for program memory (labeled A) from those used for data memory (B and D).

Memories accessed via separate buses,

Permitting instructions and data to be fetched simultaneously.

Effectively doubles the memory bandwidth.

Such a separation of program memory from data memory is known as a Harvard architecture.

An ARM CortexTM - M3 architecture

15

Includes a number of on-chip peripherals (C)• Devices that are accessed by

the processor using some of the memory addresses

• Timers, ADCs, UARTs, and other I/O devices

Each of these devices occupies a few of the memory addresses by providing memory-mapped registers

Memory Hierarchy The idea

• Hide the slower memory behind the fast memory • Cost and performance play major roles in selecting the memory.

Hit Vs. Miss Hit

• The requested data resides in a given level of memory. Miss

• The requested data is not found in the given level of memory Hit rate

• The percentage of memory accesses found in a given level of memory.

Miss rate• The percentage of memory accesses not found in a given level of

memory.

Hit Vs. Miss (Cont.) Hit time

• The time required to access the requested information in a given level of memory.

Miss penalty• The time required to process a miss,

• Replacing a block in an upper level of memory, • The additional time to deliver the requested data to the processor.

Miss Scenario The processor sends a request to the cache for location

X• if found cache hit • If not try next level

When the location is found load the whole block into the cache • Hoping that the processor will access one of the neighbor

locations next.• One miss may lead to multiple hits Locality

Can we compute the average access time based on this memory Hierarchy?

Average Access Time Assume a memory hierarchy with three levels (L1,

L2, and L3)

What is the memory average access time?

h1 hit at L1 (1-h1) miss at L1t1 L1 access time

h2 hit at L2(1-h2) miss at L2t2 L2 access time

h3 hit at L3=100%(1-h3) miss at L3t3 L3 access time

Cache Mapping Schemes

Cache Mapping Schemes Cache memory is smaller than the main memory

Only few blocks can be loaded at the cache

The cache does not use the same memory addresses

Which block in the cache is equivalent to which block in the memory? • The processor uses Memory Management Unit (MMU) to convert the

requested memory address to a cache address

Direct Mapping Assigns cache mappings using a modular approach

j = i mod n j cache block number i memory block number n number of cache blocks

Memory

Cache

Example Given M memory blocks to be mapped to 10 cache blocks, show

the direct mapping scheme?

How do you know which block is currently in the cache?

Direct Mapping (Cont.) Bits in the main memory address are divided into three fields.

Word identifies specific word in the block

Block identifies a unique block in the cache

Tag identifies which block from the main memory currently in the cache

Example Consider, for example, the case of a main memory consisting of 4K

blocks, a cache memory consisting of 128 blocks, and a block size of 16 words. Show the direct mapping and the main memory address format?

Tag

Example (Cont.)

Direct Mapping Advantage

• Easy

• Does not require any search technique to find a block in cache • Replacement is a straight forward

Disadvantages• Many blocks in MM are mapped to the same cache block• We may have others empty in the cache • Poor cache utilization

Group Activity 1 Consider, the case of a main memory

consisting of 4K blocks, a cache memory consisting of 8 blocks, and a block size of 4 words. Show the direct mapping and the main memory address format?

Group Activity 2 Given the following direct mapping chart, what is the

cache and memory location required by the following addresses:

31 126 3

4 20 2

Fully Associative Mapping Allowing any memory block to be placed anywhere in the

cache

A search technique is required to find the block number in the tag field

Example We have a main memory with 214 words , a cache with

16 blocks , and blocks is 8 words. How many tag & word fields bits?

Word field requires 3 bits

Tag field requires 11 bits 214 /8 = 2048 blocks

Fully Associative Mapping Advantages

• Flexibility • Utilizing the cache

Disadvantage• Required tag search • Associative search Parallel search • Might require extra hardware unit to do the search• Requires a replacement strategy if the cache is full • Expensive

N-way Set Associative Mapping Combines direct and fully associative mapping The cache is divided into a set of blocks All sets are the same size

Main memory blocks are mapped to a specific set based on : s = i mod S

• s specific to which block i mapped • S total number of sets

Any coming block is assigned to any cache block inside the set

N-way Set Associative Mapping

Tag field uniquely identifies the targeted block within the determined set.

Word field identifies the element (word) within the block that is requested by the processor.

Set field identifies the set

Group Activity Compute the three parameters (Word, Set, and

Tag) for a memory system having the following specification: • Size of the main memory is 4K blocks, • Size of the cache is 128 blocks, • The block size is 16 words.

Assume that the system uses 4-way set-associative mapping.

Answer

N-way Set Associative Mapping Advantages:

• Moderate utilization to the cache

Disadvantage • Still needs a tag search inside the set

If the cache is full and there is a need for block replacement ,

Which one to replace?

Cache Replacement Policies Random

• Simple • Requires random generator

First In First Out (FIFO)• Replace the block that has been in the cache the longest • Requires keeping track of the block lifetime

Least Recently Used (LRU) • Replace the one that has been used the least • Requires keeping track of the block history

Cache Replacement Policies (Cont.) Most Recently Used (MRU)

• Replace the one that has been used the most• Requires keeping track of the block history

Optimal • Hypothetical • Must know the future

Example Consider the case of a 4X8 two-dimensional array of numbers, A.

Assume that each number in the array occupies one word and that the array elements are stored column-major order in the main memory from location 1000 to location 1031. The cache consists of eight blocks each consisting of just two words. Assume also that whenever needed, LRU replacement policy is used. We would like to examine the changes in the cache if each of the direct mapping techniques is used as the following sequence of requests for the array elements are made by the processor:

Array elements in the main memory

Conclusion

16 cache miss No single hit 12 replacements Only 4 cache blocks are used

Group Activity Do the same in case of fully and 4-way set

associative mappings ?

Memory Models

48

Stacks• A stack is a region of memory that is dynamically

allocated to the program in a last-in, first-out (LIFO) pattern.

• A stack pointer (typically a register) contains the memory address of the top of the stack.

Stacks are typically used to implement procedure calls.

Memory Models-Stacks

49

In C, the compiler produces code that pushes onto the stack the location of:• instruction to execute upon returning from the procedure, • the current value of some or all of the machine registers, • the arguments to the procedure,• sets the program counter equal to the location of the procedure

code. Stack Frame

• The data for a procedure that is pushed onto the stack . When a procedure returns:

• the compiler pops its stack frame, • retrieving the program location at which to resume execution.

Memory Models-Stacks

50

It can be disastrous if the stack pointer is incremented beyond the memory allocated for the stack - stack overflow

Result in overwriting memory that is being used for other purposes.

Becomes particularly difficult with recursive programs, where a procedure calls itself.

Embedded software designers often avoid using recursion to circumvent this difficulty.

misuse or misunderstanding of the stack

51

When calling foo (), c refers to the return

address – after returning the stack frame c becomes

address of b which cause addressing

problem.

Memory Protection Units

52

A key issue in systems that support multiple simultaneous tasks is preventing one task from disrupting the execution of another.

Many processors provide memory protection in hardware. Tasks are assigned their own address space, and if a task

attempts to access memory outside its own address space, a segmentation fault or other exception results.

This will typically result in termination of the offending application.

Memory Models- Dynamic Memory Allocation

53

General-purpose software applications often have indeterminate requirements for memory, depending on parameters and/or user input.

To support such applications,• computer scientists have developed dynamic memory allocation

schemes, • a program can at any time request that the operating system allocate

additional memory. The memory is allocated from a data structure known as a

heap, which facilitates keeping track of which portions of memory are in use by which application.


54

Memory allocation occurs via an operating system call (such as malloc in C).

When the program no longer needs access to memory that has been so allocated, it deallocates the memory (by calling free in C).

it is possible for a program to inadvertently accumulate memory that is never freed. This is known as a memory leak,

for embedded applications, which typically must continue to execute for a long time, it can be disastrous.

The program will eventually fail when physical memory is exhausted.


55

memory fragmentation occurs when a program chaotically allocates and deallocates memory in varying sizes.

A fragmented memory has allocated and free memory chunks interspersed, and often the free memory chunks become too small to use.

In this case, defragmentation is required. Defragmentation and garbage collection are both very problematic

for real-time systems. Straightforward implementations of these tasks require all other

executing tasks to be stopped while the defragmentation or garbage collection is performed.

Implementations using such “stop the world” techniques can have substantial pause times, running sometimes for many milliseconds.

Programs

56

Topics

57

Code Compression Code generation and back-end compilation. Memory-oriented software optimizations.

Code Compression

58

Memory is one of the key driving factors in embedded system design

larger memory indicates an increased chip area, more power dissipation, and higher cost.

memory imposes constraints on the size of the application programs.

Code compression techniques address the problem by reducing the program size.

Traditional Code Compression

59

Compression is done off-line (prior to execution) Compressed program is loaded into the memory. Decompression is done during the program execution (online).

Dictionary-based Approach

60

take the advantage of commonly occurring instruction sequences by using a dictionary

The repeating occurrences are replaced by a codeword that points to the index of the dictionary that contains the pattern.

Improved Dictionary-based Approach

61

improve the dictionary based compression technique by considering mismatches.

Step1: Determine the instruction sequences that are di erent in few bit positions (hamming distance) ff

Step 2: Store that information in the compressed program

Step 3: Update the dictionary (if necessary).

The compression ratio will depend on how many bit changes are considered during compression

Example

62

This example considers only 1-bit change

the third pattern (from top) in the original program is di erent fffrom the first dictionary entry (index 0) on the sixth bit position (from left).

The compression ratio for this example is 95%.

CODE COMPRESSION USING BIT-MASKS

63

Your Reading Homework Link A presentation is required – I will be selecting randomly one of you to explain it next time .

Memory Optimization Techniques

64

PLATFORM-INDEPENDENT CODE TRANSFORMATIONS

65

Code Rewriting Techniques for Access Locality and Regularity• Consisting of loop (and sometimes also data flow)

transformations,

Should this algorithm be implemented directly?

Code Rewriting Techniques for Access Locality and Regularity

66

result in high storage and bandwidth requirements (assuming that N is large),

b[] signals have to be written to an off-chip background memory in the first loop and read back in the second loop.

Code Rewriting Techniques for Access Locality and Regularity

67

Rewriting the code using a loop merging transformation, gives the following:

b[] signals can be stored in registers up to the end of the accumulation, since they are immediately consumed after they have been produced.

In the overall algorithm, this reduces memory bandwidth requirements significantly,

Code Rewriting Techniques to Improve Data reuse

68

it is important to optimize data transfers and storage to utilize the memory hierarchy efficiently

The compiler literature up to now focused on improving data reuse by performing loop transformations.

hierarchical data reuse copies are added to the code, exposing the different levels of reuse

Depends on the knowledge about the memory hierarchy and their sizes .

Still hard to implement as well as to understand

Code Rewriting Techniques to Improve Data reuse

69

Only Part of the arrays are accessed in the internal loops Make them ready in buffers

Memory Estimation

70

One of the techniques is based on live elements (Signals) Requires a dependency graph In computer sciences a dependency graph is directed graph

representing dependencies of several instructions towards each other

Example

71

Lets build the Dependency graph

72

Dependences

74

Instruction Dependency • The operation performed by a stage depends on the operation(s)

performed by other stage(s).

E.g. Conditional Branch Instruction I4 can not be executed until the branch

condition in I3 is evaluated and stored. The branch takes 3 units of time

Dependences

75

Data Dependency:A source operand of instruction Ii depends on the results of

executing a proceeding Ij i > j

E.g. Ij can not be fetched unless the results of Ii are saved.

Data Dependency Write after write

Read after write

Write after read

Read after read does not cause stall

Read after write

Example Consider the execution of the following sequence of

instructions on a five-stage pipeline consisting of IF, ID, OF, IE, and IS. Show all types of data dependency

Answer

Memory Modeling

80

Based on the dependency and data flow graph,

• All variables that need to be preserved over more than one control step are stored in registers.

• The minimization of the number of registers assigned to the variables because the register count impacts the area of the resulting design.

Register Allocation by Graph Coloring

81

The life time of each variable is computed first, a graph is constructed whose nodes represent variables, the existence of an edge indicates that the life times overlap, i.e.,

they cannot share the same register; a register can only be shared by variables with nonoverlapping life

times. Thus, the problem of minimizing the register count for a given set

of variables and their life times is equivalent to the graph coloring problem .

Assign colors to each node of the graph such that the total number of colors is minimum and no two adjacent nodes share the same color

introduction to embedded systems

Documents

memory model

memory core

memory circuits

memory cell

memory blocks

successive memory locations

excessive main memory

register file size