arm

Chapter 1

ARM: An Advanced Microcontroller

Microcontrollers are single-chip computers. In a single chip it combines a relatively simple

CPU, with supports, such as, timers, serial/parallel digital/analog input/output lines. Program

memory is generally included on-chip. Also, a typically small read/write memory (commonly

known as scratch-pad) is included in the chip. To extend program and data memory further,

proper interfacing facilities are provided.

While microprocessors are used in personal computers and other high-performance applica-

tions, microcontrollers are targeted towards small applications. The operating frequency may

be as low as 32KHz, though there exists many very high speed microcontrollers. The major

requirement of such small systems is the reduced power consumption (as noted in Chapter ??).

The next important issue is of course cost. Apart from the integration of various system com-

ponents into a single chip with reduced power consumption and cost, some of the important

features that embedded system designers look for in microcontrollers are the following.

1. Whether the highest available speed of the microcontroller is sufficient for the application

in hand.

2. The size of the chip, for example, 40-pin DIP (dual inline package), QFP (quad flat

package). This determines the size of the system and thus the device.

3. Amount of on-chip ROM/RAM space should be sufficient to hold program code. De-

pending upon the design constraints, external memory may or may not be utilized.

4. Cost of a single chip, as it is going to determine the cost of the overall system.

5. The development platform should be good enough so that development time is reduced.

1

2 CHAPTER 1. ARM: AN ADVANCED MICROCONTROLLER

It is also advisable to have on-chip debugging facility (through JTAG port) and debug

software.

6. Availability of the microcontroller chips is also another determining factor.

Several microcontrollers available in the market. The following are some of the important

ones 68HC11, 8051, ARM, Atmel (AVR8, AVR32) , Freescale (CF, S08), Hitachi (H8, Su-

perH), MIPS, NEC, PIC, PowerPC, TI MSP430, Toshiba TLCS-870, Zilog (eZ8, eZ80). In the

following, we look into the ARM processor architecture that has many nice features supporting

embedded computation, and is in particular, low power.

1.1 ARM microcontroller

ARM is a 32-bit RISC (Reduced Instruction Set Computer) processor architecture developed

by the ARM Corporation. It was previously known as Advanced RISC Machine, and prior

to that Acron RISC Machine. The ARM architecture is licensed to companies that want to

manufacture ARM based CPUs or system-on-a-chip products. This enables the licensee to

develop their own processors compliant with the ARM instruction set architecture. ARM pro-

cessors possess a unique combination of features that makes ARM the most popular embedded

architecture today.

1. ARM cores are very simple, compared to other general-purpose processors available in

the market. This implies that the ARM processors will need relatively lesser number of

transistors, leaving enough space on the die to realize other functionalities on the silicon.

2. The instruction set architecture and the pipeline design of ARM are aimed at minimizing

the energy consumption a critical requirement in mobile embedded systems. In spite

of being a 32-bit microcontroller, it is capable of running 16-bit instruction set, known

as THUMB. This helps to achieve greater code density and enhanced power saving.

3. While being small and low-power, ARM processors provide high performance.

4. ARM architecture is highly modular the only mandatory component of an ARM proces-

sor is the integer pipeline. All other components including caches, memory management

unit (MMU), floating point and other co-processors are optional. This gives a lot of

flexibility in building application specific ARM-based processors.

5. To assist the developer, the ARM core has a built-in JTAG debug port and on-chip

embedded ICE (In-Circuit Emulator) that allows programs to be downloaded and

fully debugged in-system.

1.2. A BRIEF HISTORY 3

1.2 A brief history

The first ARM processor was designed by the Acron Computers Limited of Cambridge, England

between 1983 and 1985. While looking for a new processor for the next generation desktop,

they found that the existing commercial microprocessors were not suitable for their purpose,

mainly because of the following two reasons.

These processors were slower than the existing memory parts.

Complex instruction set including instructions requiring hundreds of cycles to execute,

leading to high interrupt latencies.

Thus, the requirement of designing a new processor was felt. However, designing a complex

processor used to take many years even for large companies with expertise available for pro-

cessor design. The solution was found in the Berkeley RISC 1 project which had established

that it was possible to build a very simple processor with performance comparable to the most

advanced CISC processors of the time. Acron designed their first 26-bit Acron RISC Machine

(ARM) processor in 1985, based upon the Berkeley project. It used even lesser than 25,000

transistors and still performed similar to (or better than) Intel 80286 processor that came

out at about the same time. This architecture has later been referred to as ARM version

1 architecture. It was followed by the second processor in 1987. This ARM version 2 had

a coprocessor support. It was extended with on-chip cache in ARM 3 processor. The third

version of ARM architecture, developed in 1992 had features like 32-bit addressing, support

for Memory Management Unit (MMU) and 64-bit multiply-accumulate instructions. It was

implemented in ARM 6 and ARM 7 cores. Prior to this, in 1990, Apple took a decision to use

ARM processor in their Newton PDA. A joint venture called ARM (Advanced RISC Machines)

was launched. ARM entered into the embedded market with the release of these processors

and the Apple Newton PDA in 1992. The 4th generation ARM processor came out in 1996

with the special feature of THUMB 16-bit compressed instruction set. Though THUMB

is slightly less efficient compared to the regular 32-bit ARM instruction set, it takes 40% less

space. The most prominent representative of the 4th generation ARM is the ARM7TDMI core,

which is till now the most popular ARM product. It has been used in most Apple iPod players,

including the video iPod. Another popular implementation of ARMv4 core is the Intel Stron-

gARM processor. The 5th generation of ARM architecture introduced in 1999 have digital

signal processing capability and Java byte code extensions to the ARM instruction set. Intel

XScale processor is the most popular implementation of the 5th generation ARM core. It is

used in a number of embedded devices, network processors, smart phones and PDAs. ARMv6

architecture anounced in 2001 features improvements in many areas covering the memory sys-


tem, improved exception handling and better support for multiprocessing environments. It

also includes media instructions to support Single Instruction Multiple Data (SIMD) software

execution. THUMB-2, an improved THUMB instruction set defining a new set of 32-bit in-

structions that execute along-side 16-bit instructions in THUMB state, was introduced. It

provides better support for two separate address spaces, such that code executing in the non-

secure world cannot gain access to any address space marked as secured. The protection

provided by the technology is necessary for consumer privacy and extending a range of services

such as, mobile banking and multimedia entertainment, to widespread consumer adoption and

use. The next generation ARMv7 cores have been introduced in 2005. It comes with three

different processor profiles. The A profile is for sophisticated virtual memory based OS and

user applications. The R profile is for real-time systems, and the M profile is optimized

for microcontrollers and low-cost applications. The ARMv7A architecture has the option of

NEON technology designed to address the next generation high performance, media intense,

low-power mobile hand-held devices. It is a 64/128-bit hybrid SIMD architecture developed

by ARM to accelerate the performance of multimedia and signal processing applications. The

Vector Floating Point (VFP) coprocessor support is also an architectural option. It supports

single and double precision floating point arithmetic, and is fully IEEE 754 compliant with

suitable software library. Table 1.1 summerizes the discussion.

Table 1.1: ARM architecture summary

Version Year Features Implementation

v1 1985 The first commercial RISC (26-bit) ARM1

v2 1987 Coprocessor support ARM2, ARM3

v3 1992 32-bit, MMU, 64-bit MAC ARM6, ARM7

v4 1996 THUMB ARM7TDMI, ARM8, ARM9TDMI,StrongARM

v5 1999 DSP and Jazelle extensions ARM10, XScale

v6 2001 SIMD, THUMB-2, TrustZone, Multiprocessing ARM11, ARM11 MPCore

v7 2005 Three profiles, NEON, VFP ?

1.3 Structure of ARM7

Fig 1.1 shows a block diagram of ARM7 processor. The major components of ARM7 processor

are described in the following.

1. Instruction Pipeline and Read Data Register: It gets the content of memory location

pointed to by the address bus lines A[31 : 0], of Address Register. The external 32-bit

1.3. STRUCTURE OF ARM7 5

data-in lines DATA[31 : 0] puts the content into this register.

2. Instruction Decoder and Control Logic: It has a number of control inputs determining the

operation policy of the processor. Also, it outputs a number of control signals useful for

interfacing the processor with other peripherals. The various control signals are explained

later.

3. Address Register: It holds the address of the next instruction/data to be fetched. Address

bus A[31 : 0] originates from it. The input signal ALE determines the time upto which

the registers content will remain available on the A[31 : 0] lines. Content is available as

long as ALE remains low.

4. Address Incrementer: It increments the Address Registers value by an appropriate

amount to point to the next instruction/data.

5. Register Bank: It contains 31, 32-bit registers accessible in different modes of operation

of the processor (detailed later). It also contains 6 status registers, each of size 32 bits.

6. Booths Multiplier: It is used in the multiplication instructions.

7. Barrel Shifter: One of the opearnds of data processing instructions can be shifted by a

few bit positions. The barrel shifter located at the input of ALU performs this function.

8. ALU: A 32-bit ALU performs the arithmetic and logic functions.

9. Write Data Register: It holds the value to be written into the memory. The 32-bit value

is available in the DOUT [31 : 0]. The associated signals DBE and nENOUT have been

elaborated later.

1.3.1 Control and Status Signals in ARM7

We will next elaborate the control and status signals used in ARM7 processor. Fig 1.2 shows

a functional diagram of ARM7 CPU. The signals can be grouped as per their functionality.

That way, the signals can be grouped as,

Processor mode: includes the signals nM [4 : 0].

Memory interface: A[31 : 0], DATA[31 : 0], DOUT [31 : 0], nENOUT , nMREQ, SEQ,

nRW , nBW , LOCK.

Memory management interface: nTRANS, ABORT .


Address Register

Register Bank

ShifterBarrel

BoothsMultiplier

32bit ALU

Write Data Register Instruction Pipeline& Read Data Register

DOUT[31:0]DBE nENOUT DATA[31:0]

nEXECDATA32BIGENDPROG32MCLKnWAITnRWnBWnIRQnFIQnRESETABORTnOPCnTRANSnMREQSEQLOCKnCPICPACPBnM[4:0]

LogicControl

&Decoder Instruction

ALU Bus

A Bus

Incrementer Bus

AddressIncrementer

(6 status registers)(31 X 32bit registers)

A[31:0]

PC Bu

s

ALE

B Bus

Figure 1.1: ARM7 block diagram

1.3. STRUCTURE OF ARM7 7

Clock signals: MCLK and nWAIT .

Configuration signals: PROG32, DATA32, BIGEND.

Interrupts: nIRQ, nFIQ.

Bus control signals: ALE, DBE.

Power lines: V DD and V SS.

Special signals: nEXEC and nRESET .

Processor mode signals nM [4 : 0]

These status signals identify the mode of the processor. The output bits are the inverses of

the internal status bits indicating processor operation mode (detailed later).

Clock signals MCLK and nWAIT

MCLK is the master clock input to the ARM processor. It has two phases. In phase 1 MCLK

is low and in phase 2 it is high. The clock may be stretched in either phase to interface slower

devices. Alternatively, nWAIT can be used. ARM7 can be made to wait for an integer number

of MCLK cycles by holding nWAIT low. nWAIT is internally ANDed with MCLK and must

only change when MCLK is low.

Memory interface signals A[31 : 0], DATA[31 : 0], DOUT [31 : 0], nENOUT ,

nMREQ, SEQ, nRW , nBW , LOCK

A[31 : 0] are the address lines constituting the processor address bus. If ALE is high, addresses

become valid during the phase 2 of the previous instruction cycle. This address is used in phase

1 of the referenced cycle. The stable period may be controlled by ALE as discussed later.

DATA[31 : 0] is the input data bus. During read cycles (identified by nRW = 0), input must

be valid before the end of phase 2 of the transfer cycle. DOUT [31 : 0] is the output data bus.

During write cycles (nRW = 1), outout data becomes valid during phase 1 and remain so for

the entire phase 2 of the transfer cycle. nENOUT is a status signal which is activated (made

low) by the processor when DOUT contains a valid data to be written into the memory. This

nENOUT signal can be utilized to create a bidirectional bus with DATA for the memory.

nMREQ is a status signal indicating (when low) that the processor requires memory access


during the following cycle. The signal becomes valid during phase 1, remaining valid through

phase 2 of the cycle preceding that to which it refers.SEQ active-high signal indicates that

the address used in the following cycle is either the same as the last memory address, or is

4 greater (ie. the next word address). It becomes valid during phase 1 and remains valid

throughout phase 2 of the cycle before the one to which it refers. The two signals nMREQ

and SEQ together indicate burst activity one cycle advance. nRW is another status signal.

For a read cycle, the signal is low, for a write cycle, it is high. nBW is high for a word transfer

and low for a byte transfer.The signal LOCK is used for locked memory access. When LOCK

is high, the memory controller should not allow any other device to access memory till LOCK

becomes low. It is used in particular in swap instruction.

Memory management interface nTRANS, ABORT

nTRANS signal indicates when to translate the address. nTRANS = 0 indicates that pro-

cessor is in user mode and address translation should be turned on. The timing of the signal

may be modified by ALE as discussed later. The signal ABORT is an input to the processor.

This allows the memory system to tell the processor that the requested access is not allowed.

Configuration signals PROG32, DATA32, BIGEND

PROG32 is for 32-bit program configuration. When high, it makes the processor to fetch from

32-bit address space. When low, the processor fetches instructions from a 26-bit address space.

DATA32 is for 32-bit data configuration. When high, the proessor uses 32-bit address for data

fetch. When low, data is fetched from a 26-bit address space. BIGEND = 1 instructs the

processor to utilize big-endian convention. When low, little-endian format is assumed.

Interrupts nIRQ, nFIQ

nFIQ is an asynchronous interrupt to the processor. The signal is low-level sensitive and must

be held low till an action is received from the processor. This is responded the fastest. On the

other hand, nIRQ is of lower priority.

1.4. ARM PIPELINE 9

Bus control signals ALE, DBE

ALE is the address latch enable, an input signal to ARM7 to extend the stable address on

the address lines till end of phase 2 of MCLK. By default (ALE being 1), address changes

during phase 2 of MCLK to the value needed in the next cycle. Setting ALE to 0 also extends

the following signals nBW , nRW , LOCK, nOPC, nTRANS. DBE is data bus enable. It

facilitates data bus sharing for DMA mode of operation.

Special signals nEXEC and nRESET

nEXEC is high when the instruction in the execution unit is not being executed because, for

example, it has failed its condition code check. nRESET is low-level sensitive reset signal for

the processor.

1.4 ARM Pipeline

One important feature of ARM architecture is its pipelined organization. There are various

versions of this pipeline used in different ARM architectures. These are,

3-stage pipeline (ARM7TDMI and earlier).

5-stage pipeline (ARMS, ARM9TDMI).

6-stage pipeline (ARM10TDMI).

8-stage pipeline (ARM11).

1.4.1 3-stage pipeline

The 3-stage pipeline is the classical fetch-decode-execute pipeline. The first pipeline stage reads

an instruction from memory and increments the value in the instruction address register. The

value is also stored in the program counter (PC). The next stage decodes the instruction and

prepares control signals required to execute it on. The third stage does all the actual works:

reading operands from the register file, performing ALU operations by fetching or writing to

the memory the temporary data (if necessary), and finally writing back the modified register

values. In the case of data processing instructions, the result from ALU is directly written into

the register file, thus completing the execution stage of the instruction in one cycle, while for


MCLKnWAIT

DATA32

PROG32

nM[4:0]

A[31:0]

nENOUTnMREQSEQnRWnBWLOCK

nTRANSABORT

nOPCnCPICPACPB

BIGEND

nIRQnFIQ

nRESET

ALEDBE

VDDVSS

ARM7

ProcessorMode

Memory Interface

MemoryManagementInterface

Power

ControlsBus

Interrupts

Configuration

Clocks

InterfaceCoprocessor

DATA[31:0]

DOUT[31:0]nEXEC

Figure 1.2: ARM7 functional diagram

1.4. ARM PIPELINE 11

load/store type of instructions, the address computed by the ALU is placed on the address bus

and the actual memory access is performed during the second cycle of the execution stage.


The problem with 3-stage pipeline is the pipeline stall caused by every data transfer instruction

the next instruction cannot be fetched while memory is being read/written. To circumvent

the problem, in ARM9TDMI and later architectures, instruction and data memory have been

separated. To make the pipeline balanced the following measures have been taken.

The register read step is moved to the decode stage.

Execute stage is split into three stages. The first stage performs arithmetic computations,

the second stage performs memory access, while the third stage writes the result back to

the register file. It may be noted that the second stage will remain idle when executing

data processing instructions.

This modification balances the pipeline, reducing the CPI (average number of Clocks Per

Instruction). However, there is a new complicacy we need to forward data between pipeline

stages to resolve data dependencies between the stages without stalling the pipeline.


In ARM10 core, the instruction decode stage is split into two pipeline stages the decode

stage and the register stage. This creates a 6-stage pipeline. While the decode stage performs

the decoding operation, register stage reads the register to be used. The major advancement

introduced are in the width of the instruction and data buses, both of which are made 64-bit.

Thus, fetch stage can fetch two instructions simultaneously. A static branch predictor module

has been introduced. A separate adder has been introduced in the execution unit to take care

of multiply-accummulate imstructions.


Two major changes have been introduced in ARM11 cores creating an eight-stage pipeline.

Shift operation has been separated into a separate pipeline stage.


Both instructions and data accesses are distributed across two pipeline stages.

The execution unit is split into three different pipelines that can operate concurrently and

commit instructions out-of-order also.

1.5 Instruction Set Architecture (ISA)

ARM, in most respects, is a typical RISC architecture. However, several enhancements to it

has been introduced to improve the performance further. The RISC features present in ARM

are as follows.

Large uniform register file with 16 general purpose registers.

Load/store architecture. The instructions that process data operate only on registers

and are separate from instructions that access memory.

Simple addressing modes.

Uniform and fixed-length instruction fields. All ARM instructions are 32-bit long and

most of them have a regular three-operand encoding.

These features help in the implementation of pipelining in the ARM architecture. However,

in order to keep the architecture simple and improve performance, a number of other (non-

RISC) features have been introduced.

Each instruction controls the ALU and the shifter. Thus, making the instructions more

powerful.

Auto-increment and auto-decrement addressing modes have been incorporated. This

increments or decrements the value of an index register while a load or store operation

is in progress.

Multiple load/store instructions that allow to load or store upto 16 registers at once

have been introduced. While violating the one cycle per instruction principle, they

significantly speed up performance-critical operations, such as procedure invocation and

bulk data-transfers, and lead to more compact code.

Conditional execution of instructions has been introduced. In the machine code of an

instruction, opcode is preceded by a 4-bit condition code. For the instruction to execute,

1.5. INSTRUCTION SET ARCHITECTURE (ISA) 13

the condition stated must be met. This goes a long way to eliminate small branches in

the program code and thus eliminating stalls in the pipeline.

All these features have resulted in high performance, low code size, low power consumption,

and low silicon area in ARM.

1.5.1 Registers

The ARM ISA has 16 general-purpose registers, R0 R15, in the user mode. Out of this, reg-

ister R15 is the program counter which may also be manipulated as a general-purpose register.

Register R13 and R14 also have special functions; R13 is used as the stack pointer, though

this has only been defined as a programming convention. Unusually, the ARM instruction set

does not have PUSH and POP instructions, so stack handling is done via a set of instructions

that allow loading and storing multiple registers in a single operation. R14 has special signif-

icance and is called the link register. When a procedure call is made, the return address is

automatically placed into this register (and not in the stack, as usually done in other proces-

sors). A return from the procedure can thus be implemented by moving the content of R14

to R15. Another important register, the current program status register (CPSR) contains four

1-bit condition flags (namely, negative, zero, carry, and overflow) and four fields representing

the execution state of the processor. The I and F flags of CPSR enable normal and fast

interrupts respectively. The T field is used to switch between ARM and THUMB instruction

sets. The mode field selects one of the six execution modes as follows.

1. User mode is used to run the application code. Once in user mode, the CPSR cannot be

written to. Mode can only be changed when an exception is generated.

2. Fast interrupt processing mode (FIQ) supports high speed interrupt handling. Generally

it is used for a single critical interrupt source in a system.

3. Normal interrupt processing mode (IRQ) supports all other interrupt sources in a system.

4. Supervisor mode (SVC) is entered when the processor encounters a software interrupt

instruction. These are standard ways to invoke operating system services. Upon reset,

ARM enters into this mode.

5. Undefined instruction mode (UNDEF) is entered if the fetched opcode is not an ARM

instruction or a coprocessor instruction.

6. Abort mode is entered in response to memory fault, for example, an instruction or data

fetched from an invalid memory region.


The user registers R0 to R7 are common to all operating modes. However, FIQ mode has

its own R8 to R14 registers that replace the user registers when FIQ mode is entered. Similarly,

each of the other modes have their own R13 and R14 registers so that each mode has its own

stack pointer and link register. The CPSR is also common to all the modes. However, in each

of the exception modes, an additional register the saved program status register (SPSR) is

added. SPSR registers store a copy of the value of the CPSR register before an exception was

raised. Fig 1.3(a) shows all the user accessible registers while Fig 1.3(b) shows the structure

of the CPSR.

R0

R2R1

R3R4R5R6R7R8R9R10R11R12

R7R8R9R10R11R12

R0

R2R1

R3R4R5R6

R15 (PC)R14_svcR13_svc

CPSRSPSR_fiq

R13R14

R15 (PC)

R0

R2R1

R3R4R5R6

R15 (PC)

R7_fiqR8_fiqR9_fiqR10_fiqR11_fiqR12_fiqR13_fiqR14_fiq

R7R8R9R10R11R12

R0

R2R1

R3R4R5R6

R15 (PC)

R13_abtR14_abt

R7R8R9R10R11R12

R0

R2R1

R3R4R5R6

R15 (PC)

R13_irqR14_irq

R7R8R9R10R11R12

R0

R2R1

R3R4R5R6

R15 (PC)

R13_undR14_und

CPSR CPSR CPSR CPSR CPSRSPSR_svc SPSR_abt SPSR_irq SPSR_und

User FIQ Supervisor Abort IRQ Undefined

N Z C V I F T Mode

057831 30 29 28 27 6 4Unused

(a)

N: negativeZ: zeroC: carryV: overflow

I: IRQ enableF: FRQ enableT: THUMB instruction set

(b)Mode: 6 operating modes

Figure 1.3: (a) ARM registers in different modes (b) CPSR register content


Data Types

The ARM instruction set supports six different data types, namely, 8-bit signed and unsigned,

16-bit signed and unsigned, 32-bit signed and unsigned. The ARM processor instruction set

has been designed to support these data types in little- or big-endian format. However, most

of the ARM silicon implementations use the little-endian format.

In the following we give a brief overview of different types of ARM instructions. ARM has

got two instruction sets:

ARM:

Standard 32-bit instruction set

It consists of the following types of instructions as shown in Fig 1.4.

Data processing

Data transfer

Block transfer

Branching

Multiply

Conditional

Software interrupts

THUMB:

16-bit compressed form

Code density better than most CISC processors

Dynamic decompression in pipeline

1.5.2 Data processing instructions

The ARM architecture provides a range of arithmetic operations, such as, addition, subtraction,

multiplication etc., and a set of bit-wise logical operations. All these instructions take two 32-

bit operands and return a 32-bit result. The multiplication instruction can return a 32- or

64-bit value. All these operands and results can be specified independently. Out of the three,

the first operand and the result must be registers, while the second operand can be either a

register or an immediate value. If the second operand is a register, it can be shifted or rotated


ARM instruction set

Data processing

Block transfer

Multiply

Software interrupts

Data transfer

Branching

Conditional

Figure 1.4: ARM instruction set

before being sent to the ALU. On the other hand, for immediate operand, it must be a 32-

bit value. However, all 32-bit constants cannot be specified here. This is due to the limited

space available for operand specification inside the 32-bit instruction. An immediate operand

should be a 32-bit binary number where all the binary ones fall within a group of eight adjacent

bit positions on a 2-bit boundary. More formally, a valid immediate operand n satisfies the

following equation:

n = i ROR (2 r)

where i is a number between 0 and 255 (inclusive), r is between 0 and 15 (inclusive) and ROR

is the rotate right operation. Examples of such numbers are 255 (i = 255, r = 0), 256 (i = 1,

r = 12), hexadecimal number FF000000 (i = 255, r = 4), etc.

One interesting feature of the ARM architecture is that the modification of condition flags

by arithmetic instructions is optional. This adds flexibility to the programming in the sense

that flags do not need to be checked right after the instruction that set them, but can be done

later in the instruction stream, provided that other intermediate instructions do not change

the flags.

Some examples of data processing instructions are,

ADD R1, R2, R3; R1 = R2 + R3.

ADD R1, R2, R3, LSL #2; R1 = R2 + (R3 4).

ADDS R1, R2, R3, LSL #2; R1 = R2 + (R3 4) and set condition code flags.


1.5.3 Data transfer instructions

ARM supports two types of data transfer instrcutions: single register transfers and multiple

register transfers. Single register transfer instructions can be used to transfer 1, 2, or 4-bytes of

data between a register and a memory location. On the other hand, multiple register load/store

operations can be carried out via multiple register transfer instructions. The addressing mode

to be used for multiple register transfer is base-plus-offset addressing. Value in the base register

is added to the offset stored in a register or passed as an immediate value to form the memory

address. An auto-indexed addressing mode can also be used which writes the value of the

base register incremented by the offset back to the base register. It helps in accessing the

next memory location in the next instruction without wasting an additional instruction to

increment the register.Two different auto-indexed addressing modes are supported the pre-

indexed mode uses the new computed address for the load/store operation, and then updates

the base register with the new computed value. The post-indexed mode uses the unmodified

base register for the transfer and then updates the base register. Multiple register transfer

instructions also support auto-indexed addressing. Multiple register transfer instructions are

particularly useful while entering or exiting a procedure to pass parameters or return values.

The following points may be noted regarding the offset and the indexing.

The offset can be

An unsigned 12-bit immediate value (that is, 0 to 4095 bytes)

A register, optionally shifted by an immediate value

Either added or subtracted from the base register

Prefix the offset value or register with + (default) or -

Applied

before the transfer is made: Pre-indexed addressing (optionally auto-

incrementing the base register by postfixing the instruction with an !)

after the transfer is made: Post-indexed addressing, causing the base register to

be auto-incremented.

Some examples of data transfer instructions are,


LDR R0, [R8] load content of memory location pointed to by R8 into R0

LDR R0, [R1, -R2] load content of memory location pointed to by R1R2 into R0

LDR R0, [R1, #4] load content of memory location pointed to by R1 + 4 into R0

LDR R0, [R1, #4]! load content of memory location pointed to by R1 + 4 into R0,

R1 is also incremented by 4

LDR R0, [R1], #16 Loads R0 from memory location pointed to by R1, then adds

16 to R1

LDR Load word STR Store word

LDRH Load half word STRH Store half word

LDRSH Load signed half word STRSH Store signed half word

LDRB Load byte STRB Store byte

LDRSB Load signed byte STRSB Store signed byte

As discussed earlier, ARM supports both little-endian and big-endian formats for data

access. In the little-endian format, the least significant byte of a word is stored in bits 0-7 of

an addressed word, whereas, in the big-endian format, the least significant byte is stored in

bits 24-31. It may be noted that this has got significance only when the data is stored as words

and then accessed as bytes or halfwords. Fig 1.5 gives an example of the same.

10 20 30 40

10 20 30 40

31 24 23 16 15 8 7 0

31 24 23 16 15 8 7 0 31 24 23 16 15 8 7 0

31 24 23 16 15 8 7 031 24 23 16 15 8 7 0

40 30 20 10

4000 00 00 00 00 0010

STR R0, [R1]

Bigendian LittleendianLDRB R2, [R1]

R1 = 0x200

R0 = 0x10203040

R2=0x40R2= 0x10

Memory Memory

Figure 1.5: Big-endian vs. Little-endian


Block data transfer

The Load and Store multiple instructions (LDM/STM) allow between 1 and 16 registers to be

transferred to or from memory. The transferred registers can be either:

Any subset of the current bank of registers (default)

Any subset of the user mode bank of registers when in a previledged mode (postfix

instruction with a symbol).

Like single register transfer operations, in this case also the base register is used to determine

where memory access should occur. Auto-increment, decrement are also supported for the base

register. Lowest register number is always transferred to/from the lowest memory location

accessed. The block transfer instructions have got efficient utilization in:

implementing stack for saving and restoring context

moving large blocks of data around memory

1. Stack operation: Though traditionally a stack grows down in memory with the last

pushed value at the lowest address, ARM can be made to implement ascending stack

also, where the stack grows up through memory. The value of the stack pointer can

either:

Point to the last occupied address (Full stack), and thus needs pre-decrementing

before the push

Point to the next occupied address (Empty stack), and thus needs post-decrementing

after the push

The stack type to be used is given by the postfix to the instruction as follows.

STMFD/LDMFD: Full descending stack

STMFA/LDMFA: Full ascending stack

STMED/LDMED: Empty descending stack

STMEA/LDMEA: Empty ascending stack

For example, Fig 1.6 shows four stack configurations. It may be noted that the sequence

of registers within an instruction does not affect the order in which the registers are

saved. ARM follows the strategy that the lowest register number is always stored at the


r5Old SP r5

SP

Old SPr4r3r1r0

SP

Old SP

r5

r0r1

r4r3

SP

r4r3r1r0

r0r1r3r4r5

SP

Old SP

STMFD SP!, {R0,R1,R3R5} STMED SP!, {R0,R1,R3R5} STMFA SP!, {R0,R1,R3R5} STMEA SP!, {R0,R1,R3R5}

Figure 1.6: Example stack

lowest address. Thus, instead of writing the requence as R0, R1, R3R5 specifying it

as R1, R0, R5, R4, R3 or R5, R0R1, R3R4 have the same effect.

2. Moving large data block: We can best explain it with the help of an example. Sup-

pose, we have to copy a block of memory which is an exact multiple of 12 words long from

the location pointed to by register R12 to the location pointed to by R13. R14 points

to the end of the block to be copied. The following versions of LDM/STM instructions

can be utilized for this purpose.

STMIA/LDMIA: Increment after

STMIB/LDMIB: Increment before

STMDA/LDMDA: Decrement after

STMDB/LDMDB: Decrement before

The exact code to perform the task may be as follows.

; R12 points to the start of the source data

; R14 points to the end of the source data

; R13 points to the start of the destination data

loop LDMIA R12!, {R0-R11} ; load 48 bytes

STMIA R13!, {R0-R11} ; and store them

CMP R12, R14 ; check for the end

BNE loop ; and loop until done


1.5.4 Multiplication instructions

ARM provides several versions of multiplication. These are,

Integer multiplication (32-bit result)

Long integer multiplication (64-bit result)

Multiply accumulate instruction particularly useful in signal processing applications,

for example, digital filtering

Accordingly, there are several version of the multiplication instruction.

MUL Multiply 32-bit result

MULA Multiply accumulate 32-bit result

UMULL Unsigned multiply 64-bit result

UMLAL Unsigned multiply accumulate b 64-bit result

SMULL Signed multiply 64-bit result

SMLAL Signed multiply accumulate 64-bit result

For example,

MUL R0, R1, R2 R0 gets R1 R2

MULA R0, R1, R2, R3 R0 gets R1 R2 + R3

However, there are a few restrictions regarding source and destination.

1. Destination and first operand cannot be the same register.

2. PC (R15) cannot be used for mutltiplication.

ARM uses the Booths Algorithm to perform integer multiplication. In some variants of

ARM processors, multiplication proceeds as follows.

For each pair of bits, it takes 1 cycle. One more cycle is needed to

start the instruction. The multiplication continues till the source register has

some 1s left in it. Otherwise, the algorithm early-terminates. For ex-

ample, to multiply 18 (00000000000000000000000000010010 in binary) and -1

(11111111111111111111111111111111 in binary), if the source register content is 18, it

will take 4 cycles, whereas, if the source register holds -1, number of cycles needed is 17.

Generally, the language compilers take care of such issues.


In some other variants containing extended multiplication hardware, the following enhance-

ments have been performed.

An 8-bit Booths algorithm is used, which makes the multiplication faster (maximum

standard instruction is now 5 cycles).

Early termination method is improved in the sense that the multiplication completes

when all remaining bit sets contain either all zeros, or all ones.

64-bit results can be produced from 32-bit operands, this provides higher accuracy. A

pair of registers is used to hold the result.

1.5.5 Software Interrupt

The software interrupt (SWI) instruction forces the CPU into supervisor mode. Its format is,

SWI #n

The execution of the instruction causes an exception trap to the SWI hardware vector (forcing

the mode change and an associated state saving), thus causing the SWI exception handler to

be called. The handler can now analyze the value of n to determine which action to perform.

It should be noted that the processor completely ignores n, and it is the software interrupt

handler that may analyze the value of n to perform the desired job. This is very suitable for

an operating system to implement a set of privileged operations which applications running in

user mode can request. Such requests are commonly known as system calls. The value of n is

24-bits, thus providing facility for a maximum of 224 calls. Unlike many other processors (such

as, Intel processors), the hardware does not attempt to distinguish between these 224 cases. If

separate addresses are to be maintained for them, a total of 224 4 (address bus is 32 bits =

4 bytes) = 226 bytes of memory will be needed.

1.5.6 Conditional Execution

This is an interesting feature of ARM instruction set. While most of the existing architecures

allow only branches to be executed conditionally, ARM allows all instructions to be executed

conditionally. The most significant four bits of each instruction are reserved to hold the 16

condition codes. There are four flags N, Z, C, and V. The following are the interpretaions of

these four bit settings.

0000 : EQ (Equal)


0001 : NE (Not equal)

0010 : HS/CS (Unsigned higher or same)

0011 : LO/CC (Unsigned lower)

0100 : MI (Negative)

0101 : PL (Positive or zero)

0110 : VS (Overflow)

0111 : VC (No overflow)

1000 : HI (Unsigned higher)

1001 : LS (Set unsigned lower or same)

1010 : GE (Greater or equal)

1011 : LT (Lower)

1100 : GT (Greater)

1110 : AL (Always)

1111 : NV (Reserved)

We have already seen how to set the conditional flags in instruction execution. For example,

the use of mnemonic ADDS in place of ADD sets the flags, whereas, the basic ADD instruction

does not affect any of the flags. Now, to execute an instruction conditionally, we have to use

mnemonis like EQADD. For example,

EQADD R0, R1, R2 perform R0 = R1 + R2 only if zero flag is set.

EQADDS R1, R2, R3, LSL #2 perform R1 = R2 + (R3 4) only if zero flag is set,

and set condition code flags.

1.5.7 Branch instruction

In ARM processor, the following features of branch instructions can be noted.

1. All branches are relative to the program counter.

2. Jump is always within a limit of 32MB.

3. Conditional branches are formed by using the condition codes as discussed earlier.

4. Subroutine call instruction is also modeled as a variant of branch instruction.

There are two opcodes reserved for branching B (standard branch) and BL (branch with

link, current value of PC+4 is saved in link register R14). The BL version can be used to call

subroutine. The return from subroutine can be affected by copying the content of R14 back

to PC. The branch instruction structure is as shown in Fig 1.7.


2431 28 27 25 23 0

L1 0 1 Offset

Link bit, 0 = Branch, 1 = Branch with link

Condition

Figure 1.7: Branch instruction format

In the coding of a branch instruction, to get the offset, first a 26-bit difference is computed

between the branch instruction and the target. It is then right shifted by 2 bits (as the

instructions are always word-aligned, least significant two bits are always zeros), and stored

in the instruction encoding. This provides a branch range of 32MB. While executing the

branch, the processor shifts the offset left by two bits, sign extends it to 32-bits, and adds it

to the PC to get the branch target. The instruction pipeline is filled by getting instructions

from the target address, and the execution resumes.

Another very important class of branch instructions is the Branch exchange BX and

BLX. These are similar to B and BL instructions, however, it also performs the exchange of

instruction set between ARM instructions and THUMB instructions. This is the only way

to swap instruction sets.

1.5.8 Swap instruction

A careful look into the ARM instructions will reveal that a single instruction can read/write

at the most one memory location. The swap instriction is an exception to this. Swap is an

atomic operation in which a memory read is followed by a memory write which moves byte or

word between registers and memory. Its format is,

SWP Rd, Rm, [Rn]

SWPB Rd, Rm, [Rn]

The execution proceeds as follows (as depicted in Fig 1.8).

1. It is a two-cycle operation.

2. Content of memory location pointed to by register Rn is copied into a temorary space.

3. Content of register Rm is copied into the memory location.

4. Content of the temporary space is copied into the register Rd.

1.6. THUMB INSTRUCTIONS 25

Thus, to effect an interchange between the registers Rd and Rm, they should be made the

same register. In that case, content of the memory location is swapped with the register in

an atomic operation. This can be exploited to implement the semaphore operations of the

operating system.

Rn

Rm Rd

temp1

23

Memory

Figure 1.8: Swap instruction execution

1.5.9 Modifying Status Registers

The status registers (that is, CPSR and SPSRs) can only be modified indirectly. The in-

struction MSR moves content from CPSR/SPSR to selected general purpose register (GPR),

whereas, the MRS moves content of selected GPR to CPSR/SPSR. The instruction can only

be executed in the privileged modes.

1.6 THUMB instructions

As we have noted earlier, the ARM processor supports two different instruction sets ARM

and THUMB. While the ARM instruction set has got 32-bit instructions, THUMB instructions

are 16-bit in length. These instructions are stored in a compressed form. The instructions are

decompressed into ARM instructions and then executed by the processor. Fig 1.9 shows

the THUMB instruction processing. Fig 1.10 shows the THUMB instruction decompressor.

THUMB instruction set must always be entered by runing a BX/BLX instruction. These


THUMB code 16 bit Instruction

PipelineTHUMB

DecompressorARM

InstructionDecoder

Figure 1.9: THUMB instruction processing

Instruction pipeline

To B Operand Bus

Mux select high or low halfword

Various control signals

ARM instriction decoderimmediate fielddata in

select ARM orTHUMB stream Mux

Thumb decompressor

data in from memory

Figure 1.10: THUMB instruction decompressor

1.6. THUMB INSTRUCTIONS 27

instructions are similar to their ARM counterparts, with a few exceptions.

1. THUMB instructions are executed unconditionally, excepting the branch instructions.

2. THUMB instructions have unlimited access to registers R0-R7 and R13-R15. A reduced

number of instructions can access the full register set. As shown in Fig 1.11, only a few

instructions can access registers R8-R12.

3. The instructions look more like a conventional processors instructions. For example,

instructions like PUSH and POP are present for stack manipulation, though the final

implementation is via the ARM multibyte transfer instructions. It implements a de-

scending stack with the stack pointer hardwired to R13.

4. No MSR and MRS instructions.

5. The maximum number of SWI calls is restricted to 256.

6. On RESET and on raising of an exception, the processor always enters into the ARM

instruction set mode.

7. Similarities and differences between ARM and THUMB instruction sets have been de-

tailed in Table 1.2.

R0R1R2R3R4R5R6R7R8R9R10R11R12

R13 (SP)

R15 (PC) CPSRShaded registers have restricted access

R14 (Link)

Figure 1.11: THUMB programmers model


Table 1.2: Comparison between ARM and THUMB instruction setsSimilarities Differences

1. Load-store architecture 1. Most THUMB instructions are unconditional,while all ARM instructions are conditional

2. Support for 8-, 16- and 32-bit data types 2. Most THUMB instructions use a 2-address format,while most ARM instructions use a 3-address format

3. 32-bit unsegmented memory 3. THUMB instruction formats are less regular a result of denser encoding

4. THUMB has explicit shift opcodes, while ARMimplements shifts as operand modifiers

The basic advantage of THUMB comes in the form of higher code density than ARM code.

The THUMB code requires, on an average, 30% less space. If the memory is organized as

32-bit words, ARM code is 40% faster than THUMB. However, if the memory is organized as

16-bit words only, the THUMB code is found to be faster than ARM code by 45%. The most

important advantage of THUMB comes in the form of power savings. THUMB code uses upto

30% less power than ARM code. Excepting the most speed-critical embedded devices, the

cost of memory and power are much more critical than the execution speed of the processor.

This is why the THUMB instruction set may be the choice of many simple embedded system

designs. The choice between ARM and THUMB instruction sets can be made as follows.

1. For the best performance, 32-bit memory and ARM instruction set should be used.

2. For best cost and power efficiency, it is advisable to use 16-bit memory with THUMB

code.

3. In a typical embedded system

(a) ARM code should be used in 32-bit on-chip memory for small speed-critical routines.

(b) THUMB code should be used in 16-bit off-chip memory for large non-critical control

routines.

1.7 Exceptions in ARM

The exceptions can be raised in ARM from different situations. These can be split into three

different categories.

1. Exceptions can occur through execution of instructions. These include the software

interrupts, undefined instructions and memory abort.

1.8. PROGRAMMING EXAMPLES 29

2. Exceptions can also be caused as a side-effect of an instruction, like data fetch failure.

3. Other sources like RESET, FIQ, and IRQ interrupts.

For all these exceptions, the mechanism followed is similar. The processor switches to the

privileged mode, current value of PC(R15) + 4 is saved into the link register (R14) of the

privileged mode. Current CPSR is saved in the SPSR of the privileged mode. The IRQ

interrupts are disabled. If the exception raised is an FIQ, the FIQ interrupts are also disabled.

Next, the program counter is loaded with the exception vector address, and the execution of

the exception starts. Table 1.3 shows the vector table for different ARM exceptions. Two

Table 1.3: ARM7 vector table

Exception type Mode Meaning

Reset Supervisor 0x00000000Undefined instruction Undefined 0x00000004Software interrupt (SWI) Supervisor 0x00000008Prefetch abort (instruction fetch abort) Abort 0x0000000CData Abort (data access memory abort) Abort 0x00000010IRQ (interrupt) IRQ 0x00000018FIQ (fast interrupt) FIQ 0x0000001C

important issues are to be noted here.

1. The vector at location 0x00000014 is missing to ensure a backward compatibility.

2. The FIQ has been given the highest address, so that the interrupt service routine can

start from that address directly. This eliminates the necessity of another jump instruction

from the vector address to the interrupt service routine, thus saving the time needed to

start the routine after the fast interrupt has occurred.

1.8 Programming examples

In this section, we will look into a few programming examples of moderate complexity. The

idea is to get an understanding about how to use the ARM instructions in an assembly language

program.


1.8.1 Finiding the maximum of a set of numbers

Assume that the register R0 holds the start address of the data block (each data is 4 bytes

long) and R2 holds the number of elements in the block. The maximum of the elements will

be stored in register R1.

EOR R1, R1, R1 ;clear R1 to store the largest

CMP R2, #0

BEQ Over ;if block is empty, done

Loop

LDR R3, [R0] ;get the data

CMP R3, R1 ;do comparison

BCC Looptest ;skip if R1 is bigger

MOV R1, R3 ;else get the larger in R1

Looptest

ADD R0, R0, #4 ;increment pointer R0

SUBS R2, R2, #1 ;decrement number of elements left

BNE Loop ;if not done, loop

Over

;R1 holds the largest

1.8.2 Comparing two null terminated strings

In this example we assume that the two strings are pointed to by the registers R0 and R1

respectively, both the strings are null terminated (that is, the last byte is 0), and that the

result will be stored in register R2. If the two strings match exactly, the content of R2 will be

0, else it will be -1.

Loop

LDRB R3, [R0] ;get next character of string 1

LDRB R4, [R1] ;get next character of string 2

CMP R3, R4 ;compare

BNE Notsame ;if not same, strings do not match

CMP R3, #0 ;check if end of string reached

BEQ Same ;if equal, same

ADD R0, R0, #1 ;increment pointer to string 1

ADD R1, R1, #1 ;increment pointer to string 2

BAL Loop ;branch always to check next character

1.9. CONCLUSION 31

Notsame

MOV R2, #-1 ;mark not matched

BAL Over

Same

MOV R2, #0 ;mark matched

Over

;R2 holds the match

1.9 Conclusion

In this chapter, we have seen a broad overview of ARM microcontroller architecture. It has

many interesting architectural features combining both CISC and RISC philosophy. The fa-

cilities provided in FIQ handling reduces response time to critical interrupts. The controlled

modification of status bits by arithmetic instructions and conditional instruction execution are

unique to the processor. The large number of system calls that can be supported by SWI

instruction is definitely helpful for the OS designers. Avaialability of large number of registers

help in speedy context switch. The highly compact and low-power THUMB instructions make

the processor ideal choice for small, low-cost embedded system development.

Exercises

1.1 How does a typical microcontroller differ from a microprocessor? What are the typical

architecural blocks present in a microcontroller?

1.2 What are the factors guiding the choice of a microcontroller for an embedded application?

1.3 What is meant by application specific ARM processor? How does it help the embedded

system designer?

1.4 How does in-circuit emulation helps in system development?

1.5 Explain the internal structure of ARM processor.

1.6 How does the ALE signal of ARM differ from ALE of x86 family of processors?

1.7 Explain the functional diagram of ARM processor.

1.8 Enumerate the evolution of various pipelining structures in ARM.


1.9 Compare and contrast between RISC and CISC instructions. Explain how in ARM both

of the philosophies have been merged.

1.10 How many general and special-purpose registers are there in ARM? Explain their func-

tionality.

1.11 State the role of R13, R14, R15. How does R15 differ from program counter in general

CPU?

1.12 Enumerate various operating modes of ARM.

1.13 Mention different data types available in ARM. How does a big-endian type differ from

little-endian type?

1.14 Suppose the number 0x12345678 is stored at memory location 1000 in big-endian format.

If the processor now assumes the data to be in little-endian format, what will it get if it

reads (a) a byte, (b) a half-word, (c) a word from location 1000? In the previous example,

suppose that the data bus lines are connected to the memory in the reverse order, that

is bit-0 from processor is connected to data line 31 of memory, 1 to 30, and so on. What

value will be read in caeses (a), (b) and (c) noted above?

1.15 Enumerate the types of instructions in ARM.

1.16 Why is it the case that all possible 32-bit operands cannot be specified as immediate

operands in ARM data processing instructions? Out of the following numbers, which

can be represented: 56, 1000, 513, -1?

1.17 How to set condition flags in arithmetic operations in ARM? What advantage does this

preferential setting provide us?

1.18 Mention different ways in which the address of a memory operand can be specified in

ARM.

1.19 Why do you think that not providing separate stack-space in ARM may be beneficial?

Show the content of the stack after executing each of these instructions on a partially

filled stack STMFD SP!, {R0-R12}, STMED SP!, {R0-R12}, STMFA SP!, {R0-R12},

STMEA SP!, {R0-R12}.

1.20 What do you mean by early-termination? If the operands for multiplication are 12 and

55, which one should be used as the source operand?

1.21 State the differences between the ways in which software interrupts are handled in Intel

processors and in ARM.

1.9. CONCLUSION 33

1.22 Explain the conditional execution of statements in ARM. For the following piece of high-

level code, show the corresponding assembly-level code for ARM. Assume all the variables

are available in registers.

if a > b then

x = y + z

else

x = y z

1.23 What are the advantages of program counter relative jump over absolute jump? Why

do you think that the 24-bit offset in the branch instruction structure is sufficient

for 32MB branch range in ARM? Assuming that the current program counter is at

0x10000000, compute the 24-bit offset for the addresses 0x10000020 and 0x0FFFFFFC.

1.24 How does BX/BLX differ from B/BL?

1.25 What is meant by an atomic operation? How is the SWAP intruction different from others

from atomicity viewpoint? How does it help in implementing semaphores, explain with

suitable example.

1.26 What is THUMB? How does the THUMB instruction set differ from ARM instruction

set? Are the THUMB instructions executed directly?

1.27 Why are the FIQ interrupts can be handled faster than IRQ interrupts in ARM?

1.28 Write the following programs in the assembly language of ARM.

(a) Sort a given set of numbers.

(b) Find second highest of a set of numbers without actually sorting it.

(c) Concatenate two null terminated strings.

(d) Look for a specific substring within a null terminated string.

1.29 Perform a comparative study between the microcontrollers (i) ARM, (ii) 8051, (iii) PIC,

(iv) AVR.

arm

Documents

arm architecture

arm processors

arm instruction

arm corporation

arm microcontrollerarm

arm cores

arm processor architecture

single chip