casp report

DAYANANDA SAGAR INSTITUTE OF TECHNOLOGY

(POLYTECHNIC)

BANGLORE

Affilliated to A.I.C.T.E

NEW DELHI

COMMUNICATION AND ANALYSIS SKILL DEVELOPMENT PROAMGRME

REPORT ON

“MICROPROCESSOR”

Submitted for the partial fulfillment of the requirement for the award of

5th sem diploma in electronics and communication

By

Amiyodyuti ganguly (305EC10009)

DEPARTMENT OF ELECTRONICS AND COMMUNICATION


SHAVIGE MALLESHWARA HILLS, KUMARASWAMY LAYOUT, BANSHANKARI

BANGALORE-560078

DEPARTMENT OF ELECTRONICS AND COMMUNICATION

CERTIFICATE

This is to certify that the CASP seminar entitled

“MICROPROCESSOR”

Has been completed successfully

Submitted By

Amiyodyuti ganguly(305EC10009)

It is by the confinement Board of Technical Education

GOVERNMENT OF KARNATAKA

This a confide work carried out by him in


During the year 2014-2015 under the guidance of

Mrs.Muyeen reshma

GUIDE: PRINCIPAL:

ACKNOWLEDGEMENT

This would be incomplete if I don’t take the opportunity to sincerely thank all the people before the successful completion of the project.

We gratefully thank our principal Prof. PREM KUMAR of Dayananda Sagar Institute of Technology for providing this opportunity to work on this project.

They have been the constant source inspiration for us throughout this project.

We heartily thank them for their encouragement and support.

We would like thank our internal guide Mrs. MUYEEN RESHMA for her valuable guidance and advice that has enabled us for the successful completion for this project.

We would like to thank for Head of the Department and all other staff members for their moral support.

CONTENTS

1. INTRODUCTION2. STRUCTURE3. SPECIAL-PURPOSE DESIGN4. EMBEDDED APPLICATI0NS5. HISTORY6. CADC7. TMS 10008. INTEL 40049. PICO/GENERAL INSTRUMENT10. FOUR-PHASE SYSTEMS AL 111. 8-BIT DESIGNS12. 12-BIT DESIGNS13. 16-BIT DESIGNS14. 32-BIT DESIGNS15. 64-BIT DESIGNS IN PC16. MULTI-CORE DESIGNS17. RISC18. MICROARCHITECTURE19. ARITHMETIC LOGIC UNIT20. CENTRAL PROCESSING UNIT

ABSTRACTMicroprocessor

Intel commercial microprocessor 4004, the first

A microprocessor incorporates the functions of a computer's central processing unit (CPU)

on a single integrated circuit (IC), or at most a few integrated circuits. All modern CPUs are

microprocessors making the micro- prefix redundant. The microprocessor is a multipurpose,

programmable device that accepts digital data as input, processes it according to

instructions stored in its memory, and provides results as output. It is an example

of sequential digital logic, as it has internal memory. Microprocessors operate on numbers

and symbols represented in the binary numeral system.

The integration of a whole CPU onto a single chip or on a few chips greatly reduced the cost

of processing power. The integrated circuit processor was produced in large numbers by

highly automated processes, so unit cost was low. Single-chip processors increase reliability

as there are many fewer electrical connections to fail. As microprocessor designs get faster,

the cost of manufacturing a chip (with smaller components built on a semiconductor chip the

same size) generally stays the same.

Before microprocessors, small computers had been implemented using racks of circuit boards with many medium- and small-scale integrated circuits. Microprocessors integrated this into one or a few large-scale ICs. Continued increases in microprocessor capacity have since rendered other forms of computers almost completely obsolete with one or more microprocessors used in everything from the smallest embedded systems and handheld devices to the largest mainframes and supercomputers.

http://en.wikipedia.org/wiki/File:Intel_4004.jpg

STRUCTURE

A block diagram of the internal architecture of the Z80 microprocessor, showing the

arithmetic and logic section, register file, control logic section, and buffers to external

address and data lines

The internal arrangement of a microprocessor varies depending on the age of the design

and the intended purposes of the processor. The complexity of an integrated circuit is

bounded by physical limitations of the number of transistors that can be put onto one chip,

the number of package terminations that can connect the processor to other parts of the

system, the number of interconnections it is possible to make on the chip, and the heat that

the chip can dissipate. Advancing technology makes more complex and powerful chips

feasible to manufacture.

A minimal hypothetical microprocessor might only include an arithmetic logic unit (ALU) and

a control logic section. The ALU performs operations such as addition, subtraction, and

operations such as AND or each operation of the ALU sets one or more flags in a status

register, which indicate the results of the last operation (zero value, negative number,

overflow, or others). The control logic section retrieves instruction operation codes from

memory, and initiates whatever sequence of operations of the ALU requires carrying out the

instruction. A single operation code might affect many individual data paths, registers, and

other elements of the processor.

As integrated circuit technology advanced, it was feasible to manufacture more and more

complex processors on a single chip. The size of data objects became larger; allowing more

transistors on a chip allowed word sizes to increase from 4 and 8-bit words up to today’s 64-

bit words. Additional features were added to the processor architecture; more on-chip

registers sped up programs, and complex instructions could be used to make more compact

programs. Floating-point arithmetic, for example, was often not available on 8-bit

microprocessors, but had to be carried out in software. Integration of the floating point

unit first as a separate integrated circuit and then as part of the same microprocessor chip,

sped up floating point calculations.

Occasionally, physical limitations of integrated circuits made such practices as a bit

slice approach necessary. Instead of processing all of a long word on one integrated circuit,

multiple circuits in parallel processed subsets of each data word. While this required extra

logic to handle, for example, carry and overflow within each slice, the result was a system

that could handle, say, 32-bit words using integrated circuits with a capacity for only four bits

each.

http://en.wikipedia.org/wiki/File:Z80_arch.svg

With the ability to put large numbers of transistors on one chip, it becomes feasible to

integrate memory on the same die as the processor. This CPU cache has the advantage of

faster access than off-chip memory, and increases the processing speed of the system for

many applications. Processor clock frequency has increased more rapidly than external

memory speed, except in the recent past, so cache memory is necessary if the processor is

not delayed by slower external memory.

SPECIAL PURPOSE DESIGNA microprocessor is a general purpose system. Several specialized processing devices have

followed from the technology. Microcontrollers integrate a microprocessor with peripheral

devices in embedded systems. A digital signal processor (DSP) is specialized for signal

processing. Graphics processing units may have no, limited, or general programming

facilities. For example, GPUs through the 1990s were mostly non-programmable and have

only recently gained limited facilities like programmable vertex shades.

32-bit processors have more digital logic than narrower processors, so 32-bit (and wider)

processors produce more digital noise and have higher static consumption than narrower

processors. So 8-bit or 16-bit processors are better than 32-bit processors for system on a

chip and microcontrollers that require extremely low-power electronics, or are part of

admixed with noise-sensitive on-chip analog electronics such as high-resolution analog to

digital converters, or both.

When manufactured on a similar process, 8-bit micros use less power when operating and

less power when sleeping than 32-bit micros.

However, some people say a 32-bit micro may use less average power than an 8-bit micro,

when the application requires certain operations, such as floating-point math, that take many

more clock cycles on an 8-bit micro than a 32-bit micro, and so the 8-bit micro spends more

time in high-power operating mode.

EMBEDDED APPLICATIONThousands of items that were traditionally not computer-related include microprocessors.

These include large and small household appliances, cars (and their accessory equipment

units), car keys, tools and test instruments, toys, light switches/dimmers and electrical circuit

breakers, smoke alarms, battery packs, and hi-fi audio/visual components (from DVD

players to phonograph turntables). Such products as cellular telephones, DVD video system

and HDTV broadcast systems fundamentally require consumer devices with powerful, low-

cost, microprocessors. Increasingly stringent pollution control standards effectively require

automobile manufacturers to use microprocessor engine management systems, to allow

optimal control of emissions over widely varying operating conditions of an automobile. Non-

programmable controls would require complex, bulky, or costly implementation to achieve

the results possible with a microprocessor.

A microprocessor control program (embedded software) can be easily tailored to different

needs of a product line, allowing upgrades in performance with minimal redesign of the

product. Different features can be implemented in different models of a product line at

negligible production cost.

Microprocessor control of a system can provide control strategies that would be impractical

to implement using electromechanical controls or purpose-built electronic controls. For

example, an engine control system in an automobile can adjust ignition timing based on

engine speed, load on the engine, ambient temperature, and any observed tendency for

knocking—allowing an automobile to operate on a range of fuel grades.

HISTORYThe advent of low-cost computers on integrated circuits has transformed modern society.

General-purpose microprocessors in personal computers are used for computation, text editing,

multimedia display, and communication over the Internet. Many more microprocessors are part

of embedded systems, providing digital control over myriad objects from appliances to

automobiles to cellular phones and industrial process control.

The first use of the term "microprocessor" is attributed to via tron Computer Systems describing

the custom integrated circuit used in their System 21 small computer system announced in 1968.

Intel introduced its first 4-bit microprocessor 4004 in 1971 and its 8-bit microprocessor 8008 in

1972. During the 1960s, computer processors were constructed out of small and medium-scale

ICs—each containing from tens of transistors to a few hundred. These were placed and soldered

onto printed circuit boards, and often multiple boards were interconnected in a chassis. The large

number of discrete logic gates used more electrical power—and therefore produced more heat—

than a more integrated design with fewer ICs. The distance that signals had to travel between

ICs on the boards limited a computer's operating speed.

In the NASA Apollo space missions to the moon in the 1960s and 1970s, all onboard

computations for primary guidance, navigation and control were provided by a small custom

processor called "The Apollo Guidance Computer". It used wire wrap circuit boards whose

only logic elements were three-input NOR gates.

The first microprocessors emerged in the early 1970s and were used for electronic calculators,

using binary-coded decimal (BCD) arithmetic on 4-bit words. Other embedded uses of 4-bit and

8-bit microprocessors, such as terminals, printers, various kinds of automation etc., followed

soon after. Affordable 8-bit microprocessors with 16-bit addressing also led to the first general-

purpose microcomputers from the mid-1970s on.

Since the early 1970s, the increase in capacity of microprocessors has followed Moore's law; this

originally suggested that the number of components that can be fitted onto a chip doubles every

year. With present technology, it is actually every two years, and as such Moore later changed

the period to two years.

CADCIn 1968, Garrett Air Research (which employed designers Ray Holt and Steve Geller) was

invited to produce a digital computer to compete with electromechanical systems then under

development for the main flight control computer in the US Navy's new F-14 Tomcat fighter.

The design was complete by 1970, and used a MOS-based chipset as the core CPU. The

design was significantly (approximately 20 times) smaller and much more reliable than the

mechanical systems it competed against, and was used in all of the early Tomcat models.

This system contained "a 20-bit, pipelined, parallel multi-microprocessor". The Navy refused

to allow publication of the design until 1997. For this reason the CADC, and

theMP944 chipset it used, are fairly unknown. Ray graduated California Polytechnic

University in 1968, and began his computer design career with the CADC. From its

inception, it was shrouded in secrecy until 1998 when at Holt's request, the US Navy allowed

the documents into the public domain. Since then people have debated whether this was the

first microprocessor. Holt has stated that no one has compared this microprocessor with

those that came later. According to Parab et al. (2007), "The scientific papers and literature

published around 1971 reveal that the MP944 digital processor used for the F-14 Tomcat

aircraft of the US Navy qualifies as the first microprocessor. Although interesting, it was not a

single-chip processor, as was not the Intel 4004 – they both were more like a set of parallel

building blocks you could use to make a general-purpose form. It contains a CPU, RAM,

ROM, and two other support chips like the Intel 4004. It was made from the same P-

channel technology, operated at military specifications and had larger chips -- an excellent

computer engineering design by any standards. Its design indicates a major advance over

Intel, and two year earlier. It actually worked and was flying in the F-14 when the Intel 4004

was announced. It indicates that today’s industry theme of converging DSP-

microcontroller architectures was started in 1971. This convergence of DSP and

microcontroller architectures is known as a digital signal controller.

TMS 1000The Smithsonian Institution says TI engineers Gary Boone and Michael Cochran succeeded

in creating the first microcontroller (also called a microcomputer) and the first single-chip

CPU in 1971. The result of their work was the TMS 1000, which went commercial in 1974 TI

stressed the 4-bit TMS 1000 for use in pre-programmed embedded applications, introducing

a version called the TMS1802NC on September 17, 1971 that implemented a calculator on a

chip.

TI filed for a patent on the microprocessor. Gary Boone was awarded U.S. Patent

3,757,306 for the single-chip microprocessor architecture on September 4, 1973. In 1971

and again in 1976, Intel and TI entered into broad patent cross-licensing agreements, with

Intel paying royalties to TI for the microprocessor patent. A history of these events is

contained in court documentation from a legal dispute between Cyrix and Intel, with TI

as intervener and owner of the microprocessor patent.

A computer-on-a-chip combines the microprocessor core (CPU), memory, and I/O

(input/output) lines onto one chip. The computer-on-a-chip patent, called the "microcomputer

patent" at the time, U.S. Patent 4,074,351, was awarded to Gary Boone and Michael J.

Cochran of TI. Aside from this patent, the standard meaning of microcomputer is a computer

using one or more microprocessors as its CPU(s), while the concept defined in the patent is

more akin to a microcontroller.

INTEL 4004

The 4004 with cover removed (left) and as actually used (right)

The Intel 4004 is generally regarded as the first commercially available microprocessor, and

cost $60. The first known advertisement for the 4004 is dated November 15, 1971 and

appeared in Electronic News. The project that produced the 4004 originated in 1969,

when Busicom, a Japanese calculator manufacturer, asked Intel to build a chipset for high-

performance desktop calculators. Bascom’s original design called for a programmable chip

set consisting of seven different chips. Three of the chips were to make a special-purpose

CPU with its program stored in ROM and its data stored in shift register read-write

memory. Ted Hoff, the Intel engineer assigned to evaluate the project, believed the Busicom

design could be simplified by using dynamic RAM storage for data, rather than shift register

memory, and a more traditional general-purpose CPU architecture. Hoff came up with a

four–chip architectural proposal: a ROM chip for storing the programs, a dynamic RAM chip

for storing data, a simple I/O device and a 4-bit central processing unit (CPU). Although not

a chip designer, he felt the CPU could be integrated into a single chip, but as he lacked the

technical know-how the idea remained just a wish for the time being.

While the architecture and specifications of the MCS-4 came from the interaction of Hoff

with Stanley Mazor, a software engineer reporting to him, and with Busicom engineer

Masatoshi Shima, during 1969, Mazor and Hoff moved on to other projects. In April 1970,

Intel hired Italian-born engineer Federico Faggin as project leader, a move that ultimately

made the single-chip CPU final design a reality (Shima meanwhile designed the Busicom

calculator firmware and assisted Faggin during the first six months of the implementation).

Faggin, who originally developed the silicon gate technology (SGT) in 1968 at Fairchild

Semiconductor and designed the world’s first commercial integrated circuit using SGT, the

Fairchild 3708, had the correct background to lead the project into what would become the

first commercial general purpose microprocessor. Since SGT was his very own invention, it

in addition to his new methodology for random logic design made it possible to implement a

single-chip CPU with the proper speed, power dissipation and cost. The manager of Intel's

MOS Design Department was Leslie L. Vadász at the time of the MCS-4 development, but

Vadasz's attention was completely focused on the mainstream business of semiconductor

memories and he left the leadership and the management of the MCS-4 project to Faggin,

http://en.wikipedia.org/wiki/File:C4004_(Intel).jpg

who was ultimately responsible for leading the 4004 project to its realization. Production

units of the 4004 were first delivered to Busicom in March 1971 and shipped to other

customers in late 1971.

PICO/GENERAL INSTRUMENT

The PICO1/GI250 chip introduced in 1971. This was designed by Pico Electronics

(Glenrothes, Scotland) and manufactured by General Instrument of Hicksville NY.

In 1971 Pico Electronics and General Instrument (GI) introduced their first collaboration in

ICs, a complete single chip calculator IC for the Monroe/Litton Royal Digital III calculator.

This chip could also arguably lay claim to be one of the first microprocessors or

microcontrollers having ROM, RAM and a RISC instruction set on-chip. The layout for the

four layers of the PMOS process was hand drawn at x500 scale on Mylar film, a significant

task at the time given the complexity of the chip.

Pico was a spinout by five GI design engineers whose vision was to create single chip

calculator ICs. They had significant previous design experience on multiple calculator

chipsets with both GI and Marconi-Elliott. The key team members had originally been tasked

by Elliott Automation to create an 8-bit computer in MOS and had helped establish a MOS

Research Laboratory in Glenrothes, Scotland in 1967.

Calculators were becoming the largest single market for semiconductors and Pico and GI

went on to have significant success in this burgeoning market. GI continued to innovate in

microprocessors and microcontrollers with products including the CP1600, IOB1680 and

PIC1650. In 1987 the GI Microelectronics business was spun out into the Microchip PIC

microcontroller business.

FOUR-PHASE SYSTEM ALIThe Four-Phase Systems AL1 was an 8-bit bit slice chip containing eight registers and an

ALU. It was designed by Lee Boysel in 1969. At the time, it formed part of a nine-chip, 24-bit

CPU with three AL1s, but it was later called a microprocessor when, in response to 1990s

litigation by Texas Instruments, a demonstration system was constructed where a single AL1

formed part of a courtroom demonstration computer system, together with RAM, ROM, and

an input-output device.

http://en.wikipedia.org/wiki/File:GI250_PICO1_die_photo.jpg

8-BIT DESIGNThe Intel 4004 was followed in 1972 by the Intel 8008, the world's first 8-bit microprocessor.

The 8008 was not, however, an extension of the 4004 design, but instead the culmination of

a separate design project at Intel, arising from a contract with Computer Terminals

Corporation, of San Antonio TX, for a chip for a terminal they were designing, the Data point

2200 — fundamental aspects of the design came not from Intel but from CTC. In 1968,

CTC's Vic Poor and Harry Pyle developed the original design for the instruction set and

operation of the processor. In 1969, CTC contracted two companies, Intel and Texas

Instruments, to make a single-chip implementation, known as the CTC 1201. In late 1970 or

early 1971, TI dropped out being unable to make a reliable part. In 1970, with Intel yet to

deliver the part, CTC opted to use their own implementation in the Data point 2200; using

traditional TTL logic instead (thus the first machine to run “8008 code” was not in fact a

microprocessor at all and was delivered a year earlier). Intel's version of the 1201

microprocessor arrived in late 1971, but was too late, slow, and required a number of

additional support chips. CTC had no interest in using it. CTC had originally contracted Intel

for the chip, and would have owed them $50,000 for their design work.[32] To avoid paying for

a chip they did not want (and could not use), CTC released Intel from their contract and

allowed them free use of the design. Intel marketed it as the 8008 in April, 1972, as the

world's first 8-bit microprocessor. It was the basis for the famous "Mark-8" computer kit

advertised in the magazine Radio-Electronics in 1974.

The 8008 was the precursor to the very successful Intel 8080 (1974), which offered much

improved performance over the 8008 and required fewer support chips, Zilog Z80 (1976),

and derivative Intel 8-bit processors. The competing Motorola 6800 was released August

1974 and the similar MOS Technology 6502 in 1975 (both designed largely by the same

people). The 6502 family rivaled the Z80 in popularity during the 1980s.

A low overall cost, small packaging, simple computer bus requirements, and sometimes the

integration of extra circuitry (e.g. the Z80's built-in memory refresh circuitry) allowed the

home computer "revolution" to accelerate sharply in the early 1980s. This delivered such

inexpensive machines as the Sinclair ZX-81, which sold for US$99. A variation of the 6502,

the MOS Technology 6510 was used in the Commodore 64 and yet another variant,

the 8502, powered the Commodore 128.

The Western Design Center, Inc. (WDC) introduced the CMOS 65C02 in 1982 and licensed

the design to several firms. It was used as the CPU in the Apple IIe and IIc personal

computers as well as in medical implantable grade pacemakers and defibrillators,

automotive, industrial and consumer devices. WDC pioneered the licensing of

microprocessor designs, later followed by ARM (32-bit) and other microprocessor intellectual

property (IP) providers in the 1990s.

Motorola introduced the MC6809 in 1978, an ambitious and well thought-through 8-bit

design which was source compatible with the 6800 and was implemented using purely hard-

wired logic. (Subsequent 16-bit microprocessors typically used microcode to some extent, as

CISC design requirements were getting too complex for pure hard-wired logic.)

Another early 8-bit microprocessor was the Signe tics 2650, which enjoyed a brief surge of

interest due to its innovative and powerful instruction set architecture.

A seminal microprocessor in the world of spaceflight was RCA's RCA 1802 (aka CDP1802,

RCA COSMAC) (introduced in 1976), which was used on board the Galileo probe to Jupiter

(launched 1989, arrived 1995). RCA COSMAC was the first to implement CMOS technology.

The CDP1802 was used because it could be run at very low power, and because a variant

was available fabricated using a special production process, silicon on sapphire (SOS),

which provided much better protection against cosmic radiation and electrostatic than that of

any other processor of the era. Thus, the SOS version of the 1802 was said to be the first

radiation-hardened microprocessor.

The RCA 1802 had what is called a static design, meaning that the clock frequency could be

made arbitrarily low, even to 0 Hz, a total stop condition. This let the spacecraft use minimum

electric power for long uneventful stretches of a voyage. Timers or sensors would awaken

the processor in time for important tasks, such as navigation updates, attitude control, data

acquisition, and radio communication. Current versions of the Western Design Center 65C02

and 65C816 have static cores, and thus retain data even when the clock is completely

halted.

12-BIT DESIGNThe Intersil 6100 family consisted of a 12-bit microprocessor (the 6100) and a range of

peripheral support and memory ICs. The microprocessor recognized the DEC PDP-

8minicomputer instruction set. As such it was sometimes referred to as the CMOS-PDP8.

Since it was also produced by Harris Corporation, it was also known as the Harris HM-6100.

By virtue of its CMOS technology and associated benefits, the 6100 was being incorporated

into some military designs until the early 1980s.

16-BIT DESIGNThe first multi-chip 16-bit microprocessor was the National Semiconductor IMP-16, introduced in

early 1973. An 8-bit version of the chipset was introduced in 1974 as the IMP-8.

Other early multi-chip 16-bit microprocessors include one that Digital Equipment Corporation

(DEC) used in the LSI-11 OEM board set and the packaged PDP 11/03 minicomputer—and

the Fairchild Semiconductor Micro Flame 9440, both introduced in 1975–1976. In 1975, National

introduced the first 16-bit single-chip microprocessor, the National Semiconductor PACE, which

was later followed by an NMOS version, the INS8900.

Another early single-chip 16-bit microprocessor was TI's TMS 9900, which was also compatible

with their TI-990 line of minicomputers. The 9900 was used in the TI 990/4 minicomputer, the TI-

99/4A home computer, and the TM990 line of OEM microcomputer boards. The chip was

packaged in a large ceramic 64-pin DIP package, while most 8-bit microprocessors such as the

Intel 8080 used the more common, smaller, and less expensive plastic 40-pin DIP. A follow-on

chip, the TMS 9980, was designed to compete with the Intel 8080, had the full TI 990 16-bit

instruction set, used a plastic 40-pin package, moved data 8 bits at a time, but could only

address 16 KB. A third chip, the TMS 9995, was a new design. The family later expanded to

include the 99105 and 99110.

The Western Design Center (WDC) introduced the CMOS 65816 16-bit upgrade of the WDC

CMOS 65C02 in 1984. The 65816 16-bit microprocessor was the core of the Apple IIgsand later

the Super Nintendo Entertainment System, making it one of the most popular 16-bit designs of all

time.

Intel "upsized" their 8080 design into the 16-bit Intel 8086, the first member of the x86 family,

which powers most modern PC type computers. Intel introduced the 8086 as a cost-effective way

of porting software from the 8080 lines, and succeeded in winning much business on that

premise. The 8088, a version of the 8086 that used an 8-bit external data bus, was the

microprocessor in the first IBM PC. Intel then released the 80186 and 80188, the 80286 and, in

1985, the 32-bit 80386, cementing their PC market dominance with the processor family's

backwards compatibility. The 80186 and 80188 were essentially versions of the 8086 and 8088,

enhanced with some onboard peripherals and a few new instructions; they were not used in IBM-

compatible PCs because the built-in peripherals and their locations in the memory map were

incompatible with the IBM design. The 8086 and successors had an innovative but limited

method of memory segmentation, while the 80286 introduced a full-featured segmented memory

management unit (MMU). The 80386 introduced a flat 32-bit memory model with paged memory

management.

The Intel x86 processors up to and including the 80386 do not include floating-point units (FPUs).

Intel introduced the 8087, 80287, and 80387 math coprocessors to add hardware floating-point

and transcendental function capabilities to the 8086 through 80386 CPUs. The 8087 works with

the 8086/8088 and 80186/80188, the 80187 works with the 80186/80188, the 80287 works with

the 80286 and 80386, and the 80387 works with the 80386 (yielding better performance than the

8028. The combination of an x86 CPU and an x87 coprocessor forms a single multi-chip

microprocessor; the two chips are programmed as a unit using a single integrated instruction

set. Though the 8087 coprocessor is interfaced to the CPU through I/O ports in the CPU's

address space, this is transparent to the program, which does not need to know about or access

these I/O ports directly; the program accesses the coprocessor and its registers through normal

instruction op codes. Starting with the successor to the 80386, the 80486, the FPU was

integrated with the control unit, MMU, and integer ALU in a pipelined design on a single chip (in

the 80486DX version), or the FPU was eliminated entirely (in the 80486SX version). An

ostensible coprocessor for the 80486SX, the 80487 was actually a complete 80486DX that

disabled and replaced the coprocessor less 80486SX that it was installed to upgrade.

32-BIT DESIGN

Upper interconnect layers on an Intel 80486DX2 die

16-bit designs had only been on the market briefly when 32-bit implementations started to

appear.

The most significant of the 32-bit designs is the Motorola MC68000, introduced in 1979. The

68k, as it was widely known, had 32-bit registers in its programming model but used 16-bit

internal data paths, three 16-bit Arithmetic Logic Units, and a 16-bit external data bus (to

reduce pin count), and externally supported only 24-bit addresses (internally it worked with

full 32 bit addresses). In PC-based IBM-compatible mainframes the MC68000 internal

microcode was modified to emulate the 32-bit System/370 IBM mainframe.[36] Motorola

generally described it as a 16-bit processor, though it clearly has 32-bit capable architecture.

The combination of high performance, large (16 megabytes or 224 bytes) memory space and

fairly low cost made it the most popular CPU design of its class. The Apple Lisa and

Macintosh designs made use of the 68000, as did a host of other designs in the mid-1980s,

including the Atari ST and Commodore Amiga.

The world's first single-chip fully 32-bit microprocessor, with 32-bit data paths, 32-bit buses,

and 32-bit addresses, was the AT&T Bell LabsBELLMAC-32A, with first samples in 1980,

and general production in 1982.[37][38] After the divestiture of AT&T in 1984, it was renamed

the WE 32000 (WE for Western Electric), and had two follow-on generations, the WE 32100

and WE 32200. These microprocessors were used in the AT&T 3B5 and 3B15

minicomputers; in the 3B2, the world's first desktop super microcomputer; in the

"Companion", the world's first 32-bit laptop computer; and in "Alexander", the world's first

book-sized super microcomputer, featuring ROM-pack memory cartridges similar to today's

gaming consoles. All these systems ran the UNIX System V operating system.

The first commercial, single chip, fully 32-bit microprocessor available on the market was

the HP FOCUS.

Intel's first 32-bit microprocessor was the IAPX 432, which was introduced in 1981 but was

not a commercial success. It had an advanced capability-based oriented architecture, but

poor performance compared to contemporary architectures such as Intel's own 80286

(introduced 1982), which was almost four times as fast on typical benchmark tests. However,

http://en.wikipedia.org/wiki/File:80486DX2_200x.png

the results for the iAPX432 was partly due to a rushed and therefore

suboptimal Ada compiler.

Motorola's success with the 68000 led to the MC68010, which added virtual memory

support. The MC68020, introduced in 1984 added full 32-bit data and address buses. The

68020 became hugely popular in the Unix super microcomputer market, and many small

companies (e.g., Altos, Charles River Data Systems, Cromemco) produced desktop-size

systems. The MC68030 was introduced next, improving upon the previous design by

integrating the MMU into the chip. The continued success led to the MC68040, which

included an FPU for better math performance. A 68050 failed to achieve its performance

goals and was not released, and the follow-up MC68060 was released into a market

saturated by much faster RISC designs. The 68k family faded from the desktop in the early

1990s.

Other large companies designed the 68020 and follow-ons into embedded equipment. At

one point, there were more 68020s in embedded equipment than there were Intel Pentiums

in PCs.[39] The Cold Fire processor cores are derivatives of the venerable 68020.

During this time (early to mid-1980s), National Semiconductor introduced a very similar 16-

bit pin out, 32-bit internal microprocessor called the NS 16032 (later renamed 32016), the full

32-bit version named the NS 32032. Later, National Semiconductor produced the NS 32132,

which allowed two CPUs to reside on the same memory bus with built in arbitration. The

NS32016/32 outperformed the MC68000/10, but the NS32332—which arrived at

approximately the same time as the MC68020—did not have enough performance. The third

generation chip, the NS32532, was different. It had about double the performance of the

MC68030, which was released around the same time. The appearance of RISC processors

like the AM29000 and MC88000 (now both dead) influenced the architecture of the final

core, the NS32764. Technically advanced—with a superscalar RISC core, 64-bit bus, and

internally overclocked—it could still execute Series 32000 instructions through real-time

translation.

When National Semiconductor decided to leave the Unix market, the chip was redesigned

into the Swordfish Embedded processor with a set of on chip peripherals. The chip turned

out to be too expensive for the laser printer market and was killed. The design team went to

Intel and there designed the Pentium processor, which is very similar to the NS32764 core

internally. The big success of the Series 32000 was in the laser printer market, where the

NS32CG16 with microcode Bit instructions had very good price/performance and was

adopted by large companies like Canon. By the mid-1980s, Sequent introduced the first

SMP server-class computer using the NS 32032. This was one of the design's few wins, and

it disappeared in the late 1980s. The MIPS R2000 (1984) and R3000 (1989) were highly

successful 32-bit RISC microprocessors. They were used in high-end workstations and

servers by SGI, among others. Other designs included the Zilog Z80000, which arrived too

late to market to stand a chance and disappeared quickly.

The ARM first appeared in 1985. This is a RISC processor design, which has since come to

dominate the 32-bit embedded systems processor space due in large part to its power

efficiency, its licensing model, and its wide selection of system development tools.

Semiconductor manufacturers generally license cores and integrate them into their

own system on a chip products; only a few such vendors are licensed to modify the ARM

cores. Most cell on phones include an ARM processor, as do a wide variety of other

products. There are microcontroller-oriented ARM cores without virtual memory support, as

well as symmetric multiprocessor (SMP) applications processors with virtual memory.

In the late 1980s, "microprocessor wars" started killing off some of the

microprocessors. Apparently, with only one bigger design win, sequent, the NS 32032 just

faded out of existence, and Sequent switched to Intel microprocessors.

From 1993 to 2003, the 32-bit x86 architectures became increasingly dominant in desktop,

laptop, and server markets and these microprocessors became faster and more capable.

Intel had licensed early versions of the architecture to other companies, but declined to

license the Pentium, so AMD and Cyrix built later versions of the architecture based on their

own designs. During this span, these processors increased in complexity (transistor count)

and capability (instructions/second) by at least three orders of magnitude. Intel's Pentium

line is probably the most famous and recognizable 32-bit processor model, at least with the

public at broad.

64-BIT DESIGN IN PCWhile 64-bit microprocessor designs have been in use in several markets since the early

1990s (including the Nintendo 64 gaming console in 1996), the early 2000s saw the

introduction of 64-bit microprocessors targeted at the PC market.

With AMD's introduction of a 64-bit architecture backwards-compatible with x86, x86-

64 (also called AMD64), in September 2003, followed by Intel's near fully compatible 64-bit

extensions (first called IA-32e or EM64T, later renamed Intel 64), the 64-bit desktop era

began. Both versions can run 32-bit legacy applications without any performance penalty as

well as new 64-bit software. With operating systems Windows XP x64, Windows

Vista x64, Windows 7 x64, Linux, BSD, and Mac OS X that run 64-bit native, the software is

also geared to fully utilize the capabilities of such processors. The move to 64 bits is more

than just an increase in register size from the IA-32 as it also doubles the number of general-

purpose registers.

The move to 64 bits by PowerPC processors had been intended since the processors'

design in the early 90s and was not a major cause of incompatibility. Existing integer

registers are extended as are all related data pathways, but, as was the case with IA-32,

both floating point and vector units had been operating at or above 64 bits for several years.

Unlike what happened when IA-32 was extended to x86-64, no new general purpose

registers were added in 64-bit PowerPC, so any performance gained when using the 64-bit

mode for applications making no use of the larger address space is minimal.

In 2011, ARM introduced a new 64-bit ARM architecture.

MULTI-CORE DESIGNSA different approach to improving a computer's performance is to add extra processors, as

in symmetric multiprocessing designs, which have been popular in servers and workstations

since the early 1990s. Keeping up with Moore's Law is becoming increasingly challenging as

chip-making technologies approach their physical limits. In response, microprocessor

manufacturers look for other ways to improve performance so they can maintain the

momentum of constant upgrades.

A multi-core processor is a single chip that contains more than one microprocessor core.

Each core can simultaneously execute processor instructions in parallel. This effectively

multiplies the processor's potential performance by the number of cores, if the software is

designed to take advantage of more than one processor core. Some components, such as

bus interface and cache, may be shared between cores. Because the cores are physically

close to each other, they can communicate with each other much faster than separate (off-

chip) processors in a multiprocessor system, which improves overall system performance.

In 2005, AMD released the first native dual-core processor, the Athlon X2. Intel's Pentium D

had beaten the X2 to market by a few weeks, but it used two separate CPU dies and was

less efficient than AMD's native design. As of 2012, dual-core and quad-core processors are

widely used in home PCs and laptops, while quad, six, eight, ten, twelve, and sixteen-core

processors are common in the professional and enterprise markets with workstations and

servers.

Sun Microsystems has released the Niagara and Niagara 2 chips, both of which feature an

eight-core design. The Niagara 2 supports more threads and operates at 1.6 GHz.

High-end Intel Xeon processors that are on the LGA 771, LGA1336, and LGA 2011 sockets

and high-end AMD Opteron processors that are on the C32 and G34 sockets are DP (dual

processor) capable, as well as the older Intel Core 2 Extreme QX9775 also used in an older

Mac Pro by Apple and the Intel Skull trail motherboard. AMD's G34 motherboards can

support up to four CPUs and Intel's LGA 1567 motherboards can support up to eight CPUs.

Modern desktop computers do not support systems with multiple CPUs, but few applications

outside of the professional market can make good use of more than four cores. Both Intel

and AMD currently offer fast quad- and six-core desktop CPUs, making multi CPU systems

obsolete for many purposes. AMD also offers the first and currently the only eight core

desktop CPUs with the FX-8xxx line.

The desktop market has been in a transition towards quad-core CPUs since Intel's Core 2

Quads were released and now are common, although dual-core CPUs are still more

prevalent. Older or mobile computers are less likely to have more than two cores than newer

desktops. Not all software is optimized for multi core CPU's, making fewer, more powerful

cores preferable. AMD offers CPUs with more cores for a given amount of money than

similarly priced Intel CPUs—but the AMD cores are somewhat slower, so the two trade

blows in different applications depending on how well-threaded the programs running are.

For example, Intel's cheapest Sandy Bridge quad-core CPUs often cost almost twice as

much as AMD's cheapest Athlon II, Phenom II, and FX quad-core CPUs but Intel has dual-

core CPUs in the same price ranges as AMD's cheaper quad core CPUs. In an application

that uses one or two threads, the Intel dual cores outperform AMD's similarly priced quad-

core CPUs—and if a program supports three or four threads the cheap AMD quad-core

CPUs outperform the similarly priced Intel dual-core CPUs.

Historically, AMD and Intel have switched places as the company with the fastest CPU

several times. Intel currently leads on the desktop side of the computer CPU market, with

their Sandy Bridge and Ivy Bridge series. In servers, AMD's new Opteron seem to have

superior performance for their price point. This means that AMD are currently more

competitive in low- to mid-end servers and workstations that more effectively use fewer

cores and threads.

RISCIn the mid-1980s to early 1990s, a crop of new high-performance reduced instruction set

computer (RISC) microprocessors appeared, influenced by discrete RISC-like CPU designs

such as the IBM 801 and others. RISC microprocessors were initially used in special-

purpose machines and UNIX workstations, but then gained wide acceptance in other roles.

In 1986, HP released its first system with a PA-RISC CPU. The first commercial RISC

microprocessor design was released in 1984 by MIPS Computer Systems, the 32-

bit R2000(the R1000 was not released). In 1987 in the non-Unix Acorn computers' 32-bit,

then cache-less, ARM2-based Acorn Archimedes the first commercial success using

the ARM architecture, then known as Acorn RISC Machine (ARM); first silicon ARM1 in

1985. The R3000 made the design truly practical, and the R4000 introduced the world's first

commercially available 64-bit RISC microprocessor. Competing projects would result in the

IBM POWER and Sun SPARC architectures. Soon every major vendor was releasing a

RISC design, including the AT&T CRISP, AMD 29000, Intel i860 and Intel i960, Motorola

88000, DEC Alpha.

In the late 1990s, only two 64-bit RISC architectures were still produced in volume for non-

embedded applications: SPARC and Power ISA, but as ARM has become increasingly

powerful, in the early 2010s, it became the third RISC architecture in the general computing

segment.

MARKET STATISTICSIn 2003, about US$44 billion worth of microprocessors were manufactured and sold. Although

about half of that money was spent on CPUs used in desktop or laptop personal computers,

those count for only about 2% of all CPUs sold.

About 55% of all CPUs sold in the world are 8-bit microcontrollers, over two billion of which were

sold in 1997.

In 2002, less than 10% of all the CPUs sold in the world were 32-bit or more. Of all the 32-bit

CPUs sold, about 2% are used in desktop or laptop personal computers. Most microprocessors

are used in embedded control applications such as household appliances, automobiles, and

computer peripherals. Taken as a whole, the average price for a microprocessor, microcontroller,

or DSP is just over $6.

About ten billion CPUs were manufactured in 2008. About 98% of new CPUs produced each

year are embedded.

MICROARCHITECTURE

Intel Core microarchitecture

In electronics engineering and computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

Computer architecture is the combination of microarchitecture and instruction set design.

RELATION TO INSTRUCTION SET ARCHITECTURE

The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor, address and data formats among other things. The microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA.

http://en.wikipedia.org/wiki/File:Intel_Core2_arch.svg

Single bus organization microarchitecture

The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various micro architectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALUs) and even larger elements. These diagrams generally separate the datapath (where data is placed) and the control path (which can be said to steer the data)

The person designing a system usually draws the specific microarchitecture as a kind of data flow diagram. Like a block diagram, the microarchitecture diagram shows micro architectural elements such as the arithmetic and logic unit and the register file as a single schematic symbol. Typically the diagram connects those elements with arrows and thick lines and thin lines to distinguish between three-state buses -- which require a three state buffer for each device that drives the bus; unidirectional buses -- always driven by a single source, such as the way the address bus on simpler computers is always driven by the memory address register; and individual control lines. Very simple computers have a single data bus organization -- they have a single three-state bus. The diagram of more complex computers usually shows multiple three-state buses, which help the machine do more operations simultaneously.

Each micro architectural element is in turn represented by a schematic describing the interconnections of logic gates used to implement it. Each logic gate is in turn represented by a circuit diagram describing the connections of the transistors used to implement it in some particular logic family. Machines with different microarchitectures may have the same instruction set architecture, and thus be capable of executing the same programs. New microarchitectures and/or circuitry solutions, along with advances in semiconductor manufacturing, are what allows newer generations of processors to achieve higher performance while using the same ISA.

In principle, a single microarchitecture could execute several different ISAs with only minor changes to the microcode.

http://en.wikipedia.org/wiki/File:Single_bus_organization.jpg

ASPECT OF MICROARCHITECTURE

Intel 80286 microarchitecture

The pipelined data path is the most commonly used data path design in microarchitecture today. This technique is used in most modern microprocessors, microcontrollers, and DSPs. The pipelined architecture allows multiple instructions to overlap in execution, much like an assembly line. The pipeline includes several different stages which are fundamental in microarchitecture designs Some of these stages include instruction fetch, instruction decode, execute, and write back. Some architectures include other stages such as memory access. The design of pipelines is one of the central micro architectural tasks.

Execution units are also essential to microarchitecture. Execution units include arithmetic logic units (ALU), floating point units (FPU), load/store units, branch prediction, and SIMD. These units perform the operations or calculations of the processor. The choice of the number of execution units, their latency and throughput is a central micro architectural design task. The size, latency, throughput and connectivity of memories within the system are also micro architectural decisions.

System-level design decisions such as whether or not to include peripherals, such as memory controllers, can be considered part of the micro architectural design process. This includes decisions on the performance-level and connectivity of these peripherals.

Unlike architectural design, where achieving a specific performance level is the main goal, micro architectural design pays closer attention to other constraints. Since microarchitecture design decisions directly affect what goes into a system, attention must be paid to such issues as:

Chip area/cost Power consumption Logic complexity Ease of connectivity Manufacturability Ease of debugging Testability

MICRO ARCHITRUCTURAL CONCEPT

Instruction cycle

In general, all CPUs, single-chip microprocessors or multi-chip implementations run programs by performing the following steps:

1. Read an instruction and decode it

http://en.wikipedia.org/wiki/File:Intel_i80286_arch.svg

2. Find any associated data that is needed to process the instruction3. Process the instruction4. Write the results out

The instruction cycle is repeated continuously until the power is turned off.

Increasing execution speed

Complicating this simple-looking series of steps is the fact that the memory hierarchy, which includes caching, main memory and non-volatile storage like hard disks (where the program instructions and data reside), has always been slower than the processor itself. Step (2) often introduces a lengthy (in CPU terms) delay while the data arrives over the computer. A considerable amount of research has been put into designs that avoid these delays as much as possible. Over the years, a central goal was to execute more instructions in parallel, thus increasing the effective execution speed of a program. These efforts introduced complicated logic and circuit structures. Initially, these techniques could only be implemented on expensive mainframes or supercomputers due to the amount of circuitry needed for these techniques. As semiconductor manufacturing progressed, more and more of these techniques could be implemented on a single semiconductor chip. See Moore's law.

Instruction set choice

Instruction sets have shifted over the years, from originally very simple to sometimes very complex (in various respects). In recent years, load-store architectures, VLIW and EPIC types have been in fashion. Architectures that are dealing with data parallelism include SIMD and Vectors. Some labels used to denote classes of CPU architectures are not particularly descriptive, especially so the CISC label; many early designs retroactively denoted "CISC" are in fact significantly simpler than modern RISC processors (in several respects).

However, the choice of instruction set architecture may greatly affect the complexity of implementing high performance devices. The prominent strategy, used to develop the first RISC processors, was to simplify instructions to a minimum of individual semantic complexity combined with high encoding regularity and simplicity. Such uniform instructions were easily fetched, decoded and executed in a pipelined fashion and a simple strategy to reduce the number of logic levels in order to reach high operating frequencies; instruction cache-memories compensated for the higher operating frequency and inherently low code density while large register sets were used to factor out as much of the (slow) memory accesses as possible.

Instruction pipelining

One of the first, and most powerful, techniques to improve performance is the use of the instruction pipeline. Early processor designs would carry out all of the steps above for one instruction before moving onto the next. Large portions of the circuitry were left idle at any one step; for instance, the instruction decoding circuitry would be idle during execution and so on.

Pipelines improve performance by allowing a number of instructions to work their way through the processor at the same time. In the same basic example, the processor would start to decode (step 1) a new instruction while the last one was waiting for results. This would allow up to four instructions to be "in flight" at one time, making the processor look four times as fast. Although any one instruction takes just as long to complete (there are still four steps) the CPU as a whole "retires" instructions much faster.

RISC make pipelines smaller and much easier to construct by cleanly separating each stage of the instruction process and making them take the same amount of time — one cycle. The processor as a whole operates in an assembly line fashion, with instructions coming in one side and results out the other. Due to the reduced complexity of the Classic RISC pipeline, the pipelined core and an instruction cache could be placed on the same size die that would otherwise fit the core alone on a CISC design. This was the real reason that RISC was faster. Early designs like the SPARC and MIPS often ran over 10 times as fast as Intel and Motorola CISC solutions at the same clock speed and price.

Pipelines are by no means limited to RISC designs. By 1986 the top-of-the-line VAX implementation (VAX 8800) was a heavily pipelined design, slightly predating the first commercial MIPS and SPARC designs. Most modern CPUs (even embedded CPUs) are now pipelined, and micro coded CPUs with no pipelining are seen only in the most area-constrained embedded processors. Large CISC machines, from the VAX 8800 to the modern Pentium 4 and Athlon, are implemented with both microcode and pipelines. Improvements in pipelining and caching are the two major micro architectural advances that have enabled processor performance to keep pace with the circuit technology on which they are based.

Cache

It was not long before improvements in chip manufacturing allowed for even more circuitry to be placed on the die, and designers started looking for ways to use it. One of the most common was to add an ever-increasing amount of cache memory on-die. Cache is simply very fast memory, memory that can be accessed in a few cycles as opposed to many needed to "talk" to main memory. The CPU includes a cache controller which automates reading and writing from the cache, if the data is already in the cache it simply "appears", whereas if it is not the processor is "stalled" while the cache controller reads it in.

RISC designs started adding cache in the mid-to-late 1980s, often only 4 KB in total. This number grew over time, and typical CPUs now have at least 512 KB, while more powerful CPUs come with 1 or 2 or even 4, 6, 8 or 12 MB, organized in multiple levels of a memory hierarchy. Generally speaking, more cache means more performance, due to reduced stalling.

Caches and pipelines were a perfect match for each other. Previously, it didn't make much sense to build a pipeline that could run faster than the access latency of off-chip memory. Using on-chip cache memory instead, meant that a pipeline could run at the speed of the cache access latency, a much smaller length of time. This allowed the operating frequencies of processors to increase at a much faster rate than that of off-chip memory.

Branch prediction

One barrier to achieving higher performance through instruction-level parallelism stems from pipeline stalls and flushes due to branches. Normally, whether a conditional branch will be taken isn't known until late in the pipeline as conditional branches depend on results coming from a register. From the time that the processor's instruction decoder has figured out that it has encountered a conditional branch instruction to the time that the deciding register value can be read out, the pipeline needs to be stalled for several cycles, or if it's not and the branch is taken, the pipeline needs to be flushed. As clock speeds increase the depth of the pipeline increases with it, and some modern processors may have 20 stages or more. On average, every fifth instruction executed is a branch, so without any intervention, that's a high amount of stalling.

Techniques such as branch prediction and speculative execution are used to lessen these branch penalties. Branch prediction is where the hardware makes educated guesses on whether a particular branch will be taken. In reality one side or the other of the branch will be called much more often than the other. Modern designs have rather complex statistical prediction systems, which watch the results of past branches to predict the future with greater accuracy. The guess allows the hardware to prefetch instructions without waiting for the register read. Speculative execution is a further enhancement in which the code along the predicted path is not just perfected but also executed before it is known whether the branch should be taken or not. This can yield better performance when the guess is good, with the risk of a huge penalty when the guess is bad because instructions need to be undone.

Superscalar

Even with all of the added complexity and gates needed to support the concepts outlined above, improvements in semiconductor manufacturing soon allowed even more logic gates to be used.

In the outline above the processor processes parts of a single instruction at a time. Computer programs could be executed faster if multiple instructions were processed simultaneously. This is what superscalar processors achieve, by replicating functional units such as ALUs. The replication of functional units was only made possible when the die area of a single-issue processor no longer stretched the limits of what could be reliably manufactured. By the late 1980s, superscalar designs started to enter the market place.

In modern designs it is common to find two load units, one store (many instructions have no results to store), two or more integer math units, two or more floating point units, and often a SIMD unit of some sort. The instruction issue logic grows in complexity by reading in a huge list of instructions from memory and handing them off to the different execution units that are idle at that point. The results are then collected and re-ordered at the end.

Out-of-order execution

The addition of caches reduces the frequency or duration of stalls due to waiting for data to be fetched from the memory hierarchy, but does not get rid of these stalls entirely. In early designs a cache miss would force the cache controller to stall the processor and wait. Of course there may be some other instruction in the program whose data is available in the cache at that point. Out-of-order execution allows that ready instruction to be processed while an older instruction waits on the cache, then re-orders the results to make it appear that everything happened in the programmed order. This technique is also used to avoid other operand dependency stalls, such as an instruction awaiting a result from a long latency floating-point operation or other multi-cycle operations.

Register renaming

Register renaming refers to a technique used to avoid unnecessary serialized execution of program instructions because of the reuse of the same registers by those instructions. Suppose we have two groups of instruction that will use the same register. One set of instructions is executed first to leave the register to the other set, but if the other set is assigned to a different similar register, both sets of instructions can be executed in parallel (or) in series.

Multiprocessing and multithreading

Computer architects have become stymied by the growing mismatch in CPU operating frequencies and DRAM access times. None of the techniques that exploited instruction-level parallelism (ILP) within one program could make up for the long stalls that occurred when data had to be fetched from main memory. Additionally, the large transistor counts and high operating frequencies needed for the more advanced ILP techniques required power dissipation levels that could no longer be cheaply cooled. For these reasons, newer generations of computers have started to exploit higher levels of parallelism that exist outside of a single program or program thread.

This trend is sometimes known as throughput computing. This idea originated in the mainframe market where online transaction processing emphasized not just the execution speed of one transaction, but the capacity to deal with massive numbers of transactions. With transaction-based applications such as network routing and web-site serving greatly increasing in the last decade, the computer industry has re-emphasized capacity and throughput issues.

One technique of how this parallelism is achieved is through multiprocessing systems, computer systems with multiple CPUs. Once reserved for high-end mainframes and supercomputers, small scale (2-8) multiprocessors servers have become commonplace for the small business market. For large corporations, large scale (16-256) multiprocessors are common. Even personal computers with multiple CPUs have appeared since the 1990s.

With further transistor size reductions made available with semiconductor technology advances, multicore CPUs have appeared where multiple CPUs are implemented on the same silicon chip. Initially used in chips targeting embedded markets, where simpler and smaller CPUs would allow multiple instantiations to fit on one piece of silicon. By 2005, semiconductor technology allowed dual high-end desktop CPUs CMP chips to be

manufactured in volume. Some designs, such as Sun Microsystems' Ultra SPARC T1 have reverted to simpler (scalar, in-order) designs in order to fit more processors on one piece of silicon.

Another technique that has become more popular recently is multithreading. In multithreading, when the processor has to fetch data from slow system memory, instead of stalling for the data to arrive, the processor switches to another program or program thread which is ready to execute. Though this does not speed up a particular program/thread, it increases the overall system throughput by reducing the time the CPU is idle.

Conceptually, multithreading is equivalent to a context switch at the operating system level. The difference is that a multithreaded CPU can do a thread switch in one CPU cycle instead of the hundreds or thousands of CPU cycles a context switch normally requires. This is achieved by replicating the state hardware (such as the register file and program counter) for each active thread.

A further enhancement is simultaneous multithreading. This technique allows superscalar CPUs to execute instructions from different programs/threads simultaneously in the same cycle.

ARITHMETIC LOGIC UNITIn digital electronics, an arithmetic logic unit (ALU) is a digital circuit that

performs integer arithmetic and logical operations. The ALU is a fundamental building block

of the central processing unit of a computer, and even the simplest microprocessors contain

one for purposes such as maintaining timers. The processors found inside modern CPUs

and graphics processing units (GPUs) accommodate very powerful and very complex ALUs;

a single component may contain a number of ALUs.

Mathematician John von Neumann proposed the ALU concept in 1945, when he wrote a

report on the foundations for a new computer called the EDVAC.

Arithmetic And Logic Unit schematic symbol

Casacdable 8 Bit ALU Texas Instruments SN74AS888

http://en.wikipedia.org/wiki/File:ALU_symbol.svg

http://en.wikipedia.org/wiki/File:KL_Texas_Instruments_ALU_SN74AS888.jpg

NUMERICAL SYSTEMAn ALU must process numbers using the same formats as the rest of the digital circuit. The

format of modern processors is almost always the two's complement binary number

representation. Early computers used a wide variety of number systems, including ones'

complement, two's complement, sign-magnitude format, and even true decimal systems,

with various representation of the digits.

The ones' complement and two's complement number systems allow for subtraction to be

accomplished by adding the negative of a number in a very simple way which negates the

need for specialized circuits to do subtraction; however, calculating the negative in two's

complement requires adding a one to the low order bit and propagating the carry. An

alternative way to do two's complement subtraction of A−B is to present a one to the carry

input of the adder and use ¬B rather than B as the second input. The arithmetic, logic and

shift circuits introduced in previous sections can be combined into one ALU with common

selection.

PRACTICAL OVERVIEWMost of a processor's operations are performed by one or more ALUs. An ALU loads data

from input registers. Then an external control unit tells the ALU what operation to perform on

that data, and then the ALU stores its result into an output register. The control unit is

responsible for moving the processed data between these registers, ALU and memory.

Complex operations

Engineers can design an arithmetic logic unit to calculate most operations. The more

complex the operation, the more expensive the ALU is, the more space it uses in the

processor, and the more power it dissipates. Therefore, engineers compromise. They make

the ALU powerful enough to make the processor fast, yet not so complex as to become

prohibitive. For example, computing the square root of a number might use:

1. Calculation in a single clock Design an extraordinarily complex ALU that calculates

the square root of any number in a single step.

2. Calculation pipeline Design a very complex ALU that calculates the square root of

any number in several steps. The intermediate results go through a series of circuits

arranged like a factory production line. The ALU can accept new numbers to

calculate even before having finished the previous ones. The ALU can now produce

numbers as fast as a single-clock ALU, although the results start to flow out of the

ALU only after an initial delay.

3. Iterative calculation Design a complex ALU that calculates the square root through

several steps. This usually relies on control from a complex control unit with built-in

microcode.

4. Co-processor Design a simple ALU in the processor, and sell a separate specialized

and costly processor that the customer can install just beside this one, and

implements one of the options above.

5. Software libraries Tell the programmers that there is no co-processor and there is

no emulation, so they will have to write their own algorithms to calculate square roots

by software.

6. Software emulation Emulate the existence of the co-processor. Whenever a

program attempts to perform the square root calculation, make the processor check

if there is a co-processor present and use it if there is one; if there is not

one, interrupt the processing of the program and invoke the operating system to

perform the square root calculation through some software algorithm.

The options above go from the fastest and most expensive one to the slowest and least

expensive one. Therefore, while even the simplest computer can calculate the most

complicated formula, the simplest computers will usually take a long time doing that because

of the several steps for calculating the formula.

Powerful processors like the Intel Core and AMD64 implement option #1 for several simple

operations, #2 for the most common complex operations and #3 for the extremely complex

operations.

Inputs and outputs

The inputs to the ALU are the data to be operated on (called operands) and a code from

the control unit indicating which operation to perform. Its output is the result of the

computation. One thing designers must keep in mind is whether the ALU will operate on big-

endian or little-endian numbers.

In many designs, the ALU also takes or generates inputs or outputs a set of condition codes

from or to a status register. These codes are used to indicate cases such as carry-in or

carry-out, overflow, divide-by-zero, etc.

A floating-point unit also performs arithmetic operations between two values, but they do so

for numbers in floating-point representation, which is much more complicated than the two's

complement representation used in a typical ALU. In order to do these calculations,

a FPU has several complex circuits built-in, including some internal ALUs.

In modern practice, engineers typically refer to the ALU as the circuit that performs integer

arithmetic operations (like two's complement and BCD). Circuits that calculate more complex

formats like floating point, complex numbers, etc. usually receive a more specific name such

as floating-point unit (FPU).

CENTRAL PROCESSING UNIT

An Intel 80486DX2 CPU, as seen from above.

An Intel 80486DX2, as seen from below

A central processing unit (CPU) is the hardware within a computer that carries out the instructions of a computer program by performing the basic arithmetical, logical, and input/output operations of the system. The term has been in use in the computer industry at least since the early 1960s. The form, design, and implementation of CPUs have changed over the course of their history, but their fundamental operation remains much the same.

A computer can have more than one CPU; this is called multiprocessing. All modern CPUs are microprocessors, meaning contained on a single chip. Some integrated circuits (ICs) can contain multiple CPUs on a single chip; those ICs are called multi-core processors. An IC containing a CPU can also contain peripheral devices, and other components of a computer system; this is called a system on a chip (SOC).

Two typical components of a CPU are the arithmetic logic unit (ALU), which performs arithmetic and logical operations, and the control unit(CU), which extracts instructions from memory and decodes and executes them, calling on the ALU when necessary.

Not all computational systems rely on a central processing unit. An array processor or vector processor has multiple parallel computing elements, with no one unit considered the "center". In the distributed computing model, problems are solved by a distributed interconnected set of processors.

http://en.wikipedia.org/wiki/File:Intel_80486DX2_top.jpg

http://en.wikipedia.org/wiki/File:Intel_80486DX2_bottom.jpg

TRANSISTOR AND INTEGRATED CIRCUIT CPU

CPU, core memory, and external bus interface of a DEC PDP-8/I. Made of medium-scale integrated

circuits.

The design complexity of CPUs increased as various technologies facilitated building smaller and more reliable electronic devices. The first such improvement came with the advent of the transistor. Transistorized CPUs during the 1950s and 1960s no longer had to be built out of bulky, unreliable, and fragile switching elements like vacuum tubes and electrical relays. With this improvement more complex and reliable CPUs were built onto one or several printed circuit boards containing discrete (individual) components.

During this period, a method of manufacturing many interconnected transistors in a compact space was developed. The integrated circuit (IC) allowed a large number of transistors to be manufactured on a single semiconductor-based die, or "chip". At first only very basic non-specialized digital circuits such as NOR gates were miniaturized into ICs. CPUs based upon these "building block" ICs are generally referred to as "small-scale integration" (SSI) devices. SSI ICs, such as the ones used in the Apollo guidance computer, usually contained up to a few score transistors. To build an entire CPU out of SSI ICs required thousands of individual chips, but still consumed much less space and power than earlier discrete transistor designs. As microelectronic technology advanced, an increasing number of transistors were placed on ICs, thus decreasing the quantity of individual ICs needed for a complete CPU. MSI and LSI (medium- and large-scale integration) ICs increased transistor counts to hundreds, and then thousands.

In 1964 IBM introduced its System/360 computer architecture which was used in a series of computers that could run the same programs with different speed and performance. This was significant at a time when most electronic computers were incompatible with one another, even those made by the same manufacturer. To facilitate this improvement, IBM utilized the concept of a micro program (often called "microcode"), which still sees widespread usage in modern CPUs.[4] The System/360 architecture was so popular that it dominated the mainframe computer market for decades and left a legacy that is still continued by similar modern computers like the IBM z Series. In the same year (1964), Digital Equipment Corporation (DEC) introduced another influential computer aimed at the scientific and research markets, the PDP-8. DEC would later introduce the extremely popularPDP-11 line that originally was built with SSI ICs but was eventually implemented with LSI components once these became practical. In stark contrast with its SSI and MSI predecessors, the first LSI implementation of the PDP-11 contained a CPU composed of only four LSI integrated circuits.

Transistor-based computers had several distinct advantages over their predecessors. Aside from facilitating increased reliability and lower power consumption, transistors also allowed CPUs to operate at much higher speeds because of the short switching time of a

http://en.wikipedia.org/wiki/File:PDP-8i_cpu.jpg

transistor in comparison to a tube or relay. Thanks to both the increased reliability as well as the dramatically increased speed of the switching elements (which were almost exclusively transistors by this time), CPU clock rates in the tens of megahertz were obtained during this period. Additionally while discrete transistor and IC CPUs were in heavy usage, new high-performance designs like SIMD (Single Instruction Multiple Data) processors began to appear. These early experimental designs later gave rise to the era of specialized supercomputers like those made by Cray Inc.

MICROPROCESSORS

Die of an Intel 80486DX2microprocessor (actual size: 12×6.75 mm) in its packaging

Intel Core i5 CPU on a Vaio E series laptop motherboard (on the right, beneath the heat pipe).

In the 1970s the fundamental inventions by Federico Faggin (Silicon Gate MOS ICs with self-aligned gates along with his new random logic design methodology) changed the design and implementation of CPUs forever. Since the introduction of the first commercially available microprocessor (the Intel 4004) in 1970, and the first widely used microprocessor (the Intel 8080) in 1974, this class of CPUs has almost completely overtaken all other central processing unit implementation methods. Mainframe and minicomputer manufacturers of the time launched proprietary IC development programs to upgrade their older computer architectures, and eventually produced instruction set compatible microprocessors that were backward-compatible with their older hardware and software. Combined with the advent and eventual success of the ubiquitous personal computer, the term CPU is now applied almost exclusively to microprocessors. Several CPUs (denoted 'cores') can be combined in a single processing chip.

Previous generations of CPUs were implemented as discrete components and numerous small integrated circuits (ICs) on one or more circuit boards. Microprocessors, on the other hand, are CPUs manufactured on a very small number of ICs; usually just one. The overall smaller CPU size as a result of being implemented on a single die means faster switching time because of physical factors like decreased gate parasitic

http://en.wikipedia.org/wiki/File:80486dx2-large.jpg

http://en.wikipedia.org/wiki/File:EBIntel_Corei5.JPG

capacitance. This has allowed synchronous microprocessors to have clock rates ranging from tens of megahertz to several gigahertz. Additionally, as the ability to construct exceedingly small transistors on an IC has increased, the complexity and number of transistors in a single CPU has increased many fold. This widely observed trend is described by Moore's law, which has proven to be a fairly accurate predictor of the growth of CPU (and other IC) complexity.

While the complexity, size, construction, and general form of CPUs have changed enormously since 1950, it is notable that the basic design and function has not changed much at all. Almost all common CPUs today can be very accurately described as von Neumann stored-program machines.[b] As the aforementioned Moore's law continues to hold true,[6] concerns have arisen about the limits of integrated circuit transistor technology. Extreme miniaturization of electronic gates is causing the effects of phenomena like electro migration and sub threshold to become much more significant. These newer concerns are among the many factors causing researchers to investigate new methods of computing such as the quantum computer, as well as to expand the usage of parallelism and other methods that extend the usefulness of the classical von Neumann model.

OPERATIONThe fundamental operation of most CPUs, regardless of the physical form they take, is to execute a sequence of stored instructions called a program. The instructions are kept in some kind of computer memory. There are four steps that nearly all CPUs use in their operation: fetch, decode, execute, and write back.

Fetch

The first step, fetch, involves retrieving an instruction (which is represented by a number or sequence of numbers) from program memory. The instruction's location (address) in program memory is determined by a program counter (PC), which stores a number that identifies the address of the next instruction to be fetched. After an instruction is fetched, the PC is incremented by the length of the instruction in terms of memory units so that it will contain the address of the next instruction in the sequence.[c] Often, the instruction to be fetched must be retrieved from relatively slow memory, causing the CPU to stall while waiting for the instruction to be returned. This issue is largely addressed in modern processors by caches and pipeline architectures (see below).

Decode

The instruction that the CPU fetches from memory is used to determine what the CPU is to do. In the decode step, the instruction is broken up into parts that have significance to other portions of the CPU. The way in which the numerical instruction value is interpreted is defined by the CPU's instruction set architecture (ISA).[d] Often, one group of numbers in the instruction, called the op code, indicates which operation to perform. The remaining parts of the number usually provide information required for that instruction, such as operands for an addition operation. Such operands may be given as a constant value (called an immediate value), or as a place to locate a value: a register or a memory address, as determined by some addressing mode. In older designs the portions of the CPU responsible for instruction decoding were unchangeable hardware devices. However, in more abstract and complicated CPUs and ISAs, a micro program is often used to assist in translating instructions into various configuration signals for the CPU. This micro program is sometimes rewritable so that it can be modified to change the way the CPU decodes instructions even after it has been manufactured.

Execute

After the fetch and decode steps, the execute step is performed. During this step, various portions of the CPU are connected so they can perform the desired operation. If, for instance, an addition operation was requested, the arithmetic logic unit (ALU) will be connected to a set of inputs and a set of outputs. The inputs provide the numbers to be added, and the outputs will contain the final sum. The ALU contains the circuitry to perform simple arithmetic and logical operations on the inputs (like addition and bitwise operations). If the addition operation produces a result too large for the CPU to handle, an arithmetic overflow flag in a flags register may also be set.

The final step, write back, simply "writes back" the results of the execute step to some form of memory. Very often the results are written to some internal CPU register for quick access by subsequent instructions. In other cases results may be written to slower, but cheaper and larger, main memory. Some types of instructions manipulate the program counter rather than directly produce result data. These are generally called "jumps" and facilitate behavior like loops, conditional program execution (through the use of a conditional jump), and functions in programs. Many instructions will also change the state of digits in a "flags" register. These flags can be used to influence how a program behaves, since they often indicate the outcome of various operations. For example, one type of "compare" instruction considers two values and sets a number in the flags register according to which one is greater. This flag could then be used by a later jump instruction to determine program flow.

After the execution of the instruction and write back of the resulting data, the entire process repeats, with the next instruction cycle normally fetching the next-in-sequence instruction because of the incremented value in the program counter. If the completed instruction was a jump, the program counter will be modified to contain the address of the instruction that was jumped to, and program execution continues normally. In more complex CPUs than the one described here, multiple instructions can be fetched, decoded, and executed simultaneously. This section describes what is generally referred to as the "classic RISC pipeline", which in fact is quite common among the simple CPUs used in many electronic devices (often called microcontroller). It largely ignores the important role of CPU cache, and therefore the access stage of the pipeline.

PERFORMANCEThe performance or speed of a processor depends on, among many other factors, the clock rate (generally given in multiples of hertz) and the instructions per clock (IPC), which together are the factors for the instructions per second (IPS) that the CPU can perform. Many reported IPS values have represented "peak" execution rates on artificial instruction sequences with few branches, whereas realistic workloads consist of a mix of instructions and applications, some of which take longer to execute than others. The performance of the memory hierarchy also greatly affects processor performance, an issue barely considered in MIPS calculations. Because of these problems, various standardized tests, often called "benchmarks" for this purpose—such as SPE Cint – have been developed to attempt to measure the real effective performance in commonly used applications.

Processing performance of computers is increased by using multi-core processors, which essentially is plugging two or more individual processors (called cores in this sense) into one integrated circuit. Ideally, a dual core processor would be nearly twice as powerful as a single core processor. In practice, the performance gain is far smaller, only about 50%, due to imperfect software algorithms and implementation. Increasing the number of cores in a processor (i.e. dual-core, quad-core, etc.) increases the workload that can be handled. This means that the processor can now handle numerous asynchronous events, interrupts, etc. which can take a toll on the CPU (Central

Processing Unit) when overwhelmed. These cores can be thought of as different floors in a processing plant, with each floor handling a different task. Sometimes, these cores will handle the same tasks as cores adjacent to them if a single core is not enough to handle the information.

casp report

Design

single chip

microprocessor designs

integrated circuit processor

firsta microprocessor

z80 microprocessor

singlechip processors

single integrated circuit

arithmetic logic unit20