risc vs cisc - the post-risc era - jon "hannibal" stokes

8/12/2019 RISC vs CISC - The Post-RISC Era - Jon "Hannibal" Stokes

1/26

RISC vs. CISC: the Post-RISC Era

A historical approach to the debate

by Hannibal

Retrieved 07/28/2014 from the world wide web: http://archive.arstechnica.com/cpu/4q99/risc-cisc/rvc-1.html; Article dated 10/1999;

Framing the Debate

The majority of today's processors cant rightfully be calledcompletely RISC or completely CISC. The two textbookarchitectures have evolved towards each other to such an

extent that theres no longer a clear distinction between theirrespective approaches to increasing performance andefficiency. To be specific, chips that implement the x86 CISCISA have come to look a lot like chips that implement variousRISC ISAs; the instruction set architecture is the same, butunder the hood its a whole different ball game. But this hasn'tbeen a one-way trend. Rather, the same goes for todays so-called RISC CPUs. Theyve added more instructions and morecomplexity to the point where theyre every bit as complex as

their CISC counterparts. Thus the "RISC vs. CISC" debatereally exists only in the minds of marketing departments andplatform advocates whose purpose in creating andperpetuating this fictitious conflict is to promote their petproduct by means of name-calling and sloganeering.

At this point, Id like to reference a statement made by DavidDitzel, the chief architect of Suns SPARC family and CEO ofTransmeta.

"Today [in RISC] we have large design teams and long designcycles," he said. "The performance story is also much lessclear now. The die sizes are no longer small. It just doesn'tseem to make as much sense." The result is the current cropof complex RISC chips. "Superscalar and out-of-orderexecution are the biggest problem areas that have impededperformance [leaps]," Ditzel said. "The MIPS R10,000 and HPPA-8000 seem much more complex to me than today'sstandard CISC architecture, which is the Pentium II. So where

is the advantage of RISC, if the chips aren't as simpleanymore?"
http://www.eet.com/story/OEG19981201S0003http://archive.arstechnica.com/cpu/4q99/risc-cisc/rvc-1.htmlmailto:[email protected]


2/26

This statement is important, and it sums up the current feelingamong researchers. Instead of RISC or CISC CPUs, what wehave now no longer fits in the old categories. Welcome to thepost-RISC era. What follows is a completely revised and re-clarified thesis which found its first expression here on Ars

over a year ago, before Ditzel spoke his mind on the matter,and before I had the chance to exchange e-mail with so manythoughtful and informed readers.

In this paper, I'll argue the following points:

1. RISC was not a specific technology as much as it was adesign strategythat developed in reaction to a particularschool of thought in computer design. It was a rebellionagainst prevailing norms--norms that no longerprevail in

today's world. Norms that I'll talk about.

2. "CISC" was invented retroactively as a catch-all term for thetype of thinking against which RISC was a reaction.

3. We now live in a "post-RISC" world, where the terms RISCand CISC have lost their relevance (except to marketingdepartments and platform advocates). In a post-RISC world,each architecture and implementation must be judged on itsown merits, and not in terms of a narrow, bipolar,compartmentalized worldview that tries to cram all designsinto one of two "camps."

After charting the historical development of the RISC and CISCdesign strategies, and situating those philosophies in theirproper historical/technological context, Ill discuss the idea of apost-RISC processor, and show how such processors don't fit

neatly into the RISC and CISC categories.

The historical approach

Perhaps the most common approach to comparing RISC andCISC is to list the features of each and place them side-by-sidefor comparison, discussing how each feature aids or hinders

performance. This approach is fine if youre comparing twocontemporary and competing pieces of technology, like OSs,video cards, specific CPUs, etc., but it fails when applied to


3/26

RISC and CISC. It fails because RISC and CISC are not somuch technologies as they are design strategies--approachesto achieving a specific set of goals that were defined in relationto a particular set of problems. Or, to be a bit more abstract,we could also call them design philosophies, or ways of

thinking about a set of problems and their solutions.

Its important to see these two design strategies as havingdeveloped out of a particular set of technological conditionsthat existed at a specific point in time. Each was an approachto designing machines that designers felt made the mostefficient use of the technological resources then available. Informulating and applying these strategies, researchers tookinto account the limitations of the days technologylimitationsthat dont necessarily exist today. Understanding what thoselimitations were and how computer architects worked withinthem is the key to understanding RISC and CISC. Thus, a trueRISC vs. CISC comparison requires more than just featurelists, SPEC benchmarks and sloganeeringit requires ahistorical context.

In order to understand the historical and technological contextout of which RISC and CISC developed, it is first necessary tounderstand the state of the art in VLSI, storage/memory, and

compilers in the late 70s and early 80s. These threetechnologies defined the technological environment in whichresearchers worked to build the fastest machines.

Storage and memory

Its hard to underestimate the effects that the state of storagetechnology had on computer design in the 70s and 80s. In the

1970s, computers used magnetic core memory to storeprogram code; core memory was not only expensive, it wasagonizingly slow. After the introduction of RAM things got a bitbetter on the speed front, but this didnt address the cost partof the equation. To help you wrap your mind around thesituation, consider the fact that in 1977, 1MB of DRAM costabout $5,000. By 1994, that price had dropped to under $6 (in1977 dollars) [2]. In addition to the high price of RAM,secondary storage was expensive and slow, so paging large

volumes of code into RAM from the secondary store impededperformance in a major way.
http://www.arstechnica.com/cpu/4q99/risc-cisc/biblio.html


4/26


5/26

The state of the art in Very Large Scale Integration (VLSI)yielded transistor densities that were low by todays standards.You just couldnt fit too much functionality onto one chip. Backin 1981 when Patterson and Sequin first proposed the RISC Iproject (RISC I later became the foundation for Suns SPARC

architecture), a million transistors on a single chip was a lot[1]. Because of the paucity of available transistor resources,the CISC machines of the day, like the VAX, had their variousfunctional units split up across multiple chips. This was aproblem, because the delay-power penalty on data transfersbetween chips limited performance. A single-chipimplementation would have been ideal, but, for reasons wellget into in a moment, it wasnt feasible without a radicalrethinking of current designs.

The CISC solution

The HLLCA and the software crisis

Both the sorry state of early compilers and the memory-induced constraints on code size caused some researchers inthe late 60s and early 70s to predict a coming "softwarecrisis." Hardware was getting cheaper, they argued, while

software costs were spiraling out of control. A number of theseresearchers insisted that the only way to stave off impendingdoom was to shift the burden of complexity from the(increasingly expensive) software level to the (increasinglyinexpensive) hardware level. If there was a common functionor operation for which a programmer had to write out all thesteps every time he or she used it, why not just implementthat function in hardware and make everyones life easier?After all, hardware was cheap (relatively speaking) and

programmer time wasnt. This idea of moving complexity fromthe software realm to the hardware realm is the driving ideabehind CISC, and almost everything that a true CISC machinedoes is aimed at this end.

Some researchers suggested that the way to makeprogrammers and compiler-writers jobs easier was to "closethe semantic gap" between statements in a high-levellanguage and the corresponding statements in assembler."Closing the semantic gap" was a fancy way of saying thatsystem designers should make assembly code look more like Cor PASCAL code. The most extreme proponents of this kind of


6/26

thinking were calling for the move to a High-Level LanguageComputing Architecture.

The HLLCA was CISC taken to the extreme. Its primarymotivation was to reduce overall system costs by making

computers easy to program for. By simplifying theprogrammers and compilers jobs, it was thought thatsoftware costs could be brought under control. Heres a list ofsome of the most commonly stated reasons for promotingHLLCAs [5]:

Reduce the difficulty of writing compilers. Reduce the total system cost. Reduce software development costs. Eliminate or drastically reduce system software.

Reduce the semantic gap between programming andmachine languages.

Make programs written in a HLL run more efficiently. Improve code compaction. Ease of debugging.

To summarise the above, if a complex statement in a HLL wereto translate directly into exactly one instruction in assembler,then

Compilers would be much easier to write. This would savetime and effort for software developers, thereby reducingsoftware development costs.

Code would be more compact. This would save on RAM,thereby reducing the overall cost of the system hardware.

Code would be easier to debug. Again, this would save onsoftware development and maintenance costs.

At this point in our discussion, its important to note that Imnot asserting that the flurry of literature published on HLLCAsamounted to a "CISC movement" by some "CISC camp", in theway that there was a RISC movement led by the Berkeley,IBM, and Stanford groups. There never actually was any sortof "CISC movement" to speak of. In fact, the term "CISC" wasinvented only afterthe RISC movement had started. "CISC"eventually came to be a pejorative term meaning "anythingnot RISC." So Im not equating the HLLCA literature with "CISCliterature" produced by a "CISC movement", but rather Imusing it to exemplify one of the main schools of thought incomputer architecture at the time, a school of thought to

which RISC was a reaction. We'll see more of that in a bit.


7/26


8/26

registers.) Our H compiler translates this code into assemblerfor the ARS-1 platform. The ARS-1 ISA only has twoinstructions:

MOVE[destination register, integer or source register]. This

instruction takes a value, either an integer or the contents ofanother register, and places it the destination register. SoMOVE [D, 5] would place the number 5 in register D. MOVE[D, E] would take whatever number is stored in E and placeit in D.

MUL[destination register, integer or multiplicand register].This instruction takes the contents of the destination registerand multiplies it by either an integer or the contents ofmultiplicand register, and places the result in the destinationregister. So MUL [D, 70] would multiply the contents of D by70 and place the results in D. MUL [D, E] would multiply thecontents of D by the contents of E, and place the result in D.

Statements in H

Statements in ARS-1 Assembly

1. A = 202. B = CUBE (A)1. MOVE [A, 20]

2. MUL [A, A]3. MUL [A, A]4. MOVE [B, A]

[Editor's note:this example actually finds 20^4, not 20^3.I'll correct it when the load on the server goes down. Still, itserves its purpose.] Notice how in the above example it takesfour statements in ARS-1 assembly to do the work of twostatements in H? This is because the ARS-1 computer has no

instruction for taking the CUBE of a number. You just have touse two MUL instructions to do it. So if you have an H programthat uses the CUBE() function extensively, then when youcompile it the assembler program will be quite a bit larger thanthe H program. This is a problem, because the ARS-1computer doesnt have too much memory. In addition, it takesthe compiler a long time to translate all those CUBE()statements into MUL[] instructions. Finally, if a programmerdecides to forget about H and just write code in ARS-1

assembler, he or she has more typing to do, and the fact thatthe code is so long makes it harder to debug.


9/26

One way to solve this problem would be to include a CUBEinstruction in the next generation of the ARS architecture. Sowhen the ARS-2 comes out, it has an instruction that looks asfollows:

CUBE [destination register, multiplicand register]. Thisinstruction takes the contents of the multiplicand register andcubes it. It then places the result in the destination register.So CUBE [D, E] takes whatever value is in E, cubes it, andplaces the result in D.

Statements in H

Statements in ARS-2 Assembly

1. A = 20

2. B = CUBE (A)1. MOVE [A, 20]2. CUBE [B, A]

So now there is a one-to-one correspondence between thestatements in H, on the right, and the statements in ARS-2, onthe left. The "semantic gap" has been closed, and compiledcode will be smallereasier to generate, easier to store, andeasier to debug. Of course, the ARS-2 computer still cubesnumbers by multiplying them together repeatedly in hardware,but the programmer doesnt need to know that. All theprogrammer knows is that there is an instruction on themachine that will cube a number for him; how it happens hedoesnt care. This is a good example of the fundamental CISCtenet of moving complexity from the software level to thehardware level.

Complex addressing modes

Besides implementing all kinds of instructions that doelaborate things like cube numbers, copy strings, convertvalues to BCD, etc., there was another tactic that researchersused to reduce code size and complication: complexaddressing modes. The picture below shows the storagescheme for a generic computer. If you want to multiply twonumbers, you would first load each operand from a location in

main memory (locations 1:1 through 6:4) into one of the sixregisters (A, B, C, D, E, or F). Once the numbers are loaded


10/26

into the registers, they can be multiplied by the execution unit(or ALU).

Since ARS-1 has a simple, load/store addressing scheme, wewould use the following code to multiply the contents ofmemory locations 2:3 and 5:2, and store the result in address2:3.

1. MOVE [A, 2:3]2. MOVE [B, 5:2]3. MUL [A, B]4. MOVE [2:3, A]

The above code spells out explicitly the steps that ARS-1 hasto take to add the contents of the two memory locationstogether. It tells the computer to load the two registers withthe contents of main memory, multiply the two numbers, andstore the result back in main memory.

If we wanted to make the assembly less complicated and morecompact, we could modify the ARS architecture so that when


11/26

ARS-2 is released, the above operation can be done with onlyone instruction. To do this, we change the MUL instruction sothat it can take two memory addresses as its operands. So theARS-2 assembler for the memory-to-memory multiplyoperation would look like this:

1. MUL [2:3, 5:2]Changing one instruction to four is a pretty big savings. Now,the ARS-2 still has to load the contents of the two memorylocations into registers, multiply them, and write them backouttheres no getting around all thatbut all of those lower-level operations are done in hardware and are invisible to theprogrammer. So all that complicated work of shuffling memoryand register contents around is hidden; the computer takescare of it behind the scenes. This is an example of a complexaddressing mode. That one assembler instruction actuallycarries out a "complex" series of operations. Once again, this isan example of the CISC philosophy of moving functionalityfromsoftware intohardware.

Microcode vs. direct execution

Microprogramming was one of the key breakthroughs thatallowed system architects to implement complex instructionsin hardware [6]. To understand what microprogramming is, ithelps to first consider the alternative: direct execution. Withdirect execution, the machine fetches an instruction frommemory and feeds it into a hardwired control unit. This controlunit takes the instruction as its input and activates somecircuitry that carries out the task. For instance, if the machinefetches a floating-point ADD and feeds it to the control unit,theres a circuit somewhere in there that kicks in and directsthe execution units to make sure that all of the shifting,adding, and normalization gets done. Direct execution isactually pretty much what youd expect to go on inside acomputer if you didnt know about microcoding.

The main advantage of direct execution is that its fast. Theresno extra abstraction or translation going on; the machine is

just decoding and executing the instructions right in hardware.The problem with it is that it can take up quite a bit of space.

Think about it. If every instruction has to have some circuitrythat executes it, then the more instructions you have, themore space the control unit will take up. This problem is


12/26

compounded if some of the instructions are big and complex,and take a lot of work to execute. So directly executinginstructions for a CISC machine just wasnt feasible with thelimited transistor resources of the day.

Enter microprogramming. With microprogramming, its almostlike theres a mini-CPU on the CPU. The control unit is amicrocode engine that executes microcode instructions. TheCPU designer uses these microinstructions to writemicroprograms, which are stored in a special control memory.When a normal program instruction is fetched from memoryand fed into the microcode engine, the microcode engineexecutes the proper microcode subroutine. This subroutinetells the various functional units what to do and how to do it.

As you can probably guess, in the beginning microcode was apretty slow way to do things. The ROM used for controlmemory was about 10 times faster than magnetic core-basedmain memory, so the microcode engine could stay far enoughahead to offer decent performance [7]. As microcodetechnology evolved, however, it got faster and faster. (Themicrocode engines on current CPUs are about 95% as fast asdirect execution [10].) Since microcode technology was gettingbetter and better, it made more and more sense to just move

functionality from (slower and more expensive) software to(faster and cheaper) hardware. So ISA instruction countsgrew, and program instruction counts shrank.

As microprograms got bigger and bigger to accommodate thegrowing instructions sets, however, some serious problemsstarted to emerge. To keep performance up, microcode had tobe highly optimized with no inefficiencies, and it had to beextremely compact in order to keep memory costs down. Andsince microcode programs were so large now, it became muchharder to test and debug the code. As a result, the microcodethat shipped with machines was often buggy and had to bepatched numerous times out in the field. It was the difficultiesinvolved with using microcode for control that spurredPatterson and others began to question whether implementingall of these complex, elaborate instructions in microcode wasreally the best use of limited transistor resources [11].

The RISC solution
http://www.arstechnica.com/cpu/4q99/risc-cisc/biblio.htmlhttp://www.arstechnica.com/cpu/4q99/risc-cisc/biblio.htmlhttp://www.arstechnica.com/cpu/4q99/risc-cisc/biblio.html


13/26

For reasons we wont get into here, the "software crisis" of the60s and 70s never quite hit. By 1981, technology hadchanged, but architectures were still following the same oldtrend: move complexity from software to hardware. As Imentioned earlier, many CISC implementations were so

complex that they spanned multiple chips. This situation was,for obvious reasons, not ideal. What was needed was a single-chip solutionone that would make optimal use of the scarcetransistor resources available. However, if you were going to fitan entire CPU onto one chip, you had to throw some stuffoverboard. To this end, there were studies done that wereaimed at profiling actual running application code and seeingwhat types of situations occurred most often. The idea was tofind out what the computer spent the most time working on,and optimize the architecture for that task. If there weretradeoffs to be made, they should be made in favor ofspeeding up what the computer spends the most time on, evenif it means slowing down other, less commonly done tasks.This quantitative approach to computer design was summedup by Patterson in the famous dictum: make the common casefast.

As it turned out, making the common case fast meantreversing the trend that CISC had started: functionality and

complexity had to move out of the hardware and back into thesoftware. Compiler technology was getting better and memorywas getting cheaper, so many of the concerns that drovedesigners towards more complex instruction sets were now, forthe most part, unfounded. High-level language support couldbe better done in software, reasoned researchers; spendingprecious hardware resources on HLL support was wasteful.Those resources could be used in other places to enhanceperformance.

Simple instructions and the return of direct execution

When RISC researchers went looking for excess functionality tothrow overboard, the first thing to go was the microcodeengine, and with the microcode engine went all those fancyinstructions that allegedly made programmers and compiler-writers jobs so much easier. What Patterson and others had

discovered was that hardly anyone was using the more exoticinstructions. Compiler-writers certainly werent using themthey were just too much of a pain to implement. When


14/26

compiling code, compilers forsook the more complexinstructions, opting instead to output groups of smallerinstructions that did the same thing. What researchers learnedfrom profiling applications is that a small percentage of anISAs instructions were doing the majority of the work. Those

rarely-used instructions could just be eliminated without reallylosing any functionality. This idea of reducing the instructionset by getting rid of all but the most necessary instructions,and replacing more complex instructions with groups ofsmaller ones, is what gave rise to the term ReducedInstruction Set Computer. By including only a small, carefully-chosen group of instructions on a machine, you could get rid ofthe microcode engine and move to the faster and more reliabledirect execution control method.

Not only was the number of instructions reduced, but the sizeof each instruction was reduced as well [18]. It was decidedthat all RISC instructions were, whenever possible, to take oneonly one cycle to complete. The reasoning behind this decisionwas based on a few observations. First, researchers realizedthat anything that could be done with microcode instructionscould be done with small, fast, assembly language instructions.The memory that was being used to store microcode could be

just be used to store assembler, so that the need for

microcode would be obviated altogether. Therefore many ofthe instructions on a RISC machine corresponded tomicroinstructions on a CISC machine. [12]

The second thing that drove the move to a one-cycle, uniforminstruction format was the observation that pipelining is reallyonly feasible to implement if you dont have to deal withinstructions of varying degrees of complexity. Since pipeliningallows you to execute multiple pieces of the different

instructions in parallel, a pipelined machine has a drasticallylower average number of cycles per instruction (CPI). (For anin-depth discussion of pipelining, check out my K7 designpreview). Lowering the average number of cycles that amachines instructions take to execute is one very effectiveway to lower the overall time it takes to run a program.

RISC and the performance equation

Our discussion of pipelining and its effect on CPI brings usback to a consideration of the performance equation,
http://www.arstechnica.com/cpu/3q99/k7_theory/k7-one-1.htmlhttp://www.arstechnica.com/cpu/4q99/risc-cisc/biblio.htmlhttp://www.arstechnica.com/cpu/4q99/risc-cisc/biblio.html


15/26

time/program = [ (instructions/program)x(cycles/instruction)x(time/cycle) ]

RISC designers tried to reduce the time per program bydecreasing the second term to the right of the "=", and

allowing the first term to increase slightly. It was reasonedthat the reduction in cycles-per-instruction achieved byreducing the instruction set and adding pipelining and otherfeatures (about which well talk more in a moment) wouldmore than compensate for any increase in the number ofinstructions per program. As it turns out, this reasoning wasright.

LOAD/STORE and registers

Besides pipelining, there were two key innovations thatallowed RISC designers to both decrease CPI and keep codebloat to a minimum: the elimination of complex addressingmodes and the increase in the number of architecturalregisters. In a RISC architecture, there are only register-to-register operations, and only LOADs and STOREs can accessmemory. Recall the ARS-1 and ARS-2 example architectures

we looked at earlier.

In a LOAD/STORE architecture, an ARS-2 instruction like MUL[2:3, 5:2] couldnt exist. You would have to represent thisinstruction with two LOAD instructions (used to load operandsfrom memory into the registers), one register-to-register MULinstruction (like MUL [A, B]), and a STORE instruction (used towrite the result back to memory). You would think that havingto use LOADs and STOREs instead of a single, memory-to-memory instruction would increase the instruction count so

much that memory usage and performance would suffer. As itturns out, there a few reasons why the code doesnt get asbloated as you might expect.

The aforementioned profiles of HLL application code showedPatterson and his colleagues that local scalars are by far themost frequent operands to appear in a program; they foundthat over 80% of the scalar values in a program are localvariables [13]. This meant that if they added multiple banks of

registers to the architecture, they could keep those localscalars right there in the registers, and avoid having to LOADthem every time. Thus, whenever a subroutine is called, all the


16/26

local scalars are loaded into a bank of registers and kept thereas needed [18].

In contrast, my hypothetical ARS-1 machine uses microcodeoperand-specifiers to carry out the loads and stores associated

with memory-to-memory operations (much like the VAX).What this means is that whenever the ARS-1 encounterssomething like the MUL [2:3, 5:2] instruction, its microcodeengine translates this MUL into a set of microinstructions that

1. LOAD the contents of 2:3 into a register,2. LOAD the contents of 5:2 into a register,3. MUL the two registers, and4. STORE the result back in 2:3.

This series of LOADs and STOREs takes multiple cycles, just

like it would on a RISC machine. The only difference is thatthose cycles are charged to the MUL instruction itself, makingthe MUL a multicycle instruction. And after the MUL is over andthe result is written to memory, the ARS-2s microcodeprogram will write over the contents of the two registers it justused with whatever data is needed next, instead of keeping itaround for reuse. This means that the ARS-2 is actually doingmore LOADs and STOREs than would a RISC machine, becauseit cant split the memory accesses off from the MUL instruction

and manage them intelligently.

Since those LOADs and STOREs are tied to the MUL instruction,the compiler cant shuffle them around and rearrange them formaximum efficiency. In contrast, the RISCs separation ofLOADs and STOREs from other instructions allows the compilerto schedule an operation in the delay slot immediately afterthe LOAD. So while its waiting a few cycles for the data to getloaded to the register, it can do something else instead ofsitting idle. Many CISC machines, like the VAX, take advantageof this LOAD delay slot also, but this has to be done inmicrocode.

The changed role of the compiler

As you can see from the above discussion, the compilers rolein managing memory accesses is quite different on a RISC

machine than it is on a CISC one. Patterson notes:


17/26

"RISC compilers try to keep operands in registers so thatsimple register-to-register instructions can be used. Traditionalcompilers, on the other hand, try to discover the idealaddressing mode and the shortest instruction format to addthe operands in memory. In general, the designers of RISC

compilers prefer a register-to-register model of execution sothat compliers can keep operands that will be reused inregisters, rather than repeating a memory access of acalculation. They therefore use LOADs and STOREs to accessmemory so that operands are not implicitly discarded afterbeing fetched, as in the memory-to-memoryarchitecture." [16]

In a RISC architecture, the compilers role is much moreprominent. The success of RISC actively depends onintelligent, optimizing compilers that can take the increasedresponsibilities that RISC gives them and put out optimalcode. This act of shifting the burden of code optimization fromthe hardware to the compiler was one of the key advances ofthe RISC revolution. Since the hardware was now simpler, thismeant that the software had to absorb some of the complexityby aggressively profiling the code and making judicious use ofRISC's minimal instruction set and expanded register count.Thus RISC machines devoted their limited transistor resources

to providing an environment in which code could be executedas quickly as possible, trusting that the compiler had made thecode compact and optimal.

RISC and CISC, Side by Side?

By now, it should be apparent that the acronyms "RISC" and"CISC" belie the fact that both design philosophies deal with

much more than just the simplicity or complexity of aninstruction set. In the table below, I summarize theinformation that I've presented so far, beginning with eachphilosophy's general strategy for increasing performance andkeeping costs down. I hope you've seen enough by now tounderstand, however, that any approach that affects price willaffect performance, and vice versa, so my division of RISC andCISC design strategies into "price" and "performance" issomewhat arbitrary and artificial. In fact, because the RISCand CISC design philosophies developed within a matrixdefined by the price and performance of the technologieswe've discussed (VLSI, compilers, memory/storage), the


18/26

following summary of RISC and CISC strategies and featuresshould only be understood as a set of heuristics for helpingyou organize and develop your own thoughts on the designdecisions that CPU architects make, and not as hard-and-fastrules or definitions for determining exactly what does and does

not constitute RISC and/or CISC [17].

CISCRISC

Price/Performance Strategies

Price: move complexity from software to hardware.

Performance: make tradeoffs in favor of decreased code size,at the expense of a higher CPI.

Price: move complexity from hardware to software

Performance: make tradeoffs in favor of a lower CPI, at theexpense of increased code size.

Design Decisions

A large and varied instruction set that includes simple, fastinstructions for performing basic tasks, as well as complex,multi-cycle instructions that correspond to statements in anHLL.

Support for HLLs is done in hardware. Memory-to-memory addressing modes. A microcode control unit. Spend fewer transistors on registers. Simple, single-cycle instructions that perform only basic

functions. Assembler instructions correspond to microcodeinstructions on a CISC machine.

All HLL support is done in software. Simple addressing modes that allow only LOAD and STORE to

access memory. All operations are register-to-register. direct execution control unit. spend more transistors on multiple banks of registers. use pipelined execution to lower CPI.

Post-RISC architectures and the current state of the art


19/26

A group at Michigan State University's Department ofComputer Science published an excellent paper called BeyondRISC - The Post-RISC Architecture [16]. In this paper, theyargue that today's RISC processors have departed from theirRISC roots to the point where they can no longer rightly be

called RISC. (I'll be drawing on a number of ideas from thispaper to make my points, so before writing me withcorrections/questions/flames/etc., you should read their paperfor a full explanation and defense of some of the followingassertions.) The paper notes that since the first RISC designsstarted to appear in the 80's, transistor counts have risen andarchitects have taken advantage of the increased transistorresources in a number of ways.

additional registers on-chip caches which are clocked as fast as the processor additional functional units for superscalar execution additional "non-RISC" (but fast) instructions on-chip support for floating-point operations increased pipeline depth branch prediction

To the above list, we should add

out-of-order execution (OOO)

on-chip support for SIMD operations.

The first two items, additional registers and on-chip caches,seem to be very much in line with traditional RISC thinking.I'll therefore focus on a few of the other items in making mycase.

As you can see from the above list, post-RISC architecturestake many of the RISC features as their basis (simple

instructions, a large number of registers, software support forHLLs, LOAD/STORE addressing) and either expand on them oradd wholly new features. The reason most post-RISCarchitectures still get called "RISC" is because of thosefeatures that they still share (or seem to share) with the firstRISC machines. It's interesting to note that the post-RISCarchitectures that get called CISC are so called only because ofthe ISA that's visible to the programmer; implementation isignored almost entirelyin the discussion.

Now lets look in more detail at the post-RISC features thatwere added to RISC foundations to produce today's CPUs.
http://www.arstechnica.com/cpu/4q99/risc-cisc/biblio.htmlhttp://www.egr.msu.edu/~crs/papers/postrisc2/


20/26

Superscalar execution

When the first RISC machines came out, Seymore Cray wasthe only one really doing superscalar execution. One couldargue that since superscalar execution drastically lowers the

average CPI, then it's keeping in the spirit of RISC. Indeed,superscalar execution seems to be such an essential andcommon component of all modern CPUs, that it doesn't quiteseem fair to single it out and call it "non-RISC." After all,whether RISC or CISC, modern processors use this technique.Now, reread that last sentence, because this is precisely thepoint. Superscalar execution is included in today's processorsnot because it's part of a design philosophy called "RISC," butbecause it enhances performance, and performance is all thatmatters. Concerning all of the items in the above feature list,the Michigan group notes,

"Thus, the current generation of high performance processorsbear little resemblance to the processors which started theRISC revolution. Any instruction or feature which improves theoverall price/performance ratio is considered for inclusion."

Superscalar execution has added to the complexity of today'sprocessors--especially the ones that use scoreboarding

techniques and special algorithms to dynamically scheduleparallel instruction execution on the fly (which is almost all ofthem but the Alpha). Recall the comment from Ditzel, which Iquoted at the beginning of this paper, where he identifiessuperscalar execution as one of the complexities that areimpedingperformance gains in today's RISC machines.

Branch prediction

Branch prediction is like superscalar execution in that it's oneof those things that just wasn't around in '81. Branchprediction is a feature that adds complexity to the on-chiphardware, but it was included anyway because it has beenshown to increase performance. Once again, what matters isperformance and not principle.

Additional instructions

Many would disagree that the addition of new instructions to

an ISA is a "non-RISC" tendency. "They" insist that thenumber of instructions was never intended to be reduced, butrather it was only the individual instructions themselves which


21/26

were to be reduced in cycle time and complexity. Invariably,the folks who protest this way are Mac users who know thatthe G3 has more instructions than the PII, yet they still wantto insist that the G3 is a pure RISC chip (because RISC =Good) and the PII is a pure CISC chip (because CISC = Bad).

The following quote from Patterson should put this to rest onceand for all:

"A new computer design philosophy evolved: Optimizingcompilers could be used to compile "normal" programminglanguages down to instructions that were as unencumbered asmicroinstructions in a large virtual address space, and to makethe instruction cycle time as fast as the technology wouldallow. These machines would have fewer instructionsareduced setand the remaining instructions would generallyexecute in one cyclereduced instructionshence the namereduced instruction set computers (RISCs)." [Patterson, RISCs,p. 11]

Current RISC architectures like the G3, MIPS, SPARC, etc.,have what the Michigan group calls a FISC (Fast InstructionSet Computer) ISA. Any instructions, no matter how special-purpose and rarely-used, are included if the cycle-time can bekept down. Thus the number of instructions is not reduced in

modern, post-RISC machines--only the cycle time.

On-chip floating-point and vector processing units

In fact, with the addition of SIMD and floating-point executionunits, sometimes the cycle time isn't even really "reduced."Not only do some SIMD and FP instructions take multiplecycles to complete, but neither the on-chip SIMD unit nor the

on-chip FP unit was around in the first RISC machines. Likesuperscalar execution, this complex functionality was addednot because it fit in with some overarching RISC designprinciples, but because it made the machine faster. And likesuperscalar execution, SIMD and FP units are now common onboth "RISC" and "CISC" processors. Processors with thesefeatures, especially the SIMD unit, would be better termed"post-RISC" than "RISC."

I also should mention here that the addition of FP andespecially SIMD units expands the instruction set greatly. One


22/26

of the reasons the G4 has such a huge number of instructionsis because the SIMD unit adds a whole raft of them.

Out-of-order execution

OOO is one of the least RISC-like features that modernprocessors have; it directly contradicts the RISC philosophy ofmoving complexity from hardware to software. In a nutshell, aCPU with an OOO core uses hardware to profile andaggressively optimize code by rearranging instructions andexecuting them out of program order. This aggressive, on-the-fly optimization adds immense amounts of complexity to thearchitecture, increasing both pipeline depth and cycle time,

and eating up transistor resources.

Not only does OOO add complexity to the CPU's hardware, butit simplifies the compiler's job. Instead of having the compilerreorder the code and check for dependencies, the CPU does it.This idea of shifting the burden of code optimization from thecompiler to the hardware sounds like an the exact opposite ofan idea we've heard before. According to Ditzel, this is a stepin the wrong direction [19].

Conclusion

The current technological context

Let's now briefly consider the current state of the threeparameters that defined the technological matrix from whichRISC arose, in light of the preceding discussion of post-RISCadvances.

Storage and Memory

Today's memory is fast and cheap; anyone who's installed aMicrosoft program in recent times knows that many companiesno longer consider code bloat an issue. Thus the concernsover code size that gave rise to CISC's large instruction setshave passed. Indeed, post-RISC CPUs have ever-growinginstruction sets of unprecedented size and diversity, and noone thinks twice about the effect of this on memory usage.


23/26

Memory usage has all but ceased to be a major issue in thedesigning of an ISA, and instead memory is taken for granted.

Compilers

Compiler research has come a long way in the past few years.In fact, it has come so far that next-generation architectureslike Intel's IA-64 (which I talk about here) depend wholly onthe compiler to order instructions for maximum throughput;dynamic, OOO execution is absent from the Itanium. The nextgeneration of architectures (IA-64, Transmeta, Sun's MAJC)will borrow a lot from VLIW designs. VLIW got a bad wrapwhen it first came out, because compilers weren't up to the

task of ferreting out dependencies and ordering instructions inpackets for maximum ILP. Now however, it has becomefeasible, so it's time for a fresh dose of the same medicine thatRISC dished out almost 20 years ago: move complexity fromhardware to software.

VLSI

Transistor counts are extremely high, and they're getting evenhigher. The problem now is not how do we fit neededfunctionality on one piece of silicon, but what do we do with allthese transistors. Stripping architectures down and throwingrarely-used functionality overboard is not a modern designstrategy. In fact, designers are actively looking for things tointegrate onto the die to make use of the wealth of transistorresources. They're asking not what they can throw out, butwhat they can include. Most of the post-RISC features are a

direct result of the increase in transistor counts and the "throwit in if it increases performance" approach to design.

The guilty parties

For most of the aforementioned post-RISC transgressions, theguilty parties include such "RISC" stalwarts as the MIPS, PPC,

and UltraSPARC architectures. Just as importantly, the list alsoincludes so-called "CISC" chips like AMD's Athlon and Intel'sP6. To really illustrate the point, it would be necessary to take
http://www.arstechnica.com/cpu/1q99/ia-64-preview-1.html


24/26

two example architectures and compare them to see just howsimilar they are. A comparison of the G4 and the Athlon wouldbe most appropriate here, because both chips contain manyofthe same post-RISC features. The P6 and the Athlon areparticularly interesting post-RISC processors, and they deserve

to be treated in more detail. (This, however, is not the placeto take an in-depth look at a modern CPU. I've written atechnical article on the Athlon that should serve to illustratemany of the points I've made here. I hope to start work soonon an in-depth look at the G4, comparing it throughout to theAthlon and P6.) Both the Athlon and the P6 run the CISC x86ISA in what amounts to hardware emulation, but theytranslate the x86 instructions into smaller, RISC-like operationsthat fed into a fully post-RISC core. Their cores have anumber of RISC features (LOAD/STORE memory access,pipelined execution, reduced instructions, expanded registercount via register renaming), to which are added all of thepost-RISC features we've discussed. The Athlon muddies thewaters even further in that it uses both direct execution and amicrocode engine for instruction decoding. A crucial differencebetween the Athlon (and P6) and the G4 is that, as alreadynoted, the Athlon must translate x86 instructions into smallerRISC ops.

In the end, I'm not calling the Athlon or P6 "RISC," but I'malso not calling them "CISC" either. The same goes for the G3and G4, in reverse. Indeed, in light of what we now knowabout the the historical development of RISC and CISC, andthe problems that each approach tried to solve, it should nowbe apparent that both terms are equally nonsensical whenapplied to the G3, G4, MIPS, P6, or K7. In today'stechnological climate, the problems are different, so thesolutions are different. Current architectures are a hodge-

podge of features that embody a variety of trends and designapproaches, some RISC, some CISC, and some neither. In thepost-RISC era, it no longer makes sense to divide the worldinto RISC and CISC camps. Whatever "RISC vs. CISC" debatethat once went on has long been over, and what must nowfollow is a more nuanced and far more interesting discussionthat takes each platform--hardware and software, ISA andimplementation--on its own merits.
http://arstechnica.com/cpu/3q99/k7_theory/k7-one-1.html


25/26

The papers that I cite below are all available through theACM's Digital Library.

Bibliography

[1] David A. Patterson and Carlo H. Sequin. RISC I: A ReducedInstruction Set VLSI Computer. 25 years of the internationalsymposia on Computer architecture (selected papers), 1998,Pages 216 230

[2] John L. Hennessy and David A. Patterson, ComputerArchitecture: A Quantitative Approach, Second Edition.Morgan Kaufmann Publishers, Inc. San Francisco, CA. 1996. p9.

[3] David A. Patterson and Carlo H. Squin. Retrospective onRISC I.25 years of the international symposia on Computerarchitecture (selected papers), 1998, p.25

[4] Hennessy and Patterson, p. 14

[5] David R. Ditzel and David A. Patterson. Retrospective onHLLCA. 25 years of the international symposia on Computerarchitecture (selected papers), 1998, p.166

[6] David A. Patterson, Reduced Instruction Set Computers.Commun. ACM28, 1 (Jan. 1985), p. 8

It was the microcode engine, first developed by IBM in the60s, that first enabled architects to abstract the instruction setfrom its actual implementation. With the IBM System/360, IBMintroduced a standard instruction set architecture that wouldwork across an entire line of machines; this was quite a novel

idea at the time.

[7] ibid.

[10] From a lecture by Dr. Alexander Skavantzos of LSU in a1998 course on computer organization.

[11] Patterson, Reduced Instruction Set Computers, p. 11

[12] ibid.

[13] Patterson and Sequin, RISC I: A Reduced Instruction SetComputer.p. 217
http://www.arstechnica.com/etc/books/comp-arc.htmlhttp://www.acm.org/dl/


26/26

[14] ibid. p. 218

[15] Patterson, Reduced Instruction Set Computers, p. 11

[16] I first came across this paper through a link on PaulHseih's page. There seem to be two different versions of thispaper, one later than the other. Both contain slightly different(though not contradictory) material though, so I'll be drawingon both of them in my discussion as I see fit.

[17] A question that often comes up in RISC vs. CISCdiscussions is, doesn't the term "RISC" properly apply only toan ISA, and not to a specific implementation of that ISA?Many will argue that it's inappropriate to call any machine a"RISC machine," because there are only RISC ISAs. I

disagree. I think it should be apparent from the discussionwe've had so far that RISC is about both ISA andimplementation. Consider pipelining and direct execution--don't both of these fall under the heading of"implementation"? We could easily imagine a machine who'sinstruction set looks just like a RISC instruction set, but thatisn't pipelined and uses microcode execution. Just like withprice and performance, the line between ISA andimplementation isn't always clear. What should be clear

though is that the RISC design philosophy encompasses bothrealms.

[18] Some have actually debated me on the above points,insisting that the number of instructions was never intended tobe reduced. See my discussion of post-RISC architectures for arebuttal.

[19] See the comment by Ditzel that I opened with.
http://www.azillionmonkeys.com/qed/cpuwar.html

risc vs cisc - the post-risc era - jon "hannibal" stokes

Documents