a perspective on the future of computer architecture

11

Boris BabayanIntel Fellow

October 2016

A Perspective on the Future of Computer Architecture

2

Agenda My background building Real Computers

Challenges with today’s Superscalar Computers

Lessons and Proposals for Future Computers– Constrained Designs: ie backwards compatible with pragmatic compromises

– Lessons from the last several years at Intel– Unconstrained Designs: Unlocking more performance potential

Conclusions

3

My experience building real computers Carry Save Arithmetic

– In 1954 I developed “Carry Save Arithmetic” (for multiplication, division and square root) as my student project, and presented at a Russian conference in 1955

– Precedes the first western publication of CSA by M. Nadler was in Acta Technica journal (1956)

Chief architect of Elbrus-1, Elbrus-2, and Elbrus-3 line of supercomputers– My team built Elbrus-line computers (1978-90) widely used in Russia, eg for space program, etc.– High level programming language support put in hardware (not just support of the existing HLL

corrupted by outdated architecture) – still not implemented so far in other computers– High Level Language EL–76 for Elbrus-line computers– Elbrus OS kernel had support for real High Level programming

One of first complete security solutions– Elbrus architecture, the main goal of which is real HLL EL–76 support, and Elbrus OS kernel as a

byproduct, fully solved security problems, including the possibility to prove the correctness of user-level programs.

4

My experience building real computers (continued)

• First industrial implementation of an Out-of-Order superscalar computer– Elbrus 1 (implemented in 1978) was the first commercial implementation of OoO superscalar in

the world (two-wide issue computer)– After the second generation of Elbrus computers in 1985, our team realized many weaknesses

with superscalar approach and started looking for more robust solution of the parallel execution problem, leading us to VLIW.

• Elbrus-3: A Very Long Instruction Word (VLIW) computer– Successful implementation of cluster-based VLIW architecture with fine grained parallel

execution (Elbrus 3, end of 90s), probably for the first time in industry

• Hardware assisted Binary Translation– Suggestion and the first implementation of Binary Translation (BT) technology for designing a

new architecture, built on radically new principles, but binary compatible with the old ones (Elbrus 3, end of 90s).

• Fine-grained parallel architecture– Design and simulation of radically new principles of fine-grained parallel architecture and

extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for their support.

5

Challenges with today’s Superscalar Processors

6

Drawbacks of Superscalar Paradigm - 1 Drawbacks of Superscalar architecture

– Program conversion is rather complicated (parallel->sequential->parallel)– Superscalar architecture has a performance limit (regardless of available HW)– Inability to use properly all available HW– Even SMT mode cannot significantly improve efficiency (but decreases cache

utilization efficiency instead) – Rather complicated VECTOR HW and MULTI-THREAD programming have to be

used to compensate somehow for this performance limit– Today’s High-level languages (HLL) mirror the old and present-day architectures

(linear data space, no explicit parallelism). As a result, current architecture has corrupted all today’s HLLs

– Current organization of computations does not allow for good optimizations (necessary to have full information about the algorithm to be executed, and hardware, which will execute it)

– Non-universal architecture

7

Drawbacks of Superscalar Paradigm -2

Memory and caches organization– Current architecture does not support object oriented data memory.

– This excludes possibility to support true security computing and debugging facility

– Cache organization of today’s architecture hides its internal structure, preventing the compiler to do good optimizations. This has been made for compatibility with the simple linear memory organization in older computers

Superscalar architecture today is very close to un-improvable state, including all the above mentioned drawbacks

All the above-mentioned drawbacks have the single source – inheriting of principles of ancient, early days computing with strong HW

size constraints for current architecture as its basic ones

8

Beginning of Computer Era (early 50s – mid 90s) - 1

Single execution unit era– Amount of available HW was the main constraint– Single IP, single execution unit, linear memory of small size – Performance is just a number of executed operations (fast memory vs. ops execution time)– Binary programming was the most efficient method– The programmer was responsible for all optimizations as he knew both the algorithm and

available HW resources. HW was very simple at that time, so the programmer was able to fulfil this job very well

– The only reasonable HW improvement was the possibility to improve this single execution unit

9

Beginning of Computer Era (early 50s – mid 90s) - 2• General results for that period architecture:

This architecture was un-improvable with corresponding constraints, because the main resource (single execution unit) was un-improvable (carry save and high radix arithmetic) and every architecture had to include it

This architecture was absolutely universal among programmable architectures, because any other architecture should include this single execution unit. No other architecture could work faster, or could have less HW. Usage of more HW ( more execution units, for example) was not possible because of the main constraint of available HW

• Basic Architecture Decisions: Single Instruction Pointer ISA Simple linear memory organization No data types support in HW

Input binary includes instructions how to use resources, rather than the algorithm description

10

Superscalar Era (mid 90s – now) - 1

Constraints of Superscalar era– Significant Progress in Si technology, more HW available (HW constraint was

removed), faster execution, but slow memory– Superscalar still is unable to use efficiently all HW for a single job – Implicit parallelization, but it requires to convert a linear single IP execution flow into

the parallel form in HW – The original completion ordering has to be preserved, from parallel execution into the

consecutive retirement (compatibility with the preceding decisions)– Simple linear memory organization, no support for data types

11

Superscalar Era (mid 90s – now) - 2 Outcome of this period:

Sub-optimal functionality (semantics of data and operations)– Without dynamic data types support in HW it is impossible to implement real high

level programming and true security computing

Sub-optimal performance– Programmer doesn’t know the details of rather complicated HW and as a result is

unable to fully control optimizations made by HW– The compiler does not have all information about the algorithm being compiled

(due to corrupted High-Level languages), and on the other side, the compiler is too far from the HW and is unable to fully utilize the HW and the internal HW structures (e.g. caches), which are hidden from the compiler

– Superscalar Hardware is expressed via ISA only (which inherits all obsolete solutions), no ability to provide the algorithm to such kind of HW, and all HW machinery (BPU, renaming, cache organization, etc.) is designed to support compatibility with limited performance improvement

12

New Post-SuperScalar Architecture(what we call “Best Possible” Computer System)

13

Algorithmically Oriented Post-Superscalar Era Changing the angle of view:

– Algorithm of the program itself and data dependency are the real constraints of the performance and power

– Move HW complexity into SW, free HW from code analysis and parallel conversion (closer to algorithm representation)

– Move the design into a strongly opposite direction – from resources to algorithms care

14

Constraints in Architecture are the Real Limiter• Two designs will be considered:

CONSTRAINED system– New Architecture (NArch) constrained by compatibility with legacy binaries (x86, ARM,

Power, etc.) UNCONSTRAINED system

– Advanced New Architecture (NArch+) without compatibility constraints (unconstrained), or more precisely – constrained only by the algorithm to be executed, or by HW resources of the processor

• All past designs have reached their constraints:– Arithmetic, Early day Single Execution Unit architecture, Superscalar, Functionality of High

level programming

• Therefore, to make the next step we should find some way of how to relax (for the first case of future architecture), or to remove (for the second case) the constraints

15

Basic Approach for New Architecture Design

• Let’s first design the best possible unconstrained architecture• The constrained architecture is going to be just the unconstrained architecture

limited by several mechanisms, required for compatibility support

• So we will get the Best Possible unconstrained and constrained architectures then!

• Three components must be fully investigated and designed to get the Best Possible Architecture:

Language Compiler Hardware

16

New High-Level Programming Support

The Compiler should have full information about the algorithm being compiled The new programming language should be able to expose the details of the algorithm to the

compiler and, eventually, to HW Programmer should optimize only the algorithm, but not execution New Language should have the following main features:

– Ability to express the parallel fine-grained structure of the algorithm in perfectly clear and convenient (for programmer) manner

– Right functionality (semantics) of its elements, including dynamic data types and capability support *)

– Ability to present exhaustive information about the algorithm

*) This feature was completely implemented in EL-76 language used in several generation of Elbrus computers in Russia

17

Compiler

Role of compiler:– Compiler is responsible for all optimizations (not HW)– To do this it should be model local, which allows it to have all information about model configuration– It gets all information about the algorithm from the program text after a simple transformation into an

intermediate distributive to be compiled to different computer models– No information losses during compilation (full algorithm representation)– Compiler can use some dynamic information from the execution for being able to tune optimizations

dynamically

The structure of HW elements should be appropriate for good optimizations controlled by the compiler

Local to model compiler removes compatibility requirements from HW, as HW can be changed more freely, if it’s needed to satisfy some requirements (e.g. performance, power, market segments, etc.)

18

Process of Compilation The first-level compiler generates a distributive w/o any optimizations (simple transformation from

source code to data flow graph without information losses) The optimizing “real” compiler (distributive, or D-compiler) is model dependent and generates

optimized application code from the app distributive (using dynamic feedback for tuning)

Applicationsource code

DistributiveApp App

OptimizingD-compiler

Systemlayer

HW model 1

App App

OptimizingD-compiler

Systemlayer

HW model 2

First-level compilation

(transformation)

19

Requirements for New Architecture Hardware

• Hardware should not do any optimizations (e.g. BPU, prefetching), as it doesn’t have any information about the algorithm being executed

• Release hardware from the necessity to analyze binaries and extract parallelism • Hardware should only allocate resources according to compiler instructions• Hardware should avoid “artificial binding” as Single Instruction Pointer, vectors,

cache lines, full virtual pages, etc.• Hardware should give the compiler a possibility to change HW configuration for

better optimizations (“Lego Set” HW)• Hardware should use object oriented memory (like in Elbrus computers)

20

NArch Architecture (constrained compatible case)

• The semantics of legacy binaries cannot be changed due to compatibility requirements• The only possible relaxation would be to change the way of how this semantics gets

presented to HW in explicit parallel form for execution• Release hardware from the necessity to analyze binaries and extract parallelism• Let the software layer be responsible for finding available parallelism and optimizations (via

Binary Translation technology)• Let HW be responsible for optimal scheduling only (remove unneeded complexity from

hardware and make it simpler) – like in the unconstrained case• Actually Binary Translation allows using all mechanisms of the unconstrained architecture,

with addition of:o Memory ordering rules and retiremento Checkpoint for target context reconstruction and events processingo Memory renaming technique for memory conflicts resolution in binaries via bigger register file and

special guard HW structure

• Unfortunately, due to semantics compatibility reasons the constrained architecture cannot support security and aggressive procedure level parallelization

21

Functionality (Semantics) of Basic Elements

22

In the constrained architecture functionality (semantics) of all its elements (data and operations) is strongly determined by compatibility requirements

But first let’s consider the unconstrained computer system and its elements, which were developed in accordance with the approach described above.

Note: All technologies and mechanisms are appropriate for both the constrained and the unconstrained systems

Method of New Functionality Design

23

Primitive Data Types & Operations Primitive data types (HW keeps their types together with the value):

– Potential infinity (integer)– Potential continuity (floating point)– Predicates– Enumerable types (e.g. character)– Uninitialized data– Data Descriptor and Functional Descriptor (“auxiliary” data types for technical

operations)

Primitive Data Types are Dynamic Data Types– Value is kept together with tag

Type Safety Approach– All primitive operations check types of their arguments

24

User Defined Data Types (Objects)

The “natural” requirements for the new architecture to support language level functionality, consistent with “abstract algorithm” ideas:1. Every procedure can generate a new data object and receive a reference to this new object2. This procedure, using received reference, can do everything possible with this new object

(read data from this object and update the content, execute this object as a program, and delete the object)

3. No other procedure can access this object just after it was generated, but this procedure can give a reference to this object with all or limited rights listed above to anybody it knows (has a reference to it)

4. Any procedure can generate a copy of reference to any object it’s aware of with decreased rights

5. After the object has been deleted, nobody can access it (all existing references are invalid)

Data creation with orientation on objects is an important step for data structuring, according to semantics of the source algorithm

25

Dangling Pointers and Memory Compaction To solve the dangling pointer problem (point 5) we must guarantee that after an object

has been deleted, no one can access the memory occupied by this object. The de-allocation procedure frees the physical memory, but not the virtual memory. So

physical memory can be reused, but virtual memory still remains being allocated The well-known classical solution is a garbage collection algorithm, but it’s inefficient for

solution of the dangling pointer problem When virtual memory gets close to its limit, the system starts compacting the virtual

memory The compaction algorithm*):

– Each Data Descriptor is tagged, i.e. there is a special bit in registers and in memory which marks Data Descriptors

– The system identifies what Data Descriptors are useless (point to objects de-allocated in physical memory) and replaces them by Uninitialized data, or just re-directs them to non-existent memory page, thus releasing the virtual pages which the descriptor had pointed to (according to the size of the object)

– The rest of the objects are moved to the vacant virtual memory, and their Data Descriptor’s base address is replaced by the new virtual address

This compaction can be fulfilled as a background process*Note: this compaction algorithm has been implemented in Elbrus-1,2 computers, it can be modified

to make it more efficient

26

Procedures Procedure is the fundamental notion of HLL. Every procedure has a reference to its code and context. The procedure context consists of the code, global data, parameters/return data, and its local data A procedure can be called via Functional Descriptor only (tagged value)

Entry point address

Global context

Functional Descriptor

Tag

Global data Procedure code Local data

1. A procedure can create Functional Descriptor (FD) with a special instruction, providing an entry point address and a Data Descriptor to some context as arguments, i.e. any procedure can define another procedure

2. A procedure, which has generated this FD, can give this new FD to anybody it has access, and this new owner also can call this new procedure via FD

3. A procedure that has generated an FD includes references to the code and global data into this FD4. A procedure, which got FD of the new procedure, can call this procedure and can pass it some

parameters (atomically).5. Caller can receive some return data as a result of procedure execution. Data return is logically an

atomic action6. The called procedure can’t use anything beyond the context it has been provided by the functional

descriptor and the parameters

27

Capability Mechanism Only the system that provides type safety allows the correct implementation of the

procedure mechanism. A procedure can be called via Functional Descriptor only Procedure has access to its context only. No other procedure can access this

procedure’s context, if it has not been passed as a parameter to that other procedure This approach introduces a very strong inter-procedure protection Data Descriptor (DD) and Functional Descriptor (FD) is a capability to do something

for the procedure, which has DD or FD in its context:– DD is a capability to access some object– and FD is a capability to do something – execute some procedure, which can modify some global data in

the called procedure, the data, which is not directly accessible by caller

Implementation of some operations, which should work with bit-level representations of special data types like DD, FD (COMPACTION algorithm is a good example) sometimes need operation support in HW. All these operations are also primitive operations; however, only a limited number of procedures should be able to use them

28

Full Solution of Security Problem

The described approach does not need a privileged mode for system programming– E.g. in Elbrus, all programs, including OS, are written as “application” programs

Capability approach is more powerful and more general than the privileged mode approach (consistently implemented in Elbrus; no C-list, which is wrong)

However, even this architecture cannot protect against mistakes in user programs. Probably, the only possible remedy in this case is possibility to prove correctness of user and kernel program– A formal proof of functional correctness was done for seL4 microkernel in 2009 by NICTA group

(National Information and Communications Technology, Australia)

Even in this case, only the suggested architecture can be helpful to simplify considerably the proof of program correctness (for both kernel and applications)

29

Implementation of the Described Functionality

30

Object Oriented Memory (OOM) Structure Object oriented memory was initially introduced in Burroughs B5500 computer architecture, but was not

implemented correctly All basic principles were carefully designed first in Elbrus 1 (1972-78) Present-day memory and cache systems are corrupted by compatibility with linear structure of old

computers. That means that future system should not use a traditional memory and caches organization, which excludes compiler from applying efficient optimizations

OOM structure even for constrained architecture (according to preliminary estimations) can decrease cache sizes by up to 2-3 times and nearly exclude performance losses due to cache misses

Object oriented physical memory approach:– The size of physical memory allocated for an object is equal to the object size– Each allocated object is also loaded in the virtual space with pages of fixed size– Each new object in virtual space is allocated starting from a new page contiguously (if the size of the object is

smaller than the page size, then the end of the virtual space of this page is empty)

Virtual Memory

Object NEMPTY

EMPTY

Object M

Physical Memory

31

OOM uses virtual numbers of the objects instead of virtual memory addresses Virtual page number is allocated sequentially during each object generation There exists a system register, which keeps the next still free object number being

used for the next object being generated We will use sometimes the expression “virtual address” having in mind “virtual

number”.

Object’s virtual number N Index

Virtual Page N (1)

Virtual Page N (2)

Virtual Address

Object N

EMPTY

Next Object Number SysReg

N+1

Object Oriented Memory: Objects Naming Rules

32

Allocation of Objects and Sub-objects in Caches

Unlike today’s TLB, used in contemporary computers, in this OOM architecture TLB translates virtual address not into memory physical address, but directly into physical location in some specific cache, where this piece of data is located

In each specific cache, as well as in memory, the new architecture does not use cache lines (like superscalar does)

Object’s parts allocated on cache levels are split into smaller parts, and all these parts belong to the same virtual page

Each cache level could have its own small TLB

33

Generation of an Object A special instruction in HW is used to generate an object (no SW library calls, as e.g. malloc,

no OS system calls)

The list of all occupied spaces is contained in TLB, and the system supports special lists for all free spaces. Each free-list maintains the free areas of a certain set of the sizes (more likely of power of 2)

For physical address allocation, HW should take physical address from one of the free-lists (the first empty chunk from the corresponding list - also from a special HW register)

The result of the instruction execution is the corresponding Data Descriptor.

GENOBJ

Object Type

Object SizeFree-lists

248

2N

Data Descriptor

Note: Links are located inside free memory chunks

34

The Compiler Controls OOM Usage

This memory/cache system organization allows the compiler to have a strong control of execution process

Compiler is aware of all program semantics information and can perform more sophisticated optimizations

Compiler can preload the needed data to high-level cache, at first without assigning a more precious register memory, and can move these data from cache to registers only at the last moment. But now even preloading directly into the registers sometimes could be a good alternative – now we have a big register file.

This cache organization allows using access to the first level cache directly from an instruction by physical addresses without using virtual address and associative search.

To do this, the base register (BR) can support a special mode, in which it includes pointers to the physical location of the first level cache together with its virtual address.

35

Explicitly Parallel Instruction Execution in NArch+

In NArch+ architecture all mutually independent executable objects can be executed in parallel to each other. This includes:– Operations – Chains of dependent operations inside scalar and/or iterations of loop code– Procedures– Jobs

NArch+ overcomes difficulties and constraints of Data Flow and Single IP approaches, excludes any “artificial binding” in HW (program is a parallel graph)

Two different approaches have been investigated in NArch+ for program data graph execution: strands and streams (see next slides)

36

STRANDs Oriented Architecture

HW scheduler

RFEXEC

Parallel HW

• Strands express parallelism via chains of data dependent (mainly) operations (in more natural way than e.g. in VLIW) and provide new opportunity for presenting parallelism to OoO HW

• Simple instruction scheduling for parallel execution– Need to look only at the oldest instructions in each Strand (much smaller and simpler RS)

• Strands also provide:– Bigger effective instruction window– Reduced register usage (via intra-strand accumulators)– Wider instruction issue width (via clustering with register-to-register communication)

• Adding ability to express parallelism in uISA gives additional advantages, e.g. superior control over speculation and control over power, better HW utilization, much more opportunities for optimizations, and for resolving the memory latency issue

HW scheduler

EXEC RF

HW scheduler

EXEC RF

Cluster 1 Cluster 2

Inter connect

Original data graph

IP1 IP2IP3

strands

37

Strands are extracted from the program data graph by the compiler Each strand is executed by HW in-order, but out-of-order relative to each other HW allocates a set of resources for each active strand (called WAY) The compiler creates a strand via special FORK operation, which takes a free WAY for the

strand execution

BUT the compiler has to be aware of the number of WAYs available in HW and to schedule strands accordingly. Otherwise there could be a deadlock situation (e.g. no free way to spawn new strands, and other strands are waiting for some result from this new strand)

Having Strand (WAY) as a resource for the compiler potentially limits parallelism

Drawbacks of the STRANDs Architecture

Way 0 Way 2Way 1FORK A

A:

B:

FORK B

38

DL/CL Mechanism for Register/Predicate Reuse

Definition-Line (DL):– Definition Line L is a group of DL-instructions in different streams, which

form an explicit DL-front dividing streams into intervals– DL-front crosses all alive streams according to possible timing analysis.

Fronts are successive – no cross each other

Check-Line (CL):– Check Line (CL) is a group of CL-instructions suspending execution of

some streams until the specified DL-front is completely passed– After that a corresponding register/predicate resource can be safely reused

A

+DL

B

C

D

E

G

H

I

K

L

M

N

O

P

Q

+DL

+DL

+DL

+DL

F

R

S

time

CL -2

39

Intelligent Branch Processing

– Conventional: Branch predict one path, discard everything when wrong

– New Architecture: Speculate when necessary, discard only misspeculated work– Increases performance– Reduces wasted energy due to

misspeculation– According to our statistics, 80% of

branches are not critical and can be executed without speculation

40

STREAMs Oriented ArchitectureStreams and How They Get Created

• First let’s describe the simplest case, when an algorithm to be executed is a scalar by its nature (acyclic data-dependency graph) without conditional braches

• Let’s have the total number of operations equal to the number of available registers (single assignment, no register reuse)

• For this simple case:– No decoding stage (each instruction is ready to be loaded into the corresponding execution unit, the compiler

prepares the code)– For each instruction in the graph the compiler calculates a “Priority Value Number” (PVN). This number is the

number of clocks from this instruction up to the end of the graph along the longest path. Compiler will present the code in a number of sequences of dependent instructions - “streams”

– As the first instruction in the new stream, the compiler takes an instruction with the highest PVN, not included yet into any other stream. For each next instruction in this stream, the compiler again selects an instruction with the highest PVN, data dependent on the previous instruction in the stream. And so on, until the stream reaches either the end of the scalar code, or until it gets into some other stream.

Streams decompositionData Dependency graph

Stream 1 Stream 2

Stream 3

41

Scalar Code Execution With STREAMsExecution Engine (Workers)

Register File:– Each register has an EMPTY/FULL bit (EMPTY - to prevent from reading the register, when value is not ready yet, and FULL

– to prevent from writing to the register, when not all dependent instructions have consumed the value)– Each register has an additional bit showing, if an operation generating the value for this register has been already sent to

an execution unit (EU) or is in the Reservation Station (RS) Main scheduling and execution mechanisms for Streams are “workers” (16 per cluster) How the workers work:

– Workers issue ready instructions to the RS/Execution units (the arguments are FULL, or predecessors are in the RS/EU)– Each register has a list of streams, waiting for the result in this register – If a waiting stream is ready for execution (the value is ready), it gets moved to the “waiting for a free worker” queue– A free worker takes an instruction from the “waiting for workers queue” or from the Instruction Buffer– If an argument of the next instruction in the stream is not ready yet, the worker stops executing this stream and puts it

into the waiting queue for this argument (register)

42

NArch+: Scalar Code ExecutionMore Complex Case (Bigger Code)

If scalar code is big enough, the DL/CL technique is applied for registers reuse to guarantee correct dynamic execution of streams and optimal utilization of the Instruction Buffer

When the code before CLN has been executed, it is necessary to preload the next part of the code between CLN and CLN+1. Similarly, when DLN is crossed, all code area above can be freed

The size of code between CLN and CLN+1 is not bigger than the size of the Register File Time of execution can be improved with the help of the Dynamic Feedback mechanism (both in

HW and SW) If there are conditional branches in the code, the compiler uses speculative streams to handle

these cases efficiently (predicated streams and GATE instruction to check predicate value and to kill one of the streams in case of wrong speculation)

More details on speculation techniques (e.g. load/store speculation, efficient branch handling without branch prediction) would require more low-level micro-architecture details. Alas!

This scalar technology is nearly the same both for constrained and unconstrained versions of the architecture

This scalar code execution technique is a practical implementation of Data Flow architecture

43

Summary: Strands vs. Streams Strands

HW scheduler

RFEXEC

Parallel HWOriginal program graph

The mechanism of strands execution (one way per strand) is visible to the compiler, so the compiler has to watch how many strands are going to be executed by HW at each moment, and the number is limited by the number of ways

Ways in HW

Streams

Cons: can lead to deadlock, limits parallelism due to explicit resource (ways) scheduling by the compiler

The compiler can create any number of streams, the mechanism of streams execution is not visible to the compiler

Pro: No deadlock, HW executes the original graph, natural data flow execution mechanism

Original program graph

HW scheduler

RFEXEC

Parallel HW

Workers

RS

44

NArch+: Code with Loops

Use loop iteration parallelism (both iteration internal and inter-iteration) as fully as possible

Loop iterations analysis performed by the compiler:– Find instructions, which are self-dependent over iteration– Find the groups of instructions, which being self-dependent, are also

mutually dependent over the iterations (“rings” of data dependency)– The rest of instructions create sequences or graph of dependent

instructions (a number of “rows”)– The result of each row is either an output of the iteration (STORE, for

example), or is used by another row(s) or ring(s).

Each “ring” and/or “row” loop is producing data, which are consumed by other small loops. Each producer can have a number of consumers. However, producer and consumer should be connected through a buffer, giving possibility for producer to go forward, if consumer is not ready yet to use these data

45

Loops Handling in NArch+

Differences between NArch and NArch+ in loops implementation:– NArch+ does not need to support compatibility with Single IP approach; therefore, many different

loops can be executed together (even a “single” loop can also be executed out-of-order)– NArch+ has a simple memory system without speculative buffers; therefore, in some cases

(speculations only) it is necessary to use some other mechanisms and a new HW support

Types of loops, handled by NArch+:– RECURRENT loop (including WHILE loop)– DO ALL (trip count is known before the loop start)– DO ALL (trip count becomes known during the loop execution only)– Loop with low probable “maybe” dependence between iterations (through memory) (including

WHILE loop)– Loop with “maybe” data dependence within iterations

46

Parallel Procedure Execution

• For constrained architecture, procedure can be executed on a different number of clusters, but no more than four.

• Compiler will try to do in-lining of as many called procedures as possible to be able to use the resulted procedure parallelism in full degree.

• As usual for constrained case, caller will wait for the end of called procedure and will work with the same resources.

• Call, as well as return, is logically atomic step, however, to increase performance using DL/CL technology there will be prolog and epilog areas, where both caller and callee are working together without interfering with each other.

• In unconstrained architecture, new HLL allows parallel procedures execution, but again each procedure will use no more than four clusters.

• If some procedure has DO ALL loop inside, this loop can use all available HW (many, up to all clusters in chip - ~60 today).

47

All Basic Parts of Computer Technology and Their Current Status

48

NArch/IA Architecture (IA compatible case study) NArch/IA is x86 compatible new micro-architecture based on strands approach

– NArch Strand – a sequence of (usually dependent, but can include control flow) operations with its own IP; strands are executed out-of-order, in parallel

– BT parses IA binaries, extracts strands and provides them to HW for scheduling and execution– Multiple strands allow overlapping of memory accesses (thus improving memory latency)

A fairly wide CPU due to scalable clustering– One or two bi-clusters (up to 4 clusters and 24 instructions issue width – 16 strands per cluster)– Clusters are tightly-coupled (register-to-register communication and synchronization)

Very large sparse instruction window– Much larger than in conventional superscalar (~1K instructions)– Branch resolution in large window (no HW branch predictor)– Memory disambiguation in large window– Smart retirement in large window (no retirement for registers)

Binary Translation for IA compatibility and enabling NArch uarch– Dynamic and static BT for maximum ST/MT performance and

efficiency Highly parameterized architecture (scalability)

– Variable number of clusters/strands per cluster– Dynamically reconfigurable machine (ST/MT)

Result is higher Performance and lower power at the same time

49

Advantages of the New ArchitectureCompatible (constrained) case

• This approach can ensure full compatibility with some of existing binaries (ARM, x86, POWER, RISC-V, etc.) or even with all of them on the same HW with the help of Binary Translation

• Preliminary investigations allow us to do the following rather reliable predictions:– A compatible version (NArch) can reach the best possible, un-improvable performance

restricted by binary semantics constraints (not by binary’s sequential presentation) and amount of resources available for specific model only

– ~3x-4x ST performance @ unconstrained power vs. OOO Core– ~2x ST performance @ iso-power– Less than ~50% of power @ iso-performance– ~2x MT performance @ iso-power vs OOO Core

50

Advantages of the New ArchitectureUn-Compatible (unconstrained) case

• If we release HW architecture from the requirement to maintain compatibility with old style programming, then:– We can significantly simplify the architecture (e.g. 70-75% of constrained architecture has the

burden of maintaining compatibility with SS)– Introduce explicit parallelism in programming languages to expose the algorithm structure to HW

more easily– Introduce security in HW (tagged architecture) and, eventually, get rid of viruses and make

programming safe and reliable– Get rid of obsolete cache memory hierarchy (object oriented memory)– Eventually, increase significantly the performance (up to 5x-7x or even more)– Improve scalability and universality (new distributive, HW model-oriented compiler)– Build absolutely un-improvable computer architecture

• As a result of high universality of this architecture we can hope that now all special applications like machine learning, computer vision, graphics will be supported well with high performance

51

T H A N K Y O U !

Q & A

52Intel Labs Joint Pathfinding

Backup Slides

53

Each TLB entry besides helping to translate virtual address into physical data location can include also some documentation of referenced object: its size, its user data type (Object Type Name - OTN), and maybe some other information

It also includes references to more detailed tables of physical locations of all elements of this object in cache(s)

Each object should not be necessarily presented in memory. Some objects can be generated, for example, in DCU (Level 1cache) only

Access rights

Object Number

Sub-object information

TLB Entry

Data DescriptorObject

SizeObject Type

Physical location

TLB Entry

TLBTLB Entry

DCU

LLC

MLC

Physical Memory

+ Index

Object or sub-object

Object Oriented Memory: TLB Structure

54

Advantages of Object Oriented Memory System Unlike superscalar, OOM memory/cache system is visible to compiler – no uncontrollable

physical pages, lines, cache structure hidden from the compiler. This helps significantly to improve the efficiency

Explicit object oriented structure helps to increase efficiency of memory usage. All free memory is explicitly visible to the compiler and HW

Ability to access the first level cache using physical addresses directly from instructions promises a huge increase in efficiency

Inexpensive memory allocation (without OS and library calls) also helps to increase efficiency and makes it simple to design Operating System

Eviction process is explicitly controlled by the compiler Compiler has full knowledge of cache structure and can make nearly all procedure-local

data as resident in the first level cache and can make them accessible by physical addresses, this substantially will decrease cache misses.

Cache size will also be reduced The compiler can control objects and sub-objects allocation and preloading

55

STREAMs Oriented ArchitectureRemoving Drawbacks of STRANDs Approach

Get the maximum parallelism available in Program Data Graph and execute Graph itself Still chains of data dependent operations are presented to HW, but they are just hints -

STREAMs, not the real resource New mechanism of STREAMs execution – WORKERs

No deadlocks anymore as streams are not the static scheduling resource in the compiler (any number of streams), HW “workers” dynamically choose operations from the ready streams and dispatch them to the Reservation Station for execution

More details on next slides…

1 3

2 47

58

9

1110

6

Program Data Graph

WORKERs

742

138

1

3

7

Reservation Station

Execution

1

3

78

22

8

STREAMs