a perspective on the future of computer architecture
TRANSCRIPT
11
Boris BabayanIntel Fellow
October 2016
A Perspective on the Future of Computer Architecture
2
Agenda My background building Real Computers
Challenges with today’s Superscalar Computers
Lessons and Proposals for Future Computers– Constrained Designs: ie backwards compatible with pragmatic compromises
– Lessons from the last several years at Intel– Unconstrained Designs: Unlocking more performance potential
Conclusions
3
My experience building real computers Carry Save Arithmetic
– In 1954 I developed “Carry Save Arithmetic” (for multiplication, division and square root) as my student project, and presented at a Russian conference in 1955
– Precedes the first western publication of CSA by M. Nadler was in Acta Technica journal (1956)
Chief architect of Elbrus-1, Elbrus-2, and Elbrus-3 line of supercomputers– My team built Elbrus-line computers (1978-90) widely used in Russia, eg for space program, etc.– High level programming language support put in hardware (not just support of the existing HLL
corrupted by outdated architecture) – still not implemented so far in other computers– High Level Language EL–76 for Elbrus-line computers– Elbrus OS kernel had support for real High Level programming
One of first complete security solutions– Elbrus architecture, the main goal of which is real HLL EL–76 support, and Elbrus OS kernel as a
byproduct, fully solved security problems, including the possibility to prove the correctness of user-level programs.
4
My experience building real computers (continued)
• First industrial implementation of an Out-of-Order superscalar computer– Elbrus 1 (implemented in 1978) was the first commercial implementation of OoO superscalar in
the world (two-wide issue computer)– After the second generation of Elbrus computers in 1985, our team realized many weaknesses
with superscalar approach and started looking for more robust solution of the parallel execution problem, leading us to VLIW.
• Elbrus-3: A Very Long Instruction Word (VLIW) computer– Successful implementation of cluster-based VLIW architecture with fine grained parallel
execution (Elbrus 3, end of 90s), probably for the first time in industry
• Hardware assisted Binary Translation– Suggestion and the first implementation of Binary Translation (BT) technology for designing a
new architecture, built on radically new principles, but binary compatible with the old ones (Elbrus 3, end of 90s).
• Fine-grained parallel architecture– Design and simulation of radically new principles of fine-grained parallel architecture and
extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for their support.
5
Challenges with today’s Superscalar Processors
6
Drawbacks of Superscalar Paradigm - 1 Drawbacks of Superscalar architecture
– Program conversion is rather complicated (parallel->sequential->parallel)– Superscalar architecture has a performance limit (regardless of available HW)– Inability to use properly all available HW– Even SMT mode cannot significantly improve efficiency (but decreases cache
utilization efficiency instead) – Rather complicated VECTOR HW and MULTI-THREAD programming have to be
used to compensate somehow for this performance limit– Today’s High-level languages (HLL) mirror the old and present-day architectures
(linear data space, no explicit parallelism). As a result, current architecture has corrupted all today’s HLLs
– Current organization of computations does not allow for good optimizations (necessary to have full information about the algorithm to be executed, and hardware, which will execute it)
– Non-universal architecture
7
Drawbacks of Superscalar Paradigm -2
Memory and caches organization– Current architecture does not support object oriented data memory.
– This excludes possibility to support true security computing and debugging facility
– Cache organization of today’s architecture hides its internal structure, preventing the compiler to do good optimizations. This has been made for compatibility with the simple linear memory organization in older computers
Superscalar architecture today is very close to un-improvable state, including all the above mentioned drawbacks
All the above-mentioned drawbacks have the single source – inheriting of principles of ancient, early days computing with strong HW
size constraints for current architecture as its basic ones
8
Beginning of Computer Era (early 50s – mid 90s) - 1
Single execution unit era– Amount of available HW was the main constraint– Single IP, single execution unit, linear memory of small size – Performance is just a number of executed operations (fast memory vs. ops execution time)– Binary programming was the most efficient method– The programmer was responsible for all optimizations as he knew both the algorithm and
available HW resources. HW was very simple at that time, so the programmer was able to fulfil this job very well
– The only reasonable HW improvement was the possibility to improve this single execution unit
9
Beginning of Computer Era (early 50s – mid 90s) - 2• General results for that period architecture:
This architecture was un-improvable with corresponding constraints, because the main resource (single execution unit) was un-improvable (carry save and high radix arithmetic) and every architecture had to include it
This architecture was absolutely universal among programmable architectures, because any other architecture should include this single execution unit. No other architecture could work faster, or could have less HW. Usage of more HW ( more execution units, for example) was not possible because of the main constraint of available HW
• Basic Architecture Decisions: Single Instruction Pointer ISA Simple linear memory organization No data types support in HW
Input binary includes instructions how to use resources, rather than the algorithm description
10
Superscalar Era (mid 90s – now) - 1
Constraints of Superscalar era– Significant Progress in Si technology, more HW available (HW constraint was
removed), faster execution, but slow memory– Superscalar still is unable to use efficiently all HW for a single job – Implicit parallelization, but it requires to convert a linear single IP execution flow into
the parallel form in HW – The original completion ordering has to be preserved, from parallel execution into the
consecutive retirement (compatibility with the preceding decisions)– Simple linear memory organization, no support for data types
11
Superscalar Era (mid 90s – now) - 2 Outcome of this period:
Sub-optimal functionality (semantics of data and operations)– Without dynamic data types support in HW it is impossible to implement real high
level programming and true security computing
Sub-optimal performance– Programmer doesn’t know the details of rather complicated HW and as a result is
unable to fully control optimizations made by HW– The compiler does not have all information about the algorithm being compiled
(due to corrupted High-Level languages), and on the other side, the compiler is too far from the HW and is unable to fully utilize the HW and the internal HW structures (e.g. caches), which are hidden from the compiler
– Superscalar Hardware is expressed via ISA only (which inherits all obsolete solutions), no ability to provide the algorithm to such kind of HW, and all HW machinery (BPU, renaming, cache organization, etc.) is designed to support compatibility with limited performance improvement
12
New Post-SuperScalar Architecture(what we call “Best Possible” Computer System)
13
Algorithmically Oriented Post-Superscalar Era Changing the angle of view:
– Algorithm of the program itself and data dependency are the real constraints of the performance and power
– Move HW complexity into SW, free HW from code analysis and parallel conversion (closer to algorithm representation)
– Move the design into a strongly opposite direction – from resources to algorithms care
14
Constraints in Architecture are the Real Limiter• Two designs will be considered:
CONSTRAINED system– New Architecture (NArch) constrained by compatibility with legacy binaries (x86, ARM,
Power, etc.) UNCONSTRAINED system
– Advanced New Architecture (NArch+) without compatibility constraints (unconstrained), or more precisely – constrained only by the algorithm to be executed, or by HW resources of the processor
• All past designs have reached their constraints:– Arithmetic, Early day Single Execution Unit architecture, Superscalar, Functionality of High
level programming
• Therefore, to make the next step we should find some way of how to relax (for the first case of future architecture), or to remove (for the second case) the constraints
15
Basic Approach for New Architecture Design
• Let’s first design the best possible unconstrained architecture• The constrained architecture is going to be just the unconstrained architecture
limited by several mechanisms, required for compatibility support
• So we will get the Best Possible unconstrained and constrained architectures then!
• Three components must be fully investigated and designed to get the Best Possible Architecture:
Language Compiler Hardware
16
New High-Level Programming Support
The Compiler should have full information about the algorithm being compiled The new programming language should be able to expose the details of the algorithm to the
compiler and, eventually, to HW Programmer should optimize only the algorithm, but not execution New Language should have the following main features:
– Ability to express the parallel fine-grained structure of the algorithm in perfectly clear and convenient (for programmer) manner
– Right functionality (semantics) of its elements, including dynamic data types and capability support *)
– Ability to present exhaustive information about the algorithm
*) This feature was completely implemented in EL-76 language used in several generation of Elbrus computers in Russia
17
Compiler
Role of compiler:– Compiler is responsible for all optimizations (not HW)– To do this it should be model local, which allows it to have all information about model configuration– It gets all information about the algorithm from the program text after a simple transformation into an
intermediate distributive to be compiled to different computer models– No information losses during compilation (full algorithm representation)– Compiler can use some dynamic information from the execution for being able to tune optimizations
dynamically
The structure of HW elements should be appropriate for good optimizations controlled by the compiler
Local to model compiler removes compatibility requirements from HW, as HW can be changed more freely, if it’s needed to satisfy some requirements (e.g. performance, power, market segments, etc.)
18
Process of Compilation The first-level compiler generates a distributive w/o any optimizations (simple transformation from
source code to data flow graph without information losses) The optimizing “real” compiler (distributive, or D-compiler) is model dependent and generates
optimized application code from the app distributive (using dynamic feedback for tuning)
Applicationsource code
DistributiveApp App
OptimizingD-compiler
Systemlayer
HW model 1
App App
OptimizingD-compiler
Systemlayer
HW model 2
First-level compilation
(transformation)
19
Requirements for New Architecture Hardware
• Hardware should not do any optimizations (e.g. BPU, prefetching), as it doesn’t have any information about the algorithm being executed
• Release hardware from the necessity to analyze binaries and extract parallelism • Hardware should only allocate resources according to compiler instructions• Hardware should avoid “artificial binding” as Single Instruction Pointer, vectors,
cache lines, full virtual pages, etc.• Hardware should give the compiler a possibility to change HW configuration for
better optimizations (“Lego Set” HW)• Hardware should use object oriented memory (like in Elbrus computers)
20
NArch Architecture (constrained compatible case)
• The semantics of legacy binaries cannot be changed due to compatibility requirements• The only possible relaxation would be to change the way of how this semantics gets
presented to HW in explicit parallel form for execution• Release hardware from the necessity to analyze binaries and extract parallelism• Let the software layer be responsible for finding available parallelism and optimizations (via
Binary Translation technology)• Let HW be responsible for optimal scheduling only (remove unneeded complexity from
hardware and make it simpler) – like in the unconstrained case• Actually Binary Translation allows using all mechanisms of the unconstrained architecture,
with addition of:o Memory ordering rules and retiremento Checkpoint for target context reconstruction and events processingo Memory renaming technique for memory conflicts resolution in binaries via bigger register file and
special guard HW structure
• Unfortunately, due to semantics compatibility reasons the constrained architecture cannot support security and aggressive procedure level parallelization
21
Functionality (Semantics) of Basic Elements
22
In the constrained architecture functionality (semantics) of all its elements (data and operations) is strongly determined by compatibility requirements
But first let’s consider the unconstrained computer system and its elements, which were developed in accordance with the approach described above.
Note: All technologies and mechanisms are appropriate for both the constrained and the unconstrained systems
Method of New Functionality Design
23
Primitive Data Types & Operations Primitive data types (HW keeps their types together with the value):
– Potential infinity (integer)– Potential continuity (floating point)– Predicates– Enumerable types (e.g. character)– Uninitialized data– Data Descriptor and Functional Descriptor (“auxiliary” data types for technical
operations)
Primitive Data Types are Dynamic Data Types– Value is kept together with tag
Type Safety Approach– All primitive operations check types of their arguments
24
User Defined Data Types (Objects)
The “natural” requirements for the new architecture to support language level functionality, consistent with “abstract algorithm” ideas:1. Every procedure can generate a new data object and receive a reference to this new object2. This procedure, using received reference, can do everything possible with this new object
(read data from this object and update the content, execute this object as a program, and delete the object)
3. No other procedure can access this object just after it was generated, but this procedure can give a reference to this object with all or limited rights listed above to anybody it knows (has a reference to it)
4. Any procedure can generate a copy of reference to any object it’s aware of with decreased rights
5. After the object has been deleted, nobody can access it (all existing references are invalid)
Data creation with orientation on objects is an important step for data structuring, according to semantics of the source algorithm
25
Dangling Pointers and Memory Compaction To solve the dangling pointer problem (point 5) we must guarantee that after an object
has been deleted, no one can access the memory occupied by this object. The de-allocation procedure frees the physical memory, but not the virtual memory. So
physical memory can be reused, but virtual memory still remains being allocated The well-known classical solution is a garbage collection algorithm, but it’s inefficient for
solution of the dangling pointer problem When virtual memory gets close to its limit, the system starts compacting the virtual
memory The compaction algorithm*):
– Each Data Descriptor is tagged, i.e. there is a special bit in registers and in memory which marks Data Descriptors
– The system identifies what Data Descriptors are useless (point to objects de-allocated in physical memory) and replaces them by Uninitialized data, or just re-directs them to non-existent memory page, thus releasing the virtual pages which the descriptor had pointed to (according to the size of the object)
– The rest of the objects are moved to the vacant virtual memory, and their Data Descriptor’s base address is replaced by the new virtual address
This compaction can be fulfilled as a background process*Note: this compaction algorithm has been implemented in Elbrus-1,2 computers, it can be modified
to make it more efficient
26
Procedures Procedure is the fundamental notion of HLL. Every procedure has a reference to its code and context. The procedure context consists of the code, global data, parameters/return data, and its local data A procedure can be called via Functional Descriptor only (tagged value)
Entry point address
Global context
Functional Descriptor
Tag
Global data Procedure code Local data
1. A procedure can create Functional Descriptor (FD) with a special instruction, providing an entry point address and a Data Descriptor to some context as arguments, i.e. any procedure can define another procedure
2. A procedure, which has generated this FD, can give this new FD to anybody it has access, and this new owner also can call this new procedure via FD
3. A procedure that has generated an FD includes references to the code and global data into this FD4. A procedure, which got FD of the new procedure, can call this procedure and can pass it some
parameters (atomically).5. Caller can receive some return data as a result of procedure execution. Data return is logically an
atomic action6. The called procedure can’t use anything beyond the context it has been provided by the functional
descriptor and the parameters
27
Capability Mechanism Only the system that provides type safety allows the correct implementation of the
procedure mechanism. A procedure can be called via Functional Descriptor only Procedure has access to its context only. No other procedure can access this
procedure’s context, if it has not been passed as a parameter to that other procedure This approach introduces a very strong inter-procedure protection Data Descriptor (DD) and Functional Descriptor (FD) is a capability to do something
for the procedure, which has DD or FD in its context:– DD is a capability to access some object– and FD is a capability to do something – execute some procedure, which can modify some global data in
the called procedure, the data, which is not directly accessible by caller
Implementation of some operations, which should work with bit-level representations of special data types like DD, FD (COMPACTION algorithm is a good example) sometimes need operation support in HW. All these operations are also primitive operations; however, only a limited number of procedures should be able to use them
28
Full Solution of Security Problem
The described approach does not need a privileged mode for system programming– E.g. in Elbrus, all programs, including OS, are written as “application” programs
Capability approach is more powerful and more general than the privileged mode approach (consistently implemented in Elbrus; no C-list, which is wrong)
However, even this architecture cannot protect against mistakes in user programs. Probably, the only possible remedy in this case is possibility to prove correctness of user and kernel program– A formal proof of functional correctness was done for seL4 microkernel in 2009 by NICTA group
(National Information and Communications Technology, Australia)
Even in this case, only the suggested architecture can be helpful to simplify considerably the proof of program correctness (for both kernel and applications)
29
Implementation of the Described Functionality
30
Object Oriented Memory (OOM) Structure Object oriented memory was initially introduced in Burroughs B5500 computer architecture, but was not
implemented correctly All basic principles were carefully designed first in Elbrus 1 (1972-78) Present-day memory and cache systems are corrupted by compatibility with linear structure of old
computers. That means that future system should not use a traditional memory and caches organization, which excludes compiler from applying efficient optimizations
OOM structure even for constrained architecture (according to preliminary estimations) can decrease cache sizes by up to 2-3 times and nearly exclude performance losses due to cache misses
Object oriented physical memory approach:– The size of physical memory allocated for an object is equal to the object size– Each allocated object is also loaded in the virtual space with pages of fixed size– Each new object in virtual space is allocated starting from a new page contiguously (if the size of the object is
smaller than the page size, then the end of the virtual space of this page is empty)
Virtual Memory
Object NEMPTY
EMPTY
Object M
Physical Memory
31
OOM uses virtual numbers of the objects instead of virtual memory addresses Virtual page number is allocated sequentially during each object generation There exists a system register, which keeps the next still free object number being
used for the next object being generated We will use sometimes the expression “virtual address” having in mind “virtual
number”.
Object’s virtual number N Index
Virtual Page N (1)
Virtual Page N (2)
Virtual Address
Object N
EMPTY
Next Object Number SysReg
N+1
Object Oriented Memory: Objects Naming Rules
32
Allocation of Objects and Sub-objects in Caches
Unlike today’s TLB, used in contemporary computers, in this OOM architecture TLB translates virtual address not into memory physical address, but directly into physical location in some specific cache, where this piece of data is located
In each specific cache, as well as in memory, the new architecture does not use cache lines (like superscalar does)
Object’s parts allocated on cache levels are split into smaller parts, and all these parts belong to the same virtual page
Each cache level could have its own small TLB
33
Generation of an Object A special instruction in HW is used to generate an object (no SW library calls, as e.g. malloc,
no OS system calls)
The list of all occupied spaces is contained in TLB, and the system supports special lists for all free spaces. Each free-list maintains the free areas of a certain set of the sizes (more likely of power of 2)
For physical address allocation, HW should take physical address from one of the free-lists (the first empty chunk from the corresponding list - also from a special HW register)
The result of the instruction execution is the corresponding Data Descriptor.
GENOBJ
Object Type
Object SizeFree-lists
248
2N
Data Descriptor
Note: Links are located inside free memory chunks
34
The Compiler Controls OOM Usage
This memory/cache system organization allows the compiler to have a strong control of execution process
Compiler is aware of all program semantics information and can perform more sophisticated optimizations
Compiler can preload the needed data to high-level cache, at first without assigning a more precious register memory, and can move these data from cache to registers only at the last moment. But now even preloading directly into the registers sometimes could be a good alternative – now we have a big register file.
This cache organization allows using access to the first level cache directly from an instruction by physical addresses without using virtual address and associative search.
To do this, the base register (BR) can support a special mode, in which it includes pointers to the physical location of the first level cache together with its virtual address.
35
Explicitly Parallel Instruction Execution in NArch+
In NArch+ architecture all mutually independent executable objects can be executed in parallel to each other. This includes:– Operations – Chains of dependent operations inside scalar and/or iterations of loop code– Procedures– Jobs
NArch+ overcomes difficulties and constraints of Data Flow and Single IP approaches, excludes any “artificial binding” in HW (program is a parallel graph)
Two different approaches have been investigated in NArch+ for program data graph execution: strands and streams (see next slides)
36
STRANDs Oriented Architecture
HW scheduler
RFEXEC
Parallel HW
• Strands express parallelism via chains of data dependent (mainly) operations (in more natural way than e.g. in VLIW) and provide new opportunity for presenting parallelism to OoO HW
• Simple instruction scheduling for parallel execution– Need to look only at the oldest instructions in each Strand (much smaller and simpler RS)
• Strands also provide:– Bigger effective instruction window– Reduced register usage (via intra-strand accumulators)– Wider instruction issue width (via clustering with register-to-register communication)
• Adding ability to express parallelism in uISA gives additional advantages, e.g. superior control over speculation and control over power, better HW utilization, much more opportunities for optimizations, and for resolving the memory latency issue
HW scheduler
EXEC RF
HW scheduler
EXEC RF
Cluster 1 Cluster 2
Inter connect
Original data graph
IP1 IP2IP3
strands
37
Strands are extracted from the program data graph by the compiler Each strand is executed by HW in-order, but out-of-order relative to each other HW allocates a set of resources for each active strand (called WAY) The compiler creates a strand via special FORK operation, which takes a free WAY for the
strand execution
BUT the compiler has to be aware of the number of WAYs available in HW and to schedule strands accordingly. Otherwise there could be a deadlock situation (e.g. no free way to spawn new strands, and other strands are waiting for some result from this new strand)
Having Strand (WAY) as a resource for the compiler potentially limits parallelism
Drawbacks of the STRANDs Architecture
Way 0 Way 2Way 1FORK A
A:
B:
FORK B
38
DL/CL Mechanism for Register/Predicate Reuse
Definition-Line (DL):– Definition Line L is a group of DL-instructions in different streams, which
form an explicit DL-front dividing streams into intervals– DL-front crosses all alive streams according to possible timing analysis.
Fronts are successive – no cross each other
Check-Line (CL):– Check Line (CL) is a group of CL-instructions suspending execution of
some streams until the specified DL-front is completely passed– After that a corresponding register/predicate resource can be safely reused
A
+DL
B
C
D
E
G
H
I
K
L
M
N
O
P
Q
+DL
+DL
+DL
+DL
F
R
S
time
CL -2
39
Intelligent Branch Processing
– Conventional: Branch predict one path, discard everything when wrong
– New Architecture: Speculate when necessary, discard only misspeculated work– Increases performance– Reduces wasted energy due to
misspeculation– According to our statistics, 80% of
branches are not critical and can be executed without speculation
40
STREAMs Oriented ArchitectureStreams and How They Get Created
• First let’s describe the simplest case, when an algorithm to be executed is a scalar by its nature (acyclic data-dependency graph) without conditional braches
• Let’s have the total number of operations equal to the number of available registers (single assignment, no register reuse)
• For this simple case:– No decoding stage (each instruction is ready to be loaded into the corresponding execution unit, the compiler
prepares the code)– For each instruction in the graph the compiler calculates a “Priority Value Number” (PVN). This number is the
number of clocks from this instruction up to the end of the graph along the longest path. Compiler will present the code in a number of sequences of dependent instructions - “streams”
– As the first instruction in the new stream, the compiler takes an instruction with the highest PVN, not included yet into any other stream. For each next instruction in this stream, the compiler again selects an instruction with the highest PVN, data dependent on the previous instruction in the stream. And so on, until the stream reaches either the end of the scalar code, or until it gets into some other stream.
Streams decompositionData Dependency graph
Stream 1 Stream 2
Stream 3
41
Scalar Code Execution With STREAMsExecution Engine (Workers)
Register File:– Each register has an EMPTY/FULL bit (EMPTY - to prevent from reading the register, when value is not ready yet, and FULL
– to prevent from writing to the register, when not all dependent instructions have consumed the value)– Each register has an additional bit showing, if an operation generating the value for this register has been already sent to
an execution unit (EU) or is in the Reservation Station (RS) Main scheduling and execution mechanisms for Streams are “workers” (16 per cluster) How the workers work:
– Workers issue ready instructions to the RS/Execution units (the arguments are FULL, or predecessors are in the RS/EU)– Each register has a list of streams, waiting for the result in this register – If a waiting stream is ready for execution (the value is ready), it gets moved to the “waiting for a free worker” queue– A free worker takes an instruction from the “waiting for workers queue” or from the Instruction Buffer– If an argument of the next instruction in the stream is not ready yet, the worker stops executing this stream and puts it
into the waiting queue for this argument (register)
42
NArch+: Scalar Code ExecutionMore Complex Case (Bigger Code)
If scalar code is big enough, the DL/CL technique is applied for registers reuse to guarantee correct dynamic execution of streams and optimal utilization of the Instruction Buffer
When the code before CLN has been executed, it is necessary to preload the next part of the code between CLN and CLN+1. Similarly, when DLN is crossed, all code area above can be freed
The size of code between CLN and CLN+1 is not bigger than the size of the Register File Time of execution can be improved with the help of the Dynamic Feedback mechanism (both in
HW and SW) If there are conditional branches in the code, the compiler uses speculative streams to handle
these cases efficiently (predicated streams and GATE instruction to check predicate value and to kill one of the streams in case of wrong speculation)
More details on speculation techniques (e.g. load/store speculation, efficient branch handling without branch prediction) would require more low-level micro-architecture details. Alas!
This scalar technology is nearly the same both for constrained and unconstrained versions of the architecture
This scalar code execution technique is a practical implementation of Data Flow architecture
43
Summary: Strands vs. Streams Strands
HW scheduler
RFEXEC
Parallel HWOriginal program graph
The mechanism of strands execution (one way per strand) is visible to the compiler, so the compiler has to watch how many strands are going to be executed by HW at each moment, and the number is limited by the number of ways
Ways in HW
Streams
Cons: can lead to deadlock, limits parallelism due to explicit resource (ways) scheduling by the compiler
The compiler can create any number of streams, the mechanism of streams execution is not visible to the compiler
Pro: No deadlock, HW executes the original graph, natural data flow execution mechanism
Original program graph
HW scheduler
RFEXEC
Parallel HW
Workers
RS
44
NArch+: Code with Loops
Use loop iteration parallelism (both iteration internal and inter-iteration) as fully as possible
Loop iterations analysis performed by the compiler:– Find instructions, which are self-dependent over iteration– Find the groups of instructions, which being self-dependent, are also
mutually dependent over the iterations (“rings” of data dependency)– The rest of instructions create sequences or graph of dependent
instructions (a number of “rows”)– The result of each row is either an output of the iteration (STORE, for
example), or is used by another row(s) or ring(s).
Each “ring” and/or “row” loop is producing data, which are consumed by other small loops. Each producer can have a number of consumers. However, producer and consumer should be connected through a buffer, giving possibility for producer to go forward, if consumer is not ready yet to use these data
45
Loops Handling in NArch+
Differences between NArch and NArch+ in loops implementation:– NArch+ does not need to support compatibility with Single IP approach; therefore, many different
loops can be executed together (even a “single” loop can also be executed out-of-order)– NArch+ has a simple memory system without speculative buffers; therefore, in some cases
(speculations only) it is necessary to use some other mechanisms and a new HW support
Types of loops, handled by NArch+:– RECURRENT loop (including WHILE loop)– DO ALL (trip count is known before the loop start)– DO ALL (trip count becomes known during the loop execution only)– Loop with low probable “maybe” dependence between iterations (through memory) (including
WHILE loop)– Loop with “maybe” data dependence within iterations
46
Parallel Procedure Execution
• For constrained architecture, procedure can be executed on a different number of clusters, but no more than four.
• Compiler will try to do in-lining of as many called procedures as possible to be able to use the resulted procedure parallelism in full degree.
• As usual for constrained case, caller will wait for the end of called procedure and will work with the same resources.
• Call, as well as return, is logically atomic step, however, to increase performance using DL/CL technology there will be prolog and epilog areas, where both caller and callee are working together without interfering with each other.
• In unconstrained architecture, new HLL allows parallel procedures execution, but again each procedure will use no more than four clusters.
• If some procedure has DO ALL loop inside, this loop can use all available HW (many, up to all clusters in chip - ~60 today).
47
All Basic Parts of Computer Technology and Their Current Status
48
NArch/IA Architecture (IA compatible case study) NArch/IA is x86 compatible new micro-architecture based on strands approach
– NArch Strand – a sequence of (usually dependent, but can include control flow) operations with its own IP; strands are executed out-of-order, in parallel
– BT parses IA binaries, extracts strands and provides them to HW for scheduling and execution– Multiple strands allow overlapping of memory accesses (thus improving memory latency)
A fairly wide CPU due to scalable clustering– One or two bi-clusters (up to 4 clusters and 24 instructions issue width – 16 strands per cluster)– Clusters are tightly-coupled (register-to-register communication and synchronization)
Very large sparse instruction window– Much larger than in conventional superscalar (~1K instructions)– Branch resolution in large window (no HW branch predictor)– Memory disambiguation in large window– Smart retirement in large window (no retirement for registers)
Binary Translation for IA compatibility and enabling NArch uarch– Dynamic and static BT for maximum ST/MT performance and
efficiency Highly parameterized architecture (scalability)
– Variable number of clusters/strands per cluster– Dynamically reconfigurable machine (ST/MT)
Result is higher Performance and lower power at the same time
49
Advantages of the New ArchitectureCompatible (constrained) case
• This approach can ensure full compatibility with some of existing binaries (ARM, x86, POWER, RISC-V, etc.) or even with all of them on the same HW with the help of Binary Translation
• Preliminary investigations allow us to do the following rather reliable predictions:– A compatible version (NArch) can reach the best possible, un-improvable performance
restricted by binary semantics constraints (not by binary’s sequential presentation) and amount of resources available for specific model only
– ~3x-4x ST performance @ unconstrained power vs. OOO Core– ~2x ST performance @ iso-power– Less than ~50% of power @ iso-performance– ~2x MT performance @ iso-power vs OOO Core
50
Advantages of the New ArchitectureUn-Compatible (unconstrained) case
• If we release HW architecture from the requirement to maintain compatibility with old style programming, then:– We can significantly simplify the architecture (e.g. 70-75% of constrained architecture has the
burden of maintaining compatibility with SS)– Introduce explicit parallelism in programming languages to expose the algorithm structure to HW
more easily– Introduce security in HW (tagged architecture) and, eventually, get rid of viruses and make
programming safe and reliable– Get rid of obsolete cache memory hierarchy (object oriented memory)– Eventually, increase significantly the performance (up to 5x-7x or even more)– Improve scalability and universality (new distributive, HW model-oriented compiler)– Build absolutely un-improvable computer architecture
• As a result of high universality of this architecture we can hope that now all special applications like machine learning, computer vision, graphics will be supported well with high performance
51
T H A N K Y O U !
Q & A
52Intel Labs Joint Pathfinding
Backup Slides
53
Each TLB entry besides helping to translate virtual address into physical data location can include also some documentation of referenced object: its size, its user data type (Object Type Name - OTN), and maybe some other information
It also includes references to more detailed tables of physical locations of all elements of this object in cache(s)
Each object should not be necessarily presented in memory. Some objects can be generated, for example, in DCU (Level 1cache) only
Access rights
Object Number
Sub-object information
TLB Entry
Data DescriptorObject
SizeObject Type
Physical location
TLB Entry
TLBTLB Entry
DCU
LLC
MLC
Physical Memory
+ Index
Object or sub-object
Object Oriented Memory: TLB Structure
54
Advantages of Object Oriented Memory System Unlike superscalar, OOM memory/cache system is visible to compiler – no uncontrollable
physical pages, lines, cache structure hidden from the compiler. This helps significantly to improve the efficiency
Explicit object oriented structure helps to increase efficiency of memory usage. All free memory is explicitly visible to the compiler and HW
Ability to access the first level cache using physical addresses directly from instructions promises a huge increase in efficiency
Inexpensive memory allocation (without OS and library calls) also helps to increase efficiency and makes it simple to design Operating System
Eviction process is explicitly controlled by the compiler Compiler has full knowledge of cache structure and can make nearly all procedure-local
data as resident in the first level cache and can make them accessible by physical addresses, this substantially will decrease cache misses.
Cache size will also be reduced The compiler can control objects and sub-objects allocation and preloading
55
STREAMs Oriented ArchitectureRemoving Drawbacks of STRANDs Approach
Get the maximum parallelism available in Program Data Graph and execute Graph itself Still chains of data dependent operations are presented to HW, but they are just hints -
STREAMs, not the real resource New mechanism of STREAMs execution – WORKERs
No deadlocks anymore as streams are not the static scheduling resource in the compiler (any number of streams), HW “workers” dynamically choose operations from the ready streams and dispatch them to the Reservation Station for execution
More details on next slides…
1 3
2 47
58
9
1110
6
Program Data Graph
WORKERs
742
138
1
3
7
Reservation Station
Execution
1
3
78
22
8
STREAMs