Download - Oil crises: weekend driving ban (Germany)Brasil-Alemanha 2010/11: Ano da Ciência, Tecnologia e Inovação, ... Wikinomics approach for agile world-wide mass collaboration without

[email protected]

Reiner Haartenstein (keynote): Directions of Programming Research: Seeking a Needle in the Haystack? The 1st Brazilian-German Workshop on Micro and Nano Electronics (BGME’2010), Oct 6-8, 2010, Porto Allegre, RS, Brazil 1

Reiner Hartenstein, TU Kaiserslautern, Germany

http://hartenstein.de

Directions of Programming Research:

Seeking a Needle in the Haystack?

Reiner Hartenstein

IEEE fellow

1

German-Brazilian Year of Science - Technology and Innovation 2010/2011,

Brasil-Alemanha 2010/11: Ano da Ciência, Tecnologia e Inovação,

Deutsch-Brasilianisches Jahr der Wissenschaft, Technologie und Innovation 2010/2011

Karlsruhe Institute of Technology

member,

The 1st Brazilian-German

Workshop on Micro and Nano

Electronics (BGME’2010), Oct 6-

8, 2010, Porto Allegre, RS, Brazil

© 2008, [email protected] http://hartenstein.de 2010, 2010,

Abstract (Preface)

2

The energy consumption of all computers worldwide will become unaffordable.

We need to reinvent computing.

An alternative programmable technology with massively potential for speed-up and to save energy and has been developed decades ago (RC).

Progress of HPC* is stalled by the parallelism wall and the power wall

However all this is handicapped by programmer productivity problems

© 2008, [email protected] http://hartenstein.de 2010, 2010, 3

The Twin Wall Crisis:

Power & Performance

Worldwide two drastically disruptive developments:

µP industry changed strategy over to „manycore“

(away from faster clock speed)

Energy consumption of computing becoming unaffordable

The Programming Wall

The Power Wall


Outline

• The coming Shortage of Energy

• Energy Consumption of Computing

• The Programmability Crisis

• Rescue by Reconfigurable Computing ?

• The Reconfigurability Paradox

• We need to Reinvent Computing

• Reinventing Programmer Education

• Conclusions

4


No more cheap oil

5

Currently: >80 $

Tendency: growing


Oil crises: weekend

driving ban (Germany)

6

1973 1979/1980

(depencence on near east oil countries)

[email protected]





Beyond Peak Oil

7

J. S. Gabrielli de Azevedo: Petrobras

e o Novo Marco Regulatório;

São Paulo, December 1, 2009


Cheap Oil Era reached its End

Rapidly growing energy prices (IEA: factor of 3) predicted.

50% reserves are under water. Off-shore Projects re-calculated.

IEA: “>six more Saudi Arabias for the demand predicted for 2030“

80% of crude oil is coming from decline fields.

Higher Standards of living: China, India, Brazil, Mexico, newly industr. countries.

growing electricity consumption of computers: 10 more Saudi Arabias!

IEA estimates: demand will double til the year 2030

China passes the U.S. in energy use [IEA]

8

© 2008, [email protected] http://hartenstein.de 2010,

Beyond Oil: Literature

9

US: ~3 $

… post petroleum …

… hundreds of books


Beyond oil:

Literature (2)

10


Outline

11








• Conclusions


Computers everywhere

12

[email protected]




© 2008, [email protected] http://hartenstein.de 2010, 13

... Ecosystem: just one example

13


... Supercomputers ...

14


more ...

15 © 2008, [email protected] http://hartenstein.de 2010, 2010,

Business Information Systems

without Computers

16

Lufthansa Reservation

anno 1960

http://wiki.answers.com/Q/Why_are_computers_important_in_the_world


Banking without Computers


COMMputation

18

communication and computing infrasructures everywhere

[email protected]





Innovation-driven computing

[Andy Hopper]

• Simulation and modelling are important tools which will help predict global warming and its effects.

19

• Computing will play a key part in optimizing use of resources in the physical world.

• The amount of infrastructure making up the digital world is continuing to grow rapidly and starting to consume significant energy resources.

• To help generate momentum and achieve these goals, it is important that a coordinated set of challenging international projects are investigated.

• We are experiencing a shift to the digital world in our daily lives as witnessed by the wide scale adoption of the world wide web.

Green IT:

• Smart energy meters: housing, buildings, facilities

• Carpooling and public transport by info web sites

• Road traffic and transport logistics optimization

• Reduce travelling by telecommuting.


Some grand challenge

examples for CPS

20

[Ed. Lee]

• Blackout-free electricity generation and distribution,

• Extreme-yield agriculture,

• Safe, rapid evacuation in response to natural or man-made disasters,

• Perpetual life assistants for busy, senior/disabled people,

• Location-independent access to world-class medicine,

• Near-zero automotive traffic fatalities, minimal injuries, and significantly reduced traffic congestion and delays,

• Reduce testing and integration time and costs of complex CPS systems (e.g. avionics) by 1 to 2 orders of magnitude,

• Energy-aware buildings and cities,

• Physical critical infrastructure that calls for preventive maintenance,

• Self-correcting cyber-physical systems for “one-off” applications,

• Disaster Response: Large-Scale Emergency Evacuation,

• Assistive Devices.


The World Economic Forum’s

"Global Redesign Initiative”

Organizations like UN, UNESCO; GATT, G8, G20 are increasingly inept at fixing what ails the world:

21

• economic growth • climate protection • poverty eradication • conflict avoidance • human security • global vaccine protocol • global risk management • promotion of shared values • intelligent water management • smart energy production/distribution …

“Existing global institutions require extensive rewiring to confront contemporary challenges."

Wikinomics approach for agile world-wide mass collaboration without bureaucracy.

for citizen juries, polling, digital brainstorms, policy

wikis, town hall meetings …

New paradigm to involve world citizens by global IT networks

with graphic user interfaces


Growth of the Internet

The trends are illustrated by :

expanding wireless internet,

growing number of users.

shipping electronic books,

more cloud computing?

and many other services.

22

Internet service providers need to assess how much more bandwidth will be required.

2007 a factor of 30 predicted by the year 2030, if current trends continue

Broadband connections NA, Mex, WE by end’ 2007: 155 millions - predicted for after 2011: 228 millions.

larger e-mails,

services integr’g video and software

increasing popularity of games,

massive use of video on demand,

high-definition video and pay-TV,

services by mobile phone companies.


Power Consumption of Computers

[Albert Zomaya 2008]

Power consumption by internet: x30 til 2030 if trends continue G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 –11 Sep 2008

23

at Dallas

[Randy Katz: IEEE Spectrum, Febr. 2009]

Energy cost may overtake

IT equipment cost in the near future

„Google causes 2% of the worlds

electricity consumption“

(Google denied)

at Quincey

at Boardman

2009


Electricity Bill: a Key Issue

„The possibility of computer equipment power consumption spiraling out of control could have serious consequences for the overall affordability of computing.”

Patent for water-based data centers

Cost of a G’ data center determined by monthly power bill

[L. A. Barrosso, Google]

24

Google going to sell electricity

• Already 2005, Google’s electricity bill higher than value of its equipment.

[email protected]





How Societies Chose

to Fail or Succeed

Collapse of our computing ecosystem ?

Unaffordability of von-Neumann-centric computing could jeopardize all facets of our global economy.

Manycore: failure could jeopardize both, IT industry & most sections of the economy depending on rapid improvement of IT. [Dave Patterson]

Several recent outages of cloud computing services.

Stuxnet worm: only propaganda trick?


without Cyber Infrastructure ?

26

?

copyrighted!

? ?

homo computensis

homo Neanderthalensis?


Outline

27








• Conclusions


The Trouble with Manycore

The growing core counts are racing ahead of programming paradigms and programmer productivity

28

- a challenge to CS education

to major extent also in mass markets

going to FPGA: for programmers a paradigm shift

in supercomputing

Chipmakers busy designing microprocessors that most programmers can’t program [David Patterson, IEEE Spectrum, July 2010]

doing so without any clear notion of how such devices would in general be programmed

They hope, someone will be able to figure out how to do that


Can we get it right this time?

The “parallel programming problem”: addressed for at least 25 years, in HPC.

29

Only a small number of specialized developers write parallel code.

Multicore becoming ubiquitous: some hope that “if you build it, they will come”

[T. Mattson, M. Wrinn: Parallel Programming:

Can we PLEASE get it right this time? DAC 2008, Anaheim, CA, June 8-13, 2008],

A massive worldwide effort is required, taking many years, creating masses of jobs

We need to reinvent programmer education

We need to reinvent computing

„The proud era

of von Neumann architecture passes into history.“

„Foundational change will disrupt traditional habits throughout the discipline ....“

Michael Wrinn, (keynote at SIGCSE2010): Suddenly, All Computing Is Parallel: Seizing Opportunity Amid the Clamor http://www.sigcse.org/sigcse2010/attendees/keynotes.php


Multicore is not new

•ACRI •Alliant •American Supercomputer •Ametek •Applied Dynamics •Astronautics •BBN •CDC •Convex •Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland Computer •Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines

•Kendall Square Research •Key Computer Laboratories

Dead (Super)Computer Society [Gordon Bell, keynote, ISCA 2000]

•MasPar •Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer •Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics

the single core sequential mind set was the winner

only 2 or 3 successes

most in 1985-1995

- mainly research

30

http://www.sigcse.org/sigcse2010/attendees/keynotes.php

[email protected]





Amid the Clamor ?

31

bring parallel

computing into

mainstream of

undergraduate

education

[Michael Wrinn]

current discussions: despairingly seeking a needle in a haystack.


We need a new Textbook

32

having an impact like Mead & Conway

"The book that changed everything“; Electronic Design News, Feb. 11, 2009


Outline

33








• Conclusions


The Tail wagging the Dog

34

„Central“: it controls

(almost) everything

However,

it needs

accelerators

accelerators CPU

CPU

„Central Processing

Unit“


Twin paradigm systems

35

CPU

„Central Processing

Unit“


(almost) everything

However,

it needs

accelerators

CPU

hardwired

accelerators

reconfigurable

accelerators second paradigm

accelerators

ASIC: 3%

FPGA: 97%

[Dataquest, 2009]

design start ratio


A Clean Terminology, please

program source result

Software instruction streams

Flowware data streams

Configware datapath structures configured

[email protected]





FFT

100

Reed-Solomon Decoding 2400

Viterbi Decoding 400

1000

MAC

DSP and wireless

molecular dynamics simulation

88

BLAST 52

protein identification

40

Smith-Waterman pattern matching

288

Bioinformatics

GRAPE

20 Astrophysics

SPIHT wavelet-based image compression 457

real-time face detection

6000

video-rate stereo vision

900

pattern recognition 730

Image processing, Pattern matching, Multimedia

3000 CT imaging crypto

1000

28500

DES breaking

1

1000

1,000,000

Spe

edup

-Fac

tor

Speed-up

factors

obtained

by Software

to Configware

migration

8723 DNA seq.

100

10

10,000

100,000


Energy saving factors: ~10% of speedup

38

FFT

100

Reed-Solomon Decoding 2400

Viterbi Decoding 400

1000

MAC

DSP and wireless

molecular dynamics simulation

88

BLAST 52

protein identification

40

Smith-Waterman pattern matching 288

Bioinformatics

GRAPE

20 Astrophysics

crypto 1000

28500 DES breaking

1

1000

1,000,000

Spe

edup

-Fac

tor

Power save

factors

obtained

(FPGAs) SPIHT wavelet-based

image compression 457

real-time face detection

6000

video-rate stereo vision

900

pattern recognition 730

Image processing, Pattern matching, Multimedia

3000 CT imaging

8723 DNA seq.

100,000

10,000

100

10


x86 Clock Frequency

39

1995 2000 2005100

1

10GHz

GHz

MHz

Pentium IV

Pentium III

Celeron

Celeron to Pentium IV: x20

growing clock speed

growing power consumption

(migration papers: power save not reported before 2005)

Pentium I (1989) to Pentium IV: x60

1995 – 2005: speed-ps obsolete?


Hitting 28nm, and beyond

Both de facto FPGA giants (Xilinx and Altera) are hitting 28nm at end of 2010.

40

FPGAs now capable of implementing entire SoCs.

‘ve turned into a complex heterogeneous mix of coarse-grain elements and classical fine grained LUTs.

2009: Intel ships 32nm,

2010: foundries to ship 28nm

Intel will ship 22 nm in 2011,

16 nm in 2013

Xilinx partner TSMC, the world’s largest standalone

Fab almost the de facto Fab for all FPGAs in the world.

Also Altera is well known for its long partnership

with TSMC since early 90s.


[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]

Application . Speed-up factor

Savings Power Cost Size

DNA and Protein sequencing 8723 779 22 253

DES breaking 28514 3439 96 1116

much less equipment

needed

massively saving energy

RC*: Demonstrating the intensive Impact

SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster

Tarek El-Ghazawi

*) RC = Reconfigurable Computing © 2008, [email protected] http://hartenstein.de 2010,

Drastically less Equipment needed

For instance: a hangar

full of racks replaced by a single rack

without air

conditioning

42

or ½ rack

[email protected]





END

SGI® RASC™ Module (Version1)

Xilinx Virtex II-6000 FPGA

16MB QDR SRAM

Rack-mountable

Dual NUMAlink™ 4 ports

Seamless direct attach to server's

shared memory fabric

Datasheet (PDF 145K)

SGI® RASC™ RC100 Blade

Dual Virtex 4 LX200 FPGAs

80MB QDR SRAM or 20GB DDR2

SDRAM

Blade or rack-mountable form

factor

Dual NUMAlink™ 4 ports

Seamless direct attach to server's

shared memory fabric

Datasheet (PDF 137K)

Hetero HPC


Cray-XD1 Architecture

The Cray-XD1 allows the Opteron µP to access the FPGA internal registers, internal and external memory.

44

provides several transfer modes between µP and the FPGA (depending on its initiator).

The µP can read from / write to the FPGA local memory space (i.e. internal registers, internal BRAMS, and external memory).

The FPGA can read from / write to the µP local memory space.

However, the use of HLL can disable some of these features.

The most bandwidth-efficient transfer mode:

write-only mode (producer initiates the transfer):

burst (for large amount of data) or non-burst.


The silver bullet

Reconfigurable Computing is really the silver bullet for massively saving energy

45

We have to develop a good rescue strategy

scene „Green Computing“ Reinvent Computing

predicted energy saving

factor of about 3 orders of magnitude


Bizarre FPGA Synthesis Market

Paradox of Pursuit

Synplify Saves Synthesis - Again

by Kevin Morris

46

Start analyzing the perplexing paradox of the FPGA synthesis market

and each link of the chain reveals a bizarre force vector

that eventually doubles back onto itself into an unlikely equilibrium

that miraculously has held stable for a full decade

despite disruptive forces of epic proportions.

Rube Goldberg couldn’t have designed

a more elegant confluence

of convoluted causal relationships.


Outline

47








• Conclusions


The History of Computing

(1)

48

The 1st electrical computer, ready

prototyped for mass production ?

Guess: which year, which company ?

http://www.sgi.com/pdfs/3863.pdf

http://www.sgi.com/pdfs/3920.pdf

[email protected]





The History of Computing

(2) Prototype 1884: Herman Hollerith

datastream-based

the first reconfigurable computer

DPU

The first Xilinx FPGA came 100 years later

size: like about 3 refrigerators

1989 used for US population census


Early LUT

60 years later: RAM available – e. g. ferrite core

non-volatile configuration memory

field-programmable:

manually

swapping plug boards

(motivation for von Neumann machine paradigm)


80 years later

51

much larger than 3 refrigerators

just for a few ballistic tables:

the „von Neumann“ paradigm

von Neumann syndrome


the tremendous inefficiency of

computers causes immense

electricity consumption

the tremendous inefficiency of

computers causes immense

electricity consumption

52

because

of The von Neumann

Syndrome


All but ALU is overhead: x20 efficiency

53

(data cashe)

x20

inefficiency:

just one

of several

overhead

layers

[R. Hameed et al.: Understanding Sources of Inefficiency in General-Purpose Chips; 37th ISCA, June 19-23, 2010, St. Malo, France]


massive overhead phenomena

proportionate to the number of processors

overproportionate to the number of processors

54

overhead von Neumann

machine

instruction fetch instruction stream

state address computation instruction stream

data address computation instruction stream

data meet PU + other overh. instruction stream

i / o to / from off-chip RAM instruction stream

Inter PU communication instruction stream

message passing overhead instruction stream

transactional memory overh. instruction stream

multithreading overhead etc. instruction stream

[email protected]





55

von Neumann overhead vs.

Reconfigurable Computing


machine datastream machine

instruction fetch instruction stream none*

state address computation instruction stream none*

data address computation instruction stream none*

data meet PU + other overh. instruction stream none*

i / o to / from off-chip RAM instruction stream none*

Inter PU communication instruction stream none*

message passing overhead instruction stream none*

transactional memory overh. instruction stream none*

multithreading overhead etc. instruction stream none*


Critique of the von Neumann Model

Brad Cox 1990:

Planning the Software

Industrial Revolution

Dijkstra 1968: The Goto considered harmful

R. Hartenstein, G. Koch 1975: The universal Bus considered harmful

Backus 1978: Can programming be liberated from the von Neumann style

Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style L. Savain 2006: Why Software is bad

Critique of von Neumann is not new:

Peter G. Neumann 1985-2003: 216x “Inside Risks“ 18 years inside back cover of Comm_ACM

Peter G. Neumann

56

overhead piles up to code sizes

of astronomic dimensions

“von Neumann

Syndrome”: C.V. Ramamoorthy; UC Berkeley

Nathan’s Law: Software is a gas.

It expands to fill all its containers ...

Nathan Myhrvold

Wirth‘s Law

“software is slowing faster

than hardware is accelerating“


The transition from machine level to higher level languages led to the biggest productivity gain ever made

It‘s alarming that today‘s megabytes of code are compiled from languages at low abstraction levels (C, C++,Java)

The wrong Direction: by Herd Instinct ?

[Fred Brooks]

57

Java is a religion – not a language [Yale Patt]

Bud Lawson‘s Dilbert


Burroughs B5000/5500: language-friendly stack machine

IBM 260/370 & intel x86 highly complex instruction set

MULTICS (GE, Honeywell): well manageable (impl. in PL/1)

UNIX: complexity problems, compatibility problems

Pascal killed by C, coming as an infection, along with UNIX

unnecessary complexity

inside

Widening the Semantic Gap

[Harold „Bud“ Lawson]

„portable assembler language“


Scientific Revolutions

1st Newtons Law (inertia): „people do not change direction“

scientific scenes follow the herd instinct


Apropos Herd Instinct

60

Some Programming Languages

[email protected]





40 years Software Crisis

F. L. Bauer 1968, coined „Software Crisis“ - N. N. 1995: THE STANDISH GROUP REPORT Robert N. Charette 2005: Why Software Fails; IEEE Spectrum, Sep 2005

Anthony Berglas 2008: Why it is Important that Software Projects Fail

Oct 1957 The Economist: Nov 19th 1955

In 1955, Parkinson could not have foreseen the impact of software.

The size of bureaucracy is independent of the amount of real work to be done.


Outline

62








• Conclusions


term controlled by execution

triggered by paradigm

CPU program counter

(at ALU)

instruction fetch

instruction stream

DPU**

rDPU**

data counter(s) (at memory)

data arrival* data-stream-based

*) “transport-triggered” **) does not have a program counter

- no instruction fetch

single paradigm (from the

mainframe age) is obsolete


64

term controlled by execution

triggered by paradigm

CPU program counter

(at ALU)

instruction fetch

instruction stream

DPU**

rDPU**

data counter(s) (at memory)

data arrival* data-stream

+ New Machine Model for FPGAs

*) “transport-triggered” **) does not have a program counter

- no instruction fetch

twin paradigm

twin paradigm


CPU-centric flat world model

Aristotelian model

This Software-only world model

is obsolete CPU

not visible from SE

(CS: introduced in the 40ies)

65

CPU-centric sequential-only

mind set

CPU-centric sequential-only

mind set

1,000,000 100,000 10,000

1000 100 10

but no hardware know-how

but no hardware know-how


The Machine Model Dichotomy

(1) von Neumann versus data stream machine

66

PE

Program Engineering

*) do not confuse with „dataflow“!

Flowware

Engineering

FE

auto-sequencing Memory

asM

SE

Software

Engineering

CPU

1st

Step:

The Generalization of

Software Engineering

[email protected]







67 PE

Program Engineering

Flowware

Engineering

FE


asM

SE

Software

Engineering

CPU

CE

Configware Engineering structures

pipe network

model etc. DPU Data-Path- Unit

Data-Path- Array DPA


Software Engineering —

2nd

Step:




68 PE

Program Engineering

Flowware

Engineering

FE


asM

SE

Software

Engineering

CPU

CE

Configware Engineering structures

pipe network

model etc. DPU Data-Path- Unit

Data-Path- Array DPA

time to time time to space

mapping issue


Software Engineering —

2nd

Step:


Programming Model: Flowware

Adder

Speaker

FMDemod

LPF1

Split

Gather

LPF2 LPF3

HPF1 HPF2 HPF3

Source:

MIT

StreamIT

• Pros for streaming – Streamlined, low-overhead

communication – (More) deterministic behaviour – Good match for many simple media

rich applications

[Pierre Paulin]

We‘ve to find out, which applications types and programming models Students should exercise for the flowware approach

• Cons – control-dominated applications – shunt yard problem


A Clean Terminology, please

program source result

Software instruction streams

Flowware data streams

Configware datapath structures configured


Programming Language Paradigms

(1)

language category Computer Languages Languages f. Anti Machine

both deterministic procedural sequencing: traceable, checkpointable

operation sequence driven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

state register program counter data counter(s)

address computation

massive memory cycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoided

parallel memory bank access interleaving only no restrictions

language features control flow + data manipulation

data streams only (no data manipulation)

Flowware Languages Software Languages

imperative language twins


Programming Language Paradigms

(2)

Computer Languages Languages f. Anti Machine

procedural sequencing: traceable, checkpointable

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

program counter data counter(s)

massive memory cycle overhead overhead avoided

memory cycle overhead overhead avoided

interleaving only no restrictions

control flow + data manipulation

data streams only (no data manipulation)

Flowware Languages Software Languages

imperative language twins

[email protected]





Procedural Languages Twins

systolic Flowware Languages

read next data item

goto (data address)

jump to (data address)

data loop

data loop nesting

data loop escape

data stream branching

yes: internally parallel loops

73

imperative Software Languages

read next instruction

goto (instruction address)

jump to (instruction address)

instruction loop

instruction loop nesting

instruction loop escape

instruction stream branching

no: no internally parallel loops

But there is the Asymmetry

program counter data counter(s)

for data parallelism

super


Outline

74








• Conclusions


The FPGA Programming Crisis


We need a good Textbook

N. Conner et al.: FPGAs

for Dummies; Wiley, 2008

76


Acceleration Mechanisms

•parallelism by multi bank memory architecture •auxiliary hardware for address calculation •address calculation before run time

•avoiding multiple accesses to the same data. •avoiding memory cycles for address computation •optimization by storage scheme transformations •optimization by memory architecture transformations


The language and tool disaster

Software people do not speak VHDL

Hardware people do not speak MPI nor OpenMP

Bad quality of the application development tools

Poll at FCCM’98: 86% designers hate their tools

progress stalled by qualification problems in industry and academia

Not only in embedded systems: comprehensibility barrier between procedural and structural mind set

Software people urgently need locality awareness

[email protected]





New boundary constraints are the limiting factor

79 27 October 2008 Software 2008, Zurich

Legacy scientific applications: predominantly sequential

The entire software ecosystem will need to evolve (including curricula): O/S, libraries, software development environments, compilers and languages

additional levels of parallelism: chaining, pipelining, systolic, super-systolic, wavefront arrays

additional data structures and storage organization: the new distributed memory discipline

New boundary constraints


HLL programming models

80


Taxonomy of Twin Paradigm

Programming Flows (HPRC)

81

E. El-Araby et al.: Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology And Empirical Study; Proc. SPL2007 Symp., Mar del Plata, Argentina, Febr. 2007


Dual paradigm mind set: an old hat

Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer

Software mind set:

instruction-stream-based:

flow chart ->

control instructions

(mapping from procedural to structural domain)

C. G. Bell et al: The Description and Use of Register-Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972

W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. 1967:

1972:

FF

token bit

evoke

FF FF


Multicore Programming Requirements

Efficient distribution of tasks

83

Layers of abstraction hide critical sources of and limits to efficient parallel execution

Being memory limited

Internode communications (data assembly & dispatch) reduces computational efficiency: speedup/nodes

Result: scaled up cost, power, cooling and reliability concerns


how Programmers think

“Parallel programming: informal approaches are not working” [Mattson]

84

“We must adopt a systematic approach by insight into how programmers think” [Mattson] We must adopt a systematic approach by changing how programmers think [R. .H]

[email protected]





Newer Developments in

Semiconductor Technology

Limits by increasing power density

85

significant problems in performance, power consumption and reliability: great challenges for Reconfigurable Computing.

the golden

CMOS era

is gone

Technology scaling does not deliver significant performance speedup

transistors less reliable: additional sources of errors*

defective at manufacture time

degrade and fail over the expected lifetime

process variations

increasing number of soft errors

December 28, 2009

Fault-Tolerance Techniques needed

(EM, HCD, TDDB)

*) J. M. P. Cardoso, M. Hübner (editors): Reconfigurable Computing, 2011, Springer *) S. Borkar: Designing reliable systems from unreliable components; 2005. © 2008, [email protected] http://hartenstein.de 2010, 2010,

Fault Tolerance

86

CPU

hardwired

accelerators

reconfigurable

accelerators

Fault Tolerance Implementation

accelerators


Neurocomputing

87

The Memristor

hp:

discovered

2008, prod.

announced

for 2013

direct synapse emulation replacing massively inefficient digital simulation

less transistors for logic circuits

resistor w. memory: doping moved by electric field

third paradigm:


Mem(r)istor History

1963: Memistor Corp. founded by Prof. Bernie Widrow, Stanford U.

Foto: Storz 1975

1960: missing device postulated by Prof. Karl Steinbuch

2007: Stan Williams, hp, finds Memristor

[Picture: Leon Chua]

1971: Leon Chua, UCB, specifies Memristor

88

2013 ? agreement

Hewlett Packard / Hynix Semiconductor


normal Hype Curve

89

[Olivier Temam: The Rebirth of Neural Networks; 37th ISCA, June 19-23, 2010, Saint-Malo, France]

[Olivier Temam, 2010]


Neural Network Hype Curve

90



(Olivier Temam never mentions Karl Steinbuch)

[email protected]





no

stopped Funding for 15 Years

91

[Olivier Temam,

2010]

Marvin Minsky,

Seymour Papert:

Perceptrons; 1969.

(world-wide)

SVM: Support vector machines: set of supervised learning methods to analyze data and recognize patterns


no

Marvyn Minski‘s blind alarm

92

20 y

ears

earl

ier

!

W. Hilberg: Karl Steinbuch, ein zu Unrecht vergessener Pionier der Künstlichen Neuronalen Systeme; FREQUENZ 1995, 49#(1-2):28-35.

[Olivier Temam,

2010]

1962: Karl Steinbuch

(was ignored by Marvyn

Minski book)

Steinbuchs Lernmatrix

1960




[Olivier Temam,

2010]


What ANNs can do


Defects-Tolerant Accelerators ?

94



Triple paradigm systems?

95

accelerators CPU

hardwired

accelerators

reconfigurable

accelerators

(self-reconfigurable) neurocomputing

Self-organizing Fault Tolerance


Development with VHDL is expensive

96

Development with VHDL is Expensive!

FPGAs Achilles’ Heel is in their long development time

–Relatively low level HDLs (VHDL/Verilog) are still dominant

–A large part of FPGA solution development is spent on learning specific FPGA

board APIs and debugging in hardware (70% in our experiments!)

–Unlike software, FPGAs do not currently offer forward/backward compatibility,

not even within the same family!

–FPGAs have a relatively low technology maturity and small user base

compared to software”Courtesy of Dr Khaled Benkrid, University of Edinburgh

[1] Grant Martin, Gary Smith. “High-Level Synthesis: Past, Present, and Future,”

IEEE Design and Test of Computers, July/August 2009.

In 2009, Berkeley Design Technology Inc. (BDTI), an

independent benchmarking and analysis firm, launched

the BDTI High-Level Synthesis Tool Certification Program™

to evaluate high-level synthesis tools for FPGAs.

[email protected]





The Role of Reconfigurable Computing

• Reconfigurable Computing

• Using the power of FPGAs they hope to solve the multi-core crisis

• Or, in this case, confusing computing with processor cores

• For many years FPGAs were just prototyping vehicles for ASICs

• Now they are replacing many ASICS & ASSPs

• Watch for the same Trojan effect with FPGAs in HPC

• Reconfigurable computing is a key part of the solution for concurrent programming


Architectural Impact [Patrick Lysaght]

• Architectural impact

• –Only very high volume architectures transition to leading processes

• –Programmability and concurrency are the new architectural imperatives

• –MPSoCsevolve into heterogeneous, multi-core architectures

• –Power dissipation is a dominant concern

• –Design productivity lags silicon progress

98


EPP

• Xilinx EPP Solution

• Processor-centric approach

• Software-centric approach

• ARM®processing engine


Outline

100








• Conclusions


We urgently need to reinvent computing

Conclusions

We should begin as early as we can still afford retrofitting.

But this will require a major effort for many years.

This will create many, many new jobs.

We need „une' Levée en Masses“


Conclusions

The migration of the huge supply of legacy software creates masses of jobs for decades ….

To avoid future unaffordability of our cyber infrastructure we need a massive software to

configware migration

…. and saves much more energy than most proposals from the climate protection scene

… impossible without reinventing programmer education

RC is the silver bullet

We have to hurry up to activate the public and the media, currently fully ignoring this wordwide vital issue

We must hurry up to start the required time-consuming massive campaign as long as we still can afford it

[email protected]





Programming Datastream

• Accelerate tasks by streaming

• MISD structured computation: streaming computations across a long array before storing results in memory.

103

• Can achieve 100x in improved use of memory.


Reinvent? (final remark)

avoid traditional tunnel views

to obtain new perspectives

rediscovery and revival of old ideas

rearrange and teach them properly

to reach promising new horizons



Obrigado!

[email protected]

http://hartenstein.de/reinvent-m.pdf


Debunking the GPU Myth

R. Vaduc et al.: On the Limits of GPU Acceleration; USENIX Workshop HotPar’2010, June 14 - 15, 2010, Berkeley, CA, USA

R. Bordawekar, U. Bundhugula, R. Rao: Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications! IBM Research Report, April 23, 2010, Yorktown Heights, NY, USA

V. Natoli: Kudos for CUDA, HPCwire, July 06, 2010, http://www.hpcwire.com/features/Kudos-for-CUDA-97889444.html

106

code easier to maintain,

the maturity of its compilers,

elegance or simplicity [Natoli]

CUDA: a programming language

CPUs and GPUs much closer in

performance (2.5X) than the

reported orders of magnitude

V. W. Lee et al.: Debunking the 100X GPU vs. CPU myth; 37th ISCA, June 19-23, 2010, Saint-Malo, France


END


Locality awareness is

essential for flowware

How data are moved Software: by addresses, read from instruction

Flowware: by wire (configured before run time)

relation to configware calls locality awareness

here locality is less relevant

[email protected]





traditional qualification in the time domain

109

Education Revolution

+ lean qualification in the space domain

= lean hardware modeling qualification

at a higher level of abstraction

by twin paradigm co-education:


New Programmer Education

110

New mix of skills needed, currently not available

essential: awareness of locality,

focusing on memory mapping issues and transfer

modes to detect overhead and

bottlenecks

understanding streams through complex fabrics


Two classes of solutions

111

Migration of a particular algorithm to RC

Understanding a complex modern hetero system

to detect overhead and bottlenecks


understanding architecture

112

NoC

memory

memory

memory

ASIC

ASIC

ASIC

ASIP

ASIP

ASIP

FPGA

FPGA

FPGA

µ

P

µ

P

µ

P

I/O

I/O

I/O

memory,

streams

off-chip

th

e m

em

ory

wa

ll

several transfer modes

reconfigurable accelerators

hardwired accelerators

many-core

3%

ASIC

97% FPG

A [Dataquest March 25, 2009]


New Book on NoC

Jih-Sheng Shen, Pao-Ann Hsiung (editors): Dynamic Reconfigurable Network-on-Chip Designs: Innovations for Computational Processing and Communication; Information Science Reference, Hershey, USA, April 1, 2010


Visible architecture

The programming model:

hardware view presented to the programmer:

Which hardware architecture parts are visible

and under the programmer’s direct control.

114

RC programming model:

whether (and how) the programmer can control

- data transfers between FPGA and onboard memory,

- FPGA and microprocessor memory, as well as

- FPGA and microprocessor.

[email protected]





the program counter

is the problem

115

the program counter indicates the problem

using data counters is much more efficient


Rewriting needed anyway

• Rewriting of software needed anyway: for the survival ot the µP industry (to cope with the transision to manycore)

116

• Extended scope of Software Rewriting: to save energy by orders of magnitude

• different from „classical“ green computing

stro

ng s

yne

rgy


Green computing vs. Reinvent Computing

117

scene „classical“ Green Computing (GC) Reinvent Computing (RC)

predicted energy saving

factor of about 3 orders of magnitude

status already on track

reinvent programmer education needed for 2 reasons –> also for µP

industry survival

to do funding continued

years of massive world-wide action required

support by media needed


The Anti Machine

Generalization

of the DMA

Uses data counters

instead of a Program Counter

[M. Herz et al.: IEEE ICECS 2003, Dubrovnik]

Does not need

memory-cycle-hungry

instruction streams data

counter

GAG RAM

ASM: Auto-Sequencing

Memory ASM

data

stream


Bio of Reiner Hartenstein

119

http://hartenstein.de/Hartenstein-bio.pdf


Absence of the Need to Think

• Too much effort?

• “The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use.

• In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” -Daniel Slotnick, 1967

120

[email protected]





Need a new world model

reconfigurability is the silver bullet to obtain

massively better energy efficiency as well as

much better performance by HPRC

the upcoming heterogeneous methodology .

121

The impact is a fascinating challenge to reach

new horizons of research in computer science.

We need a new generation of talented innovative scientists and engineers

to start the beginning second history of computing.

This chapter discusses its new world model.

Because of the multicore parallelism dilemma, we anyway need to reinvent programmer education


Need a new world model

reconfigurability is the silver bullet to obtain

massively better energy efficiency as well as

much better performance by HPRC

the upcoming heterogeneous methodology .

122

The impact is a fascinating challenge to reach

new horizons of research in computer science.

We need a new generation of talented innovative scientists and engineers

to start the beginning second history of computing.

This chapter discusses its new world model.

The need for a massive campaign for migration of software over to configware. Because of the multicore parallelism dilemma, we anyway need to reinvent programmer education


Platform Collision

Industry faces 'platform collision'

Which platform technology will win in the long run? And will it be the ASIC, ASSP, FPGA, MCU or IP core? And which company will be left standing?

"It's not clear, and all may coexist“ [Brad Howe, VP IC, Altera]

123

Far Future is Cloudy!

Battles will get further interesting if/when the parallel programming crisis is over

NoC research: world-wide >60 projects


versatility and heterogeneity

The semiconductor industry in all its history has not seen anything that can

match a microprocessor or FPGA in terms of versatility and heterogeneity of potentials.

Not long ago in the beginning of the last decade the reconfigurable computing research community

fell in a serious crush on coarse-grain reconfigurable hardware and FPGAs.

Computation in time vs computation in space was a major focus.

124


Von Neumann coarse-grained

Reconfigurable Architectures ?

Von Neumann once again came back as a ―hero― to the community telling us that he as a team in form of multi/many cores can compete with FPGAs and exploit features of non Von Neumann coarse-grained reconfigurable architectures. That opened a new portal of research and products for academics and industry including progress of Network on Chips (NoCs).


Market failure reasons

The failure reasons are more commercial than technical.

The market dominance of well established players has kept the

competition stakes quite high for new entrants,

companies with low differentiations badly failed;

in comparison innovative startups with strong differentiations succeeded to

either find a niche from market

or got acquired by a bigger company which bought them to strengthen its products portfolio or existing technology.

126

[email protected]





RTL vs Software

programming battle

At this point we can also see the most challenging battle between FPGAs vs MPSoC like platforms at present.

It is RTL vs Software programming.

with programming of multicore and the efforts on the way to address them with open solutions like OpenMP (www.openmp.org) and several tools from Intel to help programmers exploit its multicore processors is discussed


Processor inside FPGA vs

FPGA inside Processor: EPP

128

The concept totally changed for these new devices

This makes the device more like heterogeneous SoCs as discussed in last section (fig. 4). This allows the devices to have significant benefits for high-performance applications:

Automotive Driver Assistance,

Intelligent Video Surveillance,

Wireless Communications, and

Industrial etc

FPGAs became software-centric: EDUCATION !!!!

are software centric: not hardware centric


Power consumption

power consumption is becoming a severe problem for future integrated circuits, and therefore power efficient solutions will be very important.

129

Consequently, the challenge for reconfigurable computing is to show

that customization and massive parallelism of reconfigurable hardware

can overcome its power

consumption overhead over ASICs

providing power-efficient solutions.

Reconfigurable computing: opportunity to provide such solutions for future systems.


####

130


Language-of the Year Phenomenon

[R. Newton]

[courtesy Richard Newton]

131

KARL


Some special Languages

132

[email protected]





Some Programming Languages


Some Parallel Languages

134


ANN 135 © 2008, [email protected] http://hartenstein.de 2010, 2010,

understanding architecture

136

NoC

memory

memory

memory

ASIC

ASIC

ASIC

ASIP

ASIP

ASIP

FPGA

FPGA

FPGA

µ

P

µ

P

µ

P

I/O

I/O

I/O

AN

N

AN

N

memory,

streams

off-chip

th

e m

em

ory

wa

ll

several transfer modes

reconfigurable accelerators

hardwired accelerators

self-reconfigurable accelerators

many-core


threshold logic:

Neuron Model

x1 + x2 + x3 ≥ 1

x1 + x2 + x3 ≥ 3


Memristor

technology detected at hp 2008

TiO2 semicondictor: hi resitance

conductive by doping

resistance manipulated by moving

the doping via electrical field

“predicted” by UCB 1971

resistor with

memory

Postulated: KIT 1960

Widrow’s Memistor 1963-65

[email protected]





FPNA

logic function

depends on

resistor

dimensioning

Field-Programmable Neuron

Array

Memristor LUT


Teachable Neuron

from Boolean algebra

generalization of the LUT

to Steinbuch algebra

from Reconfigurable Computing

to Reconfigurable Neuro Computing


sonstiges


FPGA to ASIC design start ratio

142

3%

ASIC

97% FPGA

[Dataquest March 25, 2009]


Speed-up by MoM-1 compared to 68020

PISA project


No more cheap oil

144

Reiner

Reiner

Reiner

Reiner

Reiner

Reiner

Reiner

Reiner

Reiner

Reiner

Reiner

Reiner

[email protected]





Fat vs Slim processor cores

It might be possible that in few years a standard becomes for it which can benefit emerging technologies like MPPAs which use a smaller RISC machines compared to high end processors to exploit more parallelism (Thin vs Thick or Fat vs Slim processor cores).

Currently the major focus of industry is to get tools for the high-end Multicore market processors like Intel/AMD, ARM etc.

These companies are slowly and carefully increasing their cores keeping consideration of their legacy software and tools maturity.


New roadmaps of FPGA giants

7. New roadmaps of FPGA giants

we covered the fundamental strengths and weaknesses of FPGAs.

We saw how new technologies evolved and tried to

address specific market segments where they can provide a better solution compared to FPGAs by using the weakness of FPGAs as their strength.

However FPGAs also have dramatically changed with time and the FPGA vendors are well aware of the pros and cons of their technology.

The most recent announcements of FPGA giants of putting Hard processor blocks is a milestone step by FPGAs and response to the competitive technologies.

146


Tilera

Figure 3 shows the Tile64 device of Tilera Corporation (www.tilera.com).

It is a nice example of massively parallel processor arrays.

The architecture style is again regular like FPGAs.

In case of Tilera each tile is a processor core which can run a full operating system, or multiple tiles together can run a multi-processing operating system like SMP Linux.

The processor cores are connected by their iMesh on-chip network.

Their programming tools suite MDE (Multicore Development Environment) provides ease of programming w. ANSI C/C++.


Run time support of RC

Challenges to runtime support of a reconfigurable system:

148

Online monitoring;

Load balancing;

HW dependable SW;

Visualization;

Runtime resource management and scheduling;

very fast re-layout for dynamic reconfiguration;

Managing adaptive dynamic routing.

Challenging issues in ES: developing generic embedded platforms to improve productivity and reusability.

A reconfigurable system that with the above characteristics

is far from trivial to develop

ES domain applications requirement examples:

being energy efficient and/or

safety critical (even more challenging).


Structured ASICs

Structured ASICs are the class of devices based mostly on FPGA-like architecture and have special configuration mechanism to program the device at mask level.

149

This greatly reduces the cost and provides enhanced performance, however once created it is not re-programmable.

eASIC is a prominent example in this regard.

Xilinx and Altera also propose similar solutions for mass production, Easy Path and Hard Copy respectively.


The new developments in

semiconductor technology

difference to ASICs

make Reconfigurable Computing a

widely used solution for future systems.

Reconfigurable Computing can achieve such a goal;

however, several improvements are required:

three orders of magnitude higher Area*Time*Power product than ASICs.

an order of magnitude more resources

an order of magnitude higher delay

an order of magnitude higher power consumption

150

[email protected]





Universal nature due to

prototype capability

Universal nature due to prototype capability

The strongest strength of FPGAs is that they have universal capabilities due to their prototype ability & HDL programming model.

This is not true for a microprocessor.

This power helps FPGAs absorb complex functionalities in form of Hard Macro blocks.

It can be a processor, an IP or anything else.

Since the programming model is HDL it gives instant usability of the component without any burden of new standards or languages.

Highly mature in-house or 3rd party synthesis tools are available due to standard RTL flow.


fine grain vs coarse grain

To ensure high flexibility of interconnecting these LUTs requires huge amount of routing composed of

programmable switches and configuration for them which take significant area of the device.

This gave rise to new architectural concepts where the focus was to decrease the degree of fine grained flexibility of FPGAs to a coarser grained one and furthermore application specific which was inherent as when we change the level of flexibility the application domains narrow.

However the resulting solutions are orders of magnitude better in performance, power and cost when compared to general purpose FPGAs.

152

© 2008, [email protected] http://hartenstein.de 2010, 2010, 153 editor in chief

rebooting ?

Rebooting after each crash ?

… prevented rebooting the ACM/IEEE task force on curriculum recommendations


year

relative performance

94 96 98 00 02 04 06 08 10 12 14 16 18 20 22 24 26 28 30

be

gin of the

multicore era

Multimedia in the Multicore Era

Multimedia Performance Needs

application performance needs up to:

Audio 800 MIPS Graphics 11 GOPS Video 160 GOPS Digital TV 900 GOPS

[Pierre Paulin, MPSoC’09]

needed

performa

nce

growing

faster

than

Moore‘s

law

[courtesy E. Sanchez]

MIPS

GSM GPRS EDGE UMTS

next

standard


FPGA tools


IP eco-system is RTL dominated

IP eco-system is RTL dominated

The RTL flow of FPGAs provides an added benefit to IP eco-system of the industry.

It’s easier to port IPs both for ASICs and FPGAs as both use RTL.

This also holds true for FPGAs of different vendors because they all use RTL flow so porting the design to another FPGA is not extremely complicated like it is in microprocessors where legacy code plays a high role in its success and market dominance.

Furthermore as RTL is inherently parallel, mapped application is automatically optimally parallelized by CAD tools utilizing the best of the target hardware resources

(this still is one of major difficulty for multicore/multicore-like solutions).

156

[email protected]





RTL Programming

RTL Programming Have become Programmable Platform-Language of Silicon-Highly mature Tools-Path to ASICs/ASSPs, across FPGAs-No programming crisis (rising issue is compile time not programming!)

-In theory can implement anything

-Relative ease to Absorb functionalities to Hard blocks and go Heterogeneous-IP ecosystem is RTL dominated-Attractive target for IP providers.


Programming successful

With FPGAs or successful multicore-like solutions it is obvious that programming is always HDL or ANSI C/C++ and now ESL (Electronic System Level) at industrial level is bridging HDL and ANSI C/C++.

Multicore and Massively Parallel Processor Arrays (MPPAs)

158


Menta Startup: early research state

Founded 2007 in Montpellier, Fr: focused on eFPGAs*.

159

The technology is highly scalable and target independent,

customers immediately benefit of having eFPGA in their system

and based on the need of target market constraints

can go for a custom solution for a specific node

Creating highly customized domain specific eFPGAs

for the market segment of ASICs and ASSPs

so the target market segment is different than that of FPGAs.

*) embedded FPGAs

1444 LUT, 120 nm; Press release: 4Qu 2010 Laurent Rougé, Menta founder and CEO provides embedded-FPGA (eFPGA) technology for SoC (System on Chip) eFPGA Programmer® tool suite.


Menta tool suite

The eFPGA Creator tools suite of the company allows creation of customized eFPGA Core in a user friendly GUI environment

160

close collaboration with LIRMM (University of Montpellier)

working on MRAM use (Magnetoresistive Random Access Memory)

for non volatile configuration and superior architectural benefits

for partial/dynamic reconfiguration,

multi-context compared to SRAM and FLASH with

ease of fabrication with standard CMOS process compared to FLASH.

built in analysis tools and close coupling with backend silicon tools helps to build, analyze and validate the architecture to fine tune it to target needs.


Cylindrical Model:

Accelerated System


Operations within the

Cylindrical Model

• Cylinder contains the data flow graph of the kernal / application

• Diameter (and base) is determined by the I/O bandwidth from data memory. This is usually a function of the technology and type of operation.

• Height is determined by data flow size (number of operations) and effective base size.

• Parts of the data flow graph must be folded / expanded to maintain constant operational diameter

162

[email protected]





Parallelism within the

Cylindrical Model

• The cylinder model supports all forms of parallelism; MIMD / SIMD but most naturally supports streaming (MISD).

• The model assumes a defined, static data flow graph which is then realized by streaming or pipelining operations.

• It also requires a synchronizing technology (e.g. FIFOs)

to assure data coordination at an operational node.

• As each node is activated each cycle; the entire data flow

graph is executed each cycle


Design methods and tools

Tools have impact on designer productivity

164

Synthesis tools focus on:

automatically mapping high level descriptions into efficient hardware implementations

according to performance metrics, such as speed, size, and power consumption.

can support verification of functionality, timing and testability

covers all the main steps of synthesis and analysis, including:

capturing domain-specific knowledge,

profiling,

design space exploration,

multi-core partitioning,

(H/S)system partitioning,

data representation optimization,

static and dynamic reconfiguration,

optimal custom instruction set generation,

functional simulation,

programming mixed SW/RHsystems,

ensuring effective cross-boundary communication (another challenging issue).


miscelaneous


ICT market at an inflection point

166

Prosperity depends on network capacity, ..., efficient pricing, flexible platforms, & ...

Senior Counselor to the U.S. Trade Representative (USTR) on strategy and negotiations.

Broadband is significant at the inflection point, prompting major market governance changes

& massive funding

needed

Cowhey‘s & Aronson‘s Law

The battle for the living room & mobile is more important than the PC market.

... Cheap Revolution: •affordable broadband •software

performance

• low power


"Imagination is intelligence with an erection" — Victor Hugo

how programmers think

We must imagine how programmers think


Only a fraction of the chip used

in current general purpose architectures only

a small fraction of the chip is dedicated to carry useful computations,

the remaining resources in memory hierarchy and

modules indirectly for performance (branch predictor, pipeline control)

Exploiting RP to speedup computationally intensive tasks.

to deploy this larger scale, need to address several challenging issues.

- Supply voltage reduced more than 15% per technology generation,

in order to keep the power consumption low

- operating frequency increasing by 20-30% per annum.

168

[email protected]





Too many HDLs


Area cost not a limiting factor

rapid increase of on-chip devices (currently billions of transistors),

& large number of metal layers

170

Reconfigurable computing can fill, at least partially,

the above gap in the missing performance speedup.

due to power limitations, not all resources can be active at the same time;

such resources then used to offer reconfigurability and flexibility on a chip

targeting fault-tolerance, better performance, or certai lower power computations.

reconfigurable hardware area cost

is not anymore a limiting factor.

resources get “cheaper”

© 2008, [email protected] http://hartenstein.de 2010, 2010, 171 © 2008, [email protected] http://hartenstein.de 2010, 2010,

New trends in industry

172


New trends in industry

173 © 2008, [email protected] http://hartenstein.de 2010, 2010, 174

CLB CLB

CLB CLB

CLB CLB

Field-Programmable Gate Array

con

nec

t to

CL

B

form

ing

a w

ire

switch box

CLB

Configurable Logic Box

connect box

FPGA Xilinx 1984

fun

ctio

n s

elec

t

B

A

[email protected]





Data meet the Processing Unit (PU)

by Software

by

Configware

routing the data by memory-cycle-hungry instruction streams thru shared memory

placement of the execution locality ...

We have 2 choices


The Tail wagging the Dog

176

CPU „Central Processing Unit“


(almost) everything However,

it needs

accelerators

accelerators CPU


von Neumann dominance

Even hardware design went von Neumann about 1969

instruction streams + microinstruction streams

Microprogramming: nested von Neumann machines

[G. Koch et al.: The universal Bus considered harmful; 1st EUROMICRO Symp., June 1975, Nice, France 1975]

nested von Neumann bottlenecks:

multiple multiplexing overhead:


Massive Overhead Phenomena


machine RC

instruction fetch instruction stream ./.

state address computation instruction stream ./.

data address computation instruction stream ./.

data meet PU instruction stream ./.

i/o - to / from off-chip RAM instruction stream ./.

multi-threading overhead instruction stream ./.

… other overhead instruction stream ./.


Acceleration by FPGA

• Only one tenth the frequency.

• Magnitude of parallelism overcomes frequency limitations.

• Stream data across large cell array, minimizing memory bandwidth.

• Customized data structures e.g.17 bit floating point; always just enough precision.

• A software (re) configurable technology

• Need an in-depth application study to realize acceleration;

• acceleration requires more programming effort (acceleration is not automatic; ).


Strength and Weaknesses

Section 2 will present the strongest potentials of FPGAs along with their weaknesses which created opportunities for other solutions.

In section 3 for completion of scenario we will discuss structured ASICs which are similar to FPGAs in architecture and offer an interesting tradeoff between FPGAs and ASICs for low to average volumes.

Section 4 will discuss theme of the emerging technologies which have tried to take the niche from FPGA market share. We will show how these technologies have been inspired by FPGAs and have used the Multicore concept to compete with FPGAs.

Section 5 will provide a brief overview of the new FPGA startup companies that have emerged in the past few years.

Section 6 will give a glimpse of the high-end heterogeneous MPSoC Platforms.

Section 7 will present what FPGA vendors have learned in all these years form their challenges and emerging competitions and how they are adapting to it for their future devices.

180

[email protected]





FPGA to ASIC Gap

• Measuring the Gap between FPGAs and ASICs

• 30-40X Area,

• 12-14X Power,

• 3-5X Speed

• Like µProcessor:

• high price of flexibility


Productivity vs. Efficiency

182


Evaluation Metrics

183

The different HLL paradigms/approaches:

- imperative programming (Impulse-C)

- functional programming (Mitrion-C)

- schematic/graphical programming (DSPLogic)

Tarek’s evaluation metrics:

- the explicitness of the programming model,

- ease-of-use

- efficiency of generating hardware


Taxonomy ofTwin Paradigm

Programming Flows (HPRC)

184

E. El-Araby et al.: Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology And Empirical Study; Proc. SPL2007, Mar del Plata, Argentina, Febr. 2007

[courtesy Richard Newton]

„The nroff of EDA“ [R. N.]


Growth of the Internet

The future of our world-wide total computing ecosystem is facing a mind-blowing and growing electricity consumption, together with a trend toward growing cost and shrinking availability of energy sources.

Carbon footprint of the internet higher than the world-wide air traffic.

Will The Internet Break?

185

Consumer broadband connections NA, Mex, WE by end’ 2007:

155 millions - predicted for after 2011: 228 millions.

Accelerated trend: new technologies, larger e-mails, an explosion in services integrating video and software will intensify by increasing popularity of games, massive use of video on demand, high-definition video and pay-TV to the living room, as well as by newer services by mobile phone companies. and multiple connected PCs, and devices using connection


World Economic Forum’s

"Global Redesign Initiative”

Organizations like UN, GATT, G8, G20 are becoming increasing inept at fixing what ails the world: Goals of

186

• economic growth, • climate protection, • poverty eradication, • conflict avoidance, • human security and • promotion of shared values.

Klaus Schwab: "Our existing global institutions require extensive rewiring to confront contemporary challenges in an effective, inclusive and sustainable way."

IT crossborder integration enabling virtual interaction created a world:

• much more complex and more bottom-up than top-down."

• economically, politically and environmentally more interdependent

• without a new set of int’l bureaucracies piled on existing ones.

[email protected]





Important: Reinvent Computing

The growth of IT and internet for:

187

• broad engineering issues

• insuring sustainability issues of the world like

• smart energy production and distribution,

• dealing with ageing and young population

• intelligent water management,

• strengthening welfare,

• mitigating riscs.

help existing institutions by IT networking to enable existing institutions to:

• unleash public value,

• catalyzing initiatives,

• unleashing human capital in the world


The Wikinomics Approach

a global system with graphic visualization to measure success, for

188

-- more agile structures enabled by global networks for new kinds of collaboration without bureaucracy

• complete redesign the global legal system • for a global vaccine protocol, • global intellectual property system • global risk management, etc.

launch a new paradigm to involve world citizens through mass collaboration by a new communication medium including toolslike

• digital brainstorms and town hall meetings: • decision-making initiatives like citizen juries and deliberative polling • execution tools like policy wikis and • social networks with government and evaluation programs.

mass collaboration of

citizens worldwide


Further Progress stalled

Not only disruptive architectural developments in industry stall further progress of IT with respect to energy-inefficiency and performance improvements.

189

Because of the inevitable manycore architecture contemporary computer systems are in an all-dominant programmability crisis.

The progress of performance is massively stalled because of this „programming wall“ caused by lacking scalability of parallelism and an ubiquitous programmer productivity gap.

Unaffordable operation cost by excessive power consumption are a massive future survival problem for existing cyber infrastructures, which we must not surrender.


The growing core count

The growing core counts are racing ahead of programming paradigms and programmer productivity, not only in supercomputing: everywhere!

Almost all supercomputing applications had

originally been written for a single processor and now

more than 50% of the applications do not scale beyond eight cores, although the newest petascale machines employ up to 100,000 processor cores each.

What about future exascale giants expected to come up with up to a million cores?

190


Crashing into the

Programming Wall

• The list (not even complete) demonstrates, that most much earlier supercomputing projects and start-ups failed by crashing into the parallel programming wall.

• Even to-day the vast majority of HPC or supercomputing applications was originally written for a single processor with direct access to main memory.

• But the first petascale supercomputers employ more than 100,000 processor cores each, and distributed memory.

• Three real-world applications have broken the petaflop barrier (1015 calculations/second). A slightly larger number have surpassed 100 teraflops (100 x 1012 calculations/second), mostly on IBM and Cray164.

• The scene hopes, that dozens of applications are inherently parallel enough to be laboriously decomposed, sliced and diced, for mapping onto such highly parallel computers.

• But a large applications is only modestly scalable. More than 50% of the codes do not scale beyond eight cores, only about 6% can exploit more than 128 PE, still a tiny fraction of 100,000 or more available cores.number of


Amid the Clamor

192

Michael Wrinn, (keynote at SIGCSE2010): Suddenly, All Computing Is Parallel: Seizing Opportunity Amid the Clamor http://www.sigcse.org/sigcse2010/attendees/keynotes.php

„Foundational change will disrupt traditional habits throughout the discipline ....“

„The proud era

of von Neumann architecture passes into history.“

a senior course

architect in the

Intel Software

College

bring parallel

computing into

mainstream of

undergraduate

education

http://www.sigcse.org/sigcse2010/attendees/keynotes.php

[email protected]





Programming Research stalled

The programming wall forces us to reshape the fundamental nature of system design and programming methods

193

The scientific community with its current discussions looks like despairingly seeking a needle in a haystack.

The still unanswered question is, what will it really take to build affordable and successfully programmable high performance platforms.

Will we be successful in addressing scalability challenges and in finding new programming models to support finding novel environments and algorithms which improve performance, resilience and power efficiency, and can exploit extreme concurrency?

However, the evolutionary path is not addressing the key issues.

Extrapolating from petascale to future exascale machines yields a power consumption of about 120 MW or more: the power wall.


Max Planck: Replacement of false doctrines by new insights needs 50 years waiting for not only old professors but also their scholars to die off.

50 years Software Crisis

Software Engineering critics is not new:

F. L. Bauer 1968, coined the term „Software Crisis“

N. N. 1995: THE STANDISH GROUP REPORT

Robert N. Charette 2005: Why Software Fails; IEEE Spectrum, Sep 2005

Anthony Berglas 2008: Why it is Important that Software Projects Fail

Software Crisis:

term by F. L. Bauer

[1968]

194


Will we be successful ?

Will we be successful in addressing scalability

195

and in finding new programming models

to support finding novel environments and algorithms

which improve performance, resilience and power efficiency

and can exploit extreme concurrency?


A Rescue Campaign

is urgently needed

Software must be rewritten not only for Manycore

But also in general for energy-efficient computing

However, a qualified programmer population is not existing (we do not yet know, how to rewrite software)

We need to reinvent computing (and its education)


Delaying such actions will

cause a world-wide disaster

Must be done as long as we can afford a rescue campaign

Will be costly and take many, many years

Creates thousands of new jobs

To convince politicians we need presence in the media


All but ALU is overhead: x20 efficiency

198

[R. Hameed et al.: Understanding Sources of Inefficiency in General-Purpose Chips; 37th ISCA, June 19-23, 2010, St. Malo, France]

… quantifying the overheads of a 720p HD H.264

explores methods to eliminate overheads by transformations (data

cashe)

Just one of several overhead abstraction layers

[email protected]





Programmable SOC in the media

no of design

starts: + 13.4%

in 2006

[Dataquest]

#####

til 2010:

from 80,000

to 110,000

[Dataquest]

June 2005


alumnus

alumnus

(CV) at Karlsruhe: first graduate student and Ph. D. student of Karl Steinbuch

alumnus

Reiner Hartensteingiving a keynote address at ITIV 25th anniversary (in 1983)

dire

ctor

Utz Baitinger

Karl Steinbuch

founder

of ITIV

director Klaus Müller-Glaser

Jürgen Becker, vice president,

Univ. Karlsruhe


widening the

semantic gap


ISC2006 BoF Session Title and Abstract

Is Reconfigurable Computing the Next Generation Supercomputing?

Advances in reconfigurable computing, particularly FPGA (field-programmable gate array) technology, have reached a performance level where they rival and exceed the performance of general purpose processors for the right applications. FPGAs have gotten cheaper thanks to smaller geometries, multimillion gate counts and volume market leverage from ASIC preproduction and other conventional uses. The potential benefit from the widespread incorporation of FPGA technology into high-performance applications is high, provided present day barriers to their incorporation can be overcome. This session will focus on defining the anticipated market changes, anticipated roles of FPGA technology in high-performance computing (from accelerators to hybrid architectures), characterizing present day barriers to the incorporation of FPGA technology (such as identifying the right applications), and partnering efforts required (tools, benchmarks, standards, etc.)to speed the adoption of reconfigurable technology in high-performance supercomputing.

Keywords: Reconfigurable computing, FPGA Accelerators, Supercomputing

Date and Time

This BoF session is part of the conference program and will take place within a 45 minute-slot on

Wednesday 28. June 2006 from 18:00 - 19:30.

BoF Organizers

John Abott

Chief Analyst, The 451 Group, USA

Dr. Joshua Harr

CTO, Linux Networx, USA

As

CTO

for

Linux

Networ

x, Dr.

Joshu

a Harr

has the

respon

sibility

of

laying

the

technic

al

roadma

p for

the

compa

ny and

is

leading

the

team

develo

ping

cluster

manag

ement

tools.

Josh's

experie

nce

with

parallel

process

ing,

distrib

uted

comput

ing,

large

server

farms,

and

Linux

clusteri

ng

began

when

he

built

an

eight-

node

cluster

system

out of

used

compo

nents

while

in

college

. An

industr

y

expert,

Josh

has

been

called

upon

to

consult

with

busines

ses and

lecture

in

college

classro

oms.

He

earned

a Ph.D.

in

comput

ational

chemis

try and

a

bachel

or's

degree

in

molecu

lar

biolog

y from

BYU.

Dr. Eric Stahlberg

Organizing founder OpenFPGA, Ohio Supercomputer Center (OSC), USA


more offending statements to come

speaker

audience


[email protected]





How to achieve acceptance

[C

ourtesy R

ichard N

ew

ton]

Your name here: your proposals

how to hide the ugliness from the user [Herman

Schmit]


Lean hdw Qualification: not this way!

[Richard Newton]

We want a WYSIWYG design entry [Richard Newton]

206

Richard Newton: The Next EDA

Revolution (Japan, Sept 1996)

(nroff: from Unix in the 60ies)


The „von Neumann“ mainframe

introduced early

40ies

207

The contemporary basic

mind set of programmers

is still tape-oriented

Time domain:

instruction streams,

controlled by program counter

notorious

headache w.

parallelism


Languages turned into Religions

Teaching to students the tunnel view of language designers

falling in love with the subtleties of formalismes

instead of meeting the needs of the user

Java is a religion – not a language [Yale Patt]


The spirit of the Mainframe Age

For decades, we’ve trained programmers to think sequentially, breaking complex parallelism down into atomic instruction steps …

Even in “hardware” courses (unloved child of CS scenes) we often teach von Neumann machine design – deepening this tunnel view

… finally tending to code sizes of astronomic dimensions

1951: Hardware Design going von Neumann (Microprogramming)

© 2008, [email protected] http://hartenstein.de 2010, 2010, 210 27 October 2008 Software 2008, Zurich

Few parallel abstractions: low level machine-specific models (shared memory,

message passing), assembly level constructs (thread, semaphore, lock), machine-specific performance models, parallel programs are low level and machine-specific

(hard to port, reuse investments, develop market, gain economiesof scale)

Can Multicore supplant Moore‘s dividend? Not without major innovation: Difficult programming Lack of parallel algorithms Few abstractions Sequential code

Difficult programming: synchronization, races, non-determinism, missing language and tool support

Sequential code

Multiple Programming models: long-standing consensus von Neumann, no consensus on parallelism model (data

parallelism, thread parallelism, message passing)

single application may use all of them, language and tools needed to support and

integrate models (education and training, when and how to use)

Lack of parallel algorithms

End of Moore’s Dividend?

http://www.eecs.berkeley.edu/~newton/Presentations/EDARev8_96/index.htm

http://www.eecs.berkeley.edu/~newton/Presentations/EDARev8_96/index.htm

[email protected]





What Language ?

Computer scientists haven’t been interested in programming clusters. If putting the cluster on a chip is what excites them, fine.

Gordon Bell: It will still have to run Fortran!

*) like CoDe-X

Based on classical programming language principles, a dual paradigm dichotomy approach (instruction-procedural interlaced with data-procedural) is a good candidate to support parallel programming.


Tools for Team Design

• At 28-nm, FPGAs deliver the equivalent of a 20- to 30-million gate application-specific integrated circuit (ASIC).

• At this size, FPGA design tools begin to break down. Design and verification in a reasonable amount of time becomes impossible.

• Tools for team-design are coming up, where we should discuss:

• Distributed and parallel development

• Design flows

• Tracking and reporting

• An important consideration is the management and integration of sub-projects into the top level, including

• source code version control

• Design / IP re-use

• Time budgeting

• In-context synthesis

212


43% growth

• In 2010 iSuppli sees 43 percent growth for the PLD market (including FPGAs), and 30% just for FPGAs.

213

• The market for core silicon (PLDs, ASIC, and standard products (ASSPs)) is predicted to grow 21.2 percent in 2010, where PLDs will grow fastest, by 43% up to $4.7 billion.


Dave Patterson‘s Law

214

memory,

streams

off-chip

I/O

I/O

I/O

the memory

wall

Patterson


trends: revival of graphic HDLs ?

Results from CVT project funded by the EU within the ESPRIT program

abutment

expressions

describe

compound

cells,

.........


Only a fraction of the chip used

in current general purpose architectures only

a small fraction of the chip is dedicated to carry useful computations,

the remaining resources in memory hierarchy and

modules indirectly for performance (branch predictor, pipeline control)

Exploiting RP to speedup computationally intensive tasks.

to deploy this larger scale, need to address several challenging issues.

- Supply voltage reduced more than 15% per technology generation, in order to keep the power consumption low

- operating frequency increasing by 20-30% per annum.

216

[email protected]





A Rescue Campaign

is urgently needed

Software must be rewritten not only for Manycore

But also in general for energy-efficient computing

However, a qualified programmer population is not existing (we do not yet know, how to rewrite software)

We need to reinvent computing (and its education)


43% growth

• In 2010 iSuppli sees 43 percent growth for the PLD market (including FPGAs), and 30% just for FPGAs.

218

• The market for core silicon (PLDs, ASIC, and standard products (ASSPs)) is predicted to grow 21.2 percent in 2010, where PLDs will grow fastest, by 43% up to $4.7 billion.


On Bottlenecks

R. Hartenstein, G. Koch: The Universal Bus considered harmful; Symposium on the Microarchitecture of Computing Systems; June 1975, Nice, France [North Holland/American Elsevier].


Von Neumann Coarse-Grained

Reconfigurable Architectures ?

Von Neumann once again came back as a ―hero― to the community telling us that he as a team in form of multi/many cores can compete with FPGAs and exploit features of non Von Neumann coarse-grained reconfigurable architectures. That opened a new portal of research and products for academics and industry including progress of Network on Chips (NoCs).

220


Architectural Impact [Patrick Lysaght]

• Architectural impact

• – Only very high volume architectures transition to leading processes

•– Programmability and concurrency are the new architectural imperatives

• – MPSoCs evolve into heterogeneous, multi-core architectures

• – Power dissipation is a dominant concern

• – Design productivity lags silicon progress


Innovation-driven computing

[Andy Hopper]

• Simulation and modelling are important tools which will help predict global warming and its effects.

222

• Computing will play a key part in optimizing use of resources in the physical world.

• The amount of infrastructure making up the digital world is continuing to grow rapidly and starting to consume significant energy resources.

• To help generate momentum and achieve these goals, it is important that a coordinated set of challenging international projects are investigated.

• We are experiencing a shift to the digital world in our daily lives as witnessed by the wide scale adoption of the world wide web.

Green IT:

• Smart energy meters: housing, buildings, facilities

• Carpooling and public transport by info web sites

• Road traffic and transport logistics optimization

• Reduce travelling by telecommuting.

[email protected]





Hitting 28nm, and beyond

Both de facto FPGA giants (Xilinx and Altera) are hitting 28nm at end of 2010.

223

FPGAs now capable of implementing entire SoCs.

‘ve turned into a complex heterogeneous mix of coarse-grain elements and classical fine grained LUTs.

2009: Intel ships 32nm,

2010: foundries to ship 28nm

Intel will ship 22 nm in 2011,

16 nm in 2013

Xilinx partner TSMC, the world’s largest standalone

Fab almost the de facto Fab for all FPGAs in the world.

Also Altera is well known for its long partnership

with TSMC since early 90s.


Cray-XD1 Architecture

The Cray-XD1 allows the Opteron µP to access the FPGA internal registers, internal and external memory.

224

provides several transfer modes between µP and the FPGA (depending on its initiator).

The µP can read from / write to the FPGA local memory space (i.e. internal registers, internal BRAMS, and external memory).

The FPGA can read from / write to the µP local memory space.

However, the use of HLL can disable some of these features.

The most bandwidth-efficient transfer mode:

write-only mode (producer initiates the transfer):

burst (for large amount of data) or non-burst.


The new developments in

semiconductor technology

difference to ASICs

make Reconfigurable Computing a

widely used solution for future systems.

Reconfigurable Computing can achieve such a goal;

however, several improvements are required:

three orders of magnitude higher Area*Time*Power product than ASICs.

an order of magnitude more resources

an order of magnitude higher delay

an order of magnitude higher power consumption


Run time support of RC

Challenges to runtime support of a reconfigurable system:

226

Online monitoring;

Load balancing;

HW dependable SW;

Visualization;

Runtime resource management and scheduling;

very fast re-layout for dynamic reconfiguration;

Managing adaptive dynamic routing.

Challenging issues in ES: developing generic embedded platforms to improve productivity and reusability.

A reconfigurable system that with the above characteristics

is far from trivial to develop

ES domain applications requirement examples:

being energy efficient and/or

safety critical (even more challenging).


fine grain vs coarse grain

To ensure high flexibility of interconnecting these LUTs requires huge amount of routing composed of

programmable switches and configuration for them which take significant area of the device.

This gave rise to new architectural concepts where the focus was to decrease the degree of fine grained flexibility of FPGAs to a coarser grained one and furthermore application specific which was inherent as when we change the level of flexibility the application domains narrow.

However the resulting solutions are orders of magnitude better in performance, power and cost when compared to general purpose FPGAs.


Some special Languages

228

[email protected]





Some Parallel Languages

229


Area cost not a limiting factor

rapid increase of on-chip devices (currently billions of transistors),

& large number of metal layers

230

Reconfigurable computing can fill, at least partially,

the above gap in the missing performance speedup.

due to power limitations, not all resources can be active at the same time;

such resources then used to offer reconfigurability and flexibility on a chip

targeting fault-tolerance, better performance, or certai lower power computations.

reconfigurable hardware area cost

is not anymore a limiting factor.

resources get “cheaper”


Kommentar

avoid traditional tunnel views

to obtain new perspectives

rediscovery and revival of old ideas

rearrange and teach them properly

to reach promising new horizons

Download - Oil crises: weekend driving ban (Germany)Brasil-Alemanha 2010/11: Ano da Ciência, Tecnologia e Inovação, ... Wikinomics approach for agile world-wide mass collaboration without

Top Related