Download - My parallel universe
2
The Prologue...
In 2008 I quit my job to launch a chip startup with the goal of boosting processor energy efficiency by 25X.
11
The Epiphany Manycore Architecture
RISC SRAM
Router DMA
No CachesNo Standards No MMU No Legacy...No Power
13
The market reception....
13
• Ambric• Asocs• Aspex• Axis Semi• BOPS• Boston
Circuits• Brightscale• Chameleon• Clearspeed
• Ambric• Asocs• Aspex• Axis Semi• BOPS• Boston
Circuits• Brightscale• Chameleon• Clearspeed
• PACT• Picochip• Plurality• Quicksilver• Rapport• Recore• Sandbridge• SiByte
• TILERA
• PACT• Picochip• Plurality• Quicksilver• Rapport• Recore• Sandbridge• SiByte
• TILERA
• SiCortex• Silicon Hive• Spiral Gateway• Stream
Processors• Stretch• Venray• Xelerated
• XMOS• Zililabs
• SiCortex• Silicon Hive• Spiral Gateway• Stream
Processors• Stretch• Venray• Xelerated
• XMOS• Zililabs
How the $@%# will we program this thing??
14
There is no “C” of parallel programming
Erlang SystemC Intel TBB Co-Fortran Lisp Janus
Scala Haskell Pragmas Fortress Hadoop Linda
Smalltalk CUDA Clojure UPC PVM Rust
Julia OpenCL Go X10 Posix XC
Occam OpenHMPP ParaSail APL Simulink Charm++
Occam-pi OpenMP Ada Labview Ptolemy StreamIt
Verilog OpenACC C++Amp Rust Sisal Star-P
VHDL Cilk Chapel MPI MCAPI Java
15
The Problem(s)!
● Parallel programming is HARD!
● Productivity matters. Time is money
● <1% of developers know parallel programming
Technology doesn't move backwards!
17
Presenting “Parallella”
● Launched in September 2012 at $99 (now starting at $119)
● Open source SW/HW!
● Runs Linux (Ubuntu)
● Dual-core ARM A9 processor
● A sizable FPGA
● 1GB RAM USB, HDMI, GigE
● 16/64 Epiphany coprocessors
● 50 Gbit/sec IO, 25/100 GFLOPS
18
Parallella Mission and Principles
● Mission: To help make parallel computing ubiquitous
● Principles:
● Complete and open documentation● Low cost● Open source software● Open standards● Open source hardware (schematics, layout)● Open collaboration:
http://github.com/parallella
http://forums.parallella.org
19
Some Perspective...
● 1993 CM-5● 1024 processors● 136 GFLOPS/100KW● #1 in 1993 Top500 List● Price: >$30M
● 2014 Parallella-64● 66 processors● 100 GFLOPS*/5W● #1 in energy efficiency● Price: $199*
21
25X: Size does matter... Tianhe-2 ● 33 PFLOPS● $390M USD● 24 MW● Insanity!!!!
“There is STILL plenty of room at the bottom”
33 PFLOPS=~16 28nm Epiphany Wafers**
23
Parallella Research in 2014
● >10,000 Parallella boards shipped● 200+ University collaborations● $10K in hardware donated● Active Research Areas:
● Computer science education● Robotics/drones● Software defined radio● HPC
24
Parallella Universities in South America
Brazil:● Sao Paolo State University● CELTAB● Federal University of Uberlandia
Argentina:● Universidad Austral, Argentina● Universidad De Buenos Aires● Universidad Nacional de La Plata● Universidad Tecnologica Nacional● Pontificia Universidad Javeriana● Univesidad Nacional de Cordoba
Chile:● Universidad Mayor
Colombia:● Universidad Industrial de Santander
25
Some Parallella Lessons
● Openness more important than cost● You CAN build hardware with a profit outside China, we did it!
● Collaboration is VERY hard work● Time is our devs' most precious resource● Ease of use wins over performance very time.(simplicity+docs+support)
26
How we benefited from open source
● As consumers:● Linux, U-boot, Ubuntu, Beaglebone, Verilator
● As recipients:● Eclipse Multicore IDE ($1M)● OpenCL ($1M)● Multicore Epiphany simulator ($50K)● Demos ($50K)
27
It is “your” responsibility to make pervasive parallel computing a reality!
Explorers1. Create the tools to make parallel programming easier
2. Create algorithms that scale (Amdahl)
3. Create a universal parallel software stack
Teachers1. Rewrite the computer science curriculum
2. Retrain 20M programmers
28
The Future of HW: A Brief Summary
Constraint --> Result
Performance limits Massive parallelism
Thermal density Slow clocks (1MHz-1GHz)
Failure rate Distributed systems
Bandwidth No shared resources
Density 3D chip stacking
Efficiency Heterogeneous HW
Productivity Heterogeneous SW
Amdahl's law New algorithms
Development cost Open collaboration
Latency Open collaboration
29
Get ready now!!
●Critical code must be performance scalable to 1000 threads
●You (or a tool) will manage memory in software
●Know where in the universe your bits are stored!
●The hardware will fail often, can your SW handle it?
●The minimum number of languages is 2.
30
The Future is Heterogeneous
FPGA● Irregular math● IO● Customization
CPU● Legacy code● 90% of LOC● <100GFLOPS
ASIC● Makes comeback at end of Moore's Law
● Another 100X boost
Accelerators● Math crunching● Scalable● >100 GFLOPS
31
16K-64K CPUs1MB/core (3D)~20 TFLOPS
0.2W-20W
16K-64K CPUs1MB/core (3D)~20 TFLOPS
0.2W-20W64 CPUs
32KB/core100 GFLOPS
0.1W-2W
64 CPUs32KB/core
100 GFLOPS0.1W-2W
64 CPUs128KB/core80 GFLOPS
(DPF)0.1W-3W
64 CPUs128KB/core80 GFLOPS
(DPF)0.1W-3W
1K CPUs128KB/core
~1.2 TFLOPS0.4W-40W
1K CPUs128KB/core
~1.2 TFLOPS0.4W-40W
By 2018 there WILL be 64K-core chips!
This is a new world. Without legacy, a great opportunity to do software right!
2013 2015 2015 2018
32
Getting your hands dirty
● Tomorrow: LAB2 from 10am-2pm
● Email: [email protected]
● Twitter: @adapteva