d4.1 enabling technologies report on critical success ... · d4.1enabling technologies report on...

D4.1Enabling Technologies Report on Critical Success Factors Version 1.2

© 2014 RETHINK big Project. All rights reserved. www.rethinkbig-project.eu 1

D4.1 Enabling Technologies Report on Critical Success Factors (based on first period analysis m03-m08) Document Information

Contract Number 619788

Project Website www.rethinkbig-project.eu

Contractual Deadline Month 8 (31 Oct 2014)

Dissemination Level Report

Nature Public

Editor Marcus Leich (TUB)

Contributing Authors

Gina Alioto (BSC), Oriol Arcas (BSC), Rosa Badia(BSC), Francisco Cazorla (BSC), John Goodacre (ARM), Nikolaos Chrysos (FORTH), Adrián Cristal (BSC), Janez Demsar (University of Ljublijani), Santiago González-Tortosa (UPM), Consuelo Gonzalo-Martín, Max Heimel (TUB), Christos Kotselidis (UniMan), Marcus Leich (TUB), Xuesong Lu (EPFL), Evangelos Markatos (FORTH), Ernestina Menasalvas (UPM), Santiago Muelas-Pascual (UPM), Javier Navaridas (UniMan), Vasilis Pavlidis (UniMan), Adrian Popescu (EPFL), Damián Roca (BSC), Nehir Sonmez (BSC), Osman Unsal (BSC)

Reviewer John Goodacre (ARM)

Keywords

Volatile memory: DDR and 3-D, Non-volatile memory, Storage, Accelerators, Heterogeneous Computing, GPU, VP, FPGA, Co-processor, Interconnect, HLS, LAN, Optical, 3-D Integration, SW Frameworks, IoT, Edge Computing, Algorithms, HPC, Machine learning, Graph processing, Visualization, Co-design, Probabilistic, Alternative computing paradigms

This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 619788.

This content of this document reflects only the author’s views; the Union is not liable for any use that may be made of the information contained therein.



Change Log Version Description of Change

v1.2 Initial release to the European Commission



Table of Contents 1 Executive summary .................................................................................................. 5 2 Big Data technology infrastructure ......................................................................... 6 2.1 Big Data processing ................................................................................................................. 6 2.2 General system architecture ..................................................................................................... 6 3 Hardware architecture and components ................................................................ 8 3.1 Memory and data storage ........................................................................................................ 8

3.1.1 Memory hierarchy overview ............................................................................................ 8 3.1.2 Primary memory .............................................................................................................. 8 3.1.3 Secondary memory / storage ............................................................................................ 8 3.1.4 Volatile memory: DDR and 3-D ...................................................................................... 9 3.1.5 Non-volatile memory ..................................................................................................... 10 3.1.6 Main memory and storage evolution (predicted future characteristics) ........................ 11 3.1.7 Main memory and storage algorithm implications ........................................................ 12

3.2 Accelerators and heterogeneous computing .......................................................................... 13 3.2.1 Accelerators and Co-processors Overview .................................................................... 13 3.2.2 General-Purpose Co-processors ..................................................................................... 13 3.2.3 Vector Processors (VPs) ................................................................................................ 14 3.2.4 Special purpose co-processors ....................................................................................... 15 3.2.5 Reconfigurable co-processors ........................................................................................ 15 3.2.6 Accelerator design with High-Level Synthesis (HLS) tools .......................................... 16 3.2.7 Interconnects for discrete co-processors ........................................................................ 17 3.2.8 Accelerator evolution and Europe ................................................................................. 18

3.3 3-D or vertical integration ..................................................................................................... 18 3.4 On-chip Interconnect .............................................................................................................. 20

3.4.1 On-chip optical technologies ......................................................................................... 20 3.4.2 Optical Network-on-Chip (NoC) ................................................................................... 21 3.4.3 Conclusions .................................................................................................................... 21

4 Network architecture .............................................................................................. 22 4.1 Networking ............................................................................................................................. 22

4.1.1 Networking interconnects .............................................................................................. 23 4.1.2 Local Area Networks or LAN ........................................................................................ 23

4.2 IoT and Edge computing ........................................................................................................ 25 4.2.1 Evolution ........................................................................................................................ 25 4.2.2 EU strategy ..................................................................................................................... 26

5 System software ...................................................................................................... 28 5.1 Software frameworks .............................................................................................................. 28

5.1.1 The Big Data software stack .......................................................................................... 28



5.1.2 Batch processing ............................................................................................................ 29 5.1.3 Stream processing .......................................................................................................... 30 5.1.4 Graph processing ........................................................................................................... 31 5.1.5 High Performance Computing ....................................................................................... 31 5.1.6 Hardware-independent program specification ............................................................... 33 5.1.7 Broader vision for Europe .............................................................................................. 33 5.1.8 Concrete strategy ........................................................................................................... 34

5.2 Big Data Core Algorithms ...................................................................................................... 34 5.2.1 Machine learning algorithms on graphs ......................................................................... 34 5.2.2 Graph processing ........................................................................................................... 35

5.3 Visualization ........................................................................................................................... 36 5.3.1 General visual analytics ................................................................................................. 36 5.3.2 Visual Analytics for Business Intelligence .................................................................... 38

6 Alternative computing paradigms ......................................................................... 40 6.1 Probabilistic computing ......................................................................................................... 40 7 Co-design ................................................................................................................. 41 7.1 Hardware/Software co-designed for Big Data ....................................................................... 41 7.2 Hardware/Software co-designed architectures ...................................................................... 41 7.3 Hardware/Software co-designed processors ......................................................................... 42 7.4 Hardware/Software Co-designed accelerators ...................................................................... 42 7.5 Broader strategy for Europe .................................................................................................. 42 7.6 Milestones ............................................................................................................................... 43 7.7 Chip security and privacy ...................................................................................................... 43

7.7.1 Overview ........................................................................................................................ 43 7.7.2 Current status ................................................................................................................. 43 7.7.3 Hardware and hypervisor assistance .............................................................................. 44 7.7.4 Research challenges ....................................................................................................... 45

8 Open topics requiring further Investigation ......................................................... 46 9 Bibliography ............................................................................................................ 47



1 Executive summary This document provides an overview of the relevant Big Data related technologies. For each technology we explain how these technologies relate to Big Data processing, outline the state of the art and describe the current trends in development and research.

Where applicable, the authors outline how a specific technology may contribute to strengthening Europe’s presence in the Big Data domain.

Starting from a general description of what Big Data processing actually is and this is generally tackled in terms of a general system architecture, we discuss core hardware components like memory and storage, accelerators as well as on-chip interconnects and silicon integration techniques. Having discussed the local components we will cover the bridging elements like LAN and cross machine interconnects, as well as related topics such as Internet of Things and Edge Computing. Following hardware, we describe crucial software aspects such as the capabilities of current frameworks, as well as frequently employed algorithms and data visualization challenges and approaches. The remainder of the document discusses advanced techniques such as alternatives computing paradigms and hardware/software co-design opportunities that may help to greatly improve over the efficiency of current systems.

Given the sheer size and diversity of both applications and technologies in the Big Data domain, the authors realize that this document cannot give an exhaustive overview of every potentially relevant technology. The final section covers topics that are not addressed completely in this document, but appear to be relevant for the road mapping effort.



2 Big Data technology infrastructure 2.1 Big Data processing The term Big Data describes broad domain of application scenarios and corresponding solutions that deal with processing large amounts of data.

The most important characteristic of all these scenarios is the complexity of the data analysis tasks that stems from the so-called “three Vs”, namely volume, velocity, and variety. Depending on the definition, additional characteristics include also variability and veracity. In essence these characteristics convey the notion that computational complexity of a workload cannot be addressed in due time with conventional means.

Within the range of typical types of Big Data problems, practitioners may either be faced with very large amounts of data (in the order of several terabytes), moderate amounts of data arriving in short intervals, or even small amounts of data arriving in extremely short (microsecond) intervals. Depending on the concrete data set and workload, Big Data processing jobs can thus be classified as batch processing job for very large amounts of data or, most recently, also stream processing jobs, for comparatively small amounts of data that need to be addressed in near–real time.

Judged by volume and velocity, arguably, these kinds of workloads by themselves are not a new occurrence. Weather forecasts, physical simulations, financial transactions, all these domains have imposed similar workloads to conventional high performance computing architectures. So, the defining feature for Big Data, must quite counter-intuitively, be something else than the size of the data, or it must be at least not only the size of the data.

The fact that the large amount of data requires a number of directly connected computers to have an answer in reasonable amount of time and the assumption that any hardware failure will impose an unacceptable delay have made such processing very expensive. Special high performance components need to be employed to enable maximum throughput and to minimize the likelihood of failure to an acceptable level.

But what if one could accept small delays caused by hardware failure and would rather subdivide the computation into much smaller chunks each executed on one node of the system? If just a few machines fail, then also just a small fraction of the overall computation would need to be repeated. Instead of using hundreds of costly high performance machines, which can work on large data chunks, one could just distribute the workload across a system of thousands of expendable and inexpensive commodity machines.

Similar to switching to less expensive hardware, in the Big Data context people also switch to a less sophisticated software engineering process. Workloads in the Big Data domain are often characterized by a sort of an ad hoc approach. Practitioners first and foremost need to explore the value of their data, which may be subject to ever changing features of available data sources. Consequently, initially, there are no ETL (data Extraction, Transformation, and Loading) processes that integrate various data sources into a clean data warehouse. Source data is always kept in its native form lest some attribute that may prove to be important later is discarded. Similarly the actual algorithms used for analysis are very diverse in development environments and only mature and stabilize during production where they soon may need to be adjusted or even replaced in order to adapt to a changing business need.

To conclude, Big Data, is all about processing large amounts of data, but it is not only about that. Big Data means to process diverse data sources, to quickly adapt to changing workloads and demands and to do so in a cost effective manner.

2.2 General system architecture As mentioned previously, the standard system architecture has today evolved to be a cluster of potentially heterogeneous commodity machines (nodes), each of which follows the design principles of a standard PC.



Standard PC hardware is comprised of one or more multicore CPUs, main memory, as well as certain peripherals like secondary storage interface, Ethernet network adapters and graphics adapters. While CPU and main memory are connected via a high-speed memory interface, peripherals currently connect to the CPU via the slower PCIe bus. In a typical cluster, hundreds or even thousands of these nodes are connected using Ethernet network. It is crucial to understand that the nodes reside in a so-called shared-nothing environment. Neither disks nor main memory are shared between nodes.

Contrary to HPC computing approaches where CPUs may share memory (although, with different performance characteristics for different memory regions cf. NUMA) or shared disk solutions, in which each CPU has private memory, but can at least access all disks in the cluster, each CPU has only access to a its own memory and disks. Any remote access must go through the network.

Consequently, software is required to compensate for the access limitations that come with cheap shared-nothing environments and to do so in a way that ensures maximum resource utilization.



3 Hardware architecture and components 3.1 Memory and data storage One of the most principal components of any given system architecture is a form of permanent or temporary data storage. Compute resources as well as other devices access this storage mostly in random or sequential manner.

The term memory is usually used to refer to fast volatile data storage most often used to store intermediate or frequently accessed data, while storage usually refers to permanent data stores for long-term storage.

The key properties of each memory are access time for sequential and random reads as well writes, the storage density (so storage per drive or per area of silicon), and throughput (GBs of data read or written per second).

3.1.1 Memory hierarchy overview Being able to efficiently access and store data is one of the fundamental requirements for information management. Modern computer systems typically feature a multitude of different memory pools for this, hierarchically organized by their corresponding access times and sizes. For the sake of simplicity, we will classify these pools into two large categories: Primary memory or Secondary memory / storage. Primary memory includes registers and CPU caches: Small amounts of extremely fast, expensive memory that is located directly on the CPU die (on-chip) as well as the main memory. Secondary memory / storage typically describe the larger pools of slower but more economical memory that is used for more permanent or long-term storage.

3.1.2 Primary memory CPU caches are used to keep copies of frequently accessed data close to the CPU in order to accelerate data access. Modern CPU caches employ Static Random Access Memory (SRAM) which has access bandwidths of a few hundred Gigabytes per second, and access latencies of below ten nanoseconds (~1-4 cycles) and is limited to a range of 16 to 128 kilobytes for level 1 cache. There are alternatives to SRAM, such as eDRAM (embedded DRAM) which allows for the integration of a large cache (on the order of MB) in the same die. The cost per bit is larger than the usual DRAM chips, but the advantage is the potential for having up to 96MB as opposed to a maximum of 20MB of SRAM. However, higher latency and the need to refresh limit this option for use as last level cache.

Moving further away from the CPU cache in the memory hierarchy is the main memory, which is used to store data that is currently being processed. Accessing main memory is one to two orders of magnitude slower than accessing the CPU cache (>100 cycles), but main memory volumes are much higher. Modern systems typically feature a few 10GBs of main memory, with server-grade systems having up to a few 100GBs and in some cases even 1TB of main memory. The current state-of-the-art in main memory storage technology is DRAM or Dynamic Random Access Memory based on the DDR3 specification, which offers data rates of around 15GB/s per module (which can be extended up to ~30-45GB/s by accessing two to three modules at once via Dual or Triple Channel). Server-grade memory modules also feature automatic error correction (ECC). DDR4, the successor to DDR3, is currently being introduced, featuring higher data densities (mostly due to 3-D stacking, placing several dies on top of each other into a single package); faster data transfer, and lower power requirements.

The technologies (SRAM, DRAM) that are currently employed for caches and main memory are so-called volatile storage media; these media require power to maintain their stored content. Accordingly, they cannot be used to permanently store data. For permanent storage, we look to very cheap, non-volatile storage of up to a few Terabytes.

3.1.3 Secondary memory / storage Continuing beyond the main memory in the memory hierarchy is the Secondary memory often referred to as storage for which we currently exclusively employ non-volatile mass storage. The



long established default in this area are spinning hard disks, that offer very high capacities for a given price and in combination with RAID controllers can offer reasonable bandwidth while providing enterprise-class fault-tolerance. Due to the mechanical movement of the read/write head, data access latency is quite high, especially for random data access. Seagate has recently introduced an 8TB disk at only $0.03/GB using “Shingled Magnetic Recording (SMR) technology that can fit more data on a platter than the typical drive” [1].

Despite the falling cost of disks, the cheapest long-term storage media is still tape. Tape has the highest access latency but also the highest density in addition to the largest drive sizes – 90 TB or more than one order of magnitude difference with its closest competitor, the disk [2].

At the same time, solid-state drives (SSDs) that are based on flash memory are beginning to replace classical hard disks as they offer superior bandwidth and lower data access latencies of as low as few hundreds of microseconds [3]. The price per gigabyte of storage for flash based SSDs is between that of spinning disks and very fast main memory. This makes SSDs an optimal choice for companies that require faster data access compared to spinning disks, while trying to avoid the costs of a purely main memory based system.

3.1.4 Volatile memory: DDR and 3-D As previously mentioned, in 2015, the so-called volatile memory DRAM market is dominated by DDR3 but is expected that the dominant technology will be DDR4 by the end of the year. The most important aspects of DDR4 are higher data densities mostly due to 3-D stacking which can be described as placing several dies on top of each other into a single package. It is expected that these 3-D technologies will dominate the market over the next 5 years from 2015 – 2020. Currently there are 3 new standards under proposal that fit into the category of 3-D, and each of these technologies targets a different market. Here we discuss these standards, please refer to the 3D Technologies section for the full technology review.

The first 3-D standard under revision is the Hybrid Memory Cube [4] specification which was created by a consortium led by Altera, ARM Ltd., IBM, Micron, Open Silicon, Samsung, SK Hynix and Xilinx. HMC consists of a 3-D stacked memory on top of a die of high speed logic which is connected through-silicon-via (TSV). The HMC can be connected to the CPU via memory mounted adjacent to the CPU, or as scalable modules form factor. It is possible to connect up to 8 modules to work together and also support atomic operations. HMC has shown a reduction in power consumption by 70% and an increase of 15x in performance over DDR3. The HMC memory differ notability in that the actual access to the DRAM is controlled within the HMC structures leaving the processor to use a message based protocol to request memory operations. The physical connection to the HMC memory also differs from the traditional DDR digital parallel bus in utilizing a high speed serial bus with resulting in a significant reduction in the number data paths between the memory and the processor. Moreover, HMC hence occupies 90% less space due to its density and form factor. Due to its versatility, it is targeted to a wide range of general use.

The second 3-D standard is High Bandwidth Memory (HBM) [5] which is proposed as a JEDEC standard. HBM is being pitched as a replacement for GDDR5 in graphics cards, although it is not limited to this use. It has up to 8 independent channels, each one of 128 parallel data bits plus ECC capability. The proposed bandwidth is 256GB/s and capacity is 8GB of memory. In addition, it will include a self-refresh mode.

Wide I/O 2 is the last 3-D stacking standard currently under consideration [6]. It is also proposed as a JEDEC standard, but it is oriented to mobile devices. Wide I/O 2 is a low power and high bandwidth solution. It has 8 channels, each of them of 64 bits, and is connected to the CPU via an interposer.

There are other technologies that could potentially impact the DRAM market. One is DiRAM4, a 3-D technology and manufacturing process of Tezzaron (DiRAM4) [7] that is comparable in performance and capacity with the DDR4-related standards mentioned above but is constructed using a thin wafer with greater redundancy that allows for more vertical connections that could



greatly improve wafer yields and increase the number of 3-D layers and potentially reduce the cost.

3.1.5 Non-volatile memory The increasing need for efficient data access has led to a high interest in non-volatile RAM technologies (NVRAM), which aim to combine the non-volatility of mass storage with the access speed of main memory. At the moment, there are two primary NVRAM technologies in use: Battery-backed DRAM and flash storage with the former being used as replacement for volatile RAM, and the latter being used as secondary memory.

Battery-backed DRAM modules (BBU DIMM) package a small battery directly onto the memory module that can sustain the memory content for up to 72 hours in case of a power outage. Since they are based on off-the-shelf memory modules, they offer the same performance characteristics as main memory. However, the utilized batteries only have limited life, add hazardous materials to the system and significantly increase the price of the memory module. It is due to this combination of disadvantages, that BBU DIMMs are only seldom found in commercial systems.

Flash storage uses isolated electrical charges within a transistor to store information durably. Since it is based on solid state, it has several advantages over magnetic disks: First, data transfer and access latencies are strongly improved when compared to traditional magnetic disk drives. Second, flash storage doesn’t require any mechanical parts, giving it a higher failure tolerance. Additionally, flash has Multiple Level Cells or MLC which means that it is possible to store more than 1 bit in a cell. Finally, producing flash storage is highly similar to producing DRAM or CPUs, meaning manufacturers can share production lines and technological advancements with other semiconductor-based products. However, flash storage also has several shortcomings: First, it only has a limited lifespan of a few million read/write cycles, making it unfeasible for long-term backup solutions. Second, it still is much slower and requires more energy for read/write access than main memory. Currently, the most widely-used flash technology is NAND flash which typically has higher densities and lower cost than NOR flash (which has faster access time). In 2014, Samsung announced its first incarnation of V-NAND a 3-D Vertical version of these chips with 32 layers in order to improve upon these endurance and latency issues, and other companies are expected to follow suit [8].

However, manufacturers are also currently investigating several alternative technologies to replace flash that offer non-volatile storage with DRAM-like access speeds or to be used as storage class memory. These technologies include, but are not limited to ST-MRAM (magnetic polarization), phase-change memory or PCM (crystalline state), and memristors (electrical resistance).

Phase-change memory (PCM, also referred to as PC-RAM) has a faster read / write speed than NAND flash, it can be constructed in such way that can replace DRAM due its addressing capabilities; flash can only be accessed by pages. The endurance is orders of magnitude better than NAND flash, but it is still not good enough to directly replace DRAM. That said, PCM is a very good candidate for constructing hybrid memories which combine a small volatile and non-volatile RAM. The price of these hybrid memories is expected to decrease to that of standard DRAM cells. Currently, there are some chips in the market constructed with PCM.

Ferroelectric RAM (FeRAM) also has a faster read / write speed than NAND flash and an endurance equivalent to DRAM. And, FeRAM and DRAM both function in the same manner in that reading is destructive. However, FeRAM is not a good potential replacement for DRAM or for NAND flash due to its low density and the inability to store multiple bits in one cell (not MLC).

Magnetoresistive random-access memory (MRAM) is one of the technologies that shows potential to be used as replacement for DRAM (faster read / write speed than NAND flash and an endurance equivalent to DRAM). The only drawback is its cell size, which is too large to serve as DRAM and is comparable to the size of SRAM cells.

Spin-transfer torque RAM (STT-RAM) is another emerging technology that also shows great promise as a replacement for DRAM due to its expected density, possible feature size, speed and



endurance. Today it is still not competitive with DRAM, but the projections of the evolution are optimistic.

Resistive or redox RAM (ReRAM or RRAM) or/and memristors are a group of cells that are based on the common principle of metal filaments. ReRAM is also seen as a potential replacement both DRAM and NAND flash as there are excellent projections for high endurance, very low operational voltage (lower than current DRAMs that must refresh and lower than NAND flash) and Multiple Layer Cells (MLC). Additionally, some types of ReRAMs are CMOS-compatible whiles others may lend themselves to 3-D Stacking as well as 3-D Vertical cells. An additional advantage for ReRAM and memristors is their ability to store a weight in an artificial neural networks in an analog way.

Another interesting emerging technology for memory is Nanoelectromechnical Switch or NEMS. NEMS cells could be constructed in two different flavors: it could be either silicon or carbon based. Nantero Inc has announced that they can produce carbon nanotubes. NEMS also has good technical prospects to be used as universal non-volatile memory, so they can replace DRAMs, SRAM and NAND with an endurance of 1011 write cycles.

3.1.6 Main memory and storage evolution (predicted future characteristics)

Traditional Technologies Emerging Technologies

Improved Flash

DRAM

[9] SRAM

[9] NOR

[9] NAND

[9] FeRAM

[9] STT-RAM

[9]

PCRAM

[9] Memristor

[10] NEMS

[11]

Cell Elements

1T1C 6T 1T 1T1C 1(2)T1R 1T1R 1M 1T1N

Half pitch (F) (nm) 9 10 25 >10 65 16 8 3-10 10

Smallest cell area (F2)

4F2 140 F2 10 F2 4 F2 12F2 8F2 4F2

4 36

Read time (ns) <10 ns 70 ps 8 ns 0.1ms <20 ns

<10 ns < 10 ns < 50 0

Write/Erase time (ns)

<10 ns 70 ps 1ms/10ms

1/0.1 ms <10 ns <1 ns <50 ns < 250 1ns(140ps-

5ns)

Retention time (years)

64 ms N/A 10 y 10 y 10 y >10 y >10 y > 10 > 10

Write endurance >1E16 >1E16 1E5 1E5 >1E15

>1E15 1E9 1015 1011

Write op. Voltage (V)

1,5 0,7 8 15 0.7–1.5 <1

<3 < 3 < 1

Read op. Voltage (V)

1,5 0,7 4,5 4,5 0.7–1.5 <1

<1 < 3 < 1

Write energy (J/bit)

1,00E-13 unavailable

2,00E-10

1,00E-12 unavailable

unavailable

unavailable

< 5E-14 < 7E-16

Highly scalable 2E-15 [B] Promising promi

sing promisi

ng promising



The data of this table is compiled from different sources: [9], [10], [11]

In the above table we can see the expected evolution of the different technologies, the cell element is the form of the cell, and for example 1T1C means that the cell is a transistor with a capacitor. The half pitch (half the distance between identical features in an array), is used with smallest cell area (F2) to know the size of the cell: as the half pitch is less and the number of F2 is less, the total density will be larger. The read/write time is the time needed to read from or write into the cell, for some technologies the difference between read and write is very large, so this type of cells can be used in an environment where the read operation is used with more frequency than write, like secondary storage. The retention time, is the time that the cell can be read without losing the data. The endurance is the number of writes or erases that a cell can support without being corrupted, for a technology to replace DRAM this number must be very large - ideally in order to 1015 - or some technique must be applied to deal with the errors or to ensure that the cell will be written in a uniform way. The read/write voltage is the voltage needed to read or write a cell, in some technologies these voltages are not equal. The write energy is the energy needed to write a bit.

Ideally the best replacement for DRAM must to have a large endurance, high density, low read/write voltage, and low read/write latency. For secondary storage, the endurance does not need to be so large, likewise, write latency and write voltage are also not so important. For secondary storage a very large density is crucial.

3.1.6.1 Main memory Over the next 5 years (2015 – 2020), 3-D stacked memory is a promising approach for further increasing memory density and possibly lowering prices for main memory. At time of writing there are several products for mobile phones on the market that claim to have 3-D although their products are mostly Wide I/O 2.5-D. In 2014, Samsung announced its first incarnation of V-NAND, a 2.5 or 3-D version of these chips with 32 layers in order to improve upon the known endurance and latency issues in NAND flash [12]. Other companies are expected to follow suit over the next 5 years, including Intel who announced in late 2014 a partnership with Micron to launch its own product in 2015 [13].

Moreover, moving beyond (2020-2025), NVM or Non-Volatile Memories look to take on more importance if recent Intel patent applications for new instructions using NVMs are any indication [14]. Specifically, STT, NEMS and ReRAM and / or memristors look to be the next DRAM replacement. HP is currently betting on the merits of memristors performance projections by including memristors as a part of their new computer architecture, the Machine. In their public Machine roadmap, HP expects memristors prototypes in late 2015 with memristors in their commercial product by 2018 [15]. Moreover, Silicon Valley Start-up Crossbar announced that they expect to see their 3-D ReRAM to be available in products beginning in 2016 [16].

Although it was thought that PCM hybrid memories might become popular in the near-term, it is notable that the most prominent vendor of PCM products, Micron, removed PCM from its public roadmap in favor of NAND flash in January of 2014 [17].

3.1.6.2 Data storage (Secondary memory) Looking at the next 5 years (2015 – 2020) for secondary memory / storage, prices for SSDs are assumed to further drop. We expect the SSDs to slowly erode the disk market, with the focus remaining primarily on NAND FLASH and V-NAND. However, conventional mass storage interconnects such as SATA and also PCI Express are becoming a bottleneck as vendors start to manufacture SSD components with DDR RAM interfaces or as Storage Class Memory. The Storage Class Memory is a non-volatile memory that has larger latency that DRAM, but will have more capacity than DRAM and can be connected directly to the processor through the memory controller or the I/O controller depending on the technology of the cells. It will act as a new level in the memory hierarchy, between the main memory and the solid state devices or disks.

3.1.7 Main memory and storage algorithm implications Improving memory bandwidth with 3-D stacking technologies and accelerating production of NVMs while bringing them closer to the CPU will enable more power efficient data analytics due



to performance and latency improvements. However, such enabling technologies will necessitate significant software re-design efforts of the employed software and algorithms. Since, in the near future, main memory will lose its non-volatility characteristic, the algorithms and abstractions must to be adapted to take advantages of this characteristic, for example the abstraction of a conventional file system maybe not be necessary anymore. Data locality was always very important but with the new 3D stacking technologies it will be even more so. Consequently, it will likely be more efficient to move the computation to where the data is located instead of moving the data to the compute resource.

Besides tackling Big Data memory related problems with the aforementioned advances, significant research challenges are also emerging. Advanced techniques to tackle the “wear-out” pathologies have to be designed. Especially for Big Data processing, where the pressure on memory read/write is larger than any other domain, algorithmic design that minimizes repetitive writes on the same memory cells, and thus improving the average life-span of NVMs, should be researched and developed.

3.2 Accelerators and heterogeneous computing

3.2.1 Accelerators and Co-processors Overview Accelerators - or co-processors - are specialized processing devices that work alongside a Central Processing Unit (CPU) to speed up certain classes of computations. In general, we distinguish between three different classes of co-processors: General-Purpose, Special-Purpose, and Configurable. General-purpose co-processors like GPUs or the Intel Many Integrated Core (Intel MIC) are freely programmable and can run more or less any type of computation, although they are typically most suited for certain special classes of programs. Special-purpose co-processors are used to accelerate certain very specific computations, like operations from signal processing, encryption, or compression. Finally, reconfigurable co-processors like FPGAs are a special class of co-processors that can modify their behavior at runtime. Depending on its configuration, an FPGA can act either as a general-purpose processor or as a highly specialized accelerator.

In addition to these three classes, we also differentiate between discrete and embedded co-processors: While discrete co-processors are shipped in their own package, usually containing their own memory, embedded ones are directly integrated into the CPU die. Embedded co-processors, often appearing in so-called “systems-on-a-chip”, have the advantage of being able to directly operate on and address data in the system’s main memory, sometimes taking advantage of the CPU’s caching architecture. However, they are also limited in their size, power and heat budgets by the CPU. Discrete co-processors are not affected by these limitations because they must always be physically connected to the CPU via some kind of interconnect - typically PCI Express. All data that will be processed by the accelerator has to pass through this interconnect. Depending on the computation and the utilized bus, these additional transfers can quickly eat up any savings from the accelerator.

Key properties of co-processors are the speed of computation, usually measured in operations per second, their energy consumption per operation performed and their size.

3.2.2 General-Purpose Co-processors

3.2.2.1 Graphics Processing Units (GPUs) GPUs were originally developed to relieve the CPU of the additional computational complexity when computing pixel graphics, e.g. for video gaming or CAD rendering. Historically, GPUs only implemented fixed rendering pipelines. However, over the last ten years, they have matured to fully programmable, powerful co-processors. A modern GPU contains up to a few thousand simplistic compute units: For instance, Nvidia’s latest consumer GPU (GM204) features 1536 compute units, while AMD’s latest GPU (Hawaii XT) features 2816 units. These numbers are expected to keep growing in the future. Besides having a high number of cores, modern graphics cards also feature around 2-3GB of dedicated high-bandwidth GDDR5 memory, with professional



models featuring around 4-5GB. By having much wider memory interfaces, and higher access latencies compared to DDR3, GDDR5 memory achieves bandwidths of up to a few hundred GB/s.

Several frameworks, including OpenCL, CUDA and DirectCompute, allow harnessing this massive performance to accelerate general-purpose computations. While GPUs are not suited for all types of computations, such code containing complex branching logic, their high number of cores and high-bandwidth device memory can lead to tremendous acceleration factors when running data-parallel computations, e.g. from Numerics, Signal Processing, Bioinformatics, Data Processing or Machine Learning. In these scenarios, GPUs can offer speed ups of around an order of magnitude over a modern multi-core CPU.

Today’s most important server-class GPU vendors are AMD, Intel, and Nvidia. Out of these three, Intel has the lion’s share of the market, producing around 67% of all GPUs sold [18]. However, it should be noted that this number includes the embedded GPUs, which are nowadays found in virtually every Intel CPU. AMD and Nvidia cover 18% and 15% respectively, of all produced GPUs, both offering discrete cards (AMD Radeon, Nvidia Geforce). AMD also focuses on embedded GPUs, offering them for their Opteron series of processors (AMD Fusion). Other vendors (like Matrox or S3) only play a marginal role.

Besides their consumer-oriented graphics cards, AMD (FireStream) and NVidia (Tesla) also offer professional accelerator cards based on their GPUs, targeting High-Performance Computing (HPC) applications. Compared to their consumer cards, these typically feature more on-device memory (typically 12-24GB), a larger number of cores with higher clock frequencies, error-correcting memory, as well as advanced features (such as better support for double-precision floating point logic) for professional users.

3.2.2.2 GPUs for High Performance Computing or HPC As of June 2014, 66 of the top 500 fastest computers had GPUs integrated into the design. With the floating-point (FP) capability of CPUs increasing slower in comparison with GPU FP capabilities; the expectation is for this trend to continue.

The effort to use CUDA, OpenAcc and OpenCL will continue to be extended into the future in the HPC space. The main reason is that the HPC community is used to rewriting their code to take advantage of new hardware. However, there is an even stronger push to use standard languages –unmodified- for GPUs. The reason is that programmer productivity increases significantly when it is possible to use unmodified (or minimally modified) code to run on GPUs. We have already started to see certain flavors of OpenMP runtimes to map application code into the GPUs. The expectation is for this use case to be used more frequently in the future.

Another development is the integration of General Purpose GPUs (GPGPUs) with CPUs in the same die; both Intel and AMD have announced such products. These integrated GPGPUs have a different model of sharing data compared to discrete GPUs: in discrete GPUs data was shared with the CPUs through Direct Memory Access transfer protocols. For GPGPUs this model is replaced by a much simpler view for software by supporting data transfer through hardware support for coherence between the CPU and GPUmemories. Also, with embedded GPGPU already sharing a single unified memory, there is the expectation for this integration to get tighter in the future. Such evolution is being promoted by a broad group of graphic vendors participating in the Heterogeneous System Architecture (HSA) consortium

There will be changes in the way GPUs are seen by the rest of the processor components. Previously, it was not possible to run GPUs by themselves, and a co-processor model was used. Recently there has been considerable effort to run operating systems such as Linux directly on the GPUs. This implies that GPGPUs may truly become general-purpose, standalone products in the future.

3.2.3 Vector Processors (VPs) Vector Processors (VPs) were initially developed for leading-edge supercomputers such as CRAY. Their key feature is to operate over large vectors of data, typically found in supercomputing or High Performance Computing (HPC) applications, very efficiently by expressing the vector



operation as a single machine-code instruction. Recently, VPs are having a renaissance thanks to technology scaling. The current technology nodes enable packing billions of transistors on chip, which in turn facilitates the inclusion of massive VP resources on chip. Currently, Intel dominates the market with their Many-Integrated Core (MIC) product Xeon Phi, offering up to 60 cores with wide 512bit vector processing units.

Two future trends are foreseen regarding VPs: 1) on one side, we expect VP designs to expand to the low-power datacenter market segment. VPs are inherently power-efficient since the front end (fetch and decode) of the VP core could be clock-gated when a single instruction is iterated over large vectors. Preliminary academic studies have also indicated the feasibility of VPs for the low-power mobile market segment. Europe is strong in low-power, embedded systems, and is ideally placed to benefit from this proliferation. 2) On the other side, VPs will move away from the co-processor model to full-blown standalone processors. This will eliminate one current weakness of the VP co-processors: the need to be coupled with a master processor running the OS services. Intel has already announced that the next generation of Xeon Phi, Knights Landing, will also be available in a host processor configuration. This development might favor VPs over GPUs; accordingly there is also a movement – academic for the time being – for providing OS services on GPUs.

3.2.4 Special purpose co-processors Special purpose co-processors are used to accelerate specific applications. They can be embedded as well as discrete. In this case, embedded means that the special-purpose co-processor is on the same die as the host processor. Examples of such embedded accelerators include the database accelerator unit in the Oracle SparkM7, the Cyclic Redundancy Check (CRC) units on some ARM technology based devices and Intel as well as the AES unit (for encryption) from Intel. Interestingly, GPUs often contain embedded Special Purpose Co-processors themselves, mostly to accelerate certain special media operations like video/audio encoding. In the future, the expectation is for this embedding trend to increase since the dark silicon problem (not all the chip can be powered on at the same time due to power density problems) will get worse in the future and embedding special purpose accelerators will be an effective way to cope with it.

Discrete Special Purpose Co-processors are a type of Application Specific Integrated Circuit (ASIC). Compared to GPUs, VPs or FPGAs, ASICs offer the most efficient allocation of resources both from a power and performance point of view. An ASIC is custom produced for the specific application and is therefore optimized for that application from the circuit all the way to the architecture level. However, the main limiting factor is the cost of ASIC design, so they make sense for high volume applications such as automotive. However, one interesting recent application of ASICs has been to High Performance Computing, with the ANTON ASIC achieving two orders of magnitude speedup for molecular dynamics applications. Recently, ASIC designs have been proposed by academia to accelerate database queries; the expectation is for ASICs to be increasingly utilized to accelerate DBMSs as well as big data analytics.

3.2.5 Reconfigurable co-processors The most prominent type of reconfigurable co-processors is Field Programmable Gate Arrays (FPGAs). Internally, FPGAs consist of an array of programmable logic-blocks—generic binary logic gates whose behavior can be fully re-defined at runtime—as well as configurable interconnect paths to route information between these blocks. The behavior of both the blocks and the interconnect paths is controlled by a configuration layer, which can be modified by the user. This dynamic design allows FPGAs to re-define their complete behavior at runtime—they are “field-programmable”. System and hardware developers mainly utilize FPGAs to simulate new hardware designs, and to provide special-purpose accelerators for which the cost of producing ASICs would not be justified, for instance in Robotics. However, FPGAs are also starting to play a larger role in the world of Data Processing: Microsoft announced in June 2014 that by adding discrete FPGA boards to their data centers, they could accelerate certain operations in their Bing Search engine by up to 95%.



Defining the behavior of an FPGA is a complex process: First, the intended behavior is expressed using a hardware description language (HDL) such as VHDL or Verilog. Next, this description is (automatically) translated into a configuration update for the specific FPGA by “place-and-route” tools provided by the vendor. Finally, the generated configuration update is sent over to the FPGA. While the actual configuration update happens in a few milliseconds, the “place-and-route” process can take hours or even days, depending on the complexity of the HDL design. Because of this highly involved process, and the scarce availability of FPGAs in data centers, FPGAs are typically not directly used by “regular” programmers. However, this may change in the future, with FPGAs vendors adding support for “simpler” programming frameworks such as OpenCL, as well as added support for techniques like partial re-initialization, which allows reconfiguring smaller subsets of the FPGA without having to rebuild the whole chip. Another development that will likely lead to a wider adoption of FPGAs in “general-purpose” settings is the emergence of CPU-embedded FPGAs: For instance, Intel announced in August 2014 a version of its server processor Xeon that will feature an on-chip FPGA connected directly into the coherence memory view of the Intel processor via access using the Intel QPI interface.

3.2.6 Accelerator design with High-Level Synthesis (HLS) tools As mentioned previously, FPGAs are based on very low-level hardware concepts, such as gates and clock-cycles. Programming FPGAs is consequently a matter of describing low-level elements in languages like VHDL and Verilog. The concepts of these languages are so different from those found in other domains, such as GPUs, that software written for one component cannot be easily reused for the other.

These problems can be addressed by providing higher-level abstractions for FPGA programming. Such new languages and tools to describe hardware models can be classified in three groups. The first one comprises new generation hardware description languages based on a different paradigm than Verilog and VHDL. Such is the case of Bluespec SystemVerilog, a hardware description language based on guarded atomic rules and dataflow models. A second group includes domain specific languages, including simulation of mathematical models and graphical programming.

The third group is the most prominent, and includes tools that can convert a description in a popular software programming language into hardware. Some of these tools are based on similar building blocks as Verilog or VHDL. Some examples of this subgroup are the MyHDL project aims to allow developers to use a familiar language like Python; Chisel, a subset of the Scala language, which itself is a higher-level version of Java; and Lime, a Java extension that can be synthesized into hardware. Other tools are based on programming paradigms that can be new to software programmers. For instance, OpenSPL is a Java-like language based on dataflow and streaming paradigms. Altera OpenCL is a commercial tool from the FPGA vendor Altera that can map OpenCL descriptions into hardware.

Finally, the tools that transform C-like programs into hardware have received considerable attention in the recent years. This approach is also known as High-Level Synthesis (HLS). Synflow Cx is a C dialect that allows cycle-accurate hardware descriptions. Vivado HLS is the commercial product from FPGA vendor Xilinx, and it currently can transform standard C and C++ programs into Verilog and VHDL. LegUp is another compiler similar to Vivado HLS. It is open-source and targeted to Altera FPGAs.

The adoption of HLS technology by the main FPGA vendors shows the industry’s desire to embrace the software programmers market. Specifically, the objective is to let software programmers to accelerate their applications using FPGAs. This market is currently dominated by GPUs.

HLS tools still have several limitations compared to manually-coded Verilog or VHDL code. For instance, nested loops or complex data structures may be difficult to map to hardware accelerators, and applications with complex execution flows may exhibit low performance. But many digital signal processing and multimedia algorithms, based on streaming and data-flow paradigms, are well suited for HLS. Such limitations may be mainly related to the algorithms and techniques used



by HLS tools. For this reason, the quality levels of HLS tools can be expected to improve continuously, matching in some cases that of expert hardware designers in a few years.

Improving the programming experience is the main objective of HLS developers, trying to simplify hardware development and make it accessible to software programmers. However, FPGAs still suffer from several limitations compared to GPUs that may hinder the HLS success.

If the current trends in FPGA technologies continue, the following can be expected:

• HLS tools will be able to generate high performance accelerators for some classes of algorithms.

• In order to make FPGAs more usable to software programmers, main vendors will develop standard interfaces and the necessary interconnection software stack so that programming and communicating with the devices becomes transparent to the developer.

• Two different market segments will exist. First, the existing community of expert hardware designers that will continue to use Verilog, VHDL and advanced hardware description languages. Second, another segment will appear composed of regular software programmers, which will use very high abstraction levels to accelerate software applications. This latter group can potentially be much larger than the former one.

• In the next 10 years, FPGAs can be expected to share the market for high-end computers with GPUs, especially for big data applications. They will be present in some servers as high-speed PCI cards or inside heterogeneous CPUs.

3.2.7 Interconnects for discrete co-processors PCI Express is by far the most important Interconnect bus for discrete co-processors today. The current generation, PCI Express 3.1, achieves a data transfer rate of around 980MB/s in each direction per lane. Given that co-processors are typically attached via a 16-lane PCI Express interface, this leads to a peak data transfer rate of a little under 16GB/s. While this sounds high, it is only half of the peak achievable main memory rate which is around 30GB/s for dual-channel DDR3 memory. The next generation of PCI Express (4.0), which is expected by the end of 2015, aims to reduce this bandwidth gap by doubling the peak data transfer rate to 1,9GB/s per lane, or nearly 30GB/s for a 16-lane interface. Although it should be noted, that by the time that PCI Express 4.0 hits the consumer market, the faster DDR4 memory will already be well established, again leading to a bandwidth gap.

Nvidia is currently developing NVlink, a proprietary alternative interconnect to augment PCI Express. NVlink is mostly meant to establish a direct physical connection between multiple graphics cards, offering 20GB/s per lane (at 4 connectors per card, this would mean around 80GB/s bandwidth between cards, which is roughly 3 times the speed that PCI Express achieves). Additionally, IBM is currently partnering with Nvidia to support NVLink connections between a graphics card and main memory on its latest Power architecture. NVlink is expected to be introduced in 2016 with Nvidia’s line of Pascal GPUs.

Another interesting development in the field of co-processor interconnects is the Coherence Attach Processor Interface (CAPI), which IBM added in the latest version of their POWER line of CPUs. Via CAPI, discrete co-processors act like virtual embedded cores: While the co-processors still have to be physically connected via PCI Express (or NVLink), they can transparently operate in the CPU’s memory space and have hardware-managed cache coherence with the CPU. In contrast to comparable software solutions – like CUDA’s Unified Shared Memory, or OpenCL’s Virtual Shared Memory – that offer similar features, CAPI is implemented directly in hardware on the POWER CPU die, and is therefore fully transparent to the operating system & device drivers.

There was also recently announcements from a number of ARM technology parteners regards adoption of the RapidIO multi-socket technology. Such a move could suggest future products that would include a coherent multi-socketed solutions from devices utilizing the ARM architecture.

In all these above cases, the interconnect is designed for deployment using a backplane approach. It should however be noted, that such approaches require a non-negligible level of energy to



transfer data, typically measure around 100pJ per bit of data transferred. Therefore the power consumption from such technologies in scaling computers for big data could be somewhat limited.

3.2.8 Accelerator evolution and Europe In addition to future trends identified in the previous sections, the main evolutions for accelerators are expected to come from the software industry side. The performance/energy advantage of accelerators are expected to persist in the future, at present performance per joule improvement over standard processors is 10X for GPUs, 100X for FPGAs and 1000X for ASICs. While this seems advantageous for accelerators, they are difficult (GPUs), very difficult (FPGAs), and almost impossible (ASICs) to program. For this reason, most of the future development in accelerator development is expected to occur on the software side with a view to making them easier to program individually, and then as the number of different accelerators exist in a system, to program without specific knowledge of the existence of any specific type of accelerator. Europe, traditionally strong in programming languages, stands to profit from this development if it marshals its resources accordingly.

3.3 3-D or vertical integration Silicon components such as memory or compute units usually do not provide the desired functionality on their own, but come in integrated packages. While horizontal integration (either on one die, or in via multiple dice per package) is the known default, there are other solutions to this problem, such as vertical integration.

Vertical integration can be broadly described as a system integration paradigm where multiple physical tiers (dies) are stacked forming a multi-tier system [19].

The benefits of such an approach can be:

1. Increase of functional elements per area, without further decreasing structure size: this allows for integration of more units on the same area. This translates to higher memory density or to higher circuit complexity for equal package footprints.

2. Decreased power consumption: shorter wires between layers instead across layers may allow significantly lower energy consumption of the interconnect

3. Wider interconnect buses: 3D interconnects may allow for much wider compared to what is feasible with 2D interconnects.

4. Higher production yields: depending on the size of the stacked elements, individual tiers with a small footprint may allow for a higher yield than very large chips that cover the same functionality as the stacked tiers.

As was discussed regarding the energy consumption associated with backplane based interconnection for computing acceleration, vertical deployment of accelerator would also allow interconnection to use short reach technologies (as are being adopted by HMC) to connect accelerators with energy consumption less than 10pJ/bit.At the package level, 3-D packaging has been a commercially successful approach, where package-on-package (PoP) and system-in-package (SiP) are typical variants of this mature technology [20]. These well-established packaging methods are not discussed due to their limited scalability and rather coarse interconnect pitch. Alternatively, the granularity of vertically integrated systems varies significantly from the device level all the way to the die level leading to multiple fabrication processes for these systems.

Vertical integration or stacking of circuits includes two major steps: i) the formation of vertical interconnections and ii) the bonding of the individual tiers [9]. The vertical interconnects are usually implemented with short wires (with length ranging from a few µm to few hundreds of µm), which are called “Through-Silicon-Vias” (TSVs) and are formed within the silicon substrate to electrically connect the interconnect layers (i.e., back-end of the line (BEOL)) between physically adjacent tiers [19]. TSVs are a critical element of 3-D ICs as the electrical behaviour of these wires determines essentially the improvement in performance, power, and area that the third physical dimension can offer. The high yield fabrication of TSVs with high aspect ratio is, consequently, a crucial parameter for the advancement of 3-D ICs. Bonding between tiers can also be implemented at different levels and ways. The usual methods include wafer to wafer (W2W),



die to wafer (D2W), and die to die (D2D) bonding, where these methods are ordered in terms of descending manufacturing throughput. Moreover, two tiers can be bonded either face-to-face (F2F) or face-to-back, (F2B), where the “face” side corresponds to the BEOL and the “back” side corresponds to the thinned silicon substrate. Another variant of vertical integration is that of interposers, which is known as 2.5-D integration. An interposer is an interconnect fabric for high-density and high-performance interconnections among dies as compared to the traditional package substrates and PCBs. Interposer substrates can also be based on different materials, such as silicon, organic, and glass and in the near-future are expected to embed passive devices as well as transistors [9], [21]. A coarse classification of vertical systems is illustrated in Fig. 1.

Fig. 1. A broad categorization of vertically integrated systems depicting 3-D and interposer (2.5-D) based systems (solid rectangles, homogeneous 3-D and 2.5-D systems (dotted line) and heterogeneous 3-D and 2.5-D systems (dashed line).

The processing complexity added by the TSVs and this disparity of processes for manufacturing 3-D ICs require a shift from the traditional supply chain in semiconductors industry. Drawing an evolution path for 3-D integration is not a straightforward task. Previous efforts to roadmap 3-D ICs have encountered several difficulties and have necessitated frequent deferments of milestones due to a number of challenges relating to the business model, supply chain, identification of lucrative applications, manufacturing equipment, thermal management, and design and test tools and methodologies. Each of these challenges has delayed to a certain extent the volume production of 3-D ICs despite the many and significant research findings [21], [22]. However, recent announcements from memory vendors, such as Micron, SK Hynix, and Samsung are encouraging and demonstrate the significant strides of industry towards massive production of 3-D ICs [22], [23]. Memory products have sensibly been the first application for 3-D ICs due to the homogeneity of the stack (which facilitates manufacturing) and the improved bandwidth with low-power that 3-D memories exhibit. In addition, Xilinx recently adopted an interposer based interconnection solution greatly improving the performance of their FPGAs [24]. Although interposer solutions exhibit inferior performance characteristics and lower integration density compared to 3-D ICs, the lower cost and time to market are appealing and will most likely remain an economically viable approach in the near-term.

Looking further in the future, 3-D integration has been identified as one of the technologies for the sustainable future of the electronics industry in Europe beyond 2020 [25]. The Electronic Leaders Group has also pinpointed the need for additional research on this potent technology. Indeed, there is a critical requirement for the industry to provide high manufacturing yield, robust assembly processes to reduce production cost and enable large-scale production of 3-D ICs. Beyond the business logistics, several technical challenges have yet to be addressed. The research community in Europe has contributed to both fronts although more research is still required. Major research institutes across Europe, including the CEA-LETI in France, IMEC in Belgium, and Fraunhofer in Germany have developed fabrication processes for specific stages of 3-D IC manufacturing including high yield TSV processes. In addition, these research centers have produced 3-D prototypes and, in the process, have developed design flows to replace the lack of commercial EDA tools that can efficiently adapt to the complexity and diversity of 3-D fabrication processes.



The plurality of manufacturing techniques suggests that no optimal technique will exist. Rather, the applications will determine which of these processes will gain a larger share of the market although a “killer application” for 3-D ICs has yet to appear. Such an application may lie within the realm of heterogeneous 3-D systems. These systems contain a mixture of functionalities where sensing is the vital function in addition to processing, storage, and communication. The envisioned application domains for these systems include the ambient intelligence, smart cities and buildings, as well as wearable electronics and align well with the Digital Agenda for Europe. These domains are intrinsically related to the Internet-of-Things (IoT) [26], [27]. As IoT evolves, the myriads of sensory data will become relevant to the Big Data challenges that the community will need to address. In other words, 3-D technologies can be seen an enabler for Big Data providing those versatile and energy-efficient platforms that will collect, process, and store vast amounts of data through IoT devices.

3.4 On-chip Interconnect The metal interconnects typically used within integrated circuits (ICs) is starting to become the limiting factor for computing scalability both in terms of operating frequency and power consumption [28], [29]. The main issue inherent to metal interconnects is that parasitic capacitance and signal propagation delay increase as we downscale the CMOS technology, the main way to achieve higher performance the microelectronics fabrication industry has rely on. This phenomenon is broadly known as the (interconnect bottleneck). The interconnection bottleneck is particularly exacerbated in the context of Big Data applications and systems because of their data-intensive, and thus communication-intensive nature. In such workloads transferring information between the different components of the system (storage, memories, processing cores) is essential and occurs very frequently.

Currently, the most feasible solution to the interconnection bottleneck is acknowledged to be replacing long metal interconnects with optical ones which, apart from overcoming the main limitations discussed above, offer a bunch of other nice properties such as temperature invariance, enhanced reliability [30].

However, this technology is not completely mature yet and hence there is still no manufacturer able to mass-produce optically interconnected chips within a competitive budget. The following section discusses the last technological advances and how optical interconnects are moving closer and closer to become a widespread, mature technology.

3.4.1 On-chip optical technologies The production of optical interconnects within standard CMOS processes has recently become a reality and has reached the prototyping phase. Two of the major players in the chip manufacturing industry, namely IBM—with their Silicon Integrated Nanophotonics [31]—and Intel—with their Silicon Photonics Technology [32]—, have demonstrated producing silicon-based, CMOS-compatible optical components (such as light sources, modulators, wavelength-division multiplexers or photodetectors) that will be used as Building Blocks in the design of future chips. These components permit to generate photons, encode the data, transmit data in parallel and, finally, retrieve the data from the transmitted photons. The key value of such will be the ability to terminate optical communication within close physical proximity of the silicon. This allows the long-reach low power advantages of optical connectivity to be delivered directly to the silicon. If a traditional backplane technology is used to deliver the optical termination, then the inefficiency of the backplane can quickly again become dominant power component.

On-chip photon transport is another issue that is extensively being investigated [33], [34], with advances in terms of the best materials and techniques to use when producing waveguides—the name given to optical on-chip wires.

With all the components discussed above it would be possible to produce a hybrid NoC in which optical links interconnect electronic router circuitry. However, while this is clearly an improvement over an all-electronic NoC, some of the problems of electronic system would still remain, mostly those related to power and performance—due to the electronic switching and



buffering becoming the communication bottleneck. To fully overcome these issues and build a fully optical NoC, there is an extensive research body on how to create optical buffering [35], [36] and switching [37], [38]. Even though these have been investigated for a long time, it has not been until very recently that we are reaching the point in which the error bit rate in such devices is starting to become negligible, one of the most important milestones in their way to being adopted by industry and, eventually, to becoming available in mass production processes.

Of special interest, due to its very recent nature is the possibility of extending this technology to allow optical interconnections among the different layers of3D stacked chips [39], [40], [41]. Such a system design will exploit the benefits of these two technologies to produce higher density, higher performance and lower power computing systems, which will enable most cost-effective and productive computing systems. Unique to optical is the ability to broadcast data to a large number of components simultaneously, traditional wire based interconnect quickly become complex and a significant source of energy consumption. The application of such systems is speculated to be a very effective way to improve the performance and efficiency of Big Data workloads by bringing memory and processing closer [42], [43], in what has come to be called Near-Data Processing.

3.4.2 Optical Network-on-Chip (NoC) Preparing for the advent of optical on-chip interconnects the computer architecture research community has generated a wide body of knowledge on what are the best avenues to exploit these technologies as soon as they become widely available. There are a number of areas in which the community have been investigating, all of which are the key to a successful exploitation of the foreseeable near-future technologies:

• Floor planning—how to place the different components of the chip in an efficient and effective way—will undergo a fair amount of change due to the new constraints imposed by the integration of the optical components. These effects and how to best overcome them has been investigated extensively [44], [45], [46], [40].

• There is also a significant effort put on investigating the best router architectures for optical NoCs [47], [48]. Again, the new opportunities and constraints inherent to the utilization of optical components will change drastically how we design NoC elements.

• Of great importance to the overall performance and efficiency of the chip is the topology [49], [50], [51], [52] in which the system components are arranged and the routing functions [53] used to travel through this topology. We can see how the peculiarities of optical NoCs also affect greatly the performance metrics of the system and hence, will change the design methodologies in this area.

• Analysis of power [54], [55] and thermal [56] effects of Optical NoCs and their impact to the NoC architecture has been investigated as well.

3.4.3 Conclusions To conclude the discussion on optical interconnects, we would like to remark the high likelyhood this technology has to become a standard part of future computing systems because of their clear benefits when compared with electronic interconnects and their high level of maturity. This will create an ecosystem around the technology with potential for a very large economical grow as it gets implanted, mostly related to its suitability for industry- and business-level computations—being Big Data workloads the clearest representatives of applications that will be benefited. Other optical-related interconnection technologies that are being investigated, but are currently at a much lower level of maturity, are in the context of carbon-based integrated circuits. Nanophotonics seem to be very promising in both the context of graphene [57], [58], [59] and carbon nanotubes [60], [61] but are still far from demonstrating their viability for commercialization.



4 Network architecture Single computers are often not sufficient to perform the entirety of the complex computations required on Big Data workloads. Consequently, multiple computers have to be connected by a network to collectively work on the solution to a given problem.

The main challenges here are to find efficient ways to connect multiple computers to maximize their combined utility (networking) and to house and manage very large amount of systems in a cost effective way (data centers).

4.1 Networking Networks are suffering a big change since the computation capabilities of the commodity hardware can be equal or even higher to specialized and highly optimized routers. This fact provokes a current server blade connected together with some interface network cards is capable of giving the same performance than traditional routers and network devices. Advantages are a lot, including the reduction at the cost of the hardware devices. However, to obtain these benefits a lot of data has to be collected and transport and analyzed in real time. To make that new paradigm reality new concepts have been conceived. A major revolution was the introduction of Software Defined Networks (SDN), where the control plane is physically separated from the components that send the packets. In another words, the separation of the control plane and the data plane. Other technologies, such as OpenFlow, appeared to connect the SDN controller with the network devices. In the figure we can observe what implies SDN, with the control plane and the data plane separation.

http://www.thetech.in/

SDN brought another revolution, the possibility to define with software the functions to be applied to the traffic, like routing schemes, security, or firewall. Most of this software is Open Source, which has provoked an explosion of new applications, functions, and optimizations to the routing devices implemented with the SDN concepts. Now it is not necessary the use of specific operating systems to control the routing hardware devices. For example, in the case of Cisco they have the iOS and, with the eruption of the new software possibilities, it is disappearing because specific hardware is no longer required. Hence, network is obtaining a more efficient control, new functionalities and a wide area for future improvement. However, it should be noted such flexibility today is at the cost of power efficiency and performance.



A complementary concept to SDN is Network Function Virtualization (NFV). It proposes to use virtualization techniques to virtualize entire network functions such as balancers or firewalls. It is composed of Overlay Networks (OVN), the basic building blocks than an NFV infrastructure manipulates. A Virtual Network Function (VNF) can consist of different virtual machines running on top of a high-end server, or even in the Cloud. It is important to remember they are software implementations of network functions. This avoids the use of specific and costly hardware, but again at the cost of performance and power efficiency.

Internet of Things deployments are becoming important since it is one of the next big revolutions in the Information Technologies. This technology will require more information from the network topology, and with SDN it will be much easy to provide it than with the traditional routers and network technology. Moreover, since network functions can be defined in software, novel techniques can be tested and implemented to empower the IoT applications taking benefit of the major aperture of SDN.

Additionally, networks are no longer a collection of pipes to transport the data around. More intelligence is added to the network itself to have a more efficient processing and storage of the information. Technologies like Fog Computing bring the Cloud closer to the Edge to take profit of locality and reduced latency.

4.1.1 Networking interconnects Optical fiber-based interconnects currently offer the highest bandwidth when it comes to connecting one or more machines with each other. One of the biggest challenges of optical interconnects is transforming electric signals into optical ones.

On-chip photonics offer to alleviate this situation by reducing the energy consumption and production costs by integrating the electric/optical conversion directly on silicon.

Network components in which such conversions would take place extremely frequently are network switches and routers. Current research on this domain tries to eliminate the need of optical/electrical conversion entirely by building photonic switching hardware, which operates directly with light, instead of electric signals.

An alternative to increasing the network bandwidth is to reduce the amount of data that need to be transmitted between network nodes. If data reductions by means of compressions, filters, and aggregations were suitable for a given application, it would be beneficial to perform these tasks prior to shipping data over the network.

This goal can be achieved by equipping network hardware, and especially shared network components like switches and routers, with suitable compute resources. The reasoning behind this approach is that multiple components can send their data to a shared network component where it is aggregated and/or filtered and shipped to the next node, instead of shipping the entire data stream there.

4.1.2 Local Area Networks or LAN Electronic switches are currently the basic building block for large (off-chip) datacenter networks, which play a key role in scale-out applications dealing with big data. The bandwidth requirements for such networks scale together with the dataset size, as well as with the degree of application and tenant multiplexing in the datacenter. According to a recent review [62], [63], the bandwidth requirements of datacenter networks increase by a factor of 20 every four years.

Today, big data applications that employ a Map Reduce framework rely heavily on high-quality network services, in order to scale out to more processors and larger data sets. The in-network latency, including the mean and its high percentiles values, is seen as key differentiator of network architectures. Consistently providing low latencies to the network flows is a subject of elaborate network engineering that spans all network layers, starting from the kernel network stack [63] and reaching the in-network switches and routers [64].

At the same time, the power consumption of datacenters is becoming a major concern. According to [65], the datacenters in the US consumed 1.5% of the national power consumption at a cost of



more than $4.5B. Out of this power, the servers consume 40%, the storage devices 37%, and the network 23%. Whereas the bandwidth requirements increase fast, the power budget that we can allocate to "light them up" is almost constant. The power consumption of current electronic networks however does not scale well. For example, the Huawei CE12812 router requires power supplies for up to 170 Watts per 100 Gbs port [66]. Effectively, a breakthrough technology that will enable us to economically scale datacenter networks is much anticipated.

The optical network is one candidate technology, promising energy-efficient datacenter networks. The most significant benefits of optical networks are twofold. First, they promise to reduce the power consumption per switch port. Second, they can help to tame the cabling complexity by carrying multiple high speed streams on a single fiber cable using Wavelength Division Multiplexing (WDM).

Unfortunately, the benefits of optical interconnects stop here. In practice, the full replacement of electronic networks by optical ones is hindered by the lack of efficient technologies for optical packet buffering and optical header processing. Large switching latencies are also a concern, but, thanks to recent developments, it may be the smallest one. Whereas traditional micro-electro-mechanical systems (MEMS) based switching nodes exhibited a switching latency of many milliseconds, recent efforts have managed to bring this latency down to the range of a few microseconds [67].

Optical networks have been widely used in the long-haul telecommunications networks, providing high throughput and low power consumption. Currently, they are also deployed in datacenter networks for point-to-point links in the same way as point-to-point optical links were used in telecommunication networks. Optics are preferred to cover distances longer than a handful of meters, in which range copper cables are more efficient. The drawback is that power-hungry electrical-to-optical (E/O) and optical-to-electrical (O/E) transceivers are required at every switching node.

The aforementioned technological trends have shaped accordingly the network architectures that have been proposed for optical datacenter networks. Commercially available optical switches are mainly based on MEMS-based switches, whose long reconfiguration time is inappropriate for latency critical applications [68]. Therefore, hybrid digital/optical interconnects have been proposed to introduce optical technologies in datacenter networks. C-Through [69] and Helios [70] are two major representatives of this approach.

C-Through adopts two parallel networks among the top-of-rack switches: a traditional Ethernet-based electronic network, and an optical circuit-switched optical network. Using commercially available optical components, it presents a today viable alternative. Similarly, Helios brings MEMS switches and WDM links into datacenters, through commercially available optical components. Whereas inappropriate for low latency datacenter communications, these designs present an interesting starting point for future datacenter networks. By routing the large datacenter flows through a circuit-switched network, they can indirectly favor the low-latency flows as well, by discharging the electronic network from the so-called elephant flows. In this way, higher quality-of-service can be obtained, since the electronic network queues will have smaller backlogs, hence they will deliver reduced latencies for small flows.

The Osmosis project [71], by IBM and Corning, takes a different path. Aiming at a unified all-optical network, the Osmosis proposes a low-latency optical switch, featuring electronic buffers at the input and at the outputs, and using Semiconductor Optical Amplifier (SOA) for packet switching. The SOAs can switch every few nanoseconds, thus presenting a very compelling technology for optical interconnects. However, at the same time, SOAs consume a lot of power. In addition, by buffering packets at the inputs and at the outputs, the OSMOSIS requires four O/E or E/O power-hungry data conversion per switching node.

Looking forward, all-optical transparent datacenter networks is a far-reaching goal, but also one that is not expected to bring tangible results within the next five years. Optical technologies are currently examined for intrarack communication by Intel to connect pools of compute, memory and storage chassis within a single rack, with the goal to eliminate network bottlenecks between high-density servers while reducing the number of cables in a rack. For inter-rack networks, the



lack of reliable buffering and packet processing in the optical domain limits the spectrum of alternatives to architectures that move all sophisticated processing and buffering at the end-nodes (e.g. TOR switches), keeping the all-optical core as simple as possible, presumably responsible only for the switching of packets. Today, this is exploited by architectures like c-Through, where large flows can use the slow optical network, and small flows an electronic one. Once we have low-power switching elements that can reliably switch packets (with low attenuation) with sub-microsecond latencies, commercial all-optical inter-rack networks will emerge, with the potential to replace the current electronic ones.

Care should however be taken when considering the overall energy efficiency of an optical based communication system. Today’s high speed analog short-reach communication technologies can deliver data at similar rates in around 1pJ/bit, whereas, even the smallest and most power efficient optical system to deliver data over the same short-reach is at least 10 times this level of energy consumption. It is not until the inherent energy efficiency of optical “wire” transmission occurring over some distances can the energy consumed in generating the light be mitigated.

4.2 IoT and Edge computing

4.2.1 Evolution Internet of Things (IoT) is the connection of embedded devices to Internet and other networks. Sensors and actuators are a huge part of these things and in a near future all devices could be potentially connected to Internet. As stated by IBM each day 2.5 quintillion bytes of data are created, and 90% of the existing data in the world has been created in the last two years. Working with these massive amounts of data is the next major computational challenge. Estimations show in 2020 it will be around 50 billion devices connected to Internet. IoT has a great potential, since it is able to collect very useful, and at the same time, critical information. With the correct processing of all that data, human life can improve its quality and satisfaction to levels that today seem science fiction. These will result in the proliferation of a large number of IoT applications.

IoT implies a new model of data sources. Until a few years ago, most of the data was generated by devices with a human being behind it, like a computer or a smartphone. With IoT, this will change and we will observe data generated by sensors and actuators without a human controlling them. Some examples can be found with different applications. A first approach can be a smart building, where there are sensors distributed among the building taking information. The sensors are the basic element inside the IoT environment, the part that generates the data that, after a processing, it will become useful information to make a more efficient energetic use and at the same improving the comfort of the persons. Another example of IoT application is the precision agriculture. In this case sensors are distributed over a field. Sensors can provide information regarding humidity, and nutrients on the ground among other values. Later, depending on the aggregated information of the sensors over the field, actuators can irrigate or fertilize the soil whenever is necessary. In both applications, IoT can improve the efficiency and the benefits obtained in terms of productivity and better life of the workers. In the two cases mentioned before, the systems are not critical. However, there are other applications where the requirements are vital to guarantee a correct behaviour of the system where the IoT is integrated. For example, a healthcare application to monitor the vital signals of patients with risk of heart attacks. In this case the response time of the network and the system itself is critical to ensure the survival of the patient.

Going through the examples commented in the previous paragraph, we can notice which the key pillars of IoT are. It is an end-to-end platform, where we have distributed networking, computing, and storage involved. Then, it is bringing together engineering and advanced services. IoT is not a collection of applications working independently from each other; they all share the same physical environment (like a city). Currently, this is a problem. Each IoT application vendor is installing its own hardware platform into the different locations of a city. This fact means that each company deploys its own hardware resources for compute and storage. In the case of the network, sensors are connected via wireless mostly. This problem is known as the silos problem. Its implications are very costly in terms of money and development effort since each IoT application is agnostic of



the resources of others, provoking a duality. Since sensors are develop specifically for each application, the duality starts at the first level that gathers the information and do a pre-processing of the information. We can say this is the first level of aggregation in the system. Depending on the geographic distribution of the application, we can have a hierarchy of nodes to perform this aggregation.

In the last years, some alternatives appeared trying to solve all these issues mentioned before. Researchers and industry are focused in offering an infrastructure capable of handling efficient IoT deployments with different verticals operating simultaneously and without interference (or competition for the hardware resources), since Cloud is not feasible due to latency constraints and lack of flexibility. Most promising one is Fog Computing. This technology is extending the Cloud to the edge, exploiting locality and offering a great versatility to deal with mobility and dynamic systems. It also offers a reduced latency, a critical aspect for applications like healthcare. Fog Computing is a non-trivial extension of Cloud Computing, enabling features and use-cases that has not been able to run on top of the Cloud. For example, low latency apps like streaming or gaming can be easily executed on top of it, it can be used to control and to exploit networks of sensors & actuators over a specific region, or to implement large-scale distributed control systems. Among its characteristics we can find a multi-tiered hierarchical organization, the orchestration plane is organized in a hierarchical manner, and the system can offer application-based parameters such as security, data consistency, real time analytics, and availability among others.

Fog nodes are going from the first aggregation level of the system to the Cloud. They are distributed in a hierarchic manner, but having a dynamic behaviour to adapt to the IoT applications. As we mentioned before, in most of the cases the sensors & actuators are connected to the aggregation nodes with wireless technology. Techniques like Software Defined Radio are under study to control the radio layer with software rather than with specific hardware. This technique will allow the development of new functions and optimizations for the communication layer.

A key aspect to guarantee a fully operational IoT deployment is the security. It is involved in the entire system architecture, from the sensors plane to the Cloud or Datacenter. Then, security is a critical aspect that can delay the implementation of IoT applications. For example, a malicious attack can consist in introducing “bad” sensors to produce an incorrect average temperature in the building. In a more extreme case, if we are dealing with critical systems like grid distribution or healthcare, an attack can be critical because it can provoke deaths. Moreover, aspects like data privacy and management have to be controlled. Imagine your house full of sensors that control temperature, humidity, amount of food in the fridge, etc. This information is all the habits of a person, and in the wrong hands, can be used to rob the house, or violate the intimacy of its occupants.

To conclude, a problem observed in IoT is each research group or company is working with a specific part IoT like security, sensor design, or networking. However, almost nobody is working with the entire vertical that IoT implies, from the sensors to the Cloud.

4.2.2 EU strategy Europe has the strength and is the pioneer in IoT deployments. We have a huge strength in different areas of the IoT ecosystem. Europe is a leader in embedded processors, a key area inside IoT because it offers excellent computing capabilities with low power consumption. A lot of SMEs are dedicated to design and produce innovative IoT applications, going from the sensors itself to the vertical they need. In areas like security and data privacy there are big companies working on it, and the EU legislation is really clear with respect to data management policies. This point has to be emphasized because IoT will be involved in the entire life of people, then all information will be available, and susceptible of attacks.

Another problem we have nowadays is the different technological areas involved in the environment of IoT are not treated as one. Each institution or company works in its area of expertise, without taking into account the rest of the layers. It is very important to start treating



IoT a cross-vertical technology. If we can optimize the entire vertical, performance and efficiency obtained will be much better than the isolated case.

A key contribution to guarantee the vertical integration is the definition of standards at all the levels. Standards to enable the communication of devices from different companies are a must, and Europe should define it taking into account the objectives and needs of the IoT community in Europe. We have main companies and research centers in all the areas involved in IoT and we should use this fact to define world-wide standards.

A great possibility, and also a strength in front of other countries, it is the implantation of the IoT technologies in different city halls around all Europe. We have cases like Stockholm and Barcelona offer a great possibility to show the potential of IoT. Once they have massive deployments, the community will see the great advantages for the citizens and how their lives can be improved.



5 System software While the hardware provides the means to realize certain computation, it is the software that instructs the hardware how to do so. Software provides the specification that describes how the available resources are to be put to work.

This section describes the current state of existing Big Data software frameworks as well as potential influence from the HPC domain.

Additionally, we discuss Big Data core algorithms that can be thought of being the most applicable ones to most Big Data problems. The last part of this section discusses the current state of visualization techniques that help users to understand their large data sets in a visual and exploratory way.

5.1 Software frameworks

5.1.1 The Big Data software stack Again modelled after a series of publications by Google, practitioners adopted a highly robust and scalable software stack that allows complex computations to be performed on very large clusters, while being resistant to the failure of one or more machines.

The basis for this stack is a cluster of shared-nothing commodity servers interconnected using standard Ethernet.

The basis for any Big Data workload is of course, the data. In the classical software stack the data resides in very large files stored in a distributed filesystem (DFS), such as HDFS (Hadoop Distributed Filesystem) [72], which like Hadoop has been modelled according to a Google publication [73]. Since a file may be much larger than a single disk or even all disks available to a single node, the filesystem usually slices files into blocks of equal size (usually 64MB-512MB) and distributes these blocks across the cluster, replicating each block as often as it is necessary to compensate the failure of a given number of machines (e.g. each block must be replicated 3 times to guarantee full availability of all data in the case 2 machines fail).

The DFS also enables access to each block from any node of the cluster. When compute jobs are submitted to the stack (for example using a processing framework such as Hadoop MapReduce), the stack tries to maximize locality of data access. This means that any given subtask that processes a given block should be spawned on a node that has this block locally available. Should, during execution, any machine fail, the corresponding—currently unfinished—subtasks is restarted on a different machine.

In addition to these basics a complete software stack usually includes some sort of cluster management software, such as YARN [74], that allows flexible resource allocation between frameworks, as well as high level systems that use underlying frameworks as execution engines.



Sample Software Stack

5.1.2 Batch processing Batch processing is characterized by the fact that the size of all data that needs to be processed is known before the computation starts. The computation is thus performed on batches of data. The following sections describe the properties of established and novel batch processing frameworks.

5.1.2.1 Apache Hadoop MapReduce The first popular paradigm for massively parallel batch processing was MapReduce as described by Google. Alluding to the common concepts in functional programming, the programming model consists of two second-order functions, map and reduce that process independent key-value pairs or groups of key-value pairs on partitioned data. Map and Reduce only specify the processing semantics (“one tuple at a time” and “group of tuples at a time”). The developer has to implement what is actually to be done with the tuples in each of these functions. Here, the entire functionality of C++ (Google MapReduce [75]), or Java (Apache Hadoop [72]) is available. In addition, developers can use broadcast variables to transmit small-volume side-channel information.

The execution environment, together with input connectors that can read from distributed filesystems or databases, keeps track of the data partitioning. The runtime also materializes intermediate data to the distributed filesystem and assigns processing jobs of partitions to task slots on the cluster. That way the developer is not concerned with execution details such as hardware failure.

The order in which map and reduce is called is fixed. Also, the execution of the program ends with the reduce phase. Should the developer require multiple map-reduce passes to complete a certain program, a driver program needs to be implemented, that requests the execution of multiple sub-programs.

Certain complex operations, like joining two datasets are cumbersome to implement given the available primitives. Iterative algorithms are only supported via driver programs. Also, any application in which the code within the processing of a tuple must maintain and access some global state shared across multiple, if not all, instances of the processing this system quickly conflicts with the overall MapReduce mechanisms and its associated hardware cluster architecture leading to very inefficient application performance.

StorageLayer

ClusterManagement

ExecutionEngine

API / High-levelLanguage ...

...

...

...

HDFS S3 JDBC

Flink Spark Hadoop Tez

PigHiveGiraphJava(engine specific)

Scala(engine specific)

MESOSYARN



5.1.2.2 Apache Hive Data warehouse style analytics are quite popular within the big data domain. Writing such programs in low-level constructs such as those provided by MapReduce comes at a large overhead compared to conventional OLAP that relies on SQL. Apache Hive [76] seeks to solve this problem by allowing users to formulate queries in a SQL-like language called HiveQL [77].

Hive’s data processing model is inherently relational. Concepts like schema on read allow Hive to access non-relational data such as distributed flat files and convert them on-the-fly into tuples of a user-defined schema. Beside relational operations such as joins and aggregations, Hive allows users to plug map and reduce UDFs into the program to add more advanced functionality not available in HiveQL. For the actual execution of HiveQL queries, Hive can employ several different systems; such has Apache Hadoop and Apache Tez.

5.1.2.3 Apache Pig Apache Pig [78] is a system, usually employed for (exploratory) massively parallel data analysis. In Pig, programs are specified, in a scripting language called Pig Latin [79]. Pigs data model is relation-based and allows relational processing. Just like Hive, Apache Pig builds on top of Hadoop for execution.

5.1.2.4 Apache Tez Pig and Hive have shown that the promise of an execution engine that deals with the hassles of running distributed software in a shared-nothing environment had a very strong appeal. Both systems compile their queries or programs into a sequence of MapReduce jobs. While this had the benefit of massively parallel, fault tolerant execution, it also came with large overhead, since pure MapReduce programming model is quite rigid. Apache Tez [80] is an execution engine that succeeded the one of MapReduce. Now MapReduce, Pig, Hive, and potentially other languages can be executed using the much more flexible execution primitives of Tez, which allow for example, the specification of compute operations on vertices of a directed, acyclic dataflow graph (DAG).

5.1.2.5 Apache Spark Compared to Hadoop and Tez, Spark [81] offers a more flexible set of operators that include relational primitives such as joins and aggregations. Spark also supports an interactive programming mode where subsequent results are not necessarily materialized to the distributed filesystem, but kept in main memory for faster processing. Also these intermediate datasets are not computed during specification of the computation, but rather computed lazily after the user requests the materialization of the results. This is especially beneficial for iterative programs in which a driver program running on the master node repeatedly submits a step of the iteration for computation.

5.1.2.6 Apache Flink Flink [82], originating from Stratosphere [83], offers similar functionality compared to Spark, however prefers operations in pipelined mode. Data items are usually piped from one operator to the next in the data flow graph. Should there be the need for materializing the intermediate data, it is kept in main memory as long as possible before spilling to disk.

Iterations are first class citizens in the data flows of Flink. Instead of relying on a driver program to trigger the computation of an iteration step, Flink offers explicit iteration operators that allow the system to reason about the iteration during job submission time, which in turn allows for data flow optimization similar to the kind of query optimization that is performed in conventional relational databases.

5.1.3 Stream processing Beside Batch processing where the size and content of input data is not changed during processing, there is the need to handle data streams. Data stream provide a virtually infinite number of data items at fixed or arbitrary intervals. Stream processing systems operate on



windows of the incoming data items. Aside adaptations for stream windows, these systems usually provide similar processing primitives as DAG based systems such as Flink and Spark.

Beside Flink and Spark which both offer stream processing capabilities, there exist dedicated stream processing tools such as Apache Storm [84] and Apache Samza [85] that offer different windowing approaches and provide different guarantees as to how often an incoming data item is processed.

5.1.4 Graph processing Graph processing is a specific domain of data processing addressing the specific problem of algorithms operating on graph input data. The corresponding frameworks provide specialized abstraction for graph algorithms that are either based on node message passing or graph traversal. Pregel [86] is one of the most well known graph processing systems that translates programs written in a graph-specific DSL and compiles them into optimized C++ code suited for deployment on a shared nothing cluster. Apache Giraph implements a similar model based in Java and Apache Hadoop. Gunrock [87], a CUDA based implementation of graph processing primitives has been proposed to ease the development of graph processing algorithms for NVIDIA GPUs.

Even though initially not tailored for graph processing, Apache Flink offers an API that allows developers to use programming abstractions similar to the ones used by Pregel in an Apache Flink data flow program. Internally these instructions are compiled into an iterative data flow that emulates Pregel’s vertex message passing approach.

A noted challenge of all these software approach are the assumption that the hardware platform is uses a shared nothing paradigm, and that any application state that must be shared across the computation must be explicitly defined by the application programmer. This leads to inherent inefficiencies and bottlenecks since the share-nothing cluster was defined against the strong functional model of no shared state. The software challenge is then how to express shared state in a system that was designed to not support shared state.

5.1.5 High Performance Computing

5.1.5.1 OpenMP OpenMP (Open Multi-Processing) is a parallel programming model for shared memory multi-core processors in C, C++ and Fortran. OpenMP consists of language level extensions in form of compiler directives, runtime and environment variables. The language level extensions allow a declarative and platform independent way for expressing both data and task parallelism, with recent language extension to also support GPGPU and SIMD acceleration, whereas the OpenMP runtime is the platform specific implementation of the OpenMP specification for the different hardware and software platforms such as Windows, Linux, Solaris, Mac OS X, AIX, HP-UX and others.

OpenMP is suitable for parallelizing an already existing serial program to run in a shared memory multi-core processors. More specifically, OpenMP is suitable for regular parallelism such as parallelizing loops without dependencies. OpenMP does not support parallelization that can run in a shared-nothing distributed system.

In OpenMP, parts in the code that can execute in parallel as well as shared variables are annotated with compiler directives. Because directives are not first-class language extensions, if the compiler does not support them, they are ignored and the program simply runs serially. This sort of backward compatibility allowed the wide adoption of OpenMP because OpenMP programs could be executed in single-core processors in times when multi-core processors were not popular.

The parallel execution of a typical OpenMP program consists of a master and multiple slave threads. At the beginning, the execution of the program starts serially. The OpenMP runtime designates the main program thread as the master thread. The master thread forks multiple slave threads allocating threads to each available different processor. When the program execution reaches a parallel section, for example a parallel for loop, the OpenMP runtime will execute the



section in parallel by allocating it to the slave threads as well as the master thread. By default each thread executes its own parallel section independently. However, with work-sharing constructs, it is possible to divide the parallel section among the threads such that each thread executes its designated portion. When a thread completes the execution of its parallel section it waits for the others, too and then the program execution continues serially by the master thread.

Newer versions of OpenMP extend the model with task parallelism and data parallelism. Tasks make irregular parallelism easier by simply declaring any code block as a task. The data parallelism is an extension over the tasks, where tasks can execute asynchronously as long as their data dependencies are ready. The asynchronous execution of tasks exploits the available parallelism and promises to improve the performance. These newer versions of the OpenMP language, also enable explicit offload of tasks to defined accelerators, which in turn may define additional concurrency targeted within the accelerator. For example, an OpenMP tasks may be explicitly targeted to a GPU, and within that task, OpenCL could be defined to further exploit the data parallelism on the given GPGPU.

5.1.5.2 MPI MPI (Message Passing Interface) is language impendent and portable message-passing application programming interface for distributed parallel computing. The MPI specification defines the communication protocol and the semantics of how its features must behave in different implementations. The goal of MPI is high performance, scalability and portability which make it the preferred and dominant model used in high-performance computing. Although, MPI can be used by the application author to parallelize and run MPI application on any distributed system, achieving high performance relies on application expressing specific hardware knowledge of the target platform leading to solutions that typically target homogenous clusters composed of machines that have same or similar architecture and performance characteristics. Some implementation of MPI library however have been made available on specialized processors and accelerators allow the MPI application author to target explicitly the heterogeneity within some systems.

The MPI interface provides primitives to define and implement a virtual topology, synchronization and communication functionality among processes that are mapped to a compute instance (or processor). The assignment of the MPI processes to the CPUs (or cores in multi-core CPUs) is done through the agent that launches the MPI program (mpirun or mpiexec). Typically, for achieving high performance each process is assigned to a single core.

The MPI defines and implements functions for communication between the processes. These communication functions are used to distribute (send/receive) input/output data. Other MPI functions include setting logical topologies for the compute processes, combining partial results, synchronization as well as obtaining network related information.

MPI is mostly suitable for applying the same type of operation to very large amounts of data. Typically, the data is split in smaller chunks and each chunk assigned to an MPI process. The data is distributed via point-to-point or broadcast message to the designated compute process. When a compute process receives its chunk it carries the computation on the data and sends the output by using the most appropriate communication type. To improve performance, the MPI specification enables implementations that overlap computation with the communication.

MPI and OpenMP are complementary parallel programming approaches and can be combined. Because OpenMP has less overhead than MPI processes that run in the same shared-memory multi-core machine, combining OpenMP with MPI can deliver a significant performance improvement.

Due to the fundamental software paradigm shift between using OpenMP or MPI to express application parallelism, some research examples exist of extending the OpenMP runtime to utilize MPI to extend the perception of a shared memory machine over the typical shared-nothing structures required by MPI. Likewise, MPI is available as a runtime for operation on a shared-memory machine.



A conclusion from this is that different software and algorithms are typically best suited to either a shared-memory, or a share-nothing communication paradigm to best express parallelism and achieve best performance; however the hardware platform on which such application need to run may not support the software paradigm to the level of performance and capability required, for example, MapReduce with its strong alignment to the functional programming paradigm leads to requiring only a shared-nothing computing platform. This means there is no intrinsic overhead in the hardware platform to ensure a coherent and consistent view of globally shared data. However, if an application requires such coherent view of some globally shared data, then significant overheads in the hardware platform are required to ensure a consistent view of such data. Building a hardware platform that can efficiently support applications that require both shared global state which still benefiting from the inherent efficiency of a shared nothing platform is beyond today’s technology, however with in the HPC community, various initiatives around PGAS (Partitioned Global Address Space) and associated globally shared data schemes on partitioned system is providing applications an abstraction that allows it to utilise the concept of both shared memory and partitioned execution in one software paradigm

5.1.6 Hardware-independent program specification CPUs, GPUs, and FPGAs all require different programming approaches to take advantage of their respective architecture. However, there exist efforts to unify programming abstractions across these platforms to allow developers to write code once and have it execute on any of these platforms.

5.1.6.1 OpenCL OpenCL focuses on massively parallel data processing and bases its programming abstractions around the concepts of compute kernels that are applied in parallel to large data arrays. While this approach may be cumbersome for certain non-parallel programming tasks, it is very well suited to solve inherently parallel problems and ensures that OpenCL programs can be efficiently parallelized. During runtime OpenCL kernels are compiled for the desired hardware and executed. Usually CPU and GPU vendors provide the corresponding compilers and runtime libraries. While OpenCL obviates the burden of targeting specific hardware platforms, the code very often still needs to be manually tuned to achieve maximum performance for specific cache sizes and memory arrangement and transfer mechanisms.

5.1.6.2 LiquidMetal IBM Liquid Metal [88] offers a high level programming language called Lime that is Java compatible and allows developers to use familiar object oriented abstractions. Liquid Metal can compile Lime programs for execution on the CPU, the GPU or even FPGAs. Consequently it enables developers familiar with well-known object oriented concepts, to develop code for GPUs and FPGAs that otherwise would have required the developers to develop in CUDA, or OpenCL, or even Verilog in case of FPGAs.

5.1.7 Broader vision for Europe European SMEs in general cannot afford to run their own large clusters and data centers, so they have to use highly specialized hardware for the Big Data processing needs. This specialized hardware, usually requires developers that are familiar with the hardware and naturally also familiar with the SMEs problem domain to develop efficient code.

At the same time, committing to a specific application scenario and a suitable hardware platform, however comes with risks. Changing the application scenario may now also require changes to the hardware, which in turn will require different development skills. Moreover, there is often a need for predictability in terms of fault tolerance for some applications. In order to minimize these risks we argue that SMEs should run their workloads in datacenters, instead of their own hardware. Having multiple tenants, datacenters can afford to have various pieces of specialized hardware available.

Given suitable programming abstractions, along the lines of IBM liquid metal, or even a declarative language, SMEs could submit their workload to the datacenter and have it run on



whatever hardware is suitable without adapting the code to the respective platform. The primary challenge here is to ensure that the programmer at time of code authorship does not have to make specific assumptions about the hardware platform on which the application may run.

5.1.8 Concrete strategy • Develop declarative languages and optimizers for heterogeneous hardware (may

incorporate HW/SW co-design) • Setup datacenters in EU (or use U.S. datacenters if we have suitable approaches for

tackling privacy concerns) • Let SMEs run their declarative workload on these datacenters, let the system support

software figure out the best type of hardware (which is available for the money that SME would like to invest, and available at point of execution). That way the SMEs do not need to optimize their software for any specific hardware.

• If we cannot have a declarative language with a suitable runtime optimizer, at least we should have a hardware independent one like Liquid Metal, so that SMEs can manually assign their workloads to different types of hardware in the datacenter.

5.2 Big Data Core Algorithms

5.2.1 Machine learning algorithms on graphs PageRank: is an iterative algorithm that computes the rank of all vertices of a directed graph by associating to each vertex a rank value that is proportional with the number of references it receives from the other vertices, and their corresponding PageRank values.

PageRank is an iterative algorithm proposed in the context of the Web graph, where vertices are web pages and edges are references from one page to the other. Conceptually, PageRank associates to each vertex a rank value proportional with the number of inbound links from the other vertices, and their corresponding PageRank values. Used in: social media, web analysis.

Semi-clustering: is an iterative algorithm popular in social networks as it aims to find groups of people who interact frequently with each other and less frequently with others. A particularity of semi-clustering as compared with the other clustering algorithms is that a vertex can belong to more than one cluster. The input is an undirected weighted graph while the output is an undirected graph where each vertex holds a maximum number of semi-clusters it belongs to. Used in social media

Collaborative filtering: With the exponential growth of the data in the last years, it becomes increasingly difficult to separate the useful data from the noise. Collaborative filtering is a technique used by recommender systems to filter out irrelevant products from large datasets based on collaboration among users, data sources, or agents.

Collaborative filtering explores techniques for matching users with similar interests and later it uses such matchings for making recommendations for new products. One of the most popular algorithms for collaborative filtering is Alternating Least Squares (ALS). Used by recommender systems.

Belief propagation is an inference algorithm for probabilistic graphical models that computes the marginal distributions for the unobserved nodes in a graph, conditional on the observed ones. It is used in many domains such as graph analysis, computing vision and image restoration, fraud detection, error correction.

HADI (diameter and neighborhood estimation): Estimating the number of vertices reachable from a vertex v within h hops or shortly the neighborhood of v is used in social applications today.

LinkedIn for instance provides information on the number of professionals reachable within some number of hops from any given user. Neighborhood estimation can be implemented for all the vertices of an input graph using an iterative, probabilistic algorithm. Used in graph analysis, social media, web analysis.



Labeling connected components: is an algorithm that finds the number of connected components in a graph by mapping each vertex to a connected component identifier. Used in graph analysis, social media, web analysis.

5.2.2 Graph processing Among the many applications that store and process large-scale data and their relationships is the use of graph structures or graph databases. Examples include social network analysis (Facebook, Twitter), bioinformatics (protein interaction networks), scientific simulations (provenance tracking), geographic information systems (geo-graphs), etc. Graph processing is thereby a frequent task in these applications. Given the high complexity of typical graph algorithms (e.g., graph traversal, subgraph matching, graph drawing, etc) and the explosion on the size of today’s graph data (millions to billions of nodes and even more edges), there is an urgent demand on the design of efficient graph processing schemes. While one solution is to develop novel efficient algorithms, the other promising strategies, which are orthogonal to the first strategy, are to scale down graph processing via sampling [89] and scaling up graph processing via parallel processing [90].

5.2.2.1 Graph processing via sampling Instead of processing directly large-scale graphs, one can extract representative sample graphs of manageable size from the original graphs, apply graph algorithms on the sample graphs and scale the results back to the original graphs. The core of the scheme is to design efficient algorithms to extract representative sample graphs.

As the graph properties are diverse, the representativeness of the sample graphs can be interpreted in different ways, according to the properties of interest that the applications request. In the past decade, researchers have developed graph sampling algorithms to capture degree distribution, clustering coefficient distribution, connectivity, community structure and many other properties of the original graphs [89], [91], [92], [93]. Nevertheless, there is no efficient solution that fits all applications, where different graph properties need to be captured. One often has to develop a new sampling algorithm if they want to reproduce a new property of the original graphs in the sample graphs. Therefore, an interesting problem is to devise efficient sampling schemes that capture diverse properties in the original graphs. On the other hand, existing sampling algorithms are evaluated based on empirical results, i.e., the representativeness is measured by comparing empirically the properties of the sample graphs and those of the original graphs. In contrast, sampling schemes with an error bound guarantee are desired in order to meet the requirements of some applications concerning with sample quality.

Also in recent years, applications have shown interest in not only the graph structure per se, but also other types of data that are annotated to graph nodes and edges, e.g., the textual data affiliated to nodes. The underlying graphs are called attributed graphs. Recent work has focused on attributed graph mining [94], [95], [96] and OLAP on attributed graphs [97], [98]. The next generation of graph sampling algorithms should take into account the attributes annotated to the graphs as graph properties and reproduce them in the sample graphs.

5.2.2.2 Parallel graph processing It is natural to consider processing large-scale graphs by utilizing parallel computing resources. There are several challenges [90] in parallel graph processing due to the inherent characteristics in the graph structure and graph algorithms, including unstructured problems of graph data, poor locality of computations, access patterns, high data access to computation ratio, iterative nature of graph algorithms, etc.

To overcome these challenges, several programming models for parallel graph processing on different platforms have been developed in the past decade. Main approaches include CUDA on graphics processing units (GPUs) [99], [100], MapReduce [75], [101] and Bulk Synchronous Parallel (BSP) [86]. The MapReduce and BSP model have been implemented in the Hadoop environment. Still, one particular programming model is usually designed to apply on a category of graph analytical problems. For example, the BSP model has demonstrate its superiority on graph algorithms with many iterations, whereas MapReduce remains a good option for massive



graph data that not fit into main memory [102]. It still remains a challenge to design a unified programming model that handles parallel graph processing problems in diverse applications.

5.3 Visualization

5.3.1 General visual analytics In Intel’s IT Manager Survey of 200 IT [103] professionals found that four of the top five data sources for IT managers today are semistructured or unstructured. Many companies are unable to analyse these emerging forms of data, which include everything from e-mails, photos, and social media to videos, voice, and sensor data. This survey found that only about half of IT managers perform data analytics in real time, while the other half continue to rely on batch processing that fails to capture the immediacy of big data. Visual analytics is a new field that has arisen in recent years. The objective of visual analytics is to help organise the information, generate overviews and explore the information space to extract potentially useful information. As defined in [104], "visual analytics is the science of analytical reasoning facilitated by interactive visual interfaces" In this domain, visualisation is not used at the end of the process but rather it attempts to integrate it throughout the entire process to try to combine and optimise algorithmic analytics with human visual skills.

However, adapting visual analytics to Big Data problems presents new challenges and problems. The challenges are difficult and numerous, some are technological (e.g., computation, storage, rendering, ...) and some are related to human cognition and perception (e.g., visual representation, data abstraction, scale, ...). In 2012, in an article of the IEEE Computer Graphics & Applications journal, the top 10 challenges in extreme scale visual analytics were analysed [105]. These challenges were described as:

1. In Situ Analysis: Situ visual analytics tries to conduct all the possible analysis while the data are still in memory since the traditional approach of storing first the data on disk might not be feasible for petascale and beyond levels. This approach can greatly reduce I/O costs. To solve the challenges that could arise, this solution will require a new change in the HPC community.

2. Interaction and user interfaces: Whereas data sizes are growing, the human abilities have remained unchanged. The challenges of interaction and UIs are deep, multifaceted and overlapping. 10 major challenges regarding interaction and user interfaces were identified in [106] listed as follows:

• In Situ interactive analysis: similar to the previous point • User-Driven data reduction: To develop a flexible mechanism that users can easily control

according to their data collection practices and analytical needs. • Scalability and Multilevel Hierarchy: Multilevel hierarchy is a prevailing approach to many

visual analytics scalability issues. But as data size grows, so does the complexity for constructing and navigating the hierarchy.

• Representing evidence and uncertainty: The challenge is how to clearly represent the evidence and uncertainty of extreme-scale data. Many algorithms must be redesigned to consider data as distributions.

• Heterogeneous data-fusion: To analyse the interrelationships among heterogeneous data objects or entities.

• Data summarization and triage for interactive query: analyzing the entire dataset might not be feasible. Data summarization and triage lets users request data with certain characteristics.

• Analytics of temporally evolved features: The challenge is to develop effective visual analytics techniques that are computationally practical and exploit humans’ ability to track data dynamics.

• The human bottleneck: The challenge is to find ways to compensate for human cognitive weakness due to the increase in the performance of the hardware.

• Design and Engineering Development: To develop community-wide API and framework support on an HPC platform.



• The renaissance of conventional wisdom: Successfully returning to the solid principles found in conventional wisdom and discovering how to apply them to extreme-scale data

3. Large-Data Visualization: Focused on data presentation, including the visual techniques (abstract visualization, data projection, dimension reduction, ...) and the visual display of information (high-resolution displays and power wall displays). However, more data projection and dimension reduction in visualization also mean more abstract representations. Such representations require additional insight and interpretation for those performing visual reasoning. Moreover, the limitations of human visual capacity reduce the effects of high-resolution displays.

4. Databases and storage: One ongoing concern is that the cost of cloud storage per gigabyte is still significantly higher than hard drive storage in a private cluster. Another concern is that a cloud database's latency and throughput are still limited by the cloud's network bandwidth

5. Algorithms: Traditional visual analytics algorithms were not designed with scalability in mind. Moreover, most algorithms assume a post processing model in which all data are readily available in memory or on a local disk, unfeasible with extreme data scales.

6. Data Movement, Data Transport, and Network Infrastructure: Regarding the cost of data movement in the visual analytics pipeline. The geographically dispersion of data sources and the increase in size of the data will create new challenges for big data visualization. New algorithms and software that efficiently use networking resources and provide convenient abstractions must be developed.

7. Uncertainty quantification: Uncertainty quantification and visualization will be particularly important in future data analytics. Understanding the source of uncertainty in the data is important in decision-making and risk analysis. Novel visualization techniques will provide an intuitive view of uncertainty to help users understand risks tools.

8. Parallelism: In order to deal with the size of data, parallel processing can effectively reduce the processing time for visual analytics. To fully exploit the upcoming pervasive parallelism, many visual analytics algorithms must be completely redesigned; not only to increase the degree of parallelism, but also to develop new data models.

9. Domain and Development Libraries, Frameworks, and Tools: Similar to the challenge in user interaction and interfaces, the community needs to develop new APIs and support for HPC platforms.

10. Social, Community, and Government Engagements: The final challenge is for the government and online-commerce vendors communities to jointly provide leadership to disseminate their extreme-scale-data technologies to society at large.

Many of the aforementioned challenges are general and can be applied to most areas of Big Data computation, management and analytics. The challenge that is more specific to visual analytics seems to be visual representation. There are several techniques that have been developed for visualizing Big Data. Two of the most popular are abstraction and aggregation.

As defined in [106] "Abstraction is the process by which data is defined with a representation similar in form to its meaning (i.e., semantics) while hiding way details". Abstraction is based in simplifying details without losing the semantic meaning. It can be described as graduated layers of detail where the highest level contains few details and the lowest contains the full set.

Abstraction can be applied across many levels and scales. However, it is not trivial to find a mapping for this multilevel representation that preserves faithfully the semantics across all levels without causing the user to loose context.

An example of a technique based on this approach is abstract rendering [107].

Abstract rendering is based on the observation that pixels are fundamentally bins, and that rendering is essentially a binning process on a lattice of bins. By providing a more flexible binning process, the majority of rendering can be done with the geometric information stored out-of-core. Only the bin representations need to reside in memory. It provides greater control over a visual representation and access to larger data sets through server-side processing.



A similar concept to abstraction is aggregation. In this strategy, developed in the cartography and geographic visualization domain, several items are put together and represented as a single unit. Aggregation reduces the number of items without simply deleting some of them. It transforms the original items into a smaller number of constructs that summarize the properties of the original items, therefore, is a data summarization process. For example, in [108] the space is divided into compartments, the trajectories are transformed into moves between the compartments, and the moves with common origins and common destinations are aggregated to represent multiple trajectories of moving agents.

Besides abstraction and aggregation, there are several other interesting approaches. For example, in [108] Fisher et al. developed sample; Action, a prototype system that uses incremental approximate queries to display approximate aggregates allowing the user to explore and refine approximate solutions. In [109] Jang et al. used a functional representation approach for time-varying data sets and developed an efficient encoding technique using temporal similarity between time steps. Using this approach, they were able to encode the information using a functional representation reducing considerably the data storage. Finally, in [106] the typograph prototype is described, a system for visualizing large text repositories that uses a spatial semantic map metaphor. It allows global views of a text corpus and highlights important terms (computed by text analytics) and regions of interest and presents the results into a multilevel, multiscale hierarchy.

5.3.2 Visual Analytics for Business Intelligence Businesses are finding that traditional reporting process does not work nearly as well for big data, and certainly is not sufficient to capture the potential value that big data represents.

One of the main challenges is driving the rise of visualization-based data discovery tools is the increasing availability of mobile devices. Businesses that continue to rely on centralized creation of reports by a few highly trained experts are missing an opportunity to adopt a faster, more cost-effective, and more democratized BI model that takes advantage of the intersection of big data and the mobile workforce to speed insights and improve collaboration.

• By 2015, IT managers expect that 63% of all analytics will be done in real time • Of seven possibilities, IT managers indicated that they would find the most value in

receiving help deploying cost-effective data visualization methods

Visualization-based data discovery tools allow business users to mash up disparate data sources to create custom analytical views with flexibility and ease of use that simply didn’t exist before. Advanced analytics are integrated in the tools to support creation of interactive, animated graphics on desktops, as well as on powerful mobile devices. Key Features of Visualization-based Data

Discovery Tools:

• Enable real-time data analysis • Support real-time creation of dynamic, interactive presentations and reports • Allow end users to interact with data, often on mobile devices • Hold data in-memory, where it is accessible to multiple users • Allow users to share and collaborate securely

Additional Features to look for:

• Ability to visualize and explore data in-database as well as in-memory • Governance dashboard that displays user activity and data lineage • In-memory data compression to enable handling of large datasets without driving up

hardware costs • Touch optimization for use with touch-enabled mobile devices • The achievement of all these features is intimately linked with the current trends of

transforming the classical computation processes in "in-memory computation"



“In-memory analytics” is a technology that will allow operational data to be held in a single database that can handle all the day-to-day customer transactions and updates as well as analytical requests — in virtually real time.

The advantages of in-memory analytics are many: Performance gains will allow business users to retrieve better queries and create more complex models, allowing them to experiment more fully with the data in creating sales and marketing campaigns, and to retrieve current customer information, even while on the road, through mobile applications. The resulting boost in customer insights will give those who move first to these systems a real competitive advantage. Companies whose operations depend on frequent data updates will be able to run more efficiently. And by merging operational and analytical systems, with their attendant hardware and software, companies can cut the total cost of ownership of their customer data efforts significantly.



6 Alternative computing paradigms 6.1 Probabilistic computing Computing systems are going to witness a convergence between the mission-critical market (such as automotive and healthcare) and the mainstream consumer market (such as mobiles phones). Such convergence is fuelled by the common needs of both markets for (1) more performance for an available level of power and (2) support for mission-critical functionalities, such as time predictability. This convergence imposes new needs on both markets, in particular, providing guarantees, beyond best effort, on non-functional Quality of Service (QoS) requirements. Such QoS requirements include, but are not limited to, time and energy predictability.

Current approaches fail to provide guarantees for QoS requirements mainly because the complexity of the performance-improving hardware (HW) and software (SW) features used to provide the level of performance required in both mission-critical and mainstream markets. The intricate dependences between applications themselves and between HW and SW components contribute to the inaccuracy of deterministic models. Increased HW/SW complexity also increases the vulnerability of programs to pathological behaviours: small changes in the execution conditions of programs that cause them to have huge variations in their behaviour. As a result, (1) current QoS approaches cannot use the past behaviour of applications to accurately predict their future behaviour, causing system designers to over-provision the HW/SW resources in order to provide guarantees, reducing the efficiency of the system and increasing its cost; and (2) HW- and SW-improving features cannot be used since evidences cannot provided about those features’ guaranteed non-functional behaviour, limiting the progress of MC and MS systems.

Hence, a view is that modelling systems and their non-functional properties in a deterministic way – as done nowadays – has a cost that depends exponentially on the complexity of the system. Trends towards internet-of-things, big data, and any system like those increase systems complexity exponentially and thus, their modelling, characterization and implementation requires a super-exponential cost increase for accurate modelling. Therefore, predictability needed to size computer systems in terms of power needs (batteries, power supply, etc.), temperature dissipation (mobile phones, cooling solutions for data centres, etc.), timing guarantees (deadlines in real-time systems, QoS in datacenters, response time in the internet-of-things, etc.) and other non-functional properties cannot be achieved with affordable means.

Therefore, the view that moving away from deterministic hardware/software features into hardware/software that exhibit randomised non-functional behaviours, tighter guarantees can be provided on applications’ non-functional behaviour with negligible effect on performance. This shift will also enable probabilistic analysis of the non-functional (QoS) requirements of the system, providing guarantees of whether QoS requirements will be met.



7 Co-design Author: Christos Kotselidis

Any form of data processing relies on the interplay of hardware and software. If one understands where currently either hardware or software falls short or over which the abstraction is unbalanced, one can seize the opportunity to co-design hard and software in a way that alleviates these shortcomings.

This section describes hardware/software co-design approaches and elaborates on privacy and security matters, which —in a virtualized environment—have become a hardware/software cross cutting issue.

7.1 Hardware/Software co-designed for Big Data Computing is advancing in a remarkably rapid pace. “Smart devices” such as personal computers, laptops and smartphones have been integrated in almost every aspect of our lives. Furthermore, the booming of social networking (Facebook, etc.) resulted in a vast amount of personal multimedia content to be distributed and processed by personal devices and servers. The processing power needed to perform such complex computations is more demanding than ever. Moreover, for various reasons (environmental, mobility, costs, etc.) the power footprint of all computations has to be minimized. Throughout the past decades, both performance and power efficiency have been progressing at a steady pace via technological advancements in both software and hardware. The formerly “guaranteed performance boosts” led to the alienation of the two sides of computing resulting to the employment of “horizontal engineering”; software and hardware teams are contributing independently to the technological advancements.

Nowadays, due to physical limits in transistor manufacturing and research saturation in optimal code generation, performance and power efficiency advances seem to stall. Pure software or hardware solutions do not guarantee computing evolution at acceptable power envelopes anymore. Therefore, “the, currently, separated hardware and software teams must provide co-designed and vertically integrated solutions in order to keep pushing the boundaries of computing” and as such adopt a systems approach to addressing an application challenge.

Hardware/Software co-design is a promising solution aiming to bridge the strictsemantic gap often asserted between hardware and software when a design assumes the hardware cannot be altered. Software should adapt its code abstractions to fit better the underlying hardware, while simutaniously; hardware has to adapt the appropriate components in order to utilize optimally support the application code.

Regarding Big Data, where the problems become even more complicated, HW/SW co-design seems more natural towards resolving them. Furthermore, HW/SW co-designed architectures have been and/or existing in various levels and knowledge can be transferred into solving Big Data related problems.

7.2 Hardware/Software co-designed architectures HW/SW co-designed architectures have been employed in order to solve challenging problems where hardware or software only solutions do not suffice. Azul, a US company [110], designed and sold the Vega [111] line processors that are co-designed with a Java runtime system. The purpose of the Vega systems is to provide high-throughput Java-based transactional systems for the banking sector by achieving pauseless automatic memory management. The Vega systems employ a relaxed memory model and have a co-designed “Garbage Collector” in which, special hardware instructions that serve as read barriers can reclaim free space without stalling the application threads. Furthermore, the co-designed page table system of the operating system can perform bulk reclamation of memory pages. By co-designing the runtime system, the operating system, and hardware, the Vega systems manage to outperform all other platforms in terms of



transactional throughput for Java-based workloads. Theoretically, such systems or techniques could be expanded to other managed languages such as .NET, JavaScript, etc.

Another example of co-designed architectures is the recently announced data center of Microsoft and other collaborators [112] that offload specific workloads to dedicated FPGAs residing in the die along with commodity processors. By using such combinations, it is possible through co-design to move the hardware/software abstraction and improve dramatically the throughput of servers while reducing power and latency. Integrating FPGAs in data servers, however, is not a new idea and has been already researched and developed [113], [114]. The move however to allow the FPGA full concurrent shared memory access to the processor caches and memory will nevertheless, invigorate, the need for pushing performance even further while minimizing the power footprint and constitutes the application of ASICs in data centers even more appealing.

7.3 Hardware/Software co-designed processors The HW/SW co-designed processor is a very intriguing idea with very promising results. Transmeta [115] pioneered the development of such processors with the Efficeon and Crusoe architectures. The idea behind such processors is to embed a software layer, in the form of a compiler or a runtime system, inside a chip which is less complex and thus slower and more power-efficient than big out-of-order cores. The software layer can perform a variety of tasks including binary translation from one Instruction Set Architecture (ISA) to another one, re-compile and re-optimize the executed low-level code, perform speculative optimizations, etc. For example, the Transmeta line of CPUs was translating the x86 ISA to a VLIW ISA, which in turn was optimized and executed on a simple in-order core. By offloading complex and expensive (both in power and die area) tasks onto software can yield better power/performance results. Although, the Transmeta processors had a success in their early days, eventually they did not manage to sustain their business model and sold their IP to various other HW vendors such as Intel and Nvidia.

Both companies researched and developed this kind of architectures with NVidia announcing Project Denver [116] a hw/sw co-designed processor based on the similar premise as that of Transmeta. The first generation the Denver project is expected to come to market this year.

Another approach dealing with hardware/software co-design is Oracles RAPID project [117], which aims to reduce the energy consumption of database workloads by executing them on a system that offer heterogeneous hardware. Corresponding hardware-conscious software is meant to utilize the hardware in a way that maximizes power efficiency.

7.4 Hardware/Software Co-designed accelerators The shift from the traditional CPU-only execution model to heterogeneous execution is a form of HW/SW co-designed architectures. GPGPU, ASIC, FPGA offloading while at the moment is not fully embraced; it sets the ground for future development both in the hardware and the software levels. Heterogeneous programming models such as CUDA, OpenCL, the HSA foundation etc., are working in establishing new or integrating into existing programming models the heterogeneous execution. GPGPU although have been gaining momentum with its inherent applicability for data parallel workloads, it will not be until the interface abstractions can deliver a more homogonous co-design view of the platform will the full potential of such be realized.

7.5 Broader strategy for Europe Tackling Big Data issues requires a holistic re-design and implementation of current solutions at all levels: from the programming model and the analytics software layers down to the compilers and the runtime systems. To that extend EU is in placed strategically well to utilize expertise from all levels of the stack in order to solve in a holistic manner the Big Data problems.

By combining the software expertise amongst European institution and SMEs while targeting on the EU-derived ARM microservers can potentially give the edge in achieving unrivaled power/performance benefits.



7.6 Milestones • Optimize existing compiler and runtimes on ARMv8 architectures with special extensions

to exploit heterogeneous execution. • In parallel, research, develop and deploy large ARM based clusters (similar approach to

HP’s The machine). • Co-design existing or new programming languages that are suitable for Big Data. The co-

design aspect will concern the semantic enrichment of code. The metadata will be exploited from the underlying software components (compiler/runtime) for code optimizations/generation and offloading.

• Build on top of EU-derived Big Data analytics frameworks that will serve as benchmarks for achieving better performance with low energy.

7.7 Chip security and privacy

7.7.1 Overview Over the past few years we have been observing a significant amount of data being collected and stored on-line. Frequently referred to as “big data”, this vast amount of data slowly takes over a significant percentage of human activities. Entertainment, or more generally, infotainment is slowly phasing out paper-based media such as newspapers, and even optical-based media, such as DVDs, and rapidly adopts Internet-based information delivery. This shift from paper-based media to Internet-based information delivery has a fundamental implication: when people read a news article on-line, when they watch a movie on-line, when they listen to a song on-line, they themselves generate a wealth of data – collectively much more data than the data contained in the movie itself: the mere fact that they watched the movie, the selection process they followed to select the particular movie, the time they started watching and the time they stopped watching, the time they hit the pause “button”, the time the increased the speed, the time they watched it in slow motion, the web sites they visited during the time the pause button was pressed – all these are data generated by the users: a lot of data generated by the users and usually collected and stored for further processing somewhere in what is so colourfully called “the cloud”. Although infotainment was the spearhead of the Internet revolution, several other aspects of our lives are being moved to the online world as well: eBanking, shopping, making videos, taking pictures, holiday planning, to name a few. Each of these aspects moves even more personal information to the cloud: items bought, items returned, items browsed, method of purchase, amount of money spent in shopping, etc. We have reached the point where it is difficult to find an aspect of human life that does not have some digital dimension and does not plan to migrate to the on-line world.

It can be easily seen that such data may have huge security and privacy implications. On the security side, big data collections contain a lot of information about real people, which can cause real damage if lost or tampered with. For example, over the past few months we have observed headlines stating that password files have been leaked from several well-known sites. Password leakage may potentially lead to theft of user data, and once the data is stolen it is gone forever.

On the privacy side, we also have significant implications. A lot of the collected information contains private data (or “personally identifiable information” – PII) such as names, dates, credit cards numbers, the movies people watch, the places they visit, the illnesses they have, the challenges they face in their lives. Although most people “have nothing to hide”, they certainly do not want strangers to have access to all the personal details of their lives.

7.7.2 Current status The main mechanism for privacy has been, and probably still is, the extensive use of anonymity. In these approaches, users perform their tasks anonymously frequently behind an anonymizing network such as the onion router [118] or mix networks [119]. Although these approaches are effective in anonymous web browsing, their usefulness is limited in settings, such as social networks, where users have to be logged in. Indeed, although these approaches are very effective at hiding a user’s IP address, anonymity is lost when the user logs in a social network using their



username and password. In these settings, pseudonyms may be extensively used. Indeed, users may log in using a pseudonym and operate under a fake name [120].

Although anonymity and pseudonymity are very useful, sometimes people cannot really hide. If they frequently use the same computer and the same web browser from a small number of venues (such as their work and their home), all their browsing sessions can be eventually linked, and their true identify can be very well approximated [121]. In such cases privacy may be protected through obfuscation [122]: the addition of “noise” information to confuse an observer. For example, to hide a web search they want to do, people may issue a sequence of queries hiding in this sequence the one query they are really interested in.

In the world of big data, however further challenges occur since across the vast amount of anonymous data the class-demographic of the information together can result in a demographic set of a single individual and as such removes any anonymity of the individual to which any single instance of data would not expose.

Privacy is also an important dimension when data are being shared between different organizations, for example, for research purposes. In such cases, all personally identifiable information (such as names, birthdates, etc.) is removed or generalized. It has been repeatedly shown in the literature, however that even after the data are anonymized, some of the information as described can be linked to individual people [123]. Such cases may benefit from k-anonymity [124], an approach which makes sure that data are anonymized/generalized in such as way so that there will always be at least “k” individuals that match any potential profile. To avoid such dangers of de-anonymization, an alternative approach to data sharing assumes that data are not shared (not even anonymized), and the users are only allowed to send queries to a curator of the data. The curator usually employs a differential privacy algorithm [125]. The algorithm essentially adds some noise to each query response. So, if a user asks “what percentage of patients in your database had a stroke”, the differential privacy algorithm will add some amount of noise (a different amount each time the query is asked) so as to supply a result that is useful in general, but not 100% accurate so as to allow to de-anonymize individual people.

Ensuring data is kept private is also challenging the technologies by which privacy is today delivery in that there is a growing concern that the provider offering the privacy scheme can not be trusted and through malicious, governmental, or business interest expose data outside of the privacy governance given to the data.

7.7.3 Hardware and hypervisor assistance A number of research efforts have recently focused on systems that support trusted computation in hostile environments and operating systems [126], [127], [128]. These systems are designed as a generic solution for protecting computation performed by applications, in cases where the host machine has been compromised. In such settings, a process responsible for cryptographically signing a message must never expose its keys to the operating system, and therefore encryption will remain functional even when the operating system is compromised. These systems require applications running on a hypervisor, introducing significant performance overhead, or the addition of extra hardware abstraction layers and the re-compilation of the operating system.

These systems, however, are still often considered insecure and limited in their ability to protect the privacy of data.

Intel recently introduced SGX (Software Guard Extensions) [129], a set of new CPU instructions for establishing private memory regions. SGX allows an application to instantiate a protected container, which is a protected area in the application’s address space that provides confidentiality and integrity even in the presence of privileged malware. SGX could potentially be used to implement secure computation regions using an SGX container for secure storage and taking advantage of the CPU’s cryptographic instructions. These regions can be used to protect secrets from the hands of attackers even in cases of system compromise.

However, again, such schemes can only protect the data as far as the technology has influence. SGX may indeed allow regions of memory to be protected from access from execution on the given processor; however when deployed in a system, other components in the system do not have



such extensions and can still access such memory. For example, in the more recent shared-memory GPU system, the GPU does not have such extension and can be programmed to access memory across any SGX extension. Likewise, for DMA enabled interfaces or other heterogeneous processing elements.

The solutions described so far require either specialized hardware, instruction set extensions that are not yet available, or slow, hypervisor-based approaches. Another solution that is based solely on commodity, cheap, and widely available hardware is PixelVault [130]. PixelVault, a system for keeping cryptographic keys and carrying out cryptographic operations exclusively on the GPU, which allows it to protect secret keys from leakage even in the event of platform operating system compromise. This is possible by exposing secret keys only in GPU registers, keeping PixelVault’s critical code in the GPU instruction cache, and preventing any access to both of them from the host. Due to the non-preemptive execution mode of the GPU, an adversary that has full control of the host cannot tamper with PixelVault’s GPU code, but only terminate it, in which case all sensitive data is lost.

Within the ARM architecture, in addition to the more traditional hardware virtualization extensions, the processor and system architecture also provide an additional level of software privilege elevation known as TrustZone. Although often miss represented as virtualization, the two key benefits of TrustZone over virtualization is that the technology footprint exposed to move between security modes is a single vector, and as such the design space of the interface is limited to the values of the registers at the point the vector is executed. Unlike virtualization which has a significant number of vectors, traps, and hypercalls, the transition into the secure mode can be explicitly proven to not expose any potential exploits and as such is used today to deliver services that require formal certification such as PIN based credit card payment.

Trustzone also communicates from the processor into the rest of the hardware system the level of TrustZone privilege. This enables a system to only expose the rest of a rich heterogeneous platform only within specified privilege. For example, the ability for a DMA, GPU or other programmable system masters to access memory can be forcibly disabled at the hardware address bus level while specific secure operations, such a PIN entry or security key computations are carried out removing the risk not only of the platform software, but any component within the wider system from being able to view the keypad entry. Likewise, the configuration registers of system components can be limited to only certified code protected behind the TrustZone single vector interface. This for example could ensure that virtualization of system masters can only be asserted using system MMU capabilities into the address space of a single guest operating system, any risks from a compromised host OS or hypervisor removed.

7.7.4 Research challenges There are several research challenges in this area. Some of them include:

• Enable users to have control of their data. That is, enable users to know, which of their data have collected and by whom

• Enable users to know how/when their data are being transferred/sold from one entity to another.

• Provide privacy mechanisms for users who cannot hide their identity, such as in cases they have to log in their smartphone or their social network

• Provide data integrity mechanisms: make sure that data provided by the users do not change and are not tampered with

• Creating security schemes in which neither the security provider nor the hardware platform on which the data is being manipulated can be trusted,



8 Open topics requiring further Investigation This section enumerates all topics that have been identified to be relevant for the road mapping effort, but require further investigation:

1. Core processing IP market 2. Silicon nano-technology development and EDA 3. Multi-module and system in package device flexibility 4. Rich heterogeneous devices and abstraction granularity 5. Techniques in approximate and good-enough computing 6. Efficiency of power management schemes 7. Processor element roadmap 8. Data capture and realtime processing 9. Datacenter Architecture 10. Wide Area Networks or WAN 11. Further security aspects 12. Data management (NoSQL databases and related approaches) 13. Neuromorphic computing 14. Content addressable and associative memories



9 Bibliography [1] (2014, Dec.) [Online]. http://www.engadget.com/2014/12/12/seagate-ships-8tb-shingled-

hard-drive/

[2] (2012, June) [Online]. http://bigdatachallenges.com/2012/06/10/tape-vs-disk-its-time-for-a-truce/

[3] [Online]. http://www.samsung.com/global/business/semiconductor/minisite/SSD /global/html/ssd850pro/specifications.html

[4] [Online]. http://www.hybridmemorycube.org/

[5] [Online]. http://www.jedec.org/standards-documents/docs/jesd235

[6] [Online]. http://www.jedec.org/standards-documents/docs/jesd229-2

[7] [Online]. http://www.tezzaron.com/products/diram4-3d-memory/

[8] [Online]. http://www.samsung.com/global/business/semiconductor/html/product/flash-solution/vnand/overview.html

[9] International Technology Roadmap for Semiconductors. (2013) ITRS Report 2013 Edition. [Online]. http://www.itrs.net/ITRS%201999-2014%20Mtgs,%20Presentations%20&%20Links/2013ITRS/Summary2013.htm

[10] K. Eshraghian et al., "Memristor MOS Content Addressable Memory (MCAM): Hybrid Architecture for Future High Performance Search Engines," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 19, no. 8, pp. 1407-1417, Aug 2011.

[11] Soogine Chong et al., "Nanoelectromechanical (NEM) relays integrated with CMOS SRAM for improved stability and low leakage," in Computer-Aided Design - Digest of Technical Papers, 2009. ICCAD 2009. IEEE/ACM International Conference on, 2009, pp. 478-484.

[12] [Online]. http://www.pcworld.com/article/2449508/samsung-reveals-supercharged-850-pro-ssds-stuffed-with-bleeding-edge-v-nand-tech.html

[13] [Online]. http://www.itworld.com/article/2851019/intel-plans-3-D-nand-flash-next-year-for-as-much-storage-as-you-want.html

[14] Ferad Zyulkyarov, Qiong Cai, Nevin Hyuseinova, and Serkan Ozdemir, "System and method for managing persistence with a multi-level memory hierarchy including non-volatile memory," WO 2013147820 A1, Mar 29, 2012.

[15] (2014, June) [Online]. http://www.engadget.com/2014/06/11/hp-the-machine/

[16] [Online]. http://www.networkworld.com/article/2859386/network-storage/a-terabyte-on-a-postage-stamp-rram-heads-into-commercialization.html

[17] (2014, Jan) [Online]. http://www.theregister.co.uk/2014/01/14/



phase_change_micron_drops_phase_change_memory_products/

[18] Jon Peddie Research. GPU market up—Intel and Nvidia graphics winners in Q4, AMD down. [Online]. http://jonpeddie.com/press-releases/details/gpu-market-upintel-and-nvidia-graphics-winners-in-q4-amd-down/

[19] Visileios F. Pavlidis and Eby G. Friedman, Three-dimensional Integrated Circuit Design.: Morgan Kaufmann, 2009, ISBN: 978-0-12-374343-5.

[20] M. Karnezos, "3D packaging: where all technologies come together," in Electronics Manufacturing Technology Symposium, 2004. IEEE/CPMT/SEMI 29th International, 2004, pp. 64-67.

[21] G. Patti, "3-D stacked ICs: From Vision to Volume," Chip Scale Review, vol. 18, no. 6, pp. 32-35, Nov/Dec 2014.

[22] E. Jan Vardaman, "Moving 3-D ICs into HVM: Narrowing the Gap," Chip Scale Review, vol. 18, no. 6, pp. 5-7, Nov/Dec 2014.

[23] R. Courtland, "Memory in the Third Dimension," IEEE Spectrum, vol. 51, no. 1, pp. 60-61, Jan 2014.

[24] [Online]. http://www.xilinx.com/products/silicon-devices/3-Dic.html

[25] Electronic Leaders Group. (2014, Feb) A European Industrial Strategic Roadmap for Micro- and Nano-Electronic Components and Systems.-strategic-roadmap-micro-and-nano-electronic-components-and-systems. [Online]. http://ec.europa.eu/digital-agenda/en/news/european-industrial

[26] Mark Weiser, "The Computer for the 21st Century," ACM SIGMOBILE Mobile Computing and Communications Review, vol. 3, no. 3, pp. 3-11, July 1999.

[27] Internet of Things – Architecture. [Online]. http://www.iot-a.eu/public

[28] D. Bertozzi, G. Dimitrakopoulos, J. Flich, and S. Sonntag, "The fast evolving landscape of on-chip communication," Design Automation for Embedded Systems, April 2014.

[29] M. Iqbal, M.J. McFadden, and M.W. Haney, "ntrachip Global Interconnects and the Saturation of Moore's Law," in Preceedings of LEOS Summer Topical: Optical Interconnects and VLSI Photonics, 2004.

[30] K. Bergman, L.P. Carloni, A. Biberman, J. Chan, and G. Hendry, Photonic Network-on-Chip Design. New York: Springer-Verlag, 2014, vol. 68.

[31] (2015) Silicon Integrated Nanophotonics. [Online]. http://researcher.watson.ibm.com/researcher/ view_group.php?id=2757

[32] Moving Data with Slicon and Ligh. [Online]. http://www.intel.com/content/www/us/en/research/ intel-labs-silicon-photonics-



research.html

[33] Kingsley A. Ogudo, Diethelm Schmieder, Daniel Foty, and Lukas W. Snyman, "Optical propagation and refraction in silicon complementary metal–oxide–semiconductor structures at 750 nm: toward on-chip optical links and microphotonic systems," Journal of Micro/Nanolithography, MEMS, and MOEMS, vol. 12, no. 1, pp. 013015-013015, March 2013.

[34] Noam Ophir et al., "First Demonstration of Error-Free Operation of a Full Silicon On-Chip Photonic Link," in Optical Fiber Communication Conference, Los Angeles, 2011.

[35] B. Garbin, J. Javaloyes, G. Tissoni, and S. Barland, "Buffering optical data with topological localized structures," in CLEO: QELS_Fundamental Science, San Jose, 2014.

[36] J. K. Jang et al., "High-fidelity optical buffer based on temporal cavity solitons," in CLEO: 2014, OSA Technical Digest, 2014.

[37] Qiaoshan Chen, Fanfan Zhang, Ruiqiang Ji, Lei Zhang, and Lin Yang, "Universal method for constructing N-port non-blocking optical router based on 2 × 2 optical switch for photonic networks-on-chip," Optics Express, vol. 22, no. 10, pp. 12614-12627, 2014.

[38] Alexander L. Gaeta, "All-optical switching in silicon ring resonators," in Advanced Photonics for Communications, OSA Technical Digest, 2014.

[39] Kingsley A. Ogudo et al., "Towards 10-40 GHz on-chip micro-optical links with all integrated Si Av LED optical sources, Si N based waveguides and Si-Ge detector technology," in Proc. SPIE , Optical Interconnects XIV, vol. 8991, 2014.

[40] L. Ramini, D. Bertozzi, and L.P. Carloni, "Engineering a Bandwidth-Scalable Optical Layer for a 3D Multi-core Processor with Awareness of Layout Constraints," in Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, 2012, pp. 185-192.

[41] Aniruddha N. Udipi, Naveen Muralimanohar, Rajeev Balasubramonian, Al Davis, and Norman P. Jouppi, "Combining Memory and a Controller with Photonics Through 3D-stacking to Enable Scalable and Energy-efficient Systems," in ISCA '11 Proceedings of the 38th annual international symposium on Computer architecture , June 2011, pp. 425-436.

[42] Q. Guo et al., "3D-Stacked Memory-Side Acceleration: Accelerator and System Design," in 2nd Workshop on Near Data Processing (WONDP) in conjunction with the 47th IEEE/ACM International Symposium on Microarchitecture (MICRO-47), Cambridge, UKCambridge, UK, 2014.

[43] Seth H. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, USA, 2014, pp. 190-200.

[44] Anja Boos, Luca Ramini, Ulf Schlichtmann, and Davide Bertozzi, "PROTON: An Automatic Place-and-route Tool for Optical Networks-on-chip," in Proceedings of the International Conference on Computer-Aided Design ICCAD'13, San Jose, California,



2013, pp. 138-145.

[45] Kai Feng, Yaoyao Ye, and Jiang Xu, "A Formal Study on Topology and Floorplan Characteristics of Mesh and Torus-based Optical Networks-on-chip," Microprocessors & Microsystems, vol. 37, no. 8, pp. 941-952, Nov 2013.

[46] SéBastien Le Beux, Ian O'Connor, Gabriela Nicolescu, Guy Bois, and Pierre Paulin, "Reduction Methods for Adapting Optical Network on Chip Topologies to 3D Architectures," Microprocessors & Microsystems, vol. 37, no. 1, pp. 87-98, Feb 2013.

[47] Xianfang Tan, Mei Yang, Lei Zhang, Yingtao Jiang, and Jianyi Yang, "A Generic Optical Router Design for Photonic Network-on-Chips," Lightwave Technology, Journal of, vol. 30, no. 3, pp. 368-376, Feb 2012.

[48] Yaoyao Ye et al., "Holistic comparison of optical routers for chip multiprocessors," in Anti-Counterfeiting, Security and Identification (ASID), 2012 International Conference on, 2012, pp. 1-5.

[49] A. Bose, P. Ghosal, and S.P. Mohanty, "A Low Latency Scalable 3D NoC Using BFT Topology with Table Based Uniform Routing," in VLSI (ISVLSI), 2014 IEEE Computer Society Annual Symposium on, 2014, pp. 136-141.

[50] Weigang Hou, Lei Guo, Qing Cai, and Lijiao Zhu, "3D Torus ONoC: Topology design, router modeling and adaptive routing algorithm," in Optical Communications and Networks (ICOCN), 2014 13th International Conference on, 2014, pp. 1-4.

[51] Luca Ramini, Paolo Grani, Sandro Bartolini, and Davide Bertozzi, "Contrasting wavelength-routed optical NoC topologies for power-efficient 3d-stacked multicore processors using physical-layer analysis," in Design, Automation & Test in Europe Conference & Exhibition (DATE), Proceeding of the Conference on, 2013, pp. 1589-1594.

[52] Yaoyao Ye et al., "3-D Mesh-Based Optical Network-on-Chip for Multiprocessor System-on-Chip," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 32, no. 4, pp. 584-596, April 2013.

[53] Lin Liu and Yuanyuan Yang, "Energy-aware routing in hybrid optical network-on-chip for future multi-processor system-on-chip," Journal of Parallel and Distributed Computing, vol. 73, no. 2, pp. 189-197, 2013.

[54] L. Ramini et al., "Assessing the energy break-even point between an optical NoC architecture and an aggressive electronic baseline," in Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014, pp. 1-6.

[55] Assaf Shacham, Keren Bergman, and Luca P. Carloni, "The Case for Low-power Photonic Networks on Chip," in Proceedings of the 44th Annual Design Automation Conference, DAC'07, San Diego, California, 2007, pp. 132-135.

[56] Yaoyao Ye et al., "System-Level Modeling and Analysis of Thermal Effects in Optical Networks-on-Chip," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,



vol. 21, no. 2, pp. 292-305, Feb 2013.

[57] N. Gruhler, C. Benz, R. Danneau, and W. Pernice, "N. Gruhler; C. Benz; R. Danneau; W. Pernice," in CLEO: 2014, OSA Technical Digest, 2014.

[58] A. Majumdar, Jonghwan Kim, J. Vuckovic, and Feng Wang, "Graphene for Tunable Nanophotonic Resonators," Selected Topics in Quantum Electronics, IEEE Journal of, vol. 20, no. 1, pp. 68-71, Jan-Feb 2014.

[59] Longhai Yu, Jiajiu Zheng, Daoxin Dai, and Sailing He, "Observation of optically induced transparency effect in silicon nanophotonic wires with graphene," in Proc. SPIE, Smart Photonic and Optoelectronic Integrated Circuits XVI, vol. 8989, 2014.

[60] Yi Huang, Lin-Sheng Wu, Min Tang, and Junfa Mao, "High-performance resonator based on single-walled carbon nanotube bundle for THz application," Journal of Electromagnetic Waves and Applications, vol. 28, no. 3, pp. 316-325, 2014.

[61] Svetlana Khasminskaya, Feliks Pyatkov, Benjamin S. Flavel, Wolfram H. Pernice, and Ralph Krupke, "Waveguide-Integrated Light-Emitting Carbon Nanotubes," Advanced Materials, vol. 26, no. 21, pp. 3465-3472, June 2014.

[62] Petar K. Pepeljugoski et al., "Low Power and High Density Optical Interconnects for Future Supercomputers," in Optical Fiber Communication Conference, OSA Technical Digest, 2010.

[63] Marc Taubenblatt, Jeffrey A. Kash, and Yoichi Taira, "Optical Interconnects for High Performance Computing," in Asia Communications and Photonics Conference and Exhibition, 2009.

[64] D. Crisan, R. Birke, N. Chrysos, C. Minkenberg, and M. Gusat, "zFabric: How to virtualize lossless ethernet?," in Cluster Computing (CLUSTER), 2014 IEEE International Conference on, 2014, pp. 75-83.

[65] Mohammad Alizadeh et al., "Deconstructing datacenter packet transport," in Proceedings of the 11th ACM Workshop on Hot Topics in Networks (HotNets-XI), New York, NY, USA, 2012, pp. 133-138.

[66] Christoforos Kachris and I. Tomkos, "A Survey on Optical Interconnects for Data Centers," Communications Surveys & Tutorials, IEEE, vol. 14, no. 4, pp. 1021-1036, Fourth Quarter 2012.

[67] N. Farrington et al., "A Multiport Microsecond Optical Circuit Switch for Data Center Networking," Photonics Technology Letters, IEEE, vol. 25, no. 16, pp. 1589-1592, Aug 2013.

[68] Polatis Inc., "The New Optical Data Center," Polatis Data Sheet 2009.

[69] Guohui Wang et al., "c-Through: part-time optics in data centers," in Proceedings of the ACM SIGCOMM 2010 conference (SIGCOMM '10), New York, NY, USA, 2010, pp. 327-



338.

[70] Nathan Farrington et al., "Helios: a hybrid electrical/optical switch architecture for modular data centers," in Proceedings of the ACM SIGCOMM 2010 conference (SIGCOMM '10), New York, NY, USA, 2010, pp. 339-350.

[71] R. Luijten, W.E. Denzel, R.R. Grzybowski, and R. Hemenway, "Optical interconnection networks: The OSMOSIS project," in Lasers and Electro-Optics Society, 2004. LEOS 2004. The 17th Annual Meeting of the IEEE, 2004, pp. 563-564.

[72] (2014) Hadoop. [Online]. http://hadoop.apache.org

[73] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google file system," in Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03), 2003, pp. 29-43.

[74] (2014) Apache Hadoop - YARN. [Online]. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

[75] Jeffrey Dean and Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM - 50th anniversary issue: 1958 - 2008, vol. 51, no. 1, pp. 107-113, Jan 2008.

[76] (2014) Apache Hive. [Online]. https://hive.apache.org

[77] Ashish Thusoo et al., "Hive: a warehousing solution over a map-reduce framework," Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626-1629, Aug 2009.

[78] (2012) Apache Pig. [Online]. https://pig.apache.org

[79] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins, "Pig latin: a not-so-foreign language for data processing," in Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08), 2008, pp. 1099-1110.

[80] (2015) Apache Tez. [Online]. http://tez.apache.org

[81] (2015) Apache Spark. [Online]. https://spark.apache.org

[82] (2015) Apache Flink. [Online]. http://flink.apache.org

[83] Alexander Alexandrov et al., "The Stratosphere platform for big data analytics," The VLDB Journal — The International Journal on Very Large Data Bases, vol. 23, no. 6, pp. 939-964, Dec 2014.

[84] Apache Storm. [Online]. https://storm.apache.org

[85] Apache Samza. [Online]. http://samza.apache.org

[86] Grzegorz Malewicz et al., "Pregel: a system for large-scale graph processing," in



Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10), 2010, pp. 135-146.

[87] (2013) Gunrock – High-performance Graph Primitives on GPU. [Online]. http://gunrock.github.io/gunrock/

[88] Shan Shan Huang, Amir Hormati, David F. Bacon, and Rodric Rabbah, "Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary," in ECOOP 2008, 22nd European Conference on Object-Oriented Programming, 2008, pp. 76-103.

[89] Jure Leskovec and Christos Faloutsos, "Sampling from large graphs," in KDD '06 Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , 2006.

[90] Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan W. Berry, "Challenges in Parallel Graph Processing," Parallel Processing Letters, vol. 17, no. 1, pp. 5-20, 2007.

[91] C. Hubler, H.-P. Kriegel, K. Borgwardt, and Z. Ghahramani, "Metropolis Algorithms for Representative Subgraph Sampling," in Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on, 2008, pp. 283-292.

[92] Arun S. Maiya and Tanya Y. Berger-Wolf, "Sampling community structure," in Proceedings of the 19th international conference on World wide web (WWW '10), 2010, pp. 701-710.

[93] Xuesong Lu and Stéphane Bressan, "Sampling connected induced subgraphs uniformly at random," in Proceedings of the 24th international conference on Scientific and Statistical Database Management (SSDBM'12), 2012, pp. 195-212.

[94] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng, "A model-based approach to attributed graph clustering," in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12), 2012, pp. 505-516.

[95] Arlei Silva, Jr. Wagner Meira, and Mohammed J. Zaki, "Mining attribute-structure correlated patterns in large attributed graphs," Proceedings of the VLDB Endowment, vol. 5, no. 5, pp. 466-477, Jan 2012.

[96] Jieming Shi, Nikos Mamoulis, Dingming Wu, and David W. Cheung, "Density-based place clustering in geo-social networks," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data (SIGMOD '14), 2014, pp. 99-110.

[97] Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and Philip S. Yu, "Graph OLAP: a multi-dimensional framework for graph data analysis," Knowledge and Information Systems, vol. 21, no. 1, pp. 41-63, Aug 2009.

[98] Zhengkui Wang et al., "Pagrol: Parallel graph olap over large-scale attributed graphs," in Data Engineering (ICDE), 2014 IEEE 30th International Conference on, 2014, pp. 496-507.



[99] Pawan Harish and P. J. Narayanan, "Accelerating large graph algorithms on the GPU using CUDA," in Proceedings of the 14th international conference on High performance computing (HiPC'07), 2007, pp. 197-208.

[100] Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun, "Accelerating CUDA graph algorithms at maximum warp," in Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP '11), 2011, pp. 267-276.

[101] Lu Qin et al., "Scalable big graph processing in MapReduce," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data (SIGMOD '14), 2014, pp. 827-838.

[102] Tomasz Kajdanowicz, Przemyslaw Kazienko, and Wojciech Indyk, "Parallel processing of large graphs," Future Generation Computer Systems, vol. 32, pp. 324-447, Mar 2014.

[103] Intel Corporation. (2014, Aug) Big Data Analytics – Intel’s IT Manager Survey on How Organizations Are Using Big Data. [Online]. http://www.intel.de/content/dam/www/public/us/en/ documents/reports/data-insights-peer-research-report.pdf

[104] James J. Thomas and Kristin A. Cook, Eds., Illuminating the Path: The Research and Development Agenda for Visual Analytics.: IEEE CS Press, 2005, ISBN: 0769523234.

[105] Pak Chung Wong, Han-Wei Shen, C.R. Johnson, C. Chen, and Robert B. Ross, "The Top 10 Challenges in Extreme-Scale Visual Analytics," Computer Graphics and Applications, IEEE, vol. 32, no. 4, pp. 63-67, July-Aug 2012.

[106] Randall Rohrer, Celeste Lyn Paul, and Bohdan Nebesh, "Visual analytics for Big Data," The Next Wave, vol. 20, no. 4, 2014.

[107] Joseph A. Cottam, Andrew Lumsdaine, and Peter Wang, "Abstract rendering: out-of-core rendering for information visualization," in Proceedings SPIE 9017, Visualization and Data Analysis , vol. 9017, 2014.

[108] Danyel Fisher, Steven M. Drucker, and A.Christian König, "Exploratory Visualization Involving Incremental, Approximate Database Queries and Uncertainty," Computer Graphics and Applications, IEEE, vol. 32, no. 4, pp. 55-62, July-Aug 2012.

[109] Yun Jang, D.S. Ebert, and K. Gaither, "Time-Varying Data Visualization Using Functional Representations," Visualization and Computer Graphics, IEEE Transactions on, vol. 18, no. 3, pp. 421-433, March 2012.

[110] Azul Systems, Inc. (2015) Azul Systems: Java for the Real Time Business. [Online]. http://www.azulsystems.com

[111] Azul Systems, Inc. (2015) Azul Systems: Java for the Real Time Business. [Online]. http://www.azulsystems.com/products/vega/processor

[112] A. Putnam et al., "A reconfigurable fabric for accelerating large-scale datacenter services," in Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, 2014,



pp. 13-24.

[113] Oliver Pell and Oskar Mencer, "Surviving the end of frequency scaling with reconfigurable dataflow computing," ACM SIGARCH Computer Architecture News, vol. 39, no. 4, pp. 60-65, Dec 2011.

[114] Cray Inc., "Cray XD1 Datasheet," 2005.

[115] (2015, Feb) Transmeta Coporation. [Online]. http://en.wikipedia.org/wiki/Transmeta

[116] (2015, Mar) Project Denver. [Online]. http://en.wikipedia.org/wiki/Project_Denver

[117] Oracle Corporation. (2015) [Online]. https://labs.oracle.com/pls/apex/f?p=labs:49:P49_PROJECT_ID:14

[118] Roger Dingledine, Nick Mathewson, and Paul Syverson, "Tor: the second-generation onion router," in Proceedings of the 13th conference on USENIX Security Symposium (SSYM'04), vol. 13, Berkeley, CA, USA , 2004, pp. 21-21.

[119] David L. Chaum, "Untraceable electronic mail, return addresses, and digital pseudonyms," Communications of the ACM, vol. 24, no. 2, pp. 84-90, Feb 1981.

[120] Georgios Kontaxis, Michalis Polychronakis, and Evangelos P. Markatos, "SudoWeb: Minimizing Information Disclosure to Third Parties in Single Sign-on Platforms," in Information Security, 14th International Conference, Proceedings, Xi’an, China, 2011, pp. 197-212.

[121] Peter Eckersley, "How unique is your web browser? ," in Proceedings of the 10th international conference on Privacy enhancing technologies (PETS'10), 2010, pp. 1-18.

[122] Daniel C. Howe and Helen Nissenbaum, "TrackMeNot: Resisting Surveillance in Web Search," in On the Identity Trail: Privacy, Anonymity and Identity in a Networked Society, Ian Kerr, Carole Lucock, and Valerie Steeves, Eds.: Oxford University Press, 2009, pp. 417-436.

[123] Michael Barbaro and Tom Zeller. (2006, Aug) The New York Times. [Online]. http://www.nytimes.com/2006/08/09/technology/09aol.html

[124] P. Samarati, "Protecting respondents identities in microdata release," Knowledge and Data Engineering, IEEE Transactions on, vol. 13, no. 6, pp. 1010-1027, Nov/Dec 2001.

[125] Cynthia Dwork, "Differential Privacy," in Automata, Languages and Programming, 33rd International Colloquium, ICALP 2006, Proceedings, Venice, Italy, 2006.

[126] Xiaoxin Chen et al., "Overshadow: a virtualization-based approach to retrofitting protection in commodity operating systems," in Proceedings of the 13th international conference on Architectural support for programming languages and operating systems (ASPLOS XIII), New York, NY, USA, 2008.



[127] Owen S. Hofmann, Sangman Kim, Alan M. Dunn, Michael Z. Lee, and Emmett Witchel, "InkTag: secure applications on an untrusted operating system," in Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS '13), 2013, pp. 265-278.

[128] John Criswell, Nathan Dautenhahn, and Vikram Adve, "Virtual ghost: protecting applications from hostile operating systems," in Proceedings of the 19th international conference on Architectural support for programming languages and operating systems (ASPLOS '14), 2014, pp. 81-96.

[129] Intel Corporation. (2013, Sept) Software Guard Extensions Programming Reference. [Online]. https://software.intel.com/sites/default/files/329298-001.pdf

[130] Giorgos Vasiliadis, Elias Athanasopoulos, Michalis Polychronakis, and Sotiris Ioannidis, "PixelVault: Using GPUs for Securing Cryptographic Operations," in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS '14), 2014, pp. 1131-1142.

d4.1 enabling technologies report on critical success ... · d4.1enabling technologies report on...

Documents