d3.1 review v1 - cordis

16
ICT2011.9.8 FET Proactive PHIDIAS File: PHIDIAS_D_3_1.doc 1 of 16 Deliverable report for P H I D I A S "Ultra-Low-Power Holistic Design for Smart Biosignals Computing Platforms" Grant Agreement Number 318013 Deliverable D 3.1 System Specification Due date of deliverable: 30/09/2013 Lead beneficiary for this deliverable: UNIBO, EPFL, IMEC-NL Contributors: - UNIBO: deliverable draft, system specification, memory hierarchy, dynamic interconnect, technology integration - EPFL: core processor, digital CS workload analysis, synchronization mechanism - IMEC-NL: analog CS workload analysis, low-power memories, analogue front-end and A/D converter Dissemination Level: PU Public X PP Restricted to other programme participants (including the Commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential, only for members of the consortium (including the Commission Services) Version: 1.3 (review) Date 04.03.2014 Draft of the WP Leader Commented version for amendment Version accepted by the Steering Board X

Upload: others

Post on 30-Oct-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 1  of  16    

Deliverable report for

P H I D I A S

"Ultra-Low-Power Holistic Design for Smart Biosignals Computing Platforms"

Grant Agreement Number 318013

Deliverable D 3.1 System Specif ication

Due date of deliverable: 30/09/2013

Lead beneficiary for this deliverable: UNIBO, EPFL, IMEC-NL Contributors: - UNIBO: deliverable draft, system specification, memory hierarchy, dynamic interconnect, technology integration - EPFL: core processor, digital CS workload analysis, synchronization mechanism - IMEC-NL: analog CS workload analysis, low-power memories, analogue front-end and A/D converter

Dissemination Level: PU Public X PP Restricted to other programme participants (including the Commission

Services)

RE Restricted to a group specified by the consortium (including the Commission Services)

CO Confidential, only for members of the consortium (including the Commission Services)

Version: 1.3 (review) Date 04.03.2014 Draft of the WP Leader Commented version for amendment Version accepted by the Steering Board X

 

Page 2: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 2  of  16    

1. Description of task

This deliverable reports the results of the Task 1 of WP3 describing the overall Ultra-Low-Power HW and SW architecture that will be used as baseline for the PHIDIAS project tasks.

2. Summary  The objective of Phidias is to investigate new technologies to allow development of extremely power-efficient bio-sensing nodes. The project aims at addressing the shift in healthcare policies, requiring long-term monitoring of bio-signals by means of embedded, ultra-low power Wireless Body Sensor Networks. Phidias aims at unlocking the development of ultra-low power bio-sensing platforms trough disruptive technologies:

• Novel signal processing models and methods based on advanced Compressive Sensing paradigms, departing from traditional Shannon sampling methodologies

• Efficient hardware implementation of components (analog and digital) for ultra-low-power sensing

• Joint architecture optimization and integration of those components in novel heterogeneous architectures

• Evaluation of system-wide power reduction of the system, when performing bio-signal applications based on compressed sensing

In this deliverable we describe the PHIDIAS System Specification and Architecture from both a Hardware and Software viewpoint.

3. System Specif ication Compressive Sensing (CS) leverages the intrinsic sparsity of bio-signals, when considered in a specific domain such has the wavelet domain. Sparsity allows to exceed the limit on the minimum sampling frequency imposed by the Shannon – Nyquist theorem, while still allowing a high-quality reconstruction of the sampled signal. On the sensing side, CS is performed by applying a MxN sensing operator Φ on a window of the acquired signal1 x, of length N, with M << N. The obtained coefficients y are then the sparse representation of the signal. In mathematical terms, y = Φx. CS can be implemented either at the analogue hardware level or as a digital (software) routine executing on an ultra-low-power platform. In the context of the Phidias project, both these embodiments of the CS context are being considered. The envisioned system therefore comprises both analogue components such as the front-end (AFE) and analogue-to-digital converter (ADC), as well as an ultra-low-power digital

                                                                                                               1  A universal good choice for  the  Φ operator is a random matrix containing independent identically distributed (i.i.d) elements  

Page 3: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 3  of  16    

platform. The flexibility deriving from this choice also allows the partners to explore novel research avenues, including hybrid solutions where the digital and analogue part cooperate to increase energy efficiency and reconstruction quality of the sampled signals.

Digital  CS  workload  analysis    

Digital CS can be efficiently parallelized. Parallel execution, in turn, requires a lower operating clock frequency, opening the opportunity to scale down the supply voltage and the energy consumption. Aggressive voltage/frequency scaling (VFS) allows to increase the energy efficiency by reducing the static power consumption due to leakage. To illustrate this point, Figure   1 shows a comparison, derived from post-layout simulations, of the energy efficiency of single- and eight- cores implementations of digital CS using the TamaRISC cores. In the figure, it is shown that for sampling frequencies higher than 250 Hz, the parallel architectures outperform the single core one in terms of energy efficiency (including dynamic and leakage power).

 Figure  1:  Single-­‐  and  multi-­‐  core  power  consumption  for  digital  CS  (from  2)  

Moreover, because computation is performed on multiple cores concurrently, Single Instruction Multiple Data (SIMD) execution paradigms can be implemented, coalescing the accesses to the instruction memory banks by multiple cores. This strategy can lead to further energy savings with respect to the data shown in Figure  1, and will be extensively investigated for the duration of the project.

                                                                                                               2   A. Y. Dogan et al.: “Power/Performance Exploration of Single-core and Multi-core Processor Approaches for Biomedical Signal Processing,” PATMOS 2011  

Page 4: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 4  of  16    

Ultra-low-power multi-core architectures are therefore promising candidates as digital CS platforms. Nonetheless, care must be used to tailor architectural elements such as computing cores, interconnects and memories for the considered application needs, as described in the following sections of the present deliverable.

Analog  CS  implementation  analysis   A number of architectural components must be considered when devising different implementation choices for analog CS. In particular, the choice of the ADC topology, as well as front-end amplifier and the mixer/integrator implementation, result in different performance/energy efficiency points. In D2.1, two scenarios, presenting different requirements, were considered, targeting both a medical-grade device with a 16-bits resolution and 1uV rms input-referred noise and a lifestyle scenario with less stringent constraint of 12-bit resolution and 3uV rms input-referred noise. Table  1 summarizes the resulting estimations of the power consumption in both cases.

Table  1:  Energy  consumption  estimations  for  the  analogue  CS  implementation  under  different  requirements  

Application Power consumption in uW per channel

Medical grade 830 Lifestyle 0.3

 D2.1,  shows  that  the  total  power  consumption  becomes  for  medical  grade  170mW  (M=205,  N=256)  and  for  lifestyle  23.1uW  (M=77,  N=256).  This  highlights  the  potential  benefits  for  digital  CS  in  medical  grade  applications.  

System Architecture - Overview Figure  2 depicts the block diagram of the proposed architecture. It is composed of four main components: Analog Front End & Digital Converter, ULP Processor, Wireless Link, Software Stack and Integrating technology. In the context of the project state-of-the-art solutions, such as the Bluetooth Low Energy3 or Zigbee4 protocols, are assumed for the wireless transmission stage.

                                                                                                               3 http://www.bluetooth.com/Pages/Low-Energy.aspx 4 www.zigbee.org  

Figure  2:  System  Architecture  -­‐  Block  Diagram

Page 5: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 5  of  16    

The incoming bio-signal is sampled by the analog front end (AFE) and converted to digital signal. According to the more appropriate strategy (input bio-signal, quality) the AFE can produce digital sample of the input bio-signal at the nyquist frequency or at lower one. In the latter the analog front end will directly compress the input bio-signal and we will talk about “Analog compress sensing”. On the contrary the compression will be done in digital by the ultra-low power processors. A software stack running on the top of the digital processors enables to exploit parallelism in the compress sensing computation and to execute digital signal processing operation on the compressed signal to optimize it for being transmitted through the wireless link or to execute machine learning algorithm to extract important features and reducing the information sent to the Host. Analog front end and ultra-low power processors are built on two different processes as two separate die. The integrating technology substrate is in charge to interfacing these two different physical components.

4. Analog Front End and Digital Converter

A vital building block for bio-potential measurement systems is the analog-front-end (AFE), which defines the signal quality and rejects the measurement aggressors. Evidently, high quality signal extraction using portable and wearable bio-potential measurement systems require increased power dissipation in the AFE. This introduces a signal quality/battery life-time trade-off to the design of wearable and portable bio-potential measurement systems. Figure  3 shows a typical analog front end for a biopotential acquisition system. The core of the AFE is comprised of the instrumentation amplifier (IA), the performance of which mostly dictates the performance of the system, an optional programmable gain amplifier (PGA) to program the overall gain of the system, a sample and hold amplifier (SHA) to discretize the input signal so that it can be digitized subsequently by an analog to digital converter (ADC). ADC typically interacts with the digital back end (DBE) through SPI. To ensure the appropriate functioning of the AFE and DBE, ancillary blocks such as bias, voltage references, clock generators and power management circuits are required as part of the AFE. More details on the Analog CS are reported in D2.1.

Page 6: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 6  of  16    

 Figure  3:  Typical  analog  front  end

 

5. ULP Processor Near-threshold computing (NTC) has emerged as a promising approach to achieve up to one order of magnitude improvements in energy efficiency of integrated circuits. The key feature of NTC is to lower the supply voltage to a value slightly higher than the threshold voltage. One of the main issues with low-voltage operation is performance degradation, which can limit the degree of use of voltage-scaling for a given processing requirement. When the algorithms to be executed can be parallelized, such in the case of CS applications, parallel computing using multiple cores can alleviate this issue. The proposed ULP processor is based on a multi-core architecture, working in the near threshold operating region. The devised architectural template features a parametric number of processors with independent instruction streams sharing a multi-banked tightly coupled data memory (TCDM). The efficiency of communication between cores and the TCDM and the external resources is achieved through a high-bandwidth, low-latency interconnect. An abstracted view of the ULP processor architecture is presented in Figure 4 while the variants currently under investigation, covering different aspects of the designs, will be described in the next section.    

Page 7: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 7  of  16    

 Figure  4:  Architectural  template  of  the  ULP  multi-­‐core  processor  

   

ULP  Processor  Architecture  The investigated solution for the multi-core architecture is comprised by an array of processing cores, connected through crossbars with multi-banked instructions and data memories. Peripherals (i.e. AFE) are connected through memory-mapped dedicated registers and interrupt lines. The organization of memory into several banks is used for allowing multiple cores to access data or instructions in different banks in parallel, and to minimize the energy per access. The resulting platform is an architectural template, of which diverse instances can be parametrically derived by tuning, at design time, the number of computing cores and instruction or data memory banks. This approach enables the study of area, performance and energy efficiency trade-offs deriving from different choices. In the following section we depicted the different design trade-offs that this platform template enables to evaluate.      

Design  Challenge  #1:  Sampled  data  movement    The presented architectural template enables two main strategies for performing data movement from the analog front end (AFE) to the multi-core node where the actual computation is performed. Both of them share an interrupt-based mechanism to allow a deep low-power state of the digital node when there is no need of computation. What differs is the unit in charge of performing data transfers from the sampled data buffer within the AFE to the TCDM memory.    

1) DMA-­‐based This solution exploits a DMA Engine to transfer all the sampled data to the local memory before triggering the cores for computation. This architectural variant is presented in Figure 5. The DMA Engine is intended to be a simple transfer engine without complex features so as to reduce area overhead. On the other hand a custom engine to carry out the movement can improve the energy efficiency of architecture by means of a burst transfer.

 

Page 8: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 8  of  16    

 Figure  5:  architectural  variant  with  DMA-­‐based  sampled  data  transfer  

 2) Core-­‐based  

Another option considered so far consists of one core in charge of performing the data movement. Due to the simplicity of core’s design and the lack of I/D caches (details in the next section), burst transfers are not possible. In this version of the architecture (see Figure 6), an interrupt wakes up the core which in turn moves the sampled data to the local memory. The time required to perform this task is inevitably longer than in the previous version, but area occupation is reduced.  

 

 Figure  6  :  architectural  variant  with  core-­‐based  sampled  data  transfer  

   

Design  Challenge  #2:  Synchronizer    A light-weight synchronization unit can be used to orchestrate execution of the different cores. Substantial energy gains can be achieved when multiple cores execute the same application on multiple acquired inputs, as in the case of ECG. In this case, cores can execute in lock-step, so that the same instruction is accessed in the Instruction Memory (IM) and broadcasted to multiple cores, following a Single Instruction / Multiple Data (SIMD) operating mode. SIMD operations reduce the energy consumption due to instruction memory accesses, while also decreasing the

Page 9: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 9  of  16    

number of conflicts and, consequently, processor stalls. In some circumstances, cores must execute out of lock-step, either because the input data they must process is not available form an external source (such as an ADC or another core), or because a data-dependent branch is executed. In these cases, an efficient mechanism must be devised to resume lock-step execution as soon as possible (for example, at the end of a data-dependent branch). The solution being investigated by EPFL is a hybrid Hardware / Software approach based on dedicated synchronization instructions and a synchronizer block. Synchronization instructions are issued by the cores when entering/exiting data dependent branches and when waiting for input data. The synchronizer, in turn, clock gates waiting cores and resume their execution according to the issued instructions. The synchronization mechanism is being developed by EPFL, exploring its energy efficiency, as well as required area and timing overhead. A comprehensive evaluation will be provided in D2.3.    

 Figure  7:  Synchronizer  architecture  

Page 10: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 10  of  16    

Core  Processor  The TamaRISC processing core is being developed by EPFL with the goal of providing an ultra-low-power computing unit, tailored to the digital CS application but flexible enough to execute different workloads in the embedded biosignal analysis domain. Since most operands in this context don’t require more than 16-bit precision, TamaRISC is designed as a 16-bit architecture, with 32-bit operations emulated at the compiler level. It features only three pipeline stages, a choice which allows full forwarding paths among stages and minimizes stalls due to data dependencies. Finally, it presents a minimalistic instruction set (comprising 18 different instructions), which permits instruction words to be only 24-bit wide and therefore results in a low energy per access in the instruction memory. These choices resulted in a very compact implementation, requiring ~12 kilo-gates. When instantiated in a multi-core architecture, a simple Address Translation Unit (ATU) is employed to allow the division between each core private memory and the shared section of the data memory. Moreover, the instruction set of be easily enriched with dedicated extensions to support efficient synchronization among cores. The block scheme of TamaRISC is presented inError! Reference source not found.      

 

 

Memory  Hierarchy    The considered CS architecture, shown in Figure  4, features a configurable number of Processing Elements (PEs). The PEs do not have private instruction nor data caches,

TamaRISC

Instruction FetchHW Loops, Interrupt Support

Operand Modes

AddressGeneration

IRQSleep

InstructionDecode

Pogram

Mem

ory

Read Addr.

Fetch

DecodeBranchBypass

Instruction

Interrupt Return Addr.Repeat Counter

Read Addr.

Status Flags C Z N OVR0...general purpose reg....

Program Counter

Execute

ALUADD(C) / SUB(B)

AND / OR / XOR / L/R(A)-Shift16-bit (un)signed Multiplier

Writeback

R15 / Link Register

Data+SFBypass

Data M

emory

Operand 1 Operand 2

Read Data

Write Addr.+ Data

Memory Mode

Register   File

 Figure  8:  Core  processor  block  scheme

Page 11: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 11  of  16    

therefore avoiding memory coherency overhead, while they share a L1 multi-banked tightly coupled data memory (TCDM) acting as a shared scratchpad memory, as well as a multi-banked instruction memory for instruction fetch.  The TCDM has a number of ports equal to the number of banks to have concurrent access to different memory locations. Intra-cluster communication is based on a high bandwidth logarithmic interconnect (LIC). It consists of a Mesh-of-Trees (MoT) interconnection network able to support single-cycle communication between PEs and memory banks (MBs). In case of multiple conflicting requests, for fair access to memory banks, a round-robin scheduler arbitrates the accesses. To ease the negative impact of banking conflicts we consider a banking factor of 2 (32 banks in case of 16 PEs). To reduce memory access time and increase shared memory throughput PEs can benefit from the broadcast mechanism. In the hereby discussed architecture there is no need of more memory hierarchy levels because of CS node requirements. Considering the low requirements in terms of memory footprint for the target CS application and its knowledge at design-time, the whole memory footprint can fit in the first level of the memory hierarchy. In Figure  9 the memory map of the CS node is shown where a reference size of 256 KB is considered for the TCDM.  

   

Figure  9:  Memory  Map    

Design  Challenge  #3:  Hybrid  Memory  It is important here to observe that a CS application, similarly to other sensor-data based computation, is composed of two phases: data collection and computation. During the data collection phase the ULP processor waits for the number of samples required to perform CS computation. Considering typical sampling frequencies for biomedical signals, this phase exceeds in time the phase of computation and has low-workload/low-memory requirements. Instead during the Computation Phase the system is in an operating point characterized by high workload requirements and high memory footprint, all processing elements are active and working on the sampled data. These considerations point to the possibility of having only portions of the data and instruction memories active during the different phases. For instance, by using a hybrid memory architecture, combining classic 6T SRAM

Page 12: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 12  of  16    

cells with more reliable SRAM cells (8T,10T, standard cell memories), it could be possible to achieve this behaviour and at the same time offer reliable operation at lower supply  voltage. This concept is depicted in Figure 10 where the data memory is divided into two portions, namely 6T and 8T (made of 6T-SRAM and 8T-SRAM cells, respectively), and according to the different phases, data-collection or computation, only a reliable portion of the memory is active.  

 Figure  10  :    hybrid  memory  architecture  

 Such architecture greatly benefits from the varying workload/memory footprint requirements of biomedical processing, adapting in a reliable way to different operating points. When only the more reliable portion of the memory is active, a more aggressive voltage scaling can be applied to the digital node thus achieving higher energy efficiency with an area penalty.

Design  Challenge  #4:  Architecture  support  to  process  variability  Operating at low-voltages exacerbates the effects of both systematic and random variations, which are already significant issues in today’s advanced process technologies. Performance uncertainty in near-threshold region due to the global process variation alone increases to 5x from 1.3x at nominal operating voltage5. Moreover, operating in this voltage range also heightens sensitivity to temperature6.

In the architecture proposed in PHIDIAS the interconnect represents a single point of failure, therefore requiring mechanisms to tolerate variability at the architectural level. In this paragraph we provide a description of the architectural components to extend the resiliency of the baseline architecture. • Resil ient Interconnect: the resilient interconnect (Task 2.4) consists of a

                                                                                                               5   R.G.   Dreslinski   et   al.,   “Near-­‐Threshold   Computing:   Reclaiming  Moore’s   Law  Through  Energy  Efficient  Integrated  Circuits”,  In:  Proceedings  of  the  IEEE,  pp.253-­‐266,  Feb.  2010  6   Calimera  A.   et   al.   “Reducing   leakage  power  by   accounting   for   temperature   inversion  dependence  in  dual-­‐Vt  synthesized  circuits”,  In:  Proc.  of  ISLPED  2008  

Page 13: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 13  of  16    

Mesh-of-Trees (MoT) interconnection network, suitable for a multi-core processor where all cores share a L1 tightly-coupled memory. Such interconnect can tolerate delay variations due to aging or static variations with a small overhead on read/write latency. Variation tolerance is based on reconfigurable pipeline stages. The reconfigurable module is inserted on the paths to memory banks so that it can compensate the effect of variations by inserting an extra cycle of latency on the request or response transactions. The module does not increase the latency in the normal mode and, in this case, the read/write operation is completed in one cycle. The modified reliable interconnect and the pipeline stages are shown in Figure 11.

(a)

(b)

Figure 11 : reconfigurable pipel ine stages (a) and resi l ient interconnect (b) • Control Unit: the control unit has a centralized role of control and coordination

for detecting timing violations due to variations and reconfigure the resilient interconnect (D4.1). Figure 12 shows a schematic view of the interaction between this unit and the other architectural elements.

Figure 12 : Control Unit

• Testing Units: a testing unit implements two similar state machines which

generate data and address for the write and read transactions and verify the incoming data from memories. During the Detection Phase (D4.1) these modules are in charge of detecting timing failures.

Page 14: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 14  of  16    

6. Integrating Technology  

With continued technology scaling, interconnect has emerged as a dominant source of circuit delay and power consumption. In many application domains, especially those strongly constrained from the energy efficiency viewpoint the reduction of interconnect delay and related power consumption is of primary importance for IC designs. In the recent past 3D ICs emerged as an attractive solution to overcome the introduced problems [1]. TSV-based 3D integration has the potential to offer the greatest vertical interconnects density, and therefore is one of the most promising in both terms of performance and energy efficiency. Although the 3D integration technology can offer many advantages such as global interconnect reduction, large footprint, high bandwidth and heterogeneous integration [2] it introduces overhead from extra steps for bonding and extra area for TSV penetrating through dies. Moreover, 3D integration technologies, due to their novelty, suffer from the industrialization point of view, especially in terms of reliability, testing, automation of the design flows and stress on the integrating devices [3]. Recently, the 2.5D technology has been extensively researched and implemented as an alternative of the 3D technology [4]. Similar to a 3D IC design, a 2.5D design can also be partitioned into different system components and prefabricated separately as different tiers before assembly. However, these tiers are not directly stacked as in a TSV-based 3D IC chip. Instead, the tiers in a 2.5D design are spread on a substrate or an interposer and the inter-tier connections are implemented via substrate. Compared to the 3D technology, the 2.5D technology can also provide the advantages such as low interconnect delay, high bandwidth, and enabling the heterogeneous technology integration. In addition, the 2.5D scheme can mitigate area overhead of TSV because the TSV’s penetrations through different tiers are avoided. Moreover, 2.5D technology can help reducing the high temperature and binding cost of the 3D design. An extra interposer layer, however, is introduced for interconnect in a 2.5D design. Table 1 provides a quantitative comparison of the performance improvement provided by 2.5D and 3D technologies with respect to standard 2D integration [5], while Table 2 provides a qualitative evaluation of the challenges of 3D integration with respect to 2.5D integration [3]. It can be seen that although 3D integration technology can provide a slightly better performance with respect to 2.5D for what concern the average wire length, it performs worse in terms of silicon area and sensitivity to temperature. Moreover, differently from 3D technologies, the only challenging point for the 2.5D integration technology is the testing, that can be considered negligible in the context of the PHIDIAS project due to the intrinsically simple nature of the components integrated on the target platform.

3D 2.5D Wire Length 47.64% 40.70% Chip Area 3.39% 8.24% Peak T -3.64% -0.64%

3D 2.5D Design Flow New Co-Design Evolutionary Testing New Methods Evolutionary Cost High 65nm Interposer

Page 15: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 15  of  16    

Thermal Challenging Evolutionary Device Impact Stress None Reliability Challenging Evolutionary

From the industrial viewpoint, many alternatives of 2.5D/3D technologies have been proposed. Stacked Chip Scale Packaging (SCSP) of Amkor Technology [6] is one such example which provides several different 2.5D/3D options for integration of different dies in a package and delivering power and signals to them. One interesting feature of this packaging is coexistence of TSVs and wire-bonds (advanced multi-tier wire-bonding) at fine pitches. This way it is possible to take advantage of wire-bonds for power or signal delivery, as well as the TSVs. On the other hand, technologies such as TSV Silicon Interposer (TSI) [7] and wafer reconstitution [8] provide more flexibility in hybrid 2.5D/3D stacking. TSIs enable stacking of different dies on their both sides to make a better utilization of space and to facilitate heat transfer of high power chips, and wafer constitution provides electrical connections from the chip pads to the interconnects by means of an artificial wafer. Redistributed Chip Packaging (RCP) [8] provided by Freescale Semiconductor offers scalable chip-scale packaging and multi-die heterogeneous integration. The building blocks of an RCP package can include embedded dies and memory blocks, standard Surface Mount Technology (SMT) packages, and discrete Surface Mount Devices (SMDs); stacked and connected using wire-bonds, side metal interconnects, TSVs, and micro-bumps. In addition, Package on Package stacking is supported in RCP by means of Through-Package VIAs. All these new technologies come to help of 2.5D and 3D Integration, and allow for design of flexible System in Packages. In the context of the PHIDIAS project a 2.5D technology based on silicon interposers has been selected to integrate the heterogeneous components of the system including analogue and digital blocks. This choice has mainly been driven by the low bandwidth required for communication between the two parts, as well as the few required connections and the low cost and high reliability requirements of the applications proposed in the context of the PHIDIAS project. The energy, delay, area and cost models of the integrating technologies will be derived from the above described industrial solutions, and exploited in the context of the PHIDIAS projects to analyse, evaluate and select the most suitable for system integration.

7. Software Stack The ultimate Phidias goal is to enable ULP biosensor nodes; this will be achieved in the software layer by using digital CS as well as directly computing in the sensing node bio-signal analysis tasks. In addition to that the Phidias project will explore run-time power management APIs to further reduce the energy consumption exploiting the HW heterogeneity and SW computational requirement variation.

CS  software  Layer  The devised target digital platform can be exploited to support number of optimizations on the digital CS execution. The parallelism intrinsic in the application efficiently maps the target multi-core architecture. Moreover, multiple accesses to the instruction or data memory, by different cores, required by the application, are coalesced in a single operation employing broadcasting of memory contents to the cores, thus increasing energy efficiency. To maximize the benefits of broadcasting, cores can be maintained in lock-step whenever possible, employing a light-weight

Page 16: D3.1 review v1 - CORDIS

ICT-­‐2011.9.8  -­‐  FET  Proactive   PHIDIAS    

File: PHIDIAS_D_3_1.doc 16  of  16    

synchronization mechanism. Intrinsic computational phases typical of digital CS pattern can exploit the hybrid memory topology to reduce the power consumption of the memory subsystem without compromising the overall CS performances.

Application  Layer  Digital CS is the main target application for the low-power digital platform. Nonetheless, the scope of the scientific investigation will extend to diverse application scenarios in the bio-signal compression and analysis domain. In this context, both applications in the time domain and in the CS domain are being considered. Regarding the former, of particular interest are embedded digital signal processing performing filtering and features extraction of different bio-signal modalities. As for the latter, embedded classification in the CS domain is being studied.

Run-­‐Time  Layer  To further reduce the energy consumption of the proposed ULP processor running digital CS routines and analysis tasks in the PHIDIAS project will be evaluated and designed a set of run-time (event-triggered) APIs to support platform level power management. As the Phidias ULP device is memory constrained the run-time APIs will be designed with the goal of limiting their memory footprint. In addition as Phidias does not contain dedicated master cores, the run-time can take advantage of the shared L1 memory and distributed design solution will be considered.

8. Conclusions The presented deliverable describes the substrate on the top of which the Phidias project will grow up. The Phidias architecture comprises horizontally the main HW components that starting from an analog biological signal enables its energy-efficient and reliable digital computation while vertically its explores the entire software stack composed by CS and bio-signal analysis application as well by the power-management APIs.

[1] Y. Xie, J. Cong, and S. Sachin, 3D IC Design: EDA, Design and Microarchitectures, Springer, 2010. [2] Y Xie, G Loh, B Black, and K Bernstein, ”Design Space Exploration for 3D Architecture,” ACM Journal

of Emerging Technologies for Computer Systems, Vol. 2. No. 2, pp.65-103, April 2006. [3] I. Bolsens, “2.5D ICs: Just a Stepping Stone or a Long Term Alternative to 3D?”, Available:

http://www.xilinx.com/innovation/research-labs/keynotes/3-D_Architectures.pdf [4] Y. Deng and W. P. Maly, ”Interconnect characteristics of 2.5-D system integration scheme,” in

International Symposium on Physical Design, pp. 171-175, 2001. [5] C. Zhang, G. Sun, "Fabrication cost analysis for 2D, 2.5D, and 3D IC designs," 3D Systems Integration

Conference (3DIC), 2011 IEEE International , vol., no., pp.1,4, Jan. 31 2012-Feb. 2 2012. [6] A. Syed. (2012, Dec.) Emerging ic packaging technologies. [Online]. Available:

http://www.smta.org/chapters/files/Arizona-Sonora_Amkor_SMTA_AZ_Expo_2012Dec4.pdf [7] A-Star-IME. (2010, Nov.) Tsv silicon interposer for high io applications. [Online]. Available:  

http://www.ime.a-star.edu.sg/uploadfiles/3 Proposal-TSV-Interposer.pdf [8] Freescale-Semiconductor. (2013, Jan.) Freescale’s redistributed chip packaging. [Online]. Available:

http://www.freescale.com/files/shared/doc/reports_presentations/RCPPRESENTATION.pdf