track g: an innovative multicore system architecture for wireless socs/ alon yaakov

1. May 1, 2013An Innovative Multicore SystemArchitecture for Wireless SoCsAlon YaakovDSP Architecture Manager, CEVA

2. May 1, 2013Multicore in Embedded SystemDefining the Problem Control-plane Synchronization between cores Semaphores Message passing using mailbox mechanism Snooping mechanism Interrupt handling Data-planeEqualizationAntenna Processing Error CorrectionThis will be thefocus of todayspresentation 3. May 1, 2013Outline Multicore Challenges The CEVA-XC Solution 4. May 1, 2013Multicore Challenges1. Partitioning> Task partitioning onto different chip resources> Data partitioning onto different chip resources2. Resource sharing> Memories, buses, system I/Fs, peripherals, etc.3. Scheduling> Allocating tasks/data4. Data sharing> Transferring data between enginesDSP ADSP CDSP BCTCMLDFFTApplication ? 5. May 1, 2013 Tasks Parts of an algorithm running in a sequential order A task must have a defined input and output datastructure (packets)Challenge 1: Task PartitioningError CorrectionEqualizationAntenna ProcessingMLDFFTChestimationFFTChestimationReorderingInterleaverInterleaverInterleaverCTCCTCCTCConcatenation & CRCcheckerConcatenation & CRCcheckerConcatenation & CRCcheckerTaskData 6. May 1, 2013Challenge 1: Task PartitioningHW Offloading Parts of the algorithm are more suited for HWacceleration Well known algorithms that require little programmability Heavy computational effortMLDFFTChestimationFFTChestimationReorderingInterleaverInterleaverInterleaverCTCCTCCTCConcatenation & CRCcheckerConcatenation & CRCcheckerConcatenation & CRCchecker 7. May 1, 2013Challenge 1: Data Partitioning Several cores are used to process differentinput data packets Suitable for homogeneous systems Shared memory is used for storing history data Core must wait for data to update before using it Latency The entire program code is used by all cores Core suffers stall cycles if L1 memory is small 8. May 1, 2013Challenge 1: PartitioningOK, Now What? Efficient partitioning is dependent on thehardware platform Building the optimal system depends on thepartitioning There is no single optimal solution Each approach has its merits Partitioning can be eased by starting with areference that can be used as a basis 9. May 1, 2013Challenge 2: Resource Sharing Resource types DSP cores HW accelerators Memories Buses DMA Resource sharing creates contentionMemory 10. May 1, 2013Challenge 2: Resource SharingAvoiding Contentions If possible avoid contentions by duplicating HW Multiple DMAs Duplicated HW accelerators Multilayer BUS Partition memory into blocks enabling concurrent access Throughput and latency govern the minimum amount ofhardware resourcesMemory Memory Memory Memory 11. May 1, 2013Challenge 2: Resource SharingArbitration When a simple set of known rules can be defined aresource can be shared using a HW arbiter QoS Priority Bandwidth allocation (weight) Well known algorithms (round robin) Arbitration is based on timesharing of resources SchedulingMemoryArbiter 12. May 1, 2013Challenge 3: Scheduling How do we assign and schedule tasks tocores?Application ?Concatenation & CRCcheckerMLDFFTChestimationFFTChestimationReorderingInterleaverInterleaverInterleaverCTCCTCCTCConcatenation & CRCcheckerConcatenation & CRCchecker DSP ADSP CDSP BCTCMLDFFT 13. May 1, 2013Challenge 3: SchedulingStatic Scheduling Tasks are statically assigned to DSP cores Design phase includes task scheduling Data flow is fixed Suitable when the load on each task is fixedCTC HW CoreDSP CDSP BMLD HW CoreDSP AFFT HW CoreMLDFFTChestimationFFTChestimationReorderInterleaverInterleaverInterleaverCTCCTCCTCConcatenat& CRCcheckerConcatenat& CRCcheckerConcatenat& CRCchecker 14. May 1, 2013Challenge 3: SchedulingDynamic SchedulingConcatenation & CRCcheckerDSP CDSP BDSP A CTCMLDFFTMLDFFTChestimationFFTChestimationReorderInterleaverInterleaverInterleaverCTCCTCCTCConcatenation & CRCcheckerConcatenation & CRCcheckerMASTER(Scheduler)> A scheduler dynamically assigns tasksto cores> Scheduler algorithm selects the bestcore to execute the task> Processing capabilities> Locality of data> Load balance> Suitable for complex designs withvariable processing loadand QoS 15. May 1, 2013Challenge 4: Data SharingMemory Hierarchy Internal L1 memory Fast memory with no access penalty Small / medium size Dedicated per core External memory Can be on-chip (L2) or off-chip (i.e. DDR) Slow memory with access penalty Large size Shared among several cores Contentions 16. May 1, 2013Challenge 4: Data SharingUsing Cache When shared data is used, a cache system can beused to reduce the stall count Statistically reduces memory stalls, but notdeterministic Used only for accessing narrow data width Cache should be used for control data Not recommended for vector DSP data flow Large caches Many stall cycles How to share vector data? 17. May 1, 2013Challenge 4: Data SharingPre-Fetching Data A task cannot start until its preceding task completes If we can schedule the next task to be executed we canpre-fetch its input data Using static scheduling the data flow is known Using dynamic scheduling the scheduler must handle datamove prior to activating a taskMLD ReorderingFFTChestimationFFTChestimationInterleaverInterleaverInterleaverCTCCTCCTCConcatenat& CRCcheckerConcatenat& CRCcheckerConcatenat& CRCcheckerDMADMADMADMADMADMADMADMADMADMADMADMADMADMA 18. May 1, 2013Challenge 4: Data SharingPre-Fetching using DMA DMA transfer must wait for the following conditions: Source data is available Destination data can be written (i.e. allocated memory is free) DMA activation schemes Real-Time SW Programmable, large MIPS overhead HW system events Not programmable Queue manager Programmable, no MIPS overhead 19. May 1, 2013Challenge 4: Data SharingPre-Fetching using DMA with Queue Manager A Queue is a list of tasks handled in a FIFO manner Each DMA queue contains all DMA tasks related with data flowchannel DMA tasks are pushed to the queue DSP software (i.e. static scheduling) System scheduler (i.e. dynamic scheduling) Tasks are automatically activated using HW or SW events Source data is available & destination memory is freeFFTChestimationDMA 20. May 1, 2013Outline Multicore Challenges The CEVA-XC Solution 21. May 1, 2013CEVA-XC4000 Multicore SolutionOptionalCache ctrlACE 22. May 1, 2013MUST Multi-core System TechnologyOverview Fully featured data cache Non-blocking, software operations, Write-Back &Write-Through Advanced support for cache coherency Based on ARMs leading AMBA-4 ACE technology Advanced system interconnect AXI-4 - easy system integration and high Quality ofService (QoS) Multi-layer FIC (Fast Inter-Connect) - low latency, highthroughput master and slave ports Multi-level memory architecture using local TCMs andhierarchy of caches 23. May 1, 2013MUST Multicore System Technology Cont. Data Traffic Manager Automated data traffic management without DSP intervention Comprehensive software development support Advanced multicore debug and profiling Complete system emulation with real hardware Hardware abstraction layer (HAL) including drivers and system APIs Support for homogeneous and heterogeneous clusters ofmultiple DSPs and CPUs Support for advanced resource management and sharing Flexible task scheduling for different system architectures: dynamic,event based, data driven, etc. 24. May 1, 2013> Allows multiple cores to use shared memory without any softwareintervention> Superior performance to SW coherency> Simplifying software development> Easy SW partitioning and scalingfrom single core to multi-core> External memory can bedynamically partitioned intoshared and unique areas> Minimizing system memories size> Flexible memory allocation speed upthe SW development> Snooping is only applied to shared areasCache Coherency Support 25. May 1, 2013Data Traffic Management Data Traffic Manager Based on Queue Manager and Buffer Manager Structures Queue Manager - Maintains multiple independent queues of tasks Buffer Manager- Autonomously tracks data status of source anddestination buffers Data transfers are automatically managed based on tasksstatus, input and output data buffers load Automatic data traffic management and DSP offloading Prioritized scheduling for guaranteed QoS Low latency packet transfers without software intervention Results in lower memory consumption and improved systemperformance 26. May 1, 2013Data Traffic Manager Allows sharing a resource among multiple cores via a shared queue Tasks are executed based on priority and buffer status Prevents starvation and deadlocks Allows a single core to work with multiple queues The core read / writesfrom / to its buffers(can be local or external) All data transfersbetween cores andaccelerators areperformed automaticallyvia the data trafficmanager 27. May 1, 2013Dynamic Scheduling Dynamic scheduling in symmetric systems A clustered system based on homogenous DSP cores Dynamic task allocation to DSP cores in runtime Flexible choice of algorithmsbased on system load Hardware abstraction usingtask oriented APIs Shared external memories FIC interface for low-latencyhigh-bandwidth data accesses Commonly used in wirelessinfrastructure applications 28. May 1, 2013MUST Hardware Abstraction Layer(HAL) MUST is assisted by user-friendly software support Abstracts the queues, buffers, DMA and caches The software package includes: Drivers and APIs Full system profiling Graphical interface via CEVA ToolBox 29. May 1, 2013Multicore Modeling and Simulation Simulating any number of cores Support for symmetric and asymmetric configurations Support for ARM CADI (Cycle Accurate Debug Interface) Including connectivity to ARMs Real-View debugger Comprehensive multi-core simulation support Synchronization, system browsing, shared memories , inter connect, accelerator simulation,cross triggering, AXI / FIC I/F Support for user-definedcomponents ESL tools integration withfull debug capabilities Compliant with TLM 2.0 Full support for Carbonand Synopsys 30. May 1, 2013Co-processor Portfolio forWireless Modems Wide range of co-processors offering power-efficientDSP offloading at extreme processing rates A complete wireless platform addressing all major modem PHYprocessing requirements Offering flexible hardware-software partitioning Customers can focus on differentiation via DSP software Unique automated data traffic managementbetween DSP memory and hardware accelerators Allows fully parallel processing support Based on data traffic manager 31. May 1, 2013Co-processor Portfolio for WirelessModems Cont. Optimized tightly coupled extensions (TCE) MLD Maximum Likelihood MIMO detectors Supports up to four MIMO layers Achieves near ML performance De-spreader 3G De-spreader units Supports all WCDMA and HSDPA channels Scalable up to 3GPP HSPA+ Rel-11 DFT / FFT Supports multi radix DFTs Includes NCO correction Viterbi Programmable K and r values Supports tail biting LLR processing and HARQ combining Supports LTE de-rate matching Significantly reduces HARQ memory buffer sizesDramatically reducestime-to-market 32. May 1, 2013Putting It All TogetherA Cluster of Four CEVA-XC DSPs> Processor Level> Fixed-point and Floating-pointVector DSPs> Running over 1GHz> Data Cache> Platform Level> Complete set of tightly coupled co-processor units> Automated DSP offloading using data trafficmanagement> System Level> Full cache coherency support> AMBA-4 and FIC system interfaces> Software Development Support> HAL using drivers and APIs> Comprehensive system debug & profiling 33. May 1, 2013Thank You

track g: an innovative multicore system architecture for wireless socs/ alon yaakov

Technology

control data

vector data

transferring data

data sharingpre

task scheduling data

history data core

data sharingusing cache

partitioning task partitioning