issue 1 march 2005 embeddedmagazine · 2019-10-20 · pendent industry experts, xps was selected in...

R

INSIDEAccelerated System Performancewith APU-Enhanced Processing

The New Virtex-4 FPGA Familywith Immersed PowerPC

Embedded Processing with Virtex and Spartan FPGAs

High-Bandwidth TCP/IP PowerPC Applications

Scalable Software-Defined Radio

Emulate 8051 Microprocessor in PicoBlaze IP Core

Optimize MicroBlaze Processors for Consumer Electronics Products

Xilinx Embedded Processing with Optos

INSIDEAccelerated System Performancewith APU-Enhanced Processing

The New Virtex-4 FPGA Familywith Immersed PowerPC

Embedded Processing with Virtex and Spartan FPGAs

High-Bandwidth TCP/IP PowerPC Applications

Scalable Software-Defined Radio


Optimize MicroBlaze Processors for Consumer Electronics Products

Xilinx Embedded Processing with Optos

Issue 1March 2005

EmbeddedmagazineEmbeddedmagazineE M B E D D E D S O L U T I O N S F O R P R O G R A M M A B L E P R O C E S S I N G D E S I G N S

Introducing the New Virtex-4 FPGA Family . . . . . . . . . . . . . . . . . . . . . .6

Accelerated System Performance with APU-Enhanced Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

Sign of the Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

Xilinx Teams with Optos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

What Customers and Partners are Saying about Xilinx Embedded Processor Solutions . . . . . . . . . . . . . . . . . . . . . . . . .22

Considerations for High-Bandwidth TCP/IP PowerPC Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

Configure and Build the Embedded Nucleus PLUS RTOS Using Xilinx EDK . . . . . . . . . . . . . . . . . . . . . . . . .27

MicroBlaze and PowerPC Cores as Hardware Test Generators . . . . . . .30

Nohau Shortens Debugging Time for MicroBlaze and Virtex-II Pro PowerPC Users . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

A Scalable Software-Defined Radio Development System . . . . . . . . . . .37

Nucleus for Xilinx FPGAs – A New Platform for Embedded System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

Emulate 8051 Microprocessor in PicoBlaze IP Core . . . . . . . . . . . . . .44

Optimize MicroBlaze Processors for Consumer Electronics Products . . . . . . . . . . . . . . . . . . . . . . . . . . .48

Use an RTOS on Your Next MicroBlaze-Based Project . . . . . . . . . . . . .50

UltraController-II Reference Design . . . . . . . . . . . . . . . . . . . . . . . . . . .53

C O N T E N T S

E M B E D D E D M A G A Z I N E I S S U E 1, M A R C H 2 0 0 5

WWelcome to the inaugural edition of the new Xilinx EmbeddedMagazine and the Embedded Systems Conference 2005.In this, our first edition of Embedded Magazine, we have assembled a host of articles representing awide range of embedded processing applications. During the last few years, Xilinx®, our partners, andour customers have developed and shared a vision to build and assemble all of the elements requiredfor a complete and robust range of embedded processing solutions for FPGA technologies.

Included in this magazine and at the Embedded Systems Conference 2005 are examples anddemonstrations of state-of-the-art commercial applications, real-time operating systems, multi-processor debugging environments, testing of complex hardware modules, and high-speed Internetcommunication protocols.

Last year, we made significant strides in the embedded processing arena. In July, we announced theformation of the Embedded Processing Division, reinforcing our commitment to the increasinglydiverse and evolving embedded systems market. This division brings talent and technology togeth-er in an organization that intends to accelerate development of an even wider range of embeddedsystem solutions, optimizing the full capabilities of our silicon architectures at multiple perform-ance and price points.

In September, Xilinx launched our breakthrough Virtex™-4 family of products, featuring embed-ded processing solutions with PicoBlaze™ and MicroBlaze™ soft-processor cores and PowerPC™(available on all Virtex-4 FX devices). The Virtex-4 FX device includes the new PowerPC AuxiliaryProcessor Unit (APU) controller to easily connect the CPU to the FPGA fabric, enabling theimplementation of acceleration hardware for virtually any application. Once only the domain ofhigh-budget ASIC and ASSP design teams, the Virtex-4 FPGA’s architectural ability to combineapplication-specific hardware acceleration with the high-performance PowerPC and MicroBlazeprocessor shatters the traditional barriers of cost, time to market, and risk.

In February 2005, our Xilinx Platform Studio (XPS) embedded tool suite received top honors atthe inaugural International Engineering Consortium (IEC) DesignVision Awards. Judged by inde-pendent industry experts, XPS was selected in the FPGA/PLD Design Tools category for its inno-vation, uniqueness, market impact, customer benefits, and value to society.

We have an exciting lineup of events at the conference, showcasing both Xilinx and our partners’latest solutions. The Xilinx booth (#1525) will feature our award-winning Xilinx Platform Studio,as well as the latest Virtex-4 FX solution; our flexible PicoBlaze and MicroBlaze soft processors withSpartan™-3 devices; and our high-performance DSP solutions.

I hope you find the conference and our first edition of Embedded Magazine informative and inspir-ing. We invite you to unlock the power of Xilinx programmability. The advantages will change theway embedded systems are designed.

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780

© 2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and otherdesignated brands includedherein are trademarks of Xilinx, Inc. PowerPC is a trade-mark of IBM, Inc. All other trademarks are the propert yof their respective owners.

The articles, information, and other materials included inthis issue are provided solely for the convenience of ourreaders. Xilinx makes no warranties, express, implied,statutory, or otherwise, and accepts no liability with respectto any such articles, information, or other materials ortheir use, and any use thereof is solely at the risk of theuser. Any person or entity using such information in anyway releases and waives any claim it might have againstXilinx for any loss, damage, or expense caused thereby.

Mark Aaldering

Vice PresidentEmbedded Processing & IP Divisions

EDITOR IN CHIEF Carlis [email protected]

MANAGING EDITOR Forrest [email protected]

ASSISTANT MANAGING EDITOR Charmaine Cooper Hussain

XCELL ONLINE EDITOR Tom [email protected]

ADVERTISING SALES Dan Teie1-800-493-5551

ART DIRECTOR Scott Blair

www.xilinx.com/xcell/embedded

Embeddedmagazine

Theatre Presentations in Xilinx Booth #1525

A Complete Spectrum of Programmable Processing SolutionsTuesday, March 8: 1:30 p.m., 4:30 p.m., 7 p.m.Wednesday, March 9: 12 p.m., 3 p.m., 5:30 p.m.Thursday, March 10: 10:30 a.m., 1:30 p.m.

Boot a Live Embedded System in Just 15 MinutesTuesday, March 8: 2 p.m., 5 p.m., 7:30 p.m.Wednesday, March 9: 12:30 p.m., 3:30 p.m., 6 p.m.Thursday, March 10: 11 a.m.

Breakthrough Performance with Virtex-4Tuesday, March 8: 2:30 p.m., 5:30 p.m.Wednesday, March 9: 10:30 a.m., 1 p.m., 4 p.m., 6:30 p.m.Thursday, March 10: 12 p.m.

Eclipse-Based HW/SW DebugTuesday, March 8: 3 p.m., 6 p.m.Wednesday, March 9: 11 a.m., 1:30 p.m., 4:30 p.m.Thursday, March 10: 9:30 a.m., 12:30 p.m.

Breakthrough DSP Performance at the Lowest CostTuesday, March 8: 3:30 p.m., 6:30 p.m.Wednesday, March 9: 11:30 a.m., 2 p.m., 5 p.m.Thursday, March 10: 10 a.m., 1 p.m.

Xilinx Distributor PresentationsTuesday, March 8: 4 p.m. – Presenter: Insight/MemecWednesday, March 9: 2:30 p.m. – Presenter: AvnetThursday, March 10: 11:30 a.m. – Presenter: Nu Horizons

Product Demonstrations in Xilinx Booth #1525Intelligent Tools Accelerate DevelopmentWe will use the Xilinx Platform Studio tool suite to create an embeddedhardware/software/IP system and boot a real-time operating system.We will show how Platform Studio accelerates embedded processingdevelopment for programmable systems through the IDE, as well aspowerful design wizards, BSP, and application code generators.

Low Cost and Flexible Processing Come see the new Spartan-3 SP305 board implementing complex embed-ded motor control using the MicroBlaze soft 32-bit processor, a floatingpoint unit, PID control loops, PWM, DAC, and ADC to control both a brush-less DC and a three-phase AC induction motor. We will also show a fre-quency generator using the PicoBlaze 8-bit processor, capable of generat-ing a frequency range up to 640 MHz incremented at 1 Hz.

High-Performance Processing Using a Xilinx ML310 Embedded Development Platform board with multi-OS boot (Linux, VxWorks, and QNX), along with the Xilinx PlatformStudio (XPS) tool suite, we will implement a dual-processor system withinter-processor communication. The first processor will be used as a com-pute engine utilizing the FPU. The second processor will be used as a hostprocessor managing data flow and I/O.

Accelerate Programmable Embedded Platform PerformanceAccelerate system performance with the Virtex-4 FX family of devicesusing the Auxiliary Processor Unit (APU) controller. We will introducethe Xilinx ML403 FX Embedded Development Platform board anddemonstrate algorithm code acceleration with APU implementationand graphical display of output. We will utilize the Gigabit SystemReference Design (GSRD) with TCP/IP acceleration and Linux eOS.

High-Performance DSP Implementation and EmbeddedDevelopment Kit Integration Using System Generator This demo shows how you can use System Generator for DSP 6.3 tobuild DSP microprocessor-based control circuits for high-performanceDSP systems. System Generator for DSP and Xilinx DSP IP cores provideunrivaled DSP design solutions for FPGA-based DSP systems.

Hands-On Workshops

Implementing a Virtex-4 FX PowerPC System withFloating-Point Co-Processor Using Platform StudioInstructor: Glenn Steiner

Following a brief presentation, this workshop will providehands-on experience implementing a PowerPC processor sys-tem demonstrating code acceleration through an FPU co-processor. You will create the design utilizing the award-win-ning Platform Studio tool suite and target actual hardware.The application will demonstrate code acceleration via theFPU with graphical display of the output.

Location: Room 256 Tuesday, March 8:10:00 a.m.–11:30 a.m.2:00 p.m.–3:30 p.m.4:00 p.m.–5:30 p.m.

Accelerating an IDCT Software Application Using theConfigurable MicroBlaze 32-bit RISC ProcessorInstructor: Derek Palmer

Utilizing the new Fast Simplex Link (FSL) port on the XilinxMicroBlaze soft-core processor, we can add dedicated low-laten-cy co-processing engines tuned specifically for any applicationat hand. We will demonstrate a 10X performance improvementof an inverse discrete cosine transform (IDCT) algorithm, uti-lizing a Spartan-3 platform for hands-on experience.

Location: Room 256 Wednesday, March 9: 10:00 a.m.–11:30 a.m.2:00 p.m.–3:30 p.m.4:00 p.m.–5:30 p.m.

High-Performance DSP on an FPGA Made EasyInstructor: Sabine Lam

In this DSP workshop, designers will learn how to implementhigh-performance DSP designs using the new Xilinx Virtex-4FPGAs. By leveraging tools such as Xilinx System Generatorfor DSP, IP cores, and hardware, even designers new to FPGAswill realize the capability and ease-of-use of Xilinx DSPdesign solutions.

Location: Room 256 Thursday, March 10:9:00 a.m.–10:30 a.m.11:00 a.m.–12:30 p.m.1:00 p.m.–2:30 p.m.

Technical Conference Papers by Xilinx

Virtex-4 FX Platform FPGA Extends PowerPC InstructionsMicroprocessor SummitSession MPS-900Argent Hotel, San FranciscoMonday, March 7: 9:00 a.m.–10:30 a.m.Presenter: Dan Isaacs, Xilinx, APD Marketing

Approaches in FPGA Design Easy Paths to Silicon Design Seminar: EPD-664Tuesday, March 8: 4:00 p.m.–5:30 p.m.Presenter: Tim Tripp, Xilinx, Advanced System Tools Product Marketing

Switching Fabrics at Layers 1 and 2Network Systems Design Seminar: NSD-707Wednesday, March 9: 8:30 a.m.–10:00 a.m.Presenter: Hamish Fallside, Xilinx, Sr. Manager, Networking Systems Solutions

Hardware/Software Codesign for Platform FPGAsEmbedded Training Program: ETP-423Thursday, March 10: 11:15 a.m.–12:45 p.m.Presenter: Don Davis, Xilinx, Sr. Manager, High-Level Tools

Software-Defined Radio with Reconfigurable Hardware and SoftwareEmbedded Training Program: ETP-442Thursday, March 10: 2:00 p.m.–3:30 p.m.Presenter: Peter Ryser, Xilinx, Embedded SW Systems Engineer Manager

Other Xilinx-Related Technical Papers

Designing with Embedded Processor Cores in FPGAsEasy Paths to Silicon Design Seminar: EPD-524Monday, March 7: 11:00 a.m.–12:30 p.m.Presenter: Avnet

Designing with IP in FPGAsEasy Paths to Silicon Design Seminar: EPD-544Monday, March 7: 1:30 p.m.–3:00 p.m.Presenter: Avnet

Implementing DSPs in FPGAsEasy Paths to Silicon Design Seminar: EPD-604Tuesday, March 8: 8:30 a.m.–10:00 a.m.Presenter: Avnet

Designing with Soft-Processor Cores in FPGAs, Part 1Embedded Training Program: ETP-309Wednesday, March 9: 8:30 a.m.–10:00 a.m.Presenter: Avnet

Designing with Soft-Processor Cores in FPGAs, Part 2Embedded Training Program: ETP-329Wednesday, March 9: 11:00 a.m.–12:30 p.m.Presenter: Avnet

FPGA Design for Firmware Engineers, Part 1Embedded Training Program: ETP-343Wednesday, March 9: 2:00 p.m.–3:30 p.m.Presenter: Avnet

FPGA Design for Firmware Engineers, Part 2Embedded Training Program: ETP-363Wednesday, March 9: 3:45 p.m.–5:15 p.m.Presenter: Avnet

FPGA Embedded Processors: Revealing True System PerformanceEmbedded Training Program: ETP-367Wednesday, March 9: 3:45 p.m.–5:15 p.m.Presenter: Memec

Welcome to the Xilinx® Virtex-4™ editionof the Xcell Journal. We’ve created this spe-cial issue to show you the new Virtex-4FPGA family, and how its innovationsenable the creation of next-generation sys-tems that do more than ever thought possi-ble only a few years ago.

In this article, I’ll take you behind thescenes for a guided tour of some of the newtechnologies, as well as a bit of the inspira-tion and rationale behind them.

With more than 100 innovations, theVirtex-4 family represents a new milestonein the evolution of FPGA technology. Afterconducting extensive interviews with lead-ing design engineers worldwide, we knewthat they wanted the following things in anadvanced next-generation FPGA family:

• Higher performance

• Higher logic density

• Lower power

• Lower cost

• More advanced capabilities

Introducing the New Virtex-4 FPGA Family

6 Embedded magazine March 2005

The latest FPGAs from Xilinx set new records in capacity, capability, performance, power efficiency, and value.

Viewfrom the top

by Erich GoettingVice President & General Manager, Advanced Products DivisionXilinx, [email protected]

It’s relatively easy to deliver on one or twoof these items – our challenge was to deliverall of them at the same time. We did thisthrough a combination of innovative processand circuit design, process development, theASMBL architectural approach, and the useof advanced embedded functions.

Development work on the Virtex-4 fam-ily (code-named “Whitney” after the high-est mountain in the continental UnitedStates) began more than two years ago. Itrepresents the creativity and dedication ofhundreds of engineers, spanning integratedcircuit design and layout, software and IPdevelopment, process development, testingand characterization, systems and applica-tions engineering, technical documenta-tion, and product marketing.

One of the most remarkable develop-ments embodied in the new Virtex-4 FPGAfamily is the ASMBL architecture, which rep-resents a fundamentally new way of con-structing the FPGA floor plan and itsinterconnect to the package. First of all,ASMBL enables I/O pins, clock pins, andpower and ground pins to be located any-where on the silicon chip, not just along theperiphery as with previous approaches. Thisin turn allows power and ground pins to bebrought directly into the center of the silicondie, thereby significantly reducing on-chip IRdrops that can occur with the largest FPGAsrunning at the highest frequencies.

Clock input pins are also located in thecenter of the die, which reduces clock latency.This is because clock networks need to haveequal delay to all endpoints (that is, mini-mum skew), and thus the clock must emanatefrom the center. In periphery-connected clockinput pins, the signal first traverses from theedge of the die to the center, and is then dis-tributed to all regions. The Virtex-4 ASMBLdesign eliminates this traversal completely,and thus directly reduces the clock networkpropagation delay.

In addition to its electrical advantages,ASMBL provides another significant benefitin that it allows a more flexible – and thusmore precise – allocation of on-chip resources.

material, which is copper rather than alu-minum (the traditional material). More lay-ers provide more routing in less space andshorter connection distances. Copperreduces resistance compared to aluminum,and thus speeds signal interconnect andreduces on-chip power-distribution IRdrop. As clock rates go up and voltages godown, these considerations have becomeincreasingly important, and have driven theindustry-wide shift to copper interconnect.

The Virtex-4 logic fabric was complete-ly re-engineered to fully take advantage ofthe 90 nm triple-oxide CMOS process,resulting in the highest performance fabricever, with system clock rates in excess of500 MHz (at three LUT levels). At thesame time, static power was cut in halfcompared to 130 nm Virtex-II Pro™devices, as was dynamic power.

Thus, while some industry pundits wereproclaiming that the future of deep submi-cron CMOS devices was getting hotter andhotter, with chip temperatures destined toreach that of rocket nozzles and the surfaceof the sun, the Virtex-4 design’s creativeapproach has turned that conventional wis-dom on its head, resulting in overall powerreductions of 50% compared to our previ-ous 130 nm generation. In many applica-tions, such as DSP functions, power levelsare reduced even more – as much as 90%.No wonder design engineers say thatVirtex-4 FPGAs are cool – they literally are.

High-Performance Clocking Clocks were rated as one of the mostimportant and critical FPGA resources inour surveys of design engineers. Quantity,quality, connectivity, frequency, duty cycle,jitter, and skew all made a big difference.

To take clocking to the next level inVirtex-4 devices, all global clock resourceswere made fully differential, thereby reduc-ing skew, jitter, and duty-cycle distortion.This marks the first implementation of dif-ferential clocking in a programmable logicdevice. Not only that, but the number ofglobal clocks was increased to 32, for every

That in turn has enabled us to offer Virtex-4devices in three unique platforms, each witha different mix of on-chip resources:

• The LX platform, optimized for logicapplications

• The SX platform, optimized for high-end DSP applications

• The FX platform, optimized forembedded processing and high-speedserial applications

A Look Inside the Virtex-4 FPGAAt the heart of the Virtex-4 FPGA is ournext-generation 90 nm triple-oxide 10-layer copper CMOS process technology.While that’s quite a lot of adjectives,every one of them is incredibly impor-tant. The first, 90 nm, refers to the“drawn” gate length of the smallest tran-sistors. As transistors get smaller, they getfaster, use less dynamic power, and enablehigher complexity at lower price points.Chip designers think in terms of “transis-tor budgets,” which are now in the billiontransistor range.

Triple-Oxide 90 nm CMOS TechnologyTriple-oxide technology refers to the num-ber of transistor oxide thicknesses availablein the process. More oxide thicknessesallow more tuning of performance andpower in the device circuitry, and enableVirtex-4 devices to deliver industry-leadingperformance while dramatically loweringpower consumption.

One of our key inputs from many engi-neers was that performance and power werevery important constraints in their systemsdesigns, and that they needed both highperformance and low power. With a dual-oxide 90 nm process, we would have had tochoose performance or power. This wasn’tgood enough. By employing a triple-oxide90 nm process, we achieved high perform-ance and low power.

The 10-layer copper refers to the num-ber of metal interconnect layers and their

March 2005 Embedded magazine 7

V I E W F R O M T H E T O P

At the heart of the Virtex-4 FPGA is our next-generation 90 nm triple-oxide 10-layer copper CMOS process technology.

device, and internal connectivity optionsenhanced to allow any region to use any 8clocks simultaneously.

500 MHz Synchronous Memories and FIFOsOn-chip synchronous block RAM wasenhanced to run at 500 MHz. Built-in sup-port for first-in first-out (FIFO) memorieswas included directly in the block RAMunit, enabling the same 500 MHz opera-tion for FIFOs (approximately a 2Xspeedup over fabric-based FIFOs), whileeliminating the need for any additionallogic cells or complex FIFO designs.

If you’re designing systems requiringECC (error checking and correcting)memory, Virtex-4 devices have built-inECC support, with single-bit correct anddouble-bit detect. ECC is common ininfrastructure equipment in networking,telecom, storage, servers, instrumentation,and aerospace applications, and providesthe highest levels of data integrity. Like theintegrated FIFO support, the integratedECC eliminates the cost and delay of fabric-based solutions.

Speaking of on-chip memory, Virtex-4devices continue to offer SelectRAM™memory, whereby each LUT is trans-formed into a 16 x 1 RAM, ideally suitedfor building high-speed register files andlocal buffers.

At the other end of the spectrum, inter-faces to external memory devices such asDDR, DDR2, QDR-II, and RLDRAM-IIare dramatically enhanced through our newChipSync™ technology, which offers mem-ory interface speeds at rates limited only bythe speed of the external memory devices.

The new Virtex-4 ML461 AdvancedMemory Development System containsfully functional and hardware-proven refer-ence designs for all of today’s most popularmemory technologies. If you plan to useexternal memory, I highly recommend thatyou check this out.

DSP Performance of 256 GigaMAC/sIn the DSP domain, we incorporated someof the world’s fastest multiply accumulate(MAC) technology. The XtremeDSP™slice can perform an 18 x 18 signed multi-ply and 48-bit accumulate every 2 ns.

The Virtex-4 LX, FX, and SX platformsinclude the breakthrough XtremeDSPtechnology. With the new SX platform wedid something completely new – we dra-matically increased the ratio of DSP unitsto logic cells. Given the highly integratednature of XtremeDSP slices, they need onlysmall amounts of logic fabric to implementmost common DSP functions, and thusincreasing the ratio provides a significantincrease in DSP compute power per unitsilicon area. In fact, SX devices provide a10X performance increase per unit costover previous solutions.

Power is dramatically reduced as well,with more than a 10X reduction for multi-ply/add functions from previous FPGAsolutions. The Virtex-4 SX55 contains 512XtremeDSP slices, providing an aggregateDSP compute performance of 256GigaMAC/s, making it one of the mostpowerful DSP devices ever manufactured.

The state-of-the-art XtremeDSP sliceemploys new “silicon algorithms” devel-oped by a company called Arithmatica™.Many different architectures exist forimplementing multiplication, and theArithmetica architecture is truly a break-through. We are excited to see it availablefor the first time to FPGA users. For moreinformation, visit Arithmatica’s website atwww.arithmatica.com.

The Evolution of Advanced I/O TechnologyI/O continues to be a critical success factorfor today’s systems designers. During thelast decade, we have seen four majorchanges in I/O. First was the shift awayfrom 5V, the result of the need to scale volt-ages as we scaled the transistor. This in turnled to the plethora of I/O standards that weare all familiar with today: SSTL, HSTL,LVDS, and LVCMOS 1.5. The Virtex-4SelectIO™ resource continues to lead theindustry, supporting virtually every I/Ostandard in use today on every pin.

XCITE On-Chip TerminationThe second major change was the transi-tion from lumped loads to transmissionline loads – again the direct result ofMoore’s Law. As transistors got faster andclock rates increased, I/O edge rates

increased as well. But because the propaga-tion speed of signals is a constant, dictatedby the speed of light, we entered the realmin which a signal on one end of a wire wasno longer the same as the signal on theother end of the same wire. This is whattransmission lines are all about, and theirappearance during the last few years hasdriven a sea change in all aspects of signalinterconnect and I/O design.

To make sure that these signal “waves”don’t start “splashing” uncontrollably, trans-mission lines need to be driven, built, andreceived using proper signal integrityapproaches, the most critical of which is ter-mination. Traditionally implemented withdiscrete resistors on the PCB, terminationlayouts can become exceedingly difficultaround high-density pinouts like those usedin FPGAs. This often dictates more PCBlayers and thus more system cost.

Virtex-4 FPGAs include our third-generation of XCITE™ integrated digi-tally controlled termination technology.Offering a precisely controlled sourceimpedance at the output drive pin, it isdesigned to enable the driving of trans-mission lines without external compo-nents, with maximum speed and signalintegrity, and with straightforward PCBlayout and layer stack-ups.

Likewise, on inputs, XCITE offers par-allel termination for single-ended inputsand true differential termination for differ-ential inputs. Termination occurs on theend of the transmission line at the die, noton the way there on the PCB, offering max-imum signal integrity. Many customersreport that the XCITE technology hassaved them many PCB layers, increasedPCB packing density, and saved them sub-stantial dollars in their bill of materials.

Source-Synchronous InterfacesThe third major change was the shift fromsystem-synchronous to source-synchronousinterfaces. Traditional system-synchronousinterfaces work by distributing a singleclock to all transmitters and receivers inthe system, and transmitting data betweensource and destination within a singleclock cycle. This makes the data rateinversely proportional to the sum of clock-



to-out, transmission line delay, and inputsetup time.

Typically, system synchronous interfacestop out at speeds in the range of 100 MHz.To go faster, source-synchronous interfacestransmit a clock along with the data, and thereceiver uses this clock to capture the data.Using this technique, along with double-data-rate transmissions, enables parallel I/Odata rates in excess of 1 Gbps.

The challenge of source-synchronousinterfaces is that each interface generates anew clock domain at the receiver. On topof this, to operate at high speeds, the pre-cise alignment of clock and data at thereceiver is paramount. To address this newworld of source-synchronous interfaces,Virtex-4 devices include the breakthroughChipSync technology. ChipSync units liebetween the SelectIO technology and thecore FPGA fabric, are available on everyI/O pin on the device, and serve to trans-mit and receive high-speed source-syn-chronous data and clocks, achieving speedsof 1 Gbps per pin pair.

On the receiver, precise digital delay lineswork internally to align data signals to eachother, and then to align these to the receivedclock. The captured data is synchronizedand transferred to the selected FPGA coreclock domain.

To operate at maximum data rates, thetransmit and receive units include parallel-to-serial and serial-to-parallel conversionunits, respectively. Using ChipSync technol-ogy is virtually automatic for most designs,as it is utilized automatically in the variousXilinx IP cores and reference designs.

Networking interfaces such as SPI-4.2and HyperTransport™, and memory inter-faces such as DDR, DDR2 SDRAM, andQDR II SRAM, all employ the Virtex-4ChipSync technology. And if you’re design-ing your own source-synchronous interface,the ChipSync wizard gives you completecontrol and an easy-to-use GUI that lets youdial in exactly what you want to build.

Multi-Gigabit Serial InterfacesThe fourth major change in I/O has beenthe rapid adoption of high-speed serialinterfaces. For years, serial interfaces werelimited to long-distance communications,

such as those used in fiber-optic links in theSONET/SDH world and the Ethernetlinks like 100BASE-T.

A key breakthrough occurred in the late1990s, in which high-speed serial transceivers(which traditionally had been designed usingcomplex process technology such as GaAs[Gallium-Arsenide]) were for the first timecreated using advanced design techniquesusing standard CMOS. Once implementedin CMOS, these transceivers had lower costand much lower power, and could even beintegrated into complex CMOS chips.

Virtually overnight, gigabit serial tech-nology changed from a rare, expensive, andpower-hungry technology to a common,low-cost, and very power-efficient technol-ogy. This has been the economic and tech-nical impetus behind the industry’s “SerialTsunami,” in which interface after inter-face has shifted from parallel to gigabitserial links. Two common examples are vis-ible in today’s computer architectures, withthe shift from parallel PCI to 2.5 Gbpsserial PCI-Express™, and the shift fromthe parallel ATA drive interface to theSerial ATA interface.

There are more than a dozen multi-gigabit serial interfaces in widespread usetoday, with more being introduced everyyear. The Virtex-4 FX family provides ourthird-generation RocketIO™ multi-giga-bit serial transceiver technology. Spanningspeeds from 622 Mbps to more than 10Gbps, each Virtex-4 RocketIO transceiveris programmable and can implement amyriad of speeds and serial standards.Link-layer IP is available for such stan-dards as PCI Express, Serial-ATA,FibreChannel, Gigabit Ethernet, andAurora, to name a few.

In addition, Virtex-4 FX devices eachinclude multiple embedded tri-mode (or10/100/1000) Ethernet MACs, makingimplementation of compliant Ethernetdevices simpler and faster than ever.

Application-Specific Embedded ProcessingVirtex-4 embedded processing solutionsinclude full support for both MicroBlaze™32-bit soft CPUs on all devices, andembedded PowerPC™ 32-bit RISC CPUson all Virtex-4 FX devices.

The number of CPUs in one device islimited only by your imagination, and ofcourse by the available logic cells. Thehigh-performance PowerPC CPU runs atclock rates up to 450 MHz and deliversup to 702 DMIPS each. The firstPowerPC processor available by anymanufacturer on 90 nm, the PowerPCprocessor is incredibly power-efficient,using only 0.29 mW/DMIPS. Thismakes it among the lowest power micro-processors available from any manufac-turer worldwide.

New Auxiliary Processor Unit (APU)technology connects the CPU to theFPGA fabric, enabling implementation ofacceleration hardware for virtually anyapplication. Once only the domain ofhigh-budget ASIC and ASSP design teams,the Virtex-4 FPGA’s architectural ability tocombine application-specific hardwareacceleration with high-performance RISCCPUs shatters traditional barriers of cost,time-to-market, and risk.

During the next few years, I expect tosee more and more instances of application-specific acceleration, as it truly offers theability to deliver very high performance atlow cost and low power. A recent researchprogram completed within Xilinx ResearchLabs, led by Dr. Kees Vissers, demonstrateda 20-fold speedup for anencryption/decryp-tion (IPSEC) application over the basePowerPC processor. Using only 135 mW,it outperforms a 3.2 GHz Pentium™-4,while at the same time reducing power by99%. That, in my opinion, is what state-of-the-art embedded processing is all about.

ConclusionI hope that you’ve enjoyed reading a bitabout the Virtex-4 Platform FPGA andthe factors that drove its design. Fromthe breakthrough ASMBL architectureand the triple-oxide 90 nm CMOSprocess technology, to the world’s mostcapable embedded processing and multi-gigabit serial solutions, Virtex-4 devicesoffer an unparalleled set of enablingtechnologies for your next-generationsystems designs. I look forward to seeingthe creativity of the world’s designers intomorrow’s products.



by Ahmad Ansari Senior Staff Systems ArchitectXilinx, [email protected]

Peter RyserManager, Systems EngineeringXilinx, [email protected]

Dan IsaacsDirector, APD Embedded MarketingXilinx, [email protected]

The APU controller provides a flexiblehigh-bandwidth interface between the re-configurable logic in the FPGA fabric andthe pipeline of the integrated IBM™PowerPC™ 405 CPU. Fabric co-processormodules (FCM) implemented in the FPGAfabric are connected to the embeddedPowerPC processor through the APU con-troller interface to enable user-defined con-figurable hardware accelerators. Thesehardware accelerator functions operate asextensions to the PowerPC 405, therebyoffloading the CPU from demanding com-putational tasks.

APU InstructionsThe APU controller allows you to extend thenative PowerPC 405 instruction set with cus-tom instructions that are executed by the soft

FCM; the primary capabilities are shown inFigure 1. This provides a more efficient inte-gration between an application-specificfunction and the processor pipeline than ispossible using a memory-mapped coproces-sor and shared bus implementation.

The instructions supported by the APUare classified into three main categories:

• User-defined instructions (UDI)

• PowerPC floating-point instructions

• APU load/store instructions

The UDIs are programmed into thecontroller either dynamically through thePowerPC 405 device control register(DCR) or statically when the FPGA is con-figured through its bitstream. The APUcontroller allows you to optimize your sys-tem architecture by decoding instructionseither internally or in the FCM.

The floating-point unit (FPU) is anexample of an FCM. The PowerPC float-ing-point instruction set is decoded in theAPU controller, whereas the computation-al functionality is implemented in theFPGA fabric. To support FPUs with dif-ferent complexities, the APU controllerallows you to select subgroups of thePowerPC floating-point instructions.These instructions are executed in theFCM while other subgroups of instructionsare either computed through software FPU

emulation or ignored completely. This fine-tuning optimizes FPGA resources whileaccelerating the most critical calculationswith dedicated logic.

The APU controller also decodes high-performance load and store instructionsbetween the processor data cache or systemmemory and the FPGA fabric. A singleinstruction transfers up to 16 bytes of data –four times greater than a load or storeinstruction for one of the general purposeregisters (GPR) in the processor itself. Thus,this capability creates a low-latency and high-bandwidth data path to and from the FCM.

APU Controller OperationFigure 2 identifies the key modules of theAPU controller and the 405 CPU in rela-tion to the FCM soft coprocessor moduleimplemented in FPGA logic. To explainthe operation of the APU controller andthe processor interactions related to theexecution units in soft logic, we can tracethe step-by-step sequence of events thatoccur when an instruction is fetched fromcache or memory.

Once the instruction reaches the decodestage, it is simultaneously presented to boththe CPU and APU decode blocks. If theinstruction is detected as a CPU instruc-tion, the CPU will continue to execute theinstruction as it would normally.Otherwise, within the same cycle, the CPU

Accelerated System Performancewith APU-Enhanced ProcessingAccelerated System Performancewith APU-Enhanced Processing


The Auxiliary Processor Unit (APU) controller is a key embedded processing feature in the Virtex-4 FX family.

The Auxiliary Processor Unit (APU) controller is a key embedded processing feature in the Virtex-4 FX family.

will look for a response from the APU con-troller. If the APU controller recognizes theinstruction, it will provide the necessaryinformation back to the CPU.

If the APU controller does not respondwithin that same cycle, an invalid instruc-tion exception will be generated by theCPU. If the instruction is a valid and rec-ognized instruction, the necessary operandsare fetched from the processor and passedto the FCM for processing.

Because the PowerPC processor and theFCM reside in two separate clock domains,synchronization modules of the APU con-troller manage the clock frequency differ-ence. This allows the FCM to operate at aslower frequency than the processor. In thisinstance, the APU controller would receivethe resultant data from the coprocessor and

implement synchronization semantics topace the software execution with the hard-ware FCM latency.

Non-autonomous instruction types arefurther divided into blocking and non-blocking. If blocking, asynchronous excep-tions or interrupts are blocked until theFCM instruction completes. Otherwise, ifnon-blocking, the exception or interrupt istaken and the FCM is flushed.

Software DescriptionSoftware engineers can access the FCMfrom within assembler or C code. On oneside, Xilinx has enabled the GCC compiler(which is contained in the EmbeddedDevelopment Kit) to generate code thatuses an FCM floating-point unit to calcu-late floating-point operations. Furthermore,assembler mnemonics are available forUDIs and the pre-defined load/storeinstructions, enabling you to place hard-ware-accelerated functions into the regularprogram flow. For the ultimate level of flex-ibility, you can define your own instructionsdesigned specifically for the hardware func-tionality of the FCM.

You can easily use the pre-definedload/store instructions through high-levelC macros. For example, in an applicationwhere the FCM is used to convert pixeldata into the frequency domain, 8 pixels of16 bits are transferred from main memoryto an FCM register with a simple program:

unsigned short pixel_row[8]; // 8 pixels,each pixel has a size of 16 bits

lqfcm(0, pixel_row); // transfer a row ofpixels to FCM register zero

The quadword load operation main-tains cache coherency as the data is movedthrough the cache, if caching is enabled forthe corresponding address space.

The FCM operation on the pixel datacan start on an explicit command; forexample, a UDI. However, for many appli-cations the operation starts immediatelyafter the FCM hardware detects the com-pletion of the load instruction.

The latter approach has many advantages:

• Simple software – A load operationmoves the data from the memory to

at the proper execution time send the databack to the processor. The APU controllerknows in advance, based on instructiontype, if or when it will get the result.

Autonomous and Non-Autonomous InstructionsTwo major categories of instructions exist:autonomous and non-autonomous. Forautonomous instructions, the CPU contin-ues issuing instructions and does not stallwhile the FCM is operating on an instruc-tion. This overlap of execution allows youto achieve high performance through tech-niques such as software pipelining.

On the other hand, during the syn-chronized execution, the CPU pipelinestalls while the FCM is operating on aninstruction. This feature allows you to


PowerPCAPU

Controller

SoftAuxiliary

Processor

PLB

OCM FPGA Fabric

APUI/F

FPGAI/F

Processor BlockProcessor Block

• Extends PPC 405 Instruction Set – Floating Point Support (with soft auxiliary processor) – User-Defined Instructions

• Offloads CPU-Intensive Operations – Matrix Calculations • Video Processing – Floating-Point Mathematics • 3D Data Processing

• Direct Interface to HW Accelerators – High Bandwidth – Low Latency

Fetch Stage

Decode Stage

Decode

EXE Stage

Exec. Units

WB Stage

Load WB Stage

DecodeControl

DecodeRegisters

APU Decode

Pipeline Control

Buffers andSynchronization

Optional Decode

ExecutionUnits

Register File

Intructions fromCache or Memory

Processor Block

Soft CoprocessorModule

405 Core

APU Controller

Instruction

Instruction

Control

Operands

OperandsResult Result

Load Data

Figure 1 – APU expanded processing capabilities

Figure 2 – APU controller processing operative block diagram

the FCM and starts the operation. Asubsequent store instruction retrievesthe result of the operation and stores itback to main memory.

• High data transfer rates – Quadwordload and store operations take just a fewcycles to complete. A single operationmoves 16 bytes within that timeframe.

• Low latency – FCM load operationsare simple to use. The processor com-pletes the operation in a single cycle.

The principle of the RISC architectureuses a number of simple instructions ondata stored in general-purpose registers(GPR) to compute complex operations.User-defined instructions fall into this cat-egory but take the concept a step further inthat the system architect defines the com-plexity of the operation on data stored inGPRs and FCM registers (FCR). Again,from a software point of view, the engineercodes user-defined instructions through Cmacros. GCC recognizes mnemonics suchas udi0fcm as a user-defined operation ofthe general form:

udi0fcm<FCRT5/RT5>,<FCRA5/RA5/imm>,<FCRB5/RB5/imm>

The target of the operation is either aGPR or an FCR. The operands are eitherGPRs, FCRs, immediate values, or a com-bination. As you can see, the semantics arenot defined by the instruction and dependon your intentions and the implementationin the FCM.

This code sequence demonstrates theuse of a user-defined instruction as anexample of a complex add operation:

struct complex {int r, i; // 32 bit integer for realand imaginary parts

};complex a, b, r;ldfcm(0, &a); // load complex number ainto FCM register 0ldfcm(1, &b); // load complex number binto FCM register 1udi0fcm(2, 1, 0); // udi0fcm computes r = a+ b, where r is stored in FCM register 2stdfcm(&r, 2); // store complex resultfrom FCM register 2 to variable r

To increase the readability of the code,you can redefine the user-defined instruc-tion with regular C preprocessor constructs.Instead of using the udi0fcm() macro, youcan redefine it to a more comprehensiblecomplex_add() macro with #define com-plex_add(r, a, b) udi0fcm(r, a, b) and changethe listing to call complex_add(2, 1, 0)instead of udi0fcm(2, 1, 0).

Therefore, system architects can partitiontheir tasks into hardware- and software-executed pieces that are efficiently and pre-cisely interfaced to one another through theuse of the APU controller. This partitioningcan be done statically during the initial sys-tem configuration or dynamically duringthe program execution. Using the directprocessor/FPGA coupling presented by theAPU controller and its high throughputinterfaces, hardware/software synchroniza-tion is greatly simplified and performancesignificantly improved.

Accelerating System PerformanceThe following examples showcase keyadvantages the APU provides based on twodifferent scenarios. The first scenario isessentially a benchmarking comparison of afinite impulse response (FIR) filter using asoft FPU core, implemented as an FCMattached directly to the APU controller (ascompared to software emulation used tocalculate the filter function). The secondscenario implements a two-dimensional

inverse discrete cosine transform (2D-IDCT) typically used as one of the pro-cessing blocks in MPEG-2 videodecompression, again compared to emu-lating the 2D-IDCT function in software.

The two use cases are different in thatthe FPU implements a set of registers in theFPGA fabric upon which the FPU instruc-tions operate. The 2D-IDCT only requiresload and store operations, while the func-tionality of the operation on the datastream is fixed. In either case the operationsare complex enough to justify offloadinginto the FPGA fabric.

Thus, the combination of using theAPU and FPGA hardware accelerationclearly provides a significant performanceadvantage over software emulation – or theconventional method involving the proces-sor and processor local bus architecturewith a soft co-processing function.

FIR FilterThe implementation of floating-pointcalculations in hardware yields animprovement by a factor of 20 over soft-ware emulation. Connecting the FPU asan FCM to the APU controller providesperformance improvement because thelatency to access the floating-point regis-ters is reduced and dedicated load andstore instructions move the operands andresults between the FPU registers and thesystem memory.


PowerPCAPU

ControllerProcessor(soft logic)

XtremeDSPXtremeDSP

XtremeDSP

OCM FPGA Fabric

APUI/F

FPGAInterface


43.8 .40 0

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

0 0 0 0 0 000

-4.1 0 -1.1 0 0

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

223 191 159 128 98 72 39 16

Pixel Amplitude Values

Pixel DCT ValuesRGB

YUV

Blocks

APU Function: • Decompresses encoded pixel data for output display• Utilize FPGA Resources – Less overhead logic – Fast data transfer

Spatial Redundancy:Pixel Decoding Using the IDCT

MPEG Decode Flow

Figure 3 – Utilizing APU to decode pixel data for display output

2D-IDCTThe 2D-IDCT transforms a block of 8 x 8data points from the frequency domain intopixel information. A high-level diagramdepicting the pixel decode by the APU con-troller, along with advantages, is shown inFigure 3. In this example, each data pointhas a resolution of 12 bits and is representedas a 16-bit integer value. The data structureis defined where each row of 8 pixels con-sumes 16 bytes. This is an ideal size thatallows optimal use of the FCM load andstore instructions described earlier. In otherwords, eight FCM quadword load instruc-tions are needed to load a data block into the2D-IDCT hardware. Eight FCM quadwordstore instructions are sufficient to copy thepixel data back into the system memory.

The calculation of the 2D-IDCT in theFCM starts immediately after the first load,and the pixel data is available shortly afterthe last load operation. As shown in Figure4, the 2D-IDCT makes uses of the newXtremeDSP™ slices in the Virtex-4 archi-tecture that offer multiply-and-accumulatefunctionality.

A software-only implementation of a2D-IDCT takes 11 multiplies and 29 addi-tions together with a number of 32-bit loadand store operations, while the hardware-accelerated version takes 8 load and 8 storeoperations. The reduced number of opera-tions results in a speed-up of 20X in favora 2D-IDCT FCM attached through theAPU controller.

By comparison, if you connect the 2D-IDCT hardware block to the processor localbus, as it is done conventionally, the systemperformance will be reduced. This increasedlatency is mainly caused by the bus arbitra-tion overhead and the large number of 32-bit load and store instructions. This isillustrated schematically in Figure 5.

ConclusionThe low-latency and high-bandwidth fab-ric coprocessor module interface of theAPU controller enables you to acceleratealgorithms through the use of dedicatedhardware. Where operations are complexenough to justify the offloading into theFPGA fabric, or when acceleration of a

specific algorithm is desired to achieveoptimal performance, the combination ofthe APU controller and FPGA hardwareacceleration provides a definitive per-formance advantage over software emula-tion or the conventional method ofattaching coprocessors to the processormemory bus.

Generating the accelerated functionscalled by user-defined instructions is easilyperformed through GUI-based wizards.This functionality will be included in sub-sequent releases of the powerful EmbeddedDevelopment Kit or Platform Studio.

If you are more comfortable workingat the source code or assembly level, theAPU controller allows you to define yourown instructions written specifically forthe hardware functionality of the FCM,or you can easily use the pre-definedload/store instructions through high-levelC macros.

The APU controller provides a closecoupling between the PowerPC processorand the FPGA fabric. This opens up anentire range of applications that can imme-diately benefit customers by achievingincreases in system performance that werepreviously unattainable.

For additional details on the APU con-troller in Virtex-4-FX devices, includingdetailed descriptions and timing waveforms,refer to the Virtex-4 PowerPC 405 ProcessorBlock Reference Guide at www.xilinx.com/bvdocs/userguides/ug018.pdf.


PowerPCAPU

Controller

AuxiliaryProcessor(soft logic)

XtremeDSPXtremeDSP

XtremeDSP

PLB

OCM FPGA Fabric

APUI/F

FPGAInterface


• Leverages Integrated Features – PowerPC, APU, XtremeDSP Blocks

Example: Video Application – MPEG De-Compression Algorithm

• HW Acceleration Over Software – Lower Latency and High Bandwidth

• Effecient HW/SW Design Partitioning– Optimized Implementation

• Significant Performance Increase

Over 20X Performance Improvement Compared to Software Emulation

PowerPC

Memory

PowerPC

Memory Memory

ProcessorLocal Bus

ProcessorLocal Bus

FPGAInterface

SoftIDCT

SoftIDCT

PowerPC

FPGA Fabric FPGA Fabric FPGA Fabric

Software Emulation APU AcceleratedProcessor w/Soft IDCT

Software Only>200 Lines CodeSeveral 100 Instructions

Accelerated w/Soft CoreMultiple Load/Store Operationsper IDCT

APU w/XtremeDSP SlicesSingle Instruction ExecutionLeverages APU and Soft Logic

Inverse Two-Dimensional IDCT Algorithm

APUController

Processor Block

Figure 4 – Accelerated system performance with APU

Figure 5 – Comparison of implementation models for 2D-IDCT

by Jason DaughenbaughSr. Design EngineerAdvanced Electronic Designs, [email protected]

New York City’s Times Square is known as the “Crossroads of the World.”Approximately 1.5 million people passthrough the intersection of Broadway and42nd Street every day, and millions moresee the area daily on television broadcasts.No better place for outdoor advertisingexists. As a result, dazzling signs havebecome a Times Square trademark.

Every advertiser wants to have the bestadvertising medium possible, so new signsmust use the latest technology. TimesSquare tenants rely on MultiMedia, whichmanufactures the majority of the spectacu-lar signs in Times Square. WhenMultiMedia asked our company, AdvancedElectronic Designs, Inc. (AED), to design anLED sign for JPMorgan Chase™ in TimesSquare, we needed a huge amount of signalprocessing, data distribution, and interfac-ing. We also needed to design the sign veryquickly. We met this challenge by utilizingthe advantages of Xilinx® components.

We used Virtex-II™ XC2V1000 FPGAsfor video processing, and for control anddistribution we chose low-cost Spartan-3™XC3S200 FPGAs. To configure the FPGAs,we chose the Platform Flash XCF00 config-uration PROM family. And for final distri-bution of the data on the 3800 LED blocks,we used XC9572XL PLDs.

Sign of the TimesSign of the Times


Xilinx makes high-tech outdoor advertising in Times Square possible.Xilinx makes high-tech outdoor advertising in Times Square possible.

The DesignAn LED sign is like a large computer mon-itor; video data goes in and is displayed onthe sign. The sign comprises red, green,and blue LEDs that turn on and off (pulse-width modulation) to generate more thanfour trillion colors.

What made this particular design achallenge was the scale, both in terms ofphysical size as well as the amount of dataand the transfer rates involved. The sign is135 feet long and 26 feet tall. With nearlytwo million pixels, it is the highest defini-tion LED display in the world. This is tentimes the resolution of the average televi-sion screen and twice the resolution of top-of-the-line HDTV sets.

After considering our options, design-ing with Xilinx programmable logic wasthe obvious choice. The high-perform-ance, low-cost FPGAs are well suited forall three main components of this design:video processing, data distribution, andsign control.

Video ProcessingThe video processor accepts a variety ofvideo inputs. It captures these video streamsas 36 bit RGB (12 bits per color). It thencrops and places these inputs onto a mastersign image for display. Color-space conver-sion adjusts image characteristics such ascolor temperature and balance.Additional processing cor-rects for individual LEDdifferences. We also use pro-prietary image processingalgorithms to operate theLEDs efficiently while main-taining optimal image quality.

Data DistributionVideo data starts in a controlroom and ends at the LEDs. Thefirst step is the video processor,which is located in the controlroom. The video processor breaks theimages into manageable chunks to send tothe many modules of the sign so that eachLED displays the data for the correspon-ding pixel. More than 3 Gbps of video dataalone is required to operate the LEDs. Inaddition to video data, we also transfer a

allowing great distances between the con-troller and the sign itself.

We were able to use off-the-shelf switchesto distribute the data within the sign and put

inexpensive 10/100 Ethernet ports on theindividual distribution boards. The avail-

ability of Ethernet protocol analyzers,such as the open-source projectEthereal, allowed us to easily analyzeand debug the system.

Sign ControlIn advertising, time is money; thus it is cru-cial to monitor the sign at all times. Thecontrol system monitors temperaturesthroughout the sign to ensure that adequatecooling is present. Voltages are monitored todetect malfunctioning power supplies. Thecontrol system maintains error and resendcounts to detect faulty data links. It also pro-vides an interface to upgrade the FPGAsremotely for enhancements and bug fixes.

variety of control and status functions.Not wanting to re-invent the wheel,

we chose Ethernet as our data distribu-tion medium. Our video processor hasmultiple Gigabit Ethernet ports thatinterface to the sign. Gigabit Ethernetcan be transferred over fiber-optic cable,

Figure 1 – The world’s highest resolution LED display is based on Xilinx devices.


Figure 2 – The sign is built out of 3,800 display blocks.

The Benefits of Xilinx DevicesXilinx devices include a large number offeatures that are ideal for our sign project:

• The reconfigurable nature of Xilinxdevices is necessary for a project likethis. Without FPGAs, the only alterna-tive would have been an ASIC. But anASIC was not feasible for this projectfor several reasons.

First, this project had a very tight sched-ule. An ASIC could not have been com-pleted in the time allotted. Second, thevolumes of the components in this signare not of sufficient volume to hide theNREs of an ASIC. Third, an ASIC lacksthe development opportunities of anFPGA. To me, as an engineer, this rea-son is the most important. No matterhow much simulation you perform,there can always be unexpected bugs. Inan ASIC, these bugs are expensive; in anFPGA, they can be fixed easily.

Another FPGA advantage is that it canmeet future needs through featureupgrades; an ASIC cannot. The recon-figurable nature of Xilinx FPGAsallows us to provide feature upgradesand bug fixes to the customer via e-mail, making it easy for them to applyto the sign. Through an Ethernet inter-face, the FPGA reprograms thePlatform Flash configuration PROMand automatically reboots.

• Video processing requires a large num-ber of multiply operations. The videoprocessor must perform color-spaceconversion and apply calibration coeffi-cients in real time. It would require alarge portion of FPGA logic resources tobuild multipliers. Instead, this can bedone very efficiently by utilizing theembedded multipliers. Buildingpipelined processing structures with theembedded multipliers allowed us to eas-ily meet the processing requirements.

• This design required a large variety ofsignaling standards. The flexible XilinxI/O blocks allowed us to connect direct-ly to a large number of different inter-faces. Voltages ranged from standard3.3V CMOS down to 1.5V HSTL. We required single-ended and differen-tial interfaces. In some cases we couldhave used external driver and receiverparts, but that would have added com-plexity and cost to the product.

Other high-speed I/O interfaces, suchas to the DDR 333 memory, would nothave been possible without directFPGA support. The digitally controlledimpedance (DCI) modes were necessaryon the high-speed single-ended traces.

• With the high data rates involved andthe many data interfaces, we had a largenumber of clock domains. The quanti-ty of global clock nets available and theability of the digital clock managers(DCMs) to synthesize clock frequenciesmade this easy. We also used the phase-shift ability of the DCM to adjust sam-ple times on various interfaces.

• Block RAM is my favorite resource inan FPGA. Without block RAM, thereare two memory options. The firstoption is the logic slices, using flip-flops or distributed RAM, but this isexpensive and slow for anything morethan 16- to 32-bit addresses. The sec-ond option is external memory, suchas SDRAM. SDRAM storage is gener-ally in the range of tens to hundreds ofmegabytes, leaving a huge size gapbetween these two memory options.

Block RAM bridges this size gap. It canbe used for a limitless number ofthings, from FIFOs for processingengines to loadable tables for data con-versions. The flexible port-widths ofblock RAM allow you to use themindividually or in efficient combina-tions. The dual-port capability makes

them easy to use for transferring databetween clock domains or sharing data.

• While very powerful and convenient,the PLDs and Spartan-3 FPGAs arealso very inexpensive. When combinedwith the development advantages, thelow device price makes Xilinx devicesunbeatable when developing high-per-formance embedded systems.

PicoBlaze ProcessorsDevice hardware capabilities are essentialfor any design, but development tools andtricks are also very important. The favoritetoy in our Xilinx bag-of-tricks is thePicoBlaze™ processor. We could not havecompleted the project in the time allowedwithout extensive use of the PicoBlazeprocessor. The sign contains an impressivecount of more than 1,000 of these embed-ded processors, with nine different designs.

PicoBlaze processors provide efficientlogic resource utilization by time-multiplex-ing logic circuits. Many functions, especial-ly control functions, do not need to be


8 Gbps video processing

18 Billion 16-bit multiply operations per second

16 DDR333 SDRAM banks

6 Gigabit Ethernet MACs

333 Fast Ethernet MACs

>1000 PicoBlaze Processors

10 XC2V1000 Virtex-II FPGA

323 XC3S200 Spartan-3 FPGA

333 XCF00 Platform Flash PROM

3,800 XC9572XL 72 macrocell PLD

The reconfigurable nature of Xilinx FPGAs allows us to provide feature upgrades and bug fixes to the customer via e-mail ...

Table 1 – This sign includes nearly 4,500 Xilinx devices.

Table 2 – Xilinx devices achieve impressive specifications.


especially fast, or don’t happen very often.One example of this would be a serial

transfer to read a temperature sensor. Forthis application, the sensor only needs tobe read every ten seconds. It would be awaste to have a state machine for tempera-ture sensor reading that ran once every tenseconds but only took a few millisecondsto complete. The logic would be unused99.9% of the time.

These types of functions can be effi-ciently combined into a single PicoBlazeprocessor, which in the previous example

can not only read the temperature every tenseconds but perform other similar tasks inthe meantime.

The PicoBlaze processor also provides aquick and easy way to develop controlfunctions. The alternative would be tobuild a custom state machine for eachfunction. The PicoBlaze processor is a pro-grammable state machine, meaning thatthe state machine is already built; one justhas to program it. It has an intuitive andpowerful instruction set and a large code-space of 1,024 instructions. Programming

the PicoBlaze processor is a quick and easyway to define many functions.

The PicoBlaze processor is also a greattool for accelerating the testing and debug-ging process. The PicoBlaze program codeis stored in block RAM. To make a changeto the program we only need to change theblock RAM contents. It is possible to dothis without re-implementing the FPGA,saving a lot of time.

Our favorite method of PicoBlazeprocessor development, which is slightlyunique, is to use a PC serial port and a sim-

ple PC application to download the pro-gram code into the block RAM of aconfigured FPGA. We have developed aninterface board that connects to the FPGAand has the serial port, as well as severalseven-segment displays to which thePicoBlaze processor can write for debug-ging. We also allow the selection of differentprocessors so that we can work on multipleprocessors through the same interface.

This interface is not only useful fordebugging PicoBlaze programs, but alsofor debugging the logic connected to the

processor. Because it is so quick and easyto write programs for PicoBlaze proces-sors, it is very straightforward to writeprograms to test the various logic circuitsattached to the processor. We can test eachfunction individually, greatly simplifyingand accelerating any debugging thatbecomes necessary.

A key application of the PicoBlazeprocessor in this project is the Ethernetcontroller. As mentioned earlier, we select-ed Ethernet to distribute data throughoutthe sign. At each Ethernet connection, wehave an Ethernet physical layer transceiver(PHY) device connected directly to anFPGA. We developed a very simple andtiny media access controller (MAC) mod-ule, which we use inside the FPGA to con-nect the PHY to an instantiation of thePicoBlaze processor.

This Ethernet unit is small, requiringless than a quarter of the logic resourcesin the XC3S200 FPGA. It handles thebasic Ethernet layers and protocols,including ARP (address resolution proto-col). It also supports the IP (Internetprotocol) layer with ICMP (Internetcontrol message protocol), UDP (userdatagram protocol), and DHCP (dynam-ic host configuration protocol). Withthis Ethernet controller, we can plug anFPGA into our network and it negotiatesan IP address. Then we can transfer filesand data to and from it.

ConclusionXilinx devices made the challenge of devel-oping the world’s highest definition LEDdisplay achievable. These devices are a per-fect fit for a complex design because oftheir flexible nature and powerful featureset. Valuable design components such asthe PicoBlaze processor further increasetheir ease of use and thus their value.

The reconfigurable and flexible natureof the devices allowed us to ship the signwith all first-revision circuit boards,enabling us to develop a very complex sys-tem in very little time.

For more information about MultiMediaLED signs, visit www.multimediaLED.com.For more information about the engineeringprovided by AED, visit www.aedmt.com.

Figure 3 – The video data distribution board is based on an XC3S200 FPGA. It also includes SRAM, a 10/100 Ethernet port, a status display, and numerous connections to display blocks.

by Barrie MullinsIntegrated Solutions Manager, Xilinx Design ServicesXilinx, [email protected]

Optos PLC, a revolutionary eye care tech-nology company, selected Xilinx Virtex-IIPro™ FPGAs to be at the heart of theirwide-field visual imaging system. To lowerthe risk, reduce the learning curve, andmeet their time-to-market requirements forthis new technology, Optos engaged XilinxDesign Services (XDS) to help them plan,design, implement, and deliver their com-plex solution.

The Optos Panoramic System compris-es two IBM™ PowerPC™ 405 micro-processors and multiple proprietaryinterfaces, as well as industry standardinterfaces. The design utilizes the latesttechnology, tools, and software provided byXilinx and gave XDS the opportunity toapply its system, hardware, and embeddedsoftware knowledge, along with its projectmanagement expertise, to meet the time-to-market requirements.

The Design ChallengeFor Optos, the challenge was to find a tech-nology that provided them with the techni-

cal features to take their system from con-cept to reality. The system is used by aqualified operator to take a scan of thepatient’s eye while they look at a target pre-sented on an LCD panel. This image cap-ture uses Optos’ patented technology inwide-field visual imaging and transfers thedata to a microprocessor. The processor isused for storage of patient data and pro-cessing by the operator.

The challenge for XDS was to design,develop, test, simulate, integrate, andbring up the hardware and software on asingle platform to enable a complete con-nectivity solution addressing all layers ofthe Optos application. The solution alsohad to be scaleable for future features andupgrades. XDS was also required to meetbudget constraints.

The TeamsThe Xilinx Design Services team workedwith the Optos engineering team to set thesystem requirements and to agree on a sched-ule and deliverables for the project. Over theperiod of the project, the XDS project man-ager worked with the Optos project manage-ment team to ensure that milestones weredelivered and that information flowedsmoothly between the two teams.

OptosOptos was founded to develop a tech-nology that will provide ophthalmolo-gists and optometrists with improvedimage capture and analysis capability forthe early detection and prevention of eyedisease. Optos’ patented technology inwide-field visual imaging is unique.Their business strategy is to make theirvisual imaging system available to thegreatest number of practitioners, thusallowing an overall improvement in eyecare and early disease detection for a fargreater number of people.

The Optos headquarters inDunfermline, United Kingdom, is the basefor research and development of thePanoramic diagnostic ophthalmic appara-tus, the technology behind OptomapRetinal Images.

Xilinx Design ServicesOptos contracted XDS to design and deliv-er the Virtex-II Pro 2VP20 design. By doingso, they reduced the risk and learning curvefor this new technology, and they ensuredthey would meet their time-to-market goal.XDS gathered a team of experienced co-design engineers with in-depth knowledgeof project management, embedded software

Xilinx Design Services assists in the development of a new eye care technology.Xilinx Design Services assists in the development of a new eye care technology.

Xilinx Teams with OptosXilinx Teams with Optos



Low ResolutionCamera Data

Interfaces

Camera 1

IIC & SPIPeripherals

DDRSDRAM

3 MGTPeripherals

LCDZBTRAM Flash

PCI Interface toTransmit the

Low Res Camera and High Speed

MGT Data

PCI North/SouthBridge Interface tox86 EnvironmentRunning Linux

High Speed DataManagement andData Gathering

Interface

System Configuration and Control Block

DSP ProcessingBlock

Camera 2

ZBT RAM

System Configuration& Control

High-Speed DataLogic

FPGA Peripherals

Host x86 System

Virtex-II Pro 2VP20

OPBArbiter

PLBArbiter

PLBArbiter

EMCCore

(Flash)

EMCCore(ZBT)

IPIF Core(Proprietary

DisplayInterface)

GPIOCore

PLB2OPBBridge

OLB2PPBBridge

SPICore

IICCore

UART16550Core

DCRInterrupt

Core

PowerPC(Config.)

PowerPC(DSP)

Data Sniffer

DDRSDRAM

Core

Proprietary High Speed MGTInterface with FSM

& Control Logic

EMC Core(ZBT)

PLB2OPBBridge

256 Control & StatusRegisters

MGT 1 MGT 2 MGT 3

Proprietary LowResolution

Interface for 2 Cameras

System Configuration & Control

High-Speed Data Logic

Power PC 405

Processor Local Bus (PLB)

On-Chip Peripheral Bus (OPB)

Device Control Registers (DCR)

DCR Registers

IP Cores

Multi-Gigabit Tranceivers (MGT)

Proprietary Logic

PLB

PLB

PLB

OPB

DCR

Reg

Reg

OPB

OPB2PCICore

Data Flow

design, FPGA design, co-design tools, IPcores, and platform FPGAs.

The final delivery to Optos includ-ed the source code for the embeddedtest software, HDL designs, designscripts, HDL test vectors, simulationtest software, build scripts, and associ-ated testbenches. The delivery specifi-cations also included all associatedproject documentation: project plan,design specifications, testbench strate-gy, software specifications, memorymap, and verification plan.

The Panoramic SystemFigure 1 shows the basic Panoramic sys-tem design, which includes:

• System control processor (SCP),described as System Configurationand Control in the diagram legend

• Camera and digital signal processing(DSP) components, described asDSP and Camera Logic

• High-speed multi-gigabit serial transceiver (MGT) data and PCI,described as High-Speed Data Logic.

Each section interacts with the others,but is not 100 percent interdependent.

In Figure 2 you can see a moredetailed diagram of the system that XDSdesigned. The outer color blocks inFigure 2 show the division of the logicin the system. The interface between thetwo main blocks shown in Figure 2 wasthrough the use of device control regis-ters (DCRs) accessed and controlled byboth the SCP and associated hardware.

Figure 2 – Detailed view of Virtex-II Pro design in the Panoramic system

Figure 1 – Block diagram of Virtex-II Pro design in the Panoramic system

System Control ProcessorThe main function of the SCP is systemconfiguration management, coordination,status reporting, and control of the timingof events based on inputs. It starts imagecapture, and controls LCD display andother logic through registers in the system.

The main technical blocks of the SCPperform the following functions:

• Communicate with the Intel™ x86operating system environment overPCI and RS-232 connections

• Configure the LCD panel and camerainterfaces via IIC

• Display images on the LCD panel

• Communicate with peripheral subsys-tems via SPI

• Boot from internal Virtex-II Pro blockRAM and load application from exter-nal flash memory to ZBT RAM beforerunning it

• Schedule and control events in theapplication software via interrupts andGPIO

• Use DCR registers to gather status andcontrol operations in other sections.

DSP and Camera LogicThe camera interfaces are required to sup-port 30 frames per second (fps) of 640 x480 pixels, requiring bandwidths greaterthan 9 Mbps for each camera. The data isstored locally in ZBT RAM so that a DSPalgorithm running on the secondPowerPC 405 microprocessor can beapplied to the data. While the data is beingreceived from the cameras, it is also copiedto DDR SDRAM for transmission overthe PCI to the host x86 memory space.

High-Speed Data LogicRocketIO™ MGTs are used to receivehigh-speed data from three peripheralboards, each transmitting data from a singleMGT on a Virtex-II Pro 2VP7. The MGTdata transfers are initiated by an externalsystem triggered by the operator. This eventis synchronized using interrupts to the SCP.The data from the three MGTs is formattedin the Virtex-II Pro device and used to form

a larger data packet, which is then stored inDDR SDRAM for transmission over PCI.

The PCI section is used to transfer allthe data stored in the DDR SDRAM tothe x86 host processor environment.XDS designed a number of proprietaryhardware blocks to control the flow ofdata from DDR SDRAM to the OPB-PCI core and over to the host x86 mem-ory space. Configuration and status of thePCI core is carried out by the SCPthrough DCR registers.

ToolsDuring this project, XDS used several soft-ware tools to create the Panoramic system,to simulate the design, and to manage the

development environment to ensure trace-ability and quality releases. The tools usedare listed in Table 1.

Test and VerificationWhen designing a project of this size, testand verification are of paramount impor-tance to the successful completion of theproject. The XDS team focused a largenumber of resources to ensure the designmet Optos’ requirements.

The team developed a full testbench thatwould fit around the design (shown in Figure2). This testbench was designed to simulateevery interface in the design. It consisted of aPowerPC microprocessor with peripheralcores to test protocols and communicationinterfaces. Both the design and the testbenchused the swift model for the PowerPC micro-processor to run the software during simula-tion. A swift model is a cycle-accuratesimulation model that can interpret

PowerPC instructions and act accordingly.Two software applications were used to

test the system in simulation. The first ranon the testbench and the second ran on thesystem under test (SUT).

The function of the SUT application isto interact with the peripheral cores andto drive the inputs and outputs of theSUT. The application in the testbenchinteracts with the SUT in the followingmanner:

• Data received from the SUT is storedin memory.

• Protocol data is decoded, stored, andreplied to, if appropriate.

The application for the SUT allowshardware design engi-neers to control manydifferent features andto run a suite of tests inany order required. TheXDS team alsodesigned tests to verifythe pure hardwareinterfaces withoutinteraction with theapplications.

All functionality wastested at unit, function-al, and system level. The

tests were all documented in the test and ver-ification plan, which was approved by Optosbefore the system design phase began.

ConclusionWith Xilinx Design Services, you get ahighly skilled and experienced team thatwill help get you to market faster by bridg-ing the learning curve, delivering qualityon-time designs, and working in closecontact with your development team.

In a letter to the engineers at XilinxDesign Services, David Cairns, technicaldirector at Optos, wrote, “I have beenimpressed with Xilinx Design Services.Their commitment to quality and highstandards has eased transition into pro-duction. XDS went the extra mile, and wecould not have achieved our design goalswithout their help.”

For more information about XilinxDesign Services, visit www.xilinx.com/xds.

Function Tool Used Vendor

VHDL Design, Compile, Test ISE 5.2i Xilinx

Software Design, Compile, Test EDK 3.2 Xilinx

Logic and Software Simulation MTI 5.6d Mentor Graphics

Version Management CVS Open Source

Table 1 – Software used to develop, simulate, and manage the development environment of the Panoramic system


Coming Soon to a location near you!

Learn how the latest Xilinx technology can help you design cost effective solutions faster.

Gain hands-on experience to speed up your next development cycle.

What Are You Waiting For?Register now for the event nearest you.

Visit www.memec.com/xfest-2005

ASIA CANADA EUROPE JAPAN UNITED STATES

April through June

(MG0

084-

04) 1

2.20

.04 Copyright 2004 Memec, LLC. All rights reserved. Logos are owned by their proprietors and used by Memec with permission. All company and product

names may be trademarks of their respective companies.

PowerPC

“We designed our latest Nova identity4broadcast production video switcher withwhat we consider a truly disruptive tech-nology utilizing only a single Virtex™-IIPro FPGA. This approach reduced our partcount 10-fold over prior systems while pro-viding a per-channel cost at a fraction ofour competition.

“We are very excited about the new Virtex-4 FX family of devices. The higher per-formance processor and integrated APUwill allow us to rapidly create hardwaremodules for accelerating specific softwarefunctions. By leveraging these capabilitieswith the integrated EMAC cores andenhanced RocketIO™ transceivers avail-able in Virtex-4 FX devices, our next-gener-ation systems will further widen thedistance between us and our competition.”

Xilinx CustomerRoger Smith, Chief Engineer, Echolab

“Leveraging a high degree of IP design re-usedeveloped with our first Virtex-II Pro-basedsystem, we can rapidly migrate our high-per-formance storage networking architecture tothe Virtex-4 FX family of devices. By utiliz-ing many of the FX features, we are able toincrease both functionality and performancewhile reducing system cost.

“Key elements include the enhancedPowerPC™ processor running Linux OS tomanage and control our high throughputmulti-channel switching matrix and theintegrated dual tri-mode Ethernet MAC andRocketIO transceivers for high-speed datatransfer. The APU controller can alsooffload some tasks via custom-definedinstructions, resulting in further computingand performance headroom.”

Xilinx CustomerSandy Helton, Chief Technology Officer, SAN Valley Systems

MicroBlaze

“The MicroBlaze™ solution has proven tobe far more than a companion micro-processor to our FPGA compute engines. Ithas become quite an integral part in shap-ing the way we are solving problems and ishelping reduce our engineering costs.

“Currently, we use the MicroBlaze proces-sor in our 68-billion-color LED display sys-tem to manage calibration and setuproutines for the video processor module.The ability to easily modify these functionswithout completely changing the designallows us to offer a cutting-edge product.”

Xilinx Customer Ricardo Ramos, President and CEO,Interativa Paineis Eletronicos

What Customers and Partnersare Saying about Xilinx

Embedded Processor Solutions

What Customers and Partnersare Saying about Xilinx

Embedded Processor Solutions


Real breakthroughs. Real stories.Real breakthroughs. Real stories.

“Incorporating the MicroBlaze processorinside a single FPGA allowed us to use alow pin-count device and reduce PCBcomplexity and cost. But the most signifi-cant impact that the MicroBlaze processorhas offered is the ability to almost com-pletely change a design without needingto change the physical design whatsoever.As requirements change, we no longerneed to remap functionality into differentperipherals, as is often the case when mov-ing to a larger microcontroller. We canadd, delete, or move peripherals; theEmbedded Development Kit (EDK)makes this effortless.”

Xilinx PartnerErik Widding, President, Birger Engineering

“The storage module was a nice applica-tion for the MicroBlaze embedded proces-sor. We wrote routines in C for theMicroBlaze core for asynchronous transfer,including the packet handler and interpre-tation. The routines that needed accelera-tion were re-coded in RTL and moved tohardware implementations on the sameVirtex-II device. The storage elementsneeded to take data streams in, processthem, and put the data out to disk. Wecould accomplish that 100% with theMicroBlaze core, just not at speed.

“The MicroBlaze processor allowed us toupdate the program without reprogram-ming the device, drop in a test program inC, set breakpoints, and functionally debugby examining memory, variables, and regis-ters. This helped us debug the functionali-ty very quickly. We also were able to doperformance analysis and profiling thathelped us decide what modules neededacceleration.”

Xilinx Partner Derek Stark, Senior Software Engineer,Nallatech

“The PicoBlaze processor made our transi-tion towards embedding a software-basedapplication into our well-known FPGA envi-ronment seamless and surprisingly simple.”

Xilinx CustomerMoti Cohen, Hardware Design Engineer,TeraSync

Xilinx Platform Studio

Regarding Xilinx Platform Studio winningthe DesignVision Award:

“It is inspiring to see Platform Studio rec-ognized in this way. The platformapproach to next-generation embeddedsystems will become the only viable solu-tion to meet the economic and technolog-ical challenges of the future. Xilinxcontinues to prove its undeniable leader-ship in developing world-class technolo-gies that make FPGAs a compelling systemplatform for an expanding range ofembedded processing applications.”

Jerry Worchel, Principal Analyst, In-Stat/MDR

Regarding Xilinx Platform Studio 6.3iproduct release:

“The platform approach will not onlyfind a permanent home in the FPGAworld, but in the ASIC and ASSP worldsas well. With this announcement, Xilinxcontinues to prove its undeniable leader-ship in this market.”

Max Baron, Principal Analyst, In-Stat/MDR

PicoBlaze

“The favorite toy in our Xilinx bag-of-tricks is the PicoBlaze™ processor. Wecould not have completed the sign projectin the time allowed without extensive useof the PicoBlaze processor.

“The sign contains an impressive count ofmore than 1,000 of these embeddedprocessors, with nine different designs.PicoBlaze processors provide efficient logicresource utilization by time-multiplexinglogic circuits. The PicoBlaze processor alsoprovides a quick and easy way to developcontrol functions. The alternative would beto build a custom state machine for eachfunction. The PicoBlaze processor is a pro-grammable state machine, meaning thatthe state machine is already built; one justhas to program it.”

Xilinx Partner Jason Daughenbaugh, Sr. Design Engineer,Advanced Electronic Designs (AED)

“We do four channels of PID + feedforwardcontrol + profile generation + host inter-face, all with 32-bit position resolution and32-bit breakpoints at a sample rate of 10KHz, with all four axes in motion. Not badfor an 8-bit microcontroller!”

Xilinx CustomerPeter Wallace, Partner, Mesa Electronics

“We needed the control that a CPU pro-vides without the overhead of extra periph-erals and external hardware. The PicoBlazeprocessor was easy to implement andworked very well in our custom protocolapplication.”

Xilinx CustomerScott Orangio, Senior Hardware Engineer,Polycom Inc.


by Chris BorrelliEmbedded Networking ManagerXilinx, [email protected]

The TCP/IP protocol suite is the de factoworldwide standard for communicationsover the Internet and almost all intranets.Interconnecting embedded devices isbecoming standard practice even in deviceclasses that were previously stand-aloneentities.

By its very definition, an embedded archi-tecture has constrained resources, which isoften at odds with rising application require-ments. Achieving wire-speed TCP/IP per-formance continues to be a significantengineering challenge, even for high-pow-ered Intel™ Pentium™-class PCs.

In this article, we’ll discusses the per-byteand per-packet overheads limiting TCP/IPperformance and present the techniques uti-lized in the Xilinx Gigabit System ReferenceDesign (GSRD) to maximize TCP/IP overGigabit Ethernet performance in embeddedPowerPC™-based applications.

Considerations for High-BandwidthTCP/IP PowerPC ApplicationsConsiderations for High-BandwidthTCP/IP PowerPC Applications


The Xilinx Gigabit System Reference Designmaximizes TCP/IP performance.The Xilinx Gigabit System Reference Designmaximizes TCP/IP performance.

GSRD OverviewThe GSRD terminates IP-based transport pro-tocols such as TCP or UDP. It incorporates theembedded PowerPC and RocketIO™ blocksof the Virtex-II Pro™ device family, and isdelivered as an Embedded Development Kit(EDK) reference system.

The reference system as described inXilinx Application Note XAPP536 lever-ages a multi-port DDR SDRAM memorycontroller to allocate memory bandwidthbetween the PowerPC processor local bus(PLB) interfaces and two data ports. Eachdata port is attached to a direct memoryaccess (DMA) controller, allowing hard-ware peripherals high-bandwidth access tomemory.

A MontaVista™ Linux™ port is availablefor applications requiring an embedded oper-ating system, while a commercial standaloneTCP/IP stack from Treck™ is also available tosatisfy applications with the highest bandwidthrequirements.

System ArchitectureMemory bandwidth is an important consid-eration for high-performance network-attached applications. Typically, externalDDR memory is shared between the proces-sor and one or more high-bandwidth periph-erals such as Gigabit Ethernet.

The four-port multi-port memory con-troller (MPMC) efficiently divides the avail-able memory bandwidth between thePowerPC’s instruction/data PLB interfacesand a communications direct memoryaccess controller (CDMAC). The CDMACprovides two bi-directional channels ofDMA that connect to peripherals through aXilinx standard LocalLink streaming inter-face. The CDMAC implements data re-alignment to support arbitrary alignment ofpacket buffers in memory. A block diagramof the system is shown in Figure 1.

The LocalLink Gigabit Ethernet MAC(LLGMAC) peripheral incorporates theUNH-tested Xilinx LogiCORE™ 1-GigabitEthernet MAC to provide a 1 Gbps 1000-BASE-X Ethernet interface to the referencesystem. The LLGMAC implements checksumoffload on both the transmit and receive pathsfor optimal TCP performance. Figure 2 is asimplified block diagram of the peripheral.


PPC405

DCR20PB

Dual GPIO

LEDs &Pushbuttons

XCVR

DB9

UART Lite

DCR20PB

Fiber OpticsUser DefinedInterface

DCR Bus

DSPLB

RX0 TX0 RX1 TX1

ISPLB

FPGA Fabric

LocalLinkGMAC Peripheral

*Can be customizable for other applications

Port 0 Port 1 Port 2 Port 3

PLB InterfacePort

PLB InterfacePort

CommunicationDMA Controller

RXChecksum

Offload

TXChecksum

Offload

User DefinedLocalLink Peripheral

RX TX

Multi Port Memory Controller

DDR SDRAM

TX_DATA[31:0]

TX_REM[3:0]

TX_SOF_N

TX_SOP_N

TX_EOP_N

TX_EOF_N

DCR_READ

DCR_WRITE

DCR_WR_DBUS[0:31]

DCR_ABUS[0:9]

TX_SRC_RDY_N

TX_DST_RDY_N

RX_DATA[31:0]

RX_REM[3:0]

RX_SOF_N

RX_SOP_N

RX_EOP_N

RX_EOF_N

RX_SRC_RDY_N

DCR_ACK

DCR_RD_DBUS[0:31]

RX_DST_RDY_N

TXP

TXN

RXP

RXN

Xilinx LogiCORE1-Gigabit

Ethernet MAC

PeripheralRegisters

faR

TX Peripheral

LocalLink Data

CSUMInsert

CSUMGEN

CSUMFIFO

TXFIFO

RX Client Interface

CSUMGEN

CSUMFIFO

RXFIFO

RX Peripheral

Figure 2 – LocalLink Gigabit Ethernet MAC peripheral block diagram

Figure 1 – GSRD system block diagram

TCP/IP Per-Byte OverheadPer-byte overhead occurs when the processortouches payload data. The two most com-mon operations of this type are buffer copiesand TCP checksum calculation. Buffercopies represent a significant overhead fortwo reasons:

1. Most of the copies are unnecessary.

2. The processor is not an efficient datamover.

TCP checksum calculation is also expen-sive, as it is calculated over each payloaddata byte.

Embedded TCP/IP-enabled applicationssuch as medical imaging require near wire-speed TCP bandwidth to reliably transferimage data over a Gigabit Ethernet network.The data is generated from a high-resolutionimage source, not the processor.

In this case, introducing a zero-copy soft-ware API and offloading the checksum cal-culation into FPGA fabric completelyremoves the per-byte overheads. “Zero-copy”is a term that describes a TCP software inter-face where no buffer copies occur. Linux andother operating systems have introducedsoftware interfaces like sendfile() that servethis purpose, and commercial standaloneTCP/IP stack vendors like Treck offer similarzero-copy features. These software featuresallow the removal of buffer copies betweenthe user application and the TCP/IP stack oroperating system.

The data re-alignment and the checksumoffload features of GSRD provide the hard-ware support necessary for zero-copy func-tionality. The data re-alignment feature is aflexibility of the CDMAC that allows soft-ware buffers to be located at any byte offset.This removes the need for the processor tocopy unaligned buffers.

Checksum offload is a feature of theLocalLink Gigabit Ethernet (LLGMAC)peripheral. It allows the TCP payload check-sum to be calculated in FPGA fabric asEthernet frames are transferred betweenmain memory and the peripheral’s hardwareFIFOs. GSRD removes the need for costlybuffer copies and processor checksum opera-tions, leaving the PowerPC 405 to processonly protocol headers.

TCP/IP Per-Packet OverheadPer-packet overhead is associated with opera-tions surrounding the transmission or recep-tion of packets. Packet interrupts, hardwareinterfacing, and header processing are exam-ples of per-packet overheads.

Interrupt overhead represents a consider-able burden on the processor and memorysubsystem, especially when small packets aretransferred. Interrupt moderation (coalesc-ing) is a technique used in GSRD to alleviatesome of this pressure by amortizing the inter-rupt overhead across multiple packets. TheDMA engine waits until there are n framesto process before interrupting the processor,where n is a software-tunable value.

Transferring larger sized packets (jumboframes of 9,000 bytes) has a similar effectby reducing the number of frames trans-mitted, and therefore the number of inter-rupts generated. This amortizes theper-packet overhead over a larger data pay-load. GSRD supports the use of Ethernetjumbo frames.

The components of GSRD use thedevice control register (DCR) bus for con-trol and status. This provides a clean inter-face to software without interfering with thehigh-bandwidth data ports. The per-packetfeatures of GSRD help make efficient use ofthe processor and improve system-levelTCP/IP performance.

ConclusionThe Xilinx GSRD is an EDK-based refer-ence system geared toward high-performancebridging between TCP/IP-based protocolsand user data interfaces like high-resolutionimage capture or Fibre Channel. The com-ponents of GSRD contain features to addressthe per-byte and per-packet overheads of aTCP/IP system.

Table 1 details the GSRD TCP transmitperformance with varying levels of optimiza-tion for Linux and standalone Treck stacks.

Future releases of GSRD will explore fur-ther opportunities for TCP accelerationusing the FPGA fabric to offload functionssuch as TCP segmentation.

The GSRD Verilog™ source code isavailable as part of Xilinx ApplicationNote XAPP536. It leverages the MPMCand CDMAC detailed in XilinxApplication Note XAPP535 to allocatememory bandwidth between the processorand the LocalLink Gigabit Ethernet MACperipheral. The MPMC and CDMAC canbe leveraged for PowerPC-based embeddedapplications where high-bandwidth accessto DDR SDRAM memory is required.

For more information about XAPP536and XAPP535, visit www.xilinx.com/gsrd/.


TCP/IP Stack Ethernet Frame Size Optimization TCP Transmit Bandwidth

MontaVista Linux 9000 bytes (jumbo) None 270 Mbps

MontaVista Linux 9000 bytes (jumbo) Zero-copy, checksum offload 540 Mbps

Treck, Inc 9000 bytes (jumbo) Zero-copy 490 Mbps

Treck, Inc 9000 bytes (jumbo) Zero-copy, checksum offload 780 Mbps

Associated Links:Xilinx XAPP536, “Gigabit SystemReference Design”http://direct.xilinx.com/bvdocs/appnotes/xapp536.pdf

Xilinx XAPP535, “High PerformanceMulti Port Memory Controller”http://direct.xilinx.com/bvdocs/appnotes/xapp535.pdf

Treck, Inc. (www.treck.com)

MontaVista Software (www.mvista.com)

“End-System Optimizations for High-Speed TCP” (www.cs.duke.edu/ari/publications/end-system.pdf)

“Use sendfile to optimize data transfer”(http://builder.com.com/5100-6372-1044112.html)

Table 1 – TCP transmit benchmark results

by Gordon CameronBusiness Development ManagerAccelerated Technology, Mentor [email protected]

Accelerated Technology’s Nucleus™PLUS real-time operating system(RTOS) is already available for both theXilinx® MicroBlaze™ 32-bit softprocessor core and the IBM™PowerPC™ 405 core integrated intoVirtex-II Pro™ devices. This determin-istic, fast, small footprint RTOS is idealfor “hard” real-time applications.

With the release of the XilinxPlatform Studio EDK 6.3i, configura-tion of this leading royalty-free RTOSon your newly designed system is aseasy as selecting from a pull-downmenu. Instead of spending hours mod-ifying your target software to work withyour new hardware configuration, youcan configure the target software auto-matically in minutes, without the error-prone possibilities of configuring byhand. This is especially valuable duringthe earlier design phases when the hard-ware may be changing frequently. Thisprocess was enabled by one of theunderlying technologies of XilinxPlatform Studio EDK, called micro-processor library definition, or MLD.

MLD TechnologyThe Xilinx Platform Studio EDK devel-opment system is based on a data-drivencode base that makes it extensible andopen. MLD is one example of thisunderlying capability. It was createdspecifically to allow you to easily createand modify kernel configurations andassociated board support packages(BSPs) for partner-supported RTOSslike Nucleus PLUS and its extensivemiddleware offering.

MLD has two required file types: thedata definition file (.MLD) and datageneration file (.Tcl). The .MLD con-tains the Nucleus user-customizationparameters, while the .Tcl file is a Tclscript that defines a set of Nucleus-specific procedures for building thefinal software system (see Figure 1).

Configure and Build theEmbedded Nucleus PLUSRTOS Using Xilinx EDK

Configure and Build theEmbedded Nucleus PLUSRTOS Using Xilinx EDK


Nucleus PLUS RTOS for MicroBlaze and PowerPC 405 processors is now automatically configurable using XPS MLD technology.Nucleus PLUS RTOS for MicroBlaze and PowerPC 405 processors is now automatically configurable using XPS MLD technology.

Installing MLD files in XPSThe installation CD of Nucleus PLUSinstalls the RTOS, associated drivers, andthe two Nucleus-configured MLD filesthat enable you to use MLD technologywithin the Xilinx Platform Studio EDK.The default install path for the MLD filesis the \nucleus\bsp sub-directory, located in\edk_user_repository.

Once you have installed these, youcan use Xilinx Platform Studio EDKwith Nucleus now visible in the RTOSpull-down selection menu. See Figure 3afor the PPC405 and Figure 3b for theMicroBlaze processor.

Evaluating Nucleus PLUS in EDKThe Accelerated Technology NucleusPLUS evaluation software provided inthe EDK Platform Studio 6.3i shipmentincludes a limited version (LV) ofNucleus PLUS. This is a fully function-al version of the RTOS compiled into alibrary format (rather than the normalsource code distribution) with the singlerestriction that it will stop working after60 minutes, facilitating evaluation of itsfull functionality. When you purchase afull license of Nucleus PLUS fromAccelerated Technology, you receive thefull source code and, obviously, the 60-minute run time restriction is lifted.

The LV version of Nucleus PLUS isconfigured to execute from the off-chipSRAM or SDRAM module. Once youhave a full license to the RTOS, you canconfigure it to run from any memory inyour system.

Nucleus PLUS is a scalable RTOS –only the software you use in your designis included in the downloaded code.This may be contrasted with other larg-er, more static systems, which consumefar more system resources. In some cir-cumstances, the whole RTOS and appli-cation can fit in the on-chip memory,thus achieving high performance andlow power consumption. Even with larg-er applications, which may utilize exten-sive middleware, the efficient use of therelatively small amount of on-chip mem-ory means that the size of the kernelfootprint is an important consideration.

You can configure other componentsof the Nucleus system by hand to workin this environment, such as network-ing, web server, graphics, file manage-ment, USB, WiFi, and CAN bus.Future releases of Nucleus will movethese products into full integration withXilinx Platform Studio EDK and MLDtechnology.

Accelerated Technology supplies thesefiles and the associated installer as an eval-uation disk, included in the latest release ofXilinx EDK 6.3i. Accelerated Technologyhas also established a website to supportand distribute this evaluation. This sitecontains updates, evaluations, referencedesigns, and documentation for all of theAccelerated Technology Xilinx offeringsand will be updated regularly with newmiddleware implementations that you canadd to the automatic configuration of yourapplication. The website is located atwww.acceleratedtechnology.com/xilinx/.

To get up and running quickly withyour first Nucleus-based system, the instal-lation also includes a sample pre-built ref-erence design with a compiled NucleusPLUS demonstration. The pre-built refer-ence designs currently support theMemec™ design-based DS-KIT-2VP7FG456 and DS-KIT-V2MB1000FPGAs. This is the fastest method toemploy for a sample Nucleus-based, MLD-enabled Xilinx system.

Use of Xilinx’s Base System Builder isalso well documented inside the applica-tion notes accompanying the installation.With the Base System Builder, you canbuild a variety of system core configura-tions to work with the Nucleus PLUSRTOS (see Figure 2).

If you have received your EDK 6.3iupdate recently or have purchased a seat,please check the contents for this evalua-tion. After running the Nucleus PLUSevaluation disk installer, the necessary fileswill be placed into the Xilinx EDK 6.3iand the support of Nucleus PLUS will beautomatically added.

The elements of Nucleus PLUS modi-fied by the data generation file (.Tcl) forspecific hardware configuration are:

• The number and type of devices usedby the hardware designer

• Memory map information

• Locations of memory-mapped deviceregisters

• Timer configuration

• Interrupt controller configuration


HW DesignOS

Selection

XPS

HW RTOS

Netlist BSP

ToISE

ToRTOS

MLD Files

Figure 1 – Outline of an MLD-enabled system design

Figure 2 – The hardware design is complete and ready to configure the software.

Nucleus PLUS and Xilinx DevicesAs we have said, the process of creating aworking BSP for Nucleus PLUS beginswith configuring the hardware platformusing Xilinx Base System Builder or thesupplied sample reference designs. Thenyou can go to the Project > SoftwarePlatform Settings menu item and selectthe operating system you want to use fromthe list. Choosing the Nucleus option(Figure 3 a/b) will make available the spe-cific software settings for the RTOS undereach of the tabs on the Software PlatformSettings menu (Figure 4 shows the userenabling the cache on the PowerPC). Notethat in the LV version most of the softwareoptions are disabled, but can be changedin the full version.

Once you are satisfied with the soft-ware settings, you can use the GenerateNetlist and Generate Bitstream com-mands and download the hardware con-figuration onto the FPGA using XilinxXPS or ISE tools.

You can now execute the Tools >Generate Libraries and BSP commandsto configure Nucleus PLUS. The appli-cation software can be linked with theRTOS. Now you are ready to switch overto the GDB debugger and download thecombined RTOS and application imageto the FPGA.

Advanced Software ToolsUp to this point, we have bypassed manyaspects of application software design,assuming that you have code ready tocompile and link and download to theFPGA. In fact, as systems become evermore complex, both hardware and soft-ware designers require advanced state-of-the-art tools to help them complete theirprojects within budget and on time.

The Xilinx EDK-configurable versionof Nucleus PLUS uses the standard GNUsuite of tools supplied with the XilinxEDK package. This is more than adequatefor many projects for getting systems upand running, but advanced applicationdevelopment often needs more.Accelerated Technology can provide acomplete range of tools that encompass allphases of the software design process.

• If code footprint or performance isimportant, then consider the highlyoptimizing Microtec compiler forPowerPC Virtex-II Pro devices. Thisensures that the code that is shipped isthe same as the code that is debugged –a goal not achieved by many compilers.

• Application debugging often needsRTOS awareness, advanced break-points, and debugging of fully opti-mized code. These features are availableon PowerPC Virtex-II Pro devices withthe industry-standard XRAY debugger.

• To bring software development forwardin time so that it can be started beforethe hardware is complete, softwareteams can use our advanced prototyp-ing products Nucleus SIM or NucleusSIMdx. These tools allow the develop-ment of the complete application soft-ware in a host-based environment.

• UML enables software teams to raisetheir level of abstraction and producemodels of their software. NucleusBridgePoint enables full code genera-tion by using the xtUML subset ofUML 2.0.

• You can verify software/hardware inter-action in the Mentor Graphics®

Seamless® co-verification environment,which allows combined hardware andsoftware simulation for PowerPC Virtex-II Pro devices.

These tools, when combined with theNucleus PLUS RTOS, are ideal for helpingyou maximize the functionality and efficien-cy of your designs.

ConclusionThe latest EDK-configurable NucleusPLUS RTOS brings a new dimension tosystems incorporating high-performanceembedded processors from Xilinx. Itssmall size means that it can use availableon-chip memory to minimize power dissi-pation and deliver increased performance,while its wealth of middleware makes itideal for products targeted at the network-ing, telecommunications, data, communi-cation, and consumer markets.

Making this solution easy to configurewithin Xilinx EDK allows you to easily exploit the benefits of this powerfulproduct. For more information, visitwww.acceleratedtechnology.com or www.mentor.com.

Figure 3a – After installing Nucleus PLUS in EDK, Nucleus appears as an option in thedrop-down menu choosing which operating

system to use with the PowerPC 405 processor.

Figure 3b – Nucleus appears as an option in the drop-down menu choosing which

operating system to use with the MicroBlazesoft-core processor.

Figure 4 – Enabling the cache in the RTOSconfiguration parameters


by David Pellerin CTO Impulse Accelerated [email protected]

Milan SainiTechnical Marketing ManagerXilinx, [email protected]

Regardless of whether you are using aprocessor core in your FPGA design,using a Xilinx® MicroBlaze™ or IBM™PowerPC™ embedded processor canaccelerate unit testing and debugging ofmany types of FPGA-based applicationcomponents.

C code running on an embedded proces-sor can act as an in-system software/hardwaretest bench, providing test inputs to theFPGA, validating the results, and obtainingperformance numbers. In this role, theembedded processor acts as a vehicle for in-system FPGA verification and as a comple-ment to hardware simulation.

By extending this approach to includenot only C compilation to the embeddedprocessor but C-to-hardware compilationas well, it is possible – with minimal effort

– to create high-performance, mixed soft-ware/hardware test benches that closelymodel real-world conditions.

Key to this approach are high-performancestandardized interfaces between test software(C-language test benches) running on theembedded processor and other components(including the hardware under test) imple-mented in the FPGA fabric. These interfacestake advantage of communication channelsavailable in the target platform.

For example, the MicroBlaze soft-coreprocessor has access to a high-speed serialinterface called the Fast Simplex Link, or FSL.The FSL is an on-chip interconnect featurethat provides a high-performance data chan-nel between the MicroBlaze processor and thesurrounding FPGA fabric.

Similarly, the PowerPC hard processorcore, as implemented in Virtex-II Pro™and Virtex-4™ FPGAs, provides high-per-formance communication channels throughthe processor local bus (PLB) and on-chipmemory (OCM) interfaces, as illustrated inFigure 1.

Using these Xilinx-provided interfaces todefine an in-system unit test allows you toquickly verify critical components of a largerapplication. Unlike system tests (which model

real-world conditions of the entire applica-tion), a unit test allows you to focus on poten-tial trouble spots for a given component, suchas boundary conditions and corner cases, thatmight be difficult or impossible to test from asystem-level perspective. Such unit testingimproves the quality and robustness of theapplication as a whole.

Unit TestingA comprehensive hardware/software testingstrategy includes many types of tests, includ-ing the previously-described unit tests, for allcritical modules in an application.Traditionally, system designers and FPGAapplication developers have used HDL simu-lators for this purpose.

Using simulators, the FPGA designer cre-ates test benches that will exercise specificmodules by providing stimulus (test vectorsor their equivalents) and verifying the result-ing outputs. For algorithms that process largequantities of data, such testing methods canresult in very long simulation times, or maynot adequately emulate real-world condi-tions. Adding an in-system prototype testenvironment bolsters simulation-based verifi-cation and inserts more complex real-worldtesting scenarios.

MicroBlaze and PowerPC Cores as Hardware Test GeneratorsMicroBlaze and PowerPC Cores as Hardware Test Generators


Combining FPGA embedded processors with C-to-RTL compilation can accelerate the testing of complex hardware modules.Combining FPGA embedded processors with C-to-RTL compilation can accelerate the testing of complex hardware modules.

Unit testing is most effective when itfocuses on unexpected or boundary condi-tions that might be difficult to generatewhen testing at the system level. For exam-ple, in an image processing application thatperforms multiple convolutions insequence, you may want to focus yourefforts on one specific filter by testing pixelcombinations that are outside the scope ofwhat the filter would normally encounter ina typical image.

It may be impossible to test all permu-tations from the system perspective, so theunit test lets you build a suite to test spe-cific areas of interest or test only theboundary/corner cases. Performing thesetests with actual hardware (which may fortesting purposes be running at slower thanusual clock rates) obtains real, quantifiableperformance numbers for specific applica-tion components.

Introducing C-to-RTL compilation intothe testing strategy can be an effective way toincrease testing productivity. For example,to quickly generate mixed software/hard-ware test routines that run on the both theembedded processor and in dedicated hard-ware, you can use tools such as CoDeveloper(available from Impulse AcceleratedTechnologies) to create prototype hardwareand custom test generation hardware thatoperates within the FPGA to generate sam-ple inputs and validate test outputs.

hardware modules – to be described andinterconnected using buffered communica-tion channels called streams.

Impulse C also supports communica-tion through signals and shared memories,which are useful for testing hardwareprocesses that must access external or staticdata such as coefficients.

Data Throughput and Processor SelectionWhen evaluating processors for in-systemtesting, you must first consider the fact thatthe MicroBlaze processor or any other softprocessor requires a certain amount of areain the target FPGA device. If you are onlyusing the MicroBlaze processor as a testgenerator for a relatively small element ofyour complete application, this addedresource usage may be of no concern. If,however, the unit under test already pushesthe limits in the FPGA, you may want totarget a bigger device during the testingphase or consider the PowerPC core pro-vided in the Virtex-II Pro and Virtex-4platforms as an alternative.

Synthesis time can also be a factor.Depending on the synthesis tool you use,adding a MicroBlaze core to your completeapplication may add substantially to thetime required to synthesize and map theapplication to the FPGA, which can be afactor if you are performing iterative com-pile, test, and debug operations.

Again, the PowerPC core, being a hardcore that does not require synthesis, has anadvantage over the MicroBlaze core whendesign iteration times are a concern. The16 KB of data cache and 16 KB of instruc-tions cache available in the PowerPC 405processor also makes it possible to runsmall test programs entirely within cachememory, thereby increasing the perform-ance of the test application.

If a high test data rate (the throughputfrom the processor to the FPGA) is yourprimary concern, using the MicroBlazecore with the FSL bus or the PowerPC withits on-chip-memory (OCM) interface willprovide the highest possible performancefor streaming data between software andhardware components.

By using CoDeveloper and the ImpulseC libraries, you can make use of multiple

CoDeveloper generates FPGA hard-ware from the C-language softwareprocesses and automatically generatessoftware-to-hardware and hardware-to-software interfaces. You can optimizethese generated interfaces for theMicroBlaze processor and its FSL inter-face or the PowerPC and its PLB inter-face. Other approaches to datamovement, including shared memories,are also supported.

Desktop Simulation and Modeling Using CUsing C language for hardware unit testinglets you create software/hardware models(for the purpose of algorithm debugging)in software, using Microsoft™ VisualStudio™, GCC/GBD, or similar C devel-opment and debugging environments. Forthe purpose of desktop simulation, thecomplete application – the unit under test,the producer and consumer test functions,and any other needed test bench elements– is described using C, compiled under astandard desktop compiler, and executed.

Although you can do this usingSystemC, the complexity of SystemClibraries (in particular their support for data-flow abstractions through channels) makesthe process of creating such test benchessomewhat complex. CoDeveloper’s ImpulseC libraries take a simpler approach, provid-ing a set of functions that allow multiple Cprocesses – representing parallel software or


S/WProcess

H/WProcess

H/WProcess

H/WProcess

H/WProcess

H/WProcess

H/WProcess

H/WProcess

H/WProcess

MBlaze, PPCor Other

Processor

OPB, FSL or Other Interface

Impulse CApplication

FPGAHardware

Resources

Virtex or Spartan FPGA

S/WProcess

Figure 1 – Hardware and software test components are mapped to the FPGA target andcommunicate across the OPB or FSL bus to the MicroBlaze or Power PC processor.

streaming software/hardware interfacesusing a common set of stream read andwrite functions. These stream read andwrite functions provide an abstract pro-gramming model for streaming communi-cations. Figure 2 shows how the Impulse Clibrary functions support streams-basedcommunication on the software side of atypical streaming interface.

Moving Test Generators to HardwareTo maximize the performance of test gen-eration software routines, you can migratecritical test functions such as stimulus gen-erators into hardware. Rather than re-implementing such functions inVHDL or Verilog™, automated C-to-RTL compilation quickly generateshardware representing test produceror consumer functions. These func-tions interact with the unit under test,using FIFO or other interfaces toimplement data streams and supplyother test inputs.

The CoDeveloper C-to-RTL com-piler analyzes C processes (individualfunctions that communicate viastreams, signals, and shared memories)and generates synthesizable HDLcompatible with Xilinx PlatformStudio (EDK), Xilinx ISE, and third-party synthesis tools includingSynplicity® (Figure 3). The generatedRTL is automatically parallelized atthe level of inner code loops to reduceprocess latencies and increase datarates for output data streams.

Automated compilation capabilitywith the ability to express system-level parallelism (creating multiple

pipelined processes, for example) makes itpossible to generate hardware directlyfrom C language at orders of magnitudefaster than the equivalent algorithm asimplemented in software on the embed-ded microprocessor. This creates hard-ware test generators that generate outputsat a high rate.

Does C-Based Testing Eliminate the Need for HDL Simulators?C-based test methods such as thosedescribed in this article are a useful addi-tion to a designer’s bag of tricks, but theyare certainly not replacements for a com-prehensive hardware simulation. HDLsimulation can be an effective way to deter-mine cycle counts and explore hardwareinterface issues. HDL simulators can alsohelp alleviate the typically longcompile/synthesize/map times requiredbefore testing a given hardware module in-system. Hardware simulators provide muchmore visibility into a design under test, andallow single-stepping and other methods tobe used to zero-in on errors.

If tests require very specific timing,using an embedded processor to createtest data will most likely result in datarates that are only a fraction of what isneeded to obtain timing closure. In fact, ifthe test routine is implemented as a statemachine on the processor, the speed atwhich the state machine can be made tooperate will be slower than the clock fre-quency of the test logic in hardware.Hence, for most cases, the hardware por-tion would need to slow down so the CPUcan keep pace – providing test stimulusand measuring expected responses.Alternatively, you can create a bufferedinterface – a software-to-hardware bridge– to manage the test data using a stream-ing programming model.

Given the performance differencesbetween a processor-based test bench andthe potential performance of an all-hard-ware system, it should be clear that soft-ware-based testing of such applicationscannot replace true hardware simulation, inwhich you can observe, using post-routesimulation models, the design running atany simulated clock speed.

ConclusionIn-system testing using embedded proces-sors is an excellent complement to simula-tion-based testing methods, allowing youto test hardware elements at lower clockrates efficiently using actual hardwareinterfaces and potentially more accuratereal-world input stimulus. This helps toaugment simulation, because even atreduced clock rates the hardware undertest will operate substantially faster than ispossible in RTL simulation.

By combining this approach with C-to-hardware compilation tools, you canmodel large parts of the system (includ-ing the hardware test bench) in C lan-guage. The system can then beiteratively ported to hand-coded HDLor optimized from the level of C code tocreate increasingly high-performancesystem components and their accompa-nying unit- and system-level tests.

For more information, visitwww.impulsec.com, e-mail [email protected], or call (425) 576-4066.


ImpulsePlatformLibraries

Impulse CDesign Files

GenerateRTL

GenerateHardwareInterfaces

GenerateSoftwareInterfaces

HDLFiles

HDLFiles

SoftwareLibraries

Visual Studio,CodeWarrior,

GCC, etc.

Figure 3 – Hardware and software C code is compiled and debugged in a standard IDE;

the interfaces are automatically generated and the generated HDL synthesized in Xilinx tools

before downloading into the Virtex-II, Virtex-II Pro, or Virtex-4 device.

Figure 2 – A software-based unit testoperating on the embedded processor com-municates with the hardware unit under

test through data streams.

Now you can see inside your FPGA designs in a way that

will save days of development time.

The FPGA dynamic probe, when combined with an Agilent

16900 Series logic analysis system, allows you to access

different groups of signals to debug inside your FPGA—

without requiring design changes. You’ll increase visibility

into internal FPGA activity by gaining access up to 64

internal signals with each debug pin.

You’ll also be able to speed up system analysis with the

16900’s hosted power mode—which enables you and your

team to remotely access and operate the 16900 over the

network from your fastest PCs.

The intuitive user interface makes the 16900 easy to get up

and running. The touch-screen or mouse makes it simple to

use, with prices to fit your budget. Optional soft touch

connectorless probing solutions provide unprecedented

reliability, convenience and the smallest probing footprint

available. Contact Agilent Direct today to learn more.

U.S. 1-800-829-4444, Ad# 7909Canada 1-877-894-4414, Ad# 7910www.agilent.com/find/new16900www.agilent.com/find/new16903quickquote

©Agilent Technologies, Inc. 2004 Windows is a U.S. registered trademark of Microsoft Corporation

• Increased visibility with FPGA dynamic probe• Intuitive Windows

®XP Pro user interface

• Accurate and reliable probing with soft touch connectorless probes • 16900 Series logic analysis system prices starting at $21,000

Get a quick quote and/or FREE CD-ROM with video demos showing how you can reduce your development time.

X-ray vision for your designsAgilent 16900 Series logic analysis system with FPGA dynamic probe

X-ray vision for your designs

by Darrell WilburnPresidentI.Q. Services, Inc.* [email protected] or [email protected]

Platform FPGAs can implement a com-pletely configurable system-on-chip bycontaining one or more microprocessors ina tightly coupled fabric. This delivers veryflexible hardware and software, which canchange continuously throughout thedesign and debug cycle.

A powerful set of software debug toolsthat can properly support sophisticatedFPGAs is critical for successful projectcompletion. Debugging and verifying adesign from external pins is problematic atbest. Reliably measuring 200 to 300 MHzsignals (like Fast Simplex Links) over a 3-foot logic cable to an external trace facilityis very difficult – and sometimes impossi-ble – to make with sub-nanosecond preci-

sion. Furthermore, adding logic paths toprovide for external probing is greatlyintrusive, which may create new place androute problems as well as timing differ-ences in the final design.

Simulation can still help you overcomethe simpler roadblocks, but for real-timeor intermittent problems, observing in realtime through in-circuit methods quicklybecomes a necessity. On-board instrumen-tation circuits can provide visibility to allsystem signals as well as executing pro-grams.

The challenges of verification anddebug steps are:

• Instrumentation to provide correlatedhardware and software measurements

• Needing a broad range of engineeringskills

• Extreme flexibility with ever-changingneeds for both

Nohau Corporation has developed acompact on-chip development system thatenables you to efficiently address thesedebug issues. The Nohau solution includescompact on-chip debug IP calledDebugTraceBlaze that is minimized forsize, connects directly to the on-chipperipheral bus (OPB), and utilizes on-chipblock RAM for trace storage.

The debug facilities are implementedtwo ways: through hardware or software.The software-based solution uses a smallXilinx® program called XMD-STUB thatresides in the first 1K block of memory.The hardware solution uses programmablelogic in the hardware and is transparent tothe software. You may choose the solutionthat is best for you.

Personally, I prefer the software solu-tion because it has less impact on the hard-ware and is more flexible forcustomization. Also, the cost of 1K of

Nohau Shortens Debugging Timefor MicroBlaze and Virtex-II ProPowerPC Users

Nohau Shortens Debugging Timefor MicroBlaze and Virtex-II ProPowerPC UsersNohau tools provide multiprocessor debug environments for embedded systems, resulting in increased design modification and debug efficiency.

Nohau tools provide multiprocessor debug environments for embedded systems, resulting in increased design modification and debug efficiency.


memory is usually insignificant in systemsthat often have 1 MB or more. A blockdiagram for a typical small system is shownin Figure 1, illustrating placement of theNohau DebugTraceBlaze module.

Please note that the Nohau solutionrequires no external signal pins; all access isthrough the JTAG port. Furthermore, itdoes not impact timing because it onlyinterfaces through the OPB bus. Theresource utilization in the FPGA for theNohau IP is very small. Actual require-ments are shown in Figure 2.

Design/ Debug FlowThe design flow with Nohau tools presentis illustrated in Figure 3. You may build aninitial system from scratch or use a plat-form generator like Nohau BlazeGen orXilinx Platform Studio (XPS) Base SystemBuilder (BSB).

Simply specify your system with the

A second path for code development isshown in Figure 3. BlazeGen generatessmall pre-tested code snippets that fitentirely in one on-board block RAM toprovide you with a solid starting place forinitial power-up check-out. These snippetsare treated just like user code for input tothe GNU compiler.

You can enter and compile C/C++ codefrom inside XPS or from an external editorand compiler using its own make files. Forlarge programs, I recommend using anexternal GNU make facility. The outputfrom the compile process is an .elf file thatcontains all code and symbolic informationto be loaded directly by Seehau.

As shown in Figure 3, the classicedit/compile/debug loop familiar toembedded system engineers centers aroundthe Seehau debugger. Additionally, a hard-ware edit/compile/debug loop is nowincluded that loops back through newbuilds in XPS.

Debugging with SeehauSeehau provides an intuitive source-leveldebugger that can be made aware of logicsignals in the fabric; RTOS state and vari-ables; correlation of hardware signals to codeexecution; and Ethernet performance char-acteristics in Internet-aware applications.Seehau is a full-featured source or assemblydebugger with an integral real-time tracefacility. It supports either PowerPC™ hard-core or MicroBlaze™ soft-core processors.

.MHS, .MSS, .UCF, and project optionsfiles, which are generated by the platformbuilder or user-generated text files. To addthe Nohau DebugTraceBlaze IP to a proj-ect, you first build it with BSB or BlazeGenand add DebugTraceBlaze IP with a passthrough BlazeGen.

The output of the XPS build is a .bit filethat contains the bitstream required to pro-gram the target FPGA with your system.The Nohau Seehau debugger is a conven-ient and easy-to-use GUI interface thatallows fast and easy updating of systemhardware and software as well as test andcheck-out of software execution. Seehauloads the bit file and programs the FPGAin just a few seconds.


I-OPB

D-OPB

Dual PortBlock RAM

ILM

B

DLM

B

Boot Program

JTAG

UART GPIO

Nohau DebugIP

with TRACE

Ext. MemoryController

10/100 EMAC

256K X32SRAM

PHYRJ45

32-bit ProcessorMicroBlaze

Debug with no trace:

Slices = 84

LUTs = 86

Flops = 148

Mults = 0

BRAMs = 0

Debug with trace:

Slices = 425

LUTs = 642

Flops = 489

Mults = 0

BRAMs = 4

BlazeGen or User Source

C Source GNU GCC

Start

User Code

BlazeGenor

BSBXPS

Retargeting

Seehau(Source Level

Debugger)Debug Done

Software EditCompile Loop

Hardware Edit Compile Loop

MHSMSSUCFXMD

Download.bit

Figure 1 – Nohau Debug IP with trace, shown in a simple five-chip Internet-aware system

Figure 2 – Nohau resource usage with and without trace

Figure 3 – Development flow with Seehau source-level debugger in place

You can look back in time fromany execution to follow the pathbackward, or you can use the Seehauevent configuration system to specifypre- and post-triggering, complexbreakpoints, triggers on register readsand writes, and triggers on data fromthe fabric. Figure 4 shows a typicalsource-level debug display withprocessor registers, memory data,program data in source form, andtrace and breakpoint status.

Nohau tools are sold as a system,which includes the Seehau debugger, aninterface pod to the appropriate JTAGconnector, and the IP DebugTraceBlazeconfigured with trace memory. As a sys-tem, it may be ordered as EMUL-MICROBLAZE-PC.

The Nohau EMUL-MICROB-LAZE-PC provides a 512-frame or2K-frame deep trace with a trigger,post-trigger count, and break control.Probe pins may be either 8 or 40 bitswide. It will display data connected toit as specified in the XPS .MHS file.

Figure 5 illustrates a trace display inmixed mode with C source and assem-bly source intermixed. On this singledisplay, you can correlate the frame atcapture time, the execution address,the opcode of the instruction execut-ed, the disassembled MicroBlazeinstruction, the C source line that gen-erated that instruction, and 40 bits ofdata from any logic in the system.Data from logic can include signalsfrom your own logic design.

Multiprocessor System SupportRecently, Nohau completed a jointproject with Xilinx, expanding theSeehau system to include supportfor the hard-core PowerPC proces-sor found in Virtex-II Pro™devices. As Seehau is a robust,source-level debugger, the userinterface and source-level feature setare nearly identical. The only majorchanges from an embedded systemengineer’s point of view are theprocessor-level language on disas-sembled screens and the register set

associated with the PowerPC archi-tecture. Figure 6 shows a source-level debug screen of a typicalPowerPC debug session.

Seehau has also been expanded toinclude support for multiple processorsin the same fabric. The processors arerun independently. Figure 7 shows a setof screens for a two-processor system.

The Nohau GUI provides a sim-ple, easy-to-use interface that assignsa complete set of control and statuswindows to each processor. AllSeehau windows are available forboth processors, and show the namegiven to each processor in the topbanner. In the case shown in Figure7, the execution sites are namedMB1 and MB2.

When you select a command froma pull-down list by clicking on it, thecommand is directed to the processorassigned to the window in focus. Youcontrol the set of windows open foreach processor through a pull-downmenu. The choice of open windows iscontrolled by your selection of newwindows to view. The result is an easy-to-use, intuitive user interface.

ConclusionGetting the right tool set and develop-ment environment set up for a newFPGA project is critical to the successof the product development cycle. Ahighly productive development anddebug environment based aroundNohau tools supports these new mul-tiprocessor systems, with an extensionof the same powerful debug and testtools the company has offered for thelast 20 years.

For more information, please visitwww.nohau.com, www.iq-service.com,or e-mail [email protected] [email protected].

* I.Q. Services is under contract to sup-port and market platform FPGA toolsfor Nohau Corporation and performscustom start-up engineering for plat-form FPGA embedded system designs.

Figure 4 – Full source with breakpoint at context switch in µC/OS-II

Figure 5 – Trace 40-bit mixed-mode display in binary logic signals from FPGA fabric

Figure 6 – PowerPC source-level debug with trace

Figure 7 – Multi-core display with two MicroBlaze processors


by Flemming ChristensenManaging [email protected]

There has been a strong push in the pastfew years to replace analog radio systemswith digital radio systems. TheDepartment of Defense Joint TacticalRadio System program shifted the empha-sis on the development of software-definedradio (SDR) to the forefront of researchand development efforts in the defense,civilian, and commercial fields.

Although it has existed for many years,SDR technology continues to evolvethrough newly funded ventures. The “holygrail” of SDR is its promise to solve incom-patible wireless network issues by imple-menting radio functionalities as softwaremodules running on generic hardware plat-forms. Future SDR platforms will comprisehardware and software technologies thatenable reconfigurable system architecturesfor wireless networks and user terminals.

Although new SDR-based systems arepurported to be highly reconfigurable andreprogrammable right now, the truth isthat SDR hardware platforms are still intheir early development stages. Many issuesmust still be resolved, including reconfig-urable signal processing algorithms, hard-ware and software co-designmethodologies, and dynamically reconfig-urable hardware. Overall, the main keyissues for SDR embedded system platformsare flexibility, expandability, scalability,reconfigurability, and reprogrammability.

A Scalable Software-Defined Radio Development SystemA Scalable Software-Defined Radio Development System


Sundance enters the SDR fray with a Xilinx-based platform.Sundance enters the SDR fray with a Xilinx-based platform.

Meeting the ChallengeSDR is characterized by a decouplingbetween the heterogeneous execution plat-form, based on hardware functions togeth-er with DSP and MCU processors, and theapplications through abstraction layers.

One of the approaches for meeting thecomplexity demands of third-generationsystems is to use multi-core systems wherea DSP and a microcontroller work togeth-er with an FPGA-based hardware co-processor. With its embedded IBM™PowerPC™, the Xilinx® Virtex-II Pro™FPGA is rapidly becoming a solution thatembeds and tightly couples reconfigurablelogic and a processor in the same device.

Sundance, a developer of advanced sys-tem architectures for high-performance sig-nal processing applications, has focused ondesigning FPGA-based development plat-forms that address an SDR OEM’s wishlist. Our challenge was to design a systemthat would provide the scalability SDR sys-tems require.

The Embedded System ControllerThe SMT148 (Figure 1) is one of manydevelopment systems Sundance haslaunched recently. Aimed specifically atSDR, the SMT148 is a fully configurableand expandable waveform developmentenvironment that meets the many require-ments of SDR developers. This entry-level,stand-alone system enables radio designersto investigate and experiment with the

download software. The eight RocketIO™transceivers on the Virtex-II Pro deviceenable high-speed data transfer to additionalSMT148 or Virtex-II Pro add-on modules.

Downloads can leverage a powerful I/Oarchitecture that includes the popularFireWire, USB interfaces, LVDS interfaces,and JTAG for debugging and downloads.Data flows into the FPGA and is managedby a Sundance program written for theembedded PowerPC before processing.Figure 3 is the C code comprising the dataflow and RocketIO PowerPC program.

Scalable, Reconfigurable Embedded ProcessorsScalability is addressed through four add-on module sites, and you can partly resolvethe requirements of dynamic reconfigura-tion by adding additional Xilinx-basedFPGA modules. All add-on modules com-municate through the Virtex-II Pro device,which also manages two 32-bit microcon-trollers that enable communications withmost widely used standards.

With the RocketIO transceivers con-nected to differential pair connectors, youcan connect FPGA systems directly

many configurationsof multi-channel soft-ware programmableand hardware-config-urable digital radio.

The SMT148 hasat its heart a powerfulembedded systemcontroller that lever-ages the Xilinx Virtex-II Pro FPGA with itsembedded PowerPC405 processor (Figure2). As an embeddedsystem controller, therole of the Virtex-IIPro device is to man-age the reconfigura-

tion of the add-on modules, especiallywhen downloading. Reconfiguration meansswitching between modes or updating ahardware/software component.

In the global functioning of theSMT148, you can download many kinds ofsoftware (high-level applications, protocolstacks, low-level signal processing algo-rithms) and employ several methods to


Add-on ModulesAdd-on Modules Add-on ModulesAdd-on Modules

Add-on ModulesAdd-on ModulesAdd-on ModulesAdd-on Modules

USBRS485

RS232USB IEEE1394

LVDS

COMPORTSJTAG

JTAG

RS485

USBRS485

RS232USB IEEE1394

LVDS

COMPORTSJTAG

JTAG

RS485

JTAG

TIM Site 1 TIM Site 2 TIM Site 3 TIM Site 4

Global Buses

FPGAVirtex-II Pro

XC2VP7, 20, 30FFB56

UARTUSB

FireWireInterface

Power In

Power Out

PowerManagement

DAC Circuitry8 channels

16-bit 200kSps

ADC Circuitry8 channels

14-bit 400kSps

100 MHzCrystal

LVDSRSL

20 Gbitsfullduplex

RS485 SHB400 Mbytes

32 LED“Display”

CPLD

XilinxJTAG

JTAG Configuration Chain

ch0 ch7 ch0 ch7

ExternalComPorts

Figure 1 – The SMT148

Figure 2 – SMT148 systems architecture

through simple cable connections support-ing more than 2 Gbps data rates.

The SMT148 leverages the XilinxVirtex-II Pro block RAM configuration togenerate FIFOs for the RocketIO trans-ceivers, add-on modules communicationports, and a high-speed bus, as well as theembedded PowerPC code. The single high-speed bus allows parallel data transfer toand from a wide range of high-speedADC/DAC modules.

Data rates on this port are in excess of100 MHz (400 Mbps), and are useful fortransferring sampled 16-bit I and Q.Processing data streams can take placeeither in the embedded PowerPC in theVirtex-II Pro device or throughout anarray of other add-on Virtex-II Pro FPGA-based modules with embedded PowerPC.

The FPGA on the SMT148 carriercard is connected to many differentdevices and therefore has many internalinterfaces that allow it to exchange data orcommands with the external world. Allinterfaces are reset at power on whenapplying a manual reset. Figure 4 showsthe interconnections between the digitalmodules inside the FPGA.

When we designed the SMT148, weunderstood that one of the many chal-lenges OEMs would face in developingJTRS-compliant platforms was the avail-ability of interchangeable and networkedprocessing nodes. The availability of pro-cessing nodes aims to meet the expandabil-ity and scalability requirements in complexwaveform applications.

The SMT148 meets this availability chal-lenge with a network of daughter sites that

you can use for additional resources such assignal processors, reconfigurable computingmodules, and Sundance’s large family of add-on modules. These add-on modules includea variety of embedded system options such asreconfigurable modules with tightly coupledVirtex-II Pro FPGAs and DSPs, digital andanalog converters, data conversions, trans-ceivers, and I/Os of all types.

Designed for DevelopersPowered by an external supply, theSMT148 platform has an impressive topol-ogy that accepts input signals from varioussources through a network of multi-pinconnectors. High-speed I/O channels sup-port the additional nodes in a network

topology neatly harmonized to the Xilinxarchitecture.

Fully interconnected and configurablethrough their communication ports, theseadd-on sites are also connected to theembedded Virtex-II Pro FPGA. The avail-ability of a network of add-on sites removesthe main expandability restrictions oftenassociated with other platforms, and offersthe OEMs a highly compact design anddevelopment tool.

More importantly, this scalablesystem architecture makes theSMT148 a perfect development plat-form with which to resolve the manyissues related to the implementationof multiple radio functionalities in asingle environment. These can beaddressed as multiple software mod-ules running on Sundance’s reconfig-urable hardware platform.

IP CoresSundance takes advantage of thehigh-performance DSP accelerationcapabilities and flexible connectivitythat the Virtex-II Pro FPGA providesby supporting developers with a fam-ily of software tools and IP cores.

The SMT148 I/O flexibilityenables you to rapidly investigate andexperiment with features of theVirtex-II Pro FPGA as well as those ofdeveloped IP cores from Sundance.These include multi-tap complex fil-ters, Viterbi decoders, encoders, a

complete transmitter, QAM mapper, multi-phase pulse shaping filter, multi-phase cas-caded integrator comb (CIC) interpolationfilter with a fixed interpolation rate, multi-phase numerical-controlled oscillator(NCO), and multi-phase digital mixer.

ConclusionDeveloping, testing, and implementingSDR IP cores is simplified with theSundance SMT148 platform. You cannow focus on developing additional IPswithout worrying about peripheral pro-cessing or I/O devices, as these are simplyoff-the-shelf add-on IP blocks.

For more information, please visitwww.sundance.com.


ComPortIF 0

ComPortIF 1

ComPortIF 2

ComPortIF 3

ComPortIF 4

ComPortIF 5

ComPortIF 6

ComPortIF 7

CP0 CP1 CP2 CP3 CP4 CP5 CP6 CP7

µc1

µc2µcIF2

µcIF1

DAC IF ADC IF LED IF RSL IF SHB IF RS485 IF

WatchdogTimer 1

WatchdogTimer 2

WatchdogTimer 3

WatchdogTimer 4

McBSPen/dis 1

McBSPen/dis 2McBSPen/dis 3

McBSPen/dis 4

GlobalBus IF1

GlobalBus IF2

GlobalBus IF3

GlobalBus IF4

McBSP IF

GB1

GB2

GB3

GB4

RESET TIMs

To/From LVDS Drivers/ReceiversDAC ADC LED RSL SHB RS485

TIM1 TIM2 TIM3 TIM4

Switch Fabric

RESET

Figure 3 – Interconnections diagram of the digital modules inside the SMT148

by Chang Ning SunTechnical Marketing EngineerMentor Graphics [email protected]

Embedded system developmenttraditionally requires a hardwaredesign cycle to create a prototype;a software implementation cyclecan only begin once that proto-type is available. Co-design andco-verification tools such as theSeamless™ co-verification envi-ronment (CVE) from MentorGraphics® provide great flexibilityand early system integration, butthese tools have mostly been usedfor large-scale systems such asASIC designs. Ordinary embed-ded system designs have not yetbenefited from hardware/softwareco-design and co-verification.

The Nucleus real-time operating system is now available for Xilinx FPGAs with MicroBlaze and PowerPC 405 cores.The Nucleus real-time operating system is now available for Xilinx FPGAs with MicroBlaze and PowerPC 405 cores.

Nucleus for Xilinx FPGAs – A New Platform for Embedded System Design

Nucleus for Xilinx FPGAs – A New Platform for Embedded System Design


In recent years, innovations in FPGAtechnology have shifted the use of thesedevices from supporting logic to forminga central part of the embedded system.With their high logic density and highperformance, modern FPGAs allow youto implement almost an entire embeddedhardware system in a single FPGA device.

Xilinx® Spartan-II™, Spartan-3™,and Virtex-II™ Platform FPGAs (com-bined with the MicroBlaze™ soft proces-sor core) and Virtex-II Pro™ PlatformFPGAs (combined with the PowerPC™405 hard processor core) offer new hard-ware platforms and facilitate new method-ologies in embedded system design.

The Nucleus™ real-time operatingsystem (RTOS) from AcceleratedTechnology (Mentor Graphics’ EmbeddedSystems Division) fully supports XilinxPlatform FPGAs. Together with ourFPGA design flow and Seamless CVEtools, the Nucleus RTOS is a complete hard-ware/software co-design and co-verificationenvironment.

The Nucleus RTOSVery few embedded software developersstart their development on bare hardware;most choose a commercial embeddedRTOS like Nucleus as the base environ-ment and develop their applications on topof the operating system.

Generally, an RTOS provides a multi-tasking kernel and middleware compo-nents such as a TCP/IP stack, file system,or USB stack. Application developersbuild their application software by usingthe system services provided by the RTOSkernel and the middleware components. Anumber of commercial RTOSs are avail-able: some have hard real-time kernels,some have extensive middleware support,and some have good development tools.The Nucleus RTOS has all of these fea-tures from a single vendor.

Figure 1 shows the Nucleus systemarchitecture.

Nucleus PLUSNucleus PLUS, a fully preemptive and scala-ble kernel suitable for hard real-time applica-tions, is designed specifically for embedded

more difficult. These standards are oftenimplemented as middleware componentsand provided together with the RTOS.

The Nucleus RTOS has an extensivelibrary of middleware support (Table 1),which covers every vertical market of theembedded industry. Nucleus NET utilizesa zero-copy mechanism for transferringdata between user memory space and theTCP/IP stack, which significantlyimproves data transmission efficiency andreduces dynamic memory allocation.

Nucleus software fully supports thePOSIX standard for software portability.POSIX interface libraries are available forall Nucleus middleware components.

Nucleus Device DriversThe Nucleus RTOS has an extensive list ofoff-the-shelf device drivers. When you selecta new hardware IP or peripheral to use inyour design, the driver source code mayalready be available. If not, a driver templatelets you easily develop a functional driver.

systems. It has a very small footprint; the ker-nel itself can be as small as 15-20 kB.

All Nucleus kernel functions are provid-ed as libraries, so only the required kernelfunctionality is linked with the applicationcode in the final image. Nucleus PLUSprovides complete multi-tasking kernelfunctions, including task management,inter-task communication, inter-task syn-chronization, memory management(MMU), and timer management. Recently,we added a dynamic download library(DDL) into Nucleus PLUS to allowdynamic downloading and allocation ofsystem modules.

Nucleus MiddlewareThe middleware available for an RTOS hasbecome an important factor when selectinga commercial RTOS for a particular applica-tion. As embedded applications becomeincreasingly more complicated, applicationdevelopers find implementing industry stan-dards like TCP/IP, WiFi, USB, and MPEG


(written in C, C++, POSIX or Java™)

Figure 1 - Nucleus system architecture

Mentor Graphics will continue to adddevice drivers to the Nucleus library fornew hardware IP blocks for Xilinx FPGAsas well as for third parties, such as its ownIP Division.

Nucleus Development ToolsThe Nucleus RTOS has the most inte-grated development environment (IDE)in the industry. Combined with ourMicrotec compiler tools, the code|labembedded development environment(EDE) and XRAY debuggers provide acomplete software design flow, fromcompilation to debugging.

Our C/C++ source-level debuggerssupport both run-mode and freeze modedebugging with complete Nucleus kernelawareness. The XRAY debugger supportsboth homogeneous and heterogeneousmulti-core debugging, and a wide range ofprocessors and DSPs.

In addition to the Microtec compil-ers, the EDE, IDE, and debuggers arealso integrated with many third-party

compiler tools such as GNU. For system-on-chip (SoC) users, inte-

gration of the Nucleus RTOS and XRAYdebugger with other products in theMentor Graphics lineup (such as SeamlessCVE for hardware/software co-verifica-tion) offer a complete hardware/softwareco-design and co-verification platform.

Royalty-Free with Source CodeThe royalty-free with source code busi-ness model of the Nucleus RTOS, com-bined with the high scalability and smallfootprint architecture, provides greatvalue to customers developing large-vol-ume products such as cell phones, PDAs,and cameras. A single up-front license feecovers all production rights for a singleproduct using Nucleus software, with noadditional royalty fees and no need totrack shipment numbers.

Nucleus software has proven to be arobust RTOS and has been widely used inevery vertical market of the embeddedindustry, including consumer electronics,

telecom, defense/aerospace, automotive,and telematics.

Nucleus for Xilinx FPGAsThe scalable and configurable nature ofNucleus products, combined with the flexi-bility and versatility of FPGAs, providesgreat potential for hardware/software co-design, co-verification, and early integra-tion of your software and hardware projects.

Nucleus for MicroBlazeThe MicroBlaze soft processor core featuresa Harvard-style RISC architecture with 32-bit instruction and data buses. AMicroBlaze soft core can be programmedinto Spartan-II, Spartan-3, or Virtex-IIPlatform FPGAs.

Nucleus software fully supports theXilinx MicroBlaze soft processor core withGNU compiler tool and GDB debugger.

Nucleus for Virtex-II ProFPGAs/PowerPC 405

Many embedded system developers arealready familiar with the PowerPC 405processor. Xilinx Virtex-II Pro FPGAs pro-vide fully configurable hardware platformswith the PowerPC 405 hard processor core.

Nucleus software fully supportsVirtex-II Pro FPGAs with the PowerPC405 hard core, providing a MicrotecPowerPC compiler and XRAY kernel-aware debugger, as well as GNU compil-er tools and GDB debugger.

FPGA Reference Design for Nucleus SoftwareTo give embedded system developers aquick start with Nucleus software forXilinx FPGAs, we worked with Xilinx andMemec™ Insight to develop referencedesigns for running Nucleus software. Eachreference design is a sample FPGA hard-ware configuration based on a MemecInsight FPGA development board.

You can build and implement all of thereference designs into the FPGA device by


Nucleus Product Function

Nucleus NET Complete embedded TCP/IP stack

Nucleus Extended Protocols Telnet, FTP, TFTP, Embedded Shell

Nucleus SNMP version 1,2,3 Network management protocols

Nucleus RMON (1-9 groups) Network remote monitoring protocols

Nucleus Residential Gateway NET, PPP, PPPoE, DHCP server

Nucleus WebServ Embedded HTTPD web server

Nucleus EMAIL POP3 and SMTP

Nucleus FILE MS DOS compatible file system

Nucleus GRAFIX Graphics-rendering engine with windowing toolkit

Nucleus USB Complete USB stack for USB specifications 1.1, 2.0, and OTG

Nucleus 802.11 STA (WiFi) 802.11b and 802.11g protocol stack

Nucleus IPv6 IP version 6 protocol stack

CEE-J Java VM for Nucleus CLDC, MIDP, Embedded Java, and Personal Java

Nucleus software has proven to be a robust RTOS and has been widely used in every vertical market of the embedded industry...

Table 1 - Primary Nucleus middleware components

using Xilinx EDK and ISE software. Foreach reference design, we provide a LimitedVersion (LV) of our Nucleus software,including the Nucleus PLUS kernel andthe relevant middleware (such as NucleusNET) to help you evaluate Nucleus soft-ware with the FPGA.

The LV version of Nucleus softwarefunctions exactly the same as a full version,but is provided as binary only andruns for a limited time. Currently,we have FPGA reference designs forthe following Memec Insightboards:

• Spartan-IIE LC MicroBlaze development board

• Spartan-3 LC MicroBlaze development board

• Spartan-3 MB MicroBlaze development board

• Virtex-II MB MicroBlaze development board

• Virtex-II Pro PowerPC development board

We have also created a website(www.acceleratedtechnology.com/xilinx) tosupport Nucleus software for Xilinx FPGAs,where you can download FPGA referencedesigns and Nucleus software updates.

FPGA Design FlowThe FPGA reference designs for theNucleus RTOS can be used as your startingpoint to build a custom FPGA-based hard-ware platform. All of the reference designsare provided as Xilinx Platform Studio(XPS) projects, so you can directly openthese projects in XPS and edit, build, andimplement the design with Xilinx EDKand ISE software. Typically, building aNucleus system using an FPGA and con-figurable cores involves the following:

• Configuring a hardware platform withFPGA, processor core, hardware IP,and memory

• Building the design with EDK/ISE soft-ware to generate an FPGA bitstream

• Building software libraries with EDKto generate low-level device driver code

• Developing Nucleus device drivers andNucleus applications

• Programming the FPGA device

• Downloading Nucleus into the FPGAhardware board

Once you have reached this point, youcan start evaluating your Nucleus softwareand begin debugging your application.

Xilinx EDK and ISE software provide acomplete environment for configuring andimplementing an FPGA-based hardwareplatform. At the same time, EDK also gen-erates software libraries and C header filesfor the hardware design. These librariesprovide basic processor boot-up code andlow-level hardware IP device driver func-tion code. These low-level functions canbe treated as a “BIOS” layer for Nucleus.

Figure 2 shows the source code directo-ries generated by EDK in an XPS project.You will find one subdirectory for each pieceof hardware IP, such as “emac_v1_00_d” forEthernet and “uartlite_v1_00_b” for UART.Each subdirectory contains the driver codefor that hardware IP. These driver functionsare included with the hardware IP from thevendor and have been integrated into theEDK installation.

Nucleus software provides an OSadaptation layer that is an interface layerbetween the EDK-generated “BIOS”function layer and the Nucleus high-leveldevice drivers. This OS adaptation layer

serves as a hardware abstraction layer,which will significantly simplify theNucleus device driver porting overFPGA-based hardware platforms.

Nucleus Software DebuggingBoth MicroBlaze soft-core and PowerPC405 hard-core designs provide a standardJTAG debug interface. So after imple-

menting the hardware design inthe FPGA, you can take fulladvantage of our JTAG-basedcode|lab EDE or XRAY debuggerto facilitate application debug-ging. Both code|lab and XRAYhave complete Nucleus kernelawareness, which can help youanalyze complex real-time behav-iors in a multi-tasking system.

Co-Design and Co-VerificationAs we mentioned earlier, Nucleussoftware provides a viable platformfor hardware/software co-designand co-verification. As one of theonly EDA companies with anembedded software focus, MentorGraphics’ Nucleus RTOS and

XRAY debugger are fully integrated withour Seamless hardware/software co-verifi-cation software platform.

The combination of the Nucleus RTOSand Seamless co-verification tool providesyou with one of the most comprehensivehardware/software co-design and co-verifi-cation tools possible.

ConclusionEmbedded systems are becoming increasing-ly complicated. To meet the challenge, inte-grated solutions including hardware/softwareco-design and co-verification are required.FPGAs with configurable processor coresbring a new, innovative approach to mod-ern embedded system designs and open thedoor to hardware/software co-design andco-verification.

Nucleus software for FPGAs can sig-nificantly accelerate your hardware/soft-ware development cycle and improveyour hardware/software integration quali-ty. For more information, please visitwww.acceleratedtechnology.com.


Figure 2 - Xilinx EDK-generated source code directories

Put the functions of a legacy microprocessor into a Xilinx FPGA.Put the functions of a legacy microprocessor into a Xilinx FPGA.


by Lance Roman PresidentRoman-Jones, [email protected]

Brad FayetteSenior Software EngineerRoman-Jones, [email protected]



How do you put a one-dollar Intel™ 8051microprocessor into an FPGA withoutusing 10 dollars’ worth of FPGA fabric?

The answer is emulation. Using soft-ware emulation, Roman-Jones Inc. hasdeveloped a new type of 8051 processorcore built on a Xilinx 8-bit, soft-corePicoBlaze™ (PB) processor. This “new”PB8051 is more than 70% smaller thancompeting soft-core implementations –without sacrificing any of the performanceof this legacy part. The PB8051 is a XilinxAllianceCORE™ microprocessor builtthrough emulation.

The Legacy of the 8051The Intel 8051 family of microprocessors– probably one of the most popular archi-tectures around – is still the core of manyembedded applications. This processorjust refuses to retire. Many designers areusing legacy code from previous projects,while others are actually writing new code.

The 8051 architecture was designed forASIC fabric. It is not efficient in anFPGA, resulting in excess logic usage withmarginal performance.

FPGA microprocessor integration is asolution for older 8051 products undergo-ing redesign to eliminate obsolesce, lowercosts, decrease component count, andincrease overall performance.

The new FPGA-embedded PB8051designs allow you to take advantage ofexisting in-house software tools and yourown architecture familiarity to quicklyimplement a finished design. The inte-grated PB8051 can be customized on theFPGA to exact requirements.

Processor EmulationProgrammers have used microprocessoremulation for many years as a softwaredevelopment vehicle. It allows program-mers to write and test code on a develop-ment platform before testing on targethardware.

This same concept can be practicalwhen the target microprocessor architec-ture does not lend itself to efficient imple-mentation and use of FPGA resources.

The features of our PB8051 emulatedprocessor include:

type core. Hook up block RAM oroff-chip program ROM, and you’reready to go.

• Low Cost – The PB8051 is $495with an easy Xilinx SignOnce IPlicense.

Architecture of Emulated 8051As shown in Figure 1, the architecture of anemulated processor has several elements.Each element is designed independently,but together, they act as a whole.

PicoBlaze PlatformThe PicoBlaze host processor is the heartof the emulated system and defines thearchitecture of the PB8051 emulatedprocessor core. It comprises:

• PicoBlaze Code ROM – Block ROMcontains the software code to emulate8051 instructions.

• Internal Address/Bus – The PicoBlazeperipheral bus is a 256-byte addressspace accessed by PicoBlaze I/Oinstructions. It allows the PicoBlazeprocessor to interface with RAM,emulation peripherals, timers, and theserial port. This bus is internal to thecore and thus hidden from the user.

• Emulation Peripherals – Not all emu-lation tasks can be done in softwareand still meet the performancerequirements for your particularapplication. Specialized hardwareassists the PicoBlaze code to performtasks that are time-consuming, on acritical execution path, or both.

A good example is the parity bit in thePSW (program status word) register thatreflects 8051 accumulator parity. Thisfunction must be performed for every8051 instruction executed, and wouldtake several PicoBlaze instruction cyclesto perform. It is, therefore, done inhardware.

• Instruction Decode ROM – This is akey emulation peripheral that is usedto decode 8051 instructions to setPicoBlaze routine locations and emu-lation parameters.

• Smaller Size – Traditional 8051implementations range from 1,100 to 1,600 slices of FPGA logic. ThePicoBlaze 8051 processor requiresjust 76 slices. Emulation hardwarerequires 77 slices. Add another 158slices for two timers and a four-modeserial port, and you have a total of a311 slices. This is a reduction ofmore than two-thirds of the FPGAfabric of competing products.

• Faster – At 1.3 million instructionsper second (MIPS), the PB8051 isfaster than a legacy 8051 (1 MIPS)running at 12 MHz. Compare this to5 MIPS with a 40 MHz Dallas versionor 8 MIPS with a traditional FPGA

implementation – which takes upmore than three times as much FPGAfabric as the PB8051. The PicoBlazeprocessor itself runs at a remarkable 40MIPS using an 80 MHz clock.

• Software Friendly – You can write Ccode or assembler code with yourpresent software development tools togenerate programs. You can also runlegacy objects out of a 27C512EPROM.

• Easy Hardware – You can use VHDLor Verilog™ hardware descriptionlanguages to instantiate the 8051-


This “new” PB8051

is more than 70%

smaller than

competing soft-core

implementations –

without sacrificing

any of the performance

of this legacy part.

• Address Decode – A significantamount of the emulation peripheralhardware is dedicated to simpleaddress decode of the PicoBlazeperipheral bus. This address space isfor the PicoBlaze processor only and isinsulated from the 8051 application.

• Block RAM – 256 bytes are availableto 8051 internal RAM and some 8051registers. You can access this blockRAM via 8051 instructions.

• Serial Port and Timer – The actual8051 timer and multi-mode serial portwere best done in hardware instead oftrying to implement these functions insoftware. Clock prescaling inputs areprovided so that these functions canrun at a clock rate independent of theemulated system.

PicoBlaze Emulation SoftwareA 1K x 16 block ROM holds the PicoBlazecode that performs the actual emulation. Theemulation program is carefully constructed

in very tight PicoBlaze assembler code, opti-mized for speed and efficiency.

The emulation program is divided intoseveral segments:

• Instruction Fetch – An 8051 bus cycleis simulated to fetch the next instruc-tion from 8051 program memory (64Kb size), which may be on-chip blockRAM or off-chip EPROM (such as the27C256).

• Instruction Decode – Fetched instruc-tions are decoded to determine theaddressing mode and operation.

• Fetch Operands – Depending uponthe addressing mode, additionaloperands are fetched for the instruc-tion from program memory, internalRAM, or external RAM.

• Instruction Execution – An instructionperforms the desired operation andupdates emulated register contents,including affected 8051 PSW flags.

• Scan Interrupts – This function deter-mines if an interrupt is pending, and ifso, services it. An interrupt window isgenerated at the end of every emulatedinstruction when required.

In addition to PicoBlaze code, Java™software utilities process symbols takenfrom PicoBlaze listings (.LOG files) into aROM table. These .LOG files are used todecode 8051 opcodes.

This program also produces the .COEfiles used by the Xilinx COREGenerator™ system to create PicoBlazecode ROM. All of this is transparent todesigners integrating with the PB8051.

User Back-End InterfaceWhat gives the PB8051 its “hardware flavor” is the user back-end interface,where you interface your logic designwith the emulated 8051 processor. Theback-end interface is part of the emula-tion peripherals, controlled by thePicoBlaze platform.


1K x 16Block ROMEmulation

Program Code

256 x 8Instruction

Decode ROM

PicoBlaze

256 x 8Block RAM

Serial Port

InternalAddress/Data

Bus

Timer

AddressDecode

8051Emulation

Peripherals

AddressSERIAL0_PRE12

SERIAL2_PRE32

TIME_PRE

CLK

WR

RST_8051

RD

PSEN

INSTR_FETCH

EXT_BUS_START

EXT_BUS_HOLD

P1_IN[7:0]

P1_OUT[7:0]

P3_IN[7:0]

P3_OUT[7:0]

EXT_DATA_IN[7:0]

EXT_DATA_OUT[7:0]

EXT_ADDRESS[15:0]

ROM_ADDRESS[15:0]

ROM_DATA[7:0]

RST

Data

Interrupt

Address Decode, Data, Signals

Address Decode, Data, Signals

Figure 1 – PB8051 block diagram

Just as the 8051 processor family hasmany derivatives to define port, function-ality, features, and pinout, the back-endinterface serves the same function. ThePB8051 has a back-end interface thatresembles the generic 8031, the ROM-lessversion of the 8051.

Roman-Jones Inc. customizes back-endinterfaces to meet your exact 8051 needs,such as removing an unused serial port oradding an I2C port to emulate the 80C652derivative.

Designing with the PB8051Incorporating the PB8051 into the rest ofyour design is easy, because it comes with ref-erence designs and examples of Xilinx inte-grated software design (ISE) projects.

Hardware ConsiderationsFigure 1 illustrates the signal names avail-able on the user back-end interface. AVHDL or Verilog template provides theexact signal names – many of which arealready familiar to 8051 designers. A fewnew types of signals exist, including:

• Pre-scales used by the timer and serialport to set counting and baud rates.

• One-clock-wide read/write strobes toread and write external memory space.There is also a program store enable(PSEN) strobe for the 8051 codememory.

• Bus start and hold signals used toinsert wait states for external memorycycles that cannot be completed in onesystem clock cycle.

The PB8051 is instantiated as a compo-nent into your top-level design. You deter-mine if the 8051 program resides inon-chip block RAM or off-chip EPROM.Peripherals can be hooked up to either theP1/P3 port lines or to the external addressand data buses. For convenience, theaddress and data lines for program and datamemory spaces are separated, so conven-

tional multiplexer circuitry is not needed.If your entire design, including the 8051

program, resides on the FPGA, simply setthe EXT_BUS_HOLD to “low” to take fulladvantage of running at clock speed. If youelect to use an off-chip EPROM or haveslow peripherals, wait states can be insertedby asserting EXT_BUS_HOLD at “high.”One of our reference designs illustrates waitstate generation.

Xilinx Implementation ConsiderationsThere are few implementation considera-tions other than the need to place thePB8051 design netlist into your projectdirectory and instantiate it as a compo-nent in your VHDL or Verilog design;ISE software will do the rest. ISE schemat-ic capture is also supported.

Simulation ConsiderationsWe’ve included a behavior simulationmodel of the PB8051 for adding (alongwith the rest of your design) to your favoritesimulator. Modeltech and Aldec™ simula-tors have been tested for correct operation.Post place-and-route or timing simulationsfollow a conventional design flow.

You will enjoy watching your 8051instruction execution flow go by on thesimulation waveforms. This makes behav-ioral debugging straightforward, fast, andeasy. We’ve provided a reference testbench with example waveform files.

8051 Software ConsiderationsTo generate your 8051 programs the wayyou always have, use your favorite C com-piler or assembler and linker (if necessary) toproduce the same Intel hex format file thatyou would use to burn an EPROM. You caneven use an existing hex file, because thePB8051 looks like a regular 8051 as far assoftware code is concerned. An Intel hex to.COE utility is included for those designsthat put the 8051 software into on-chipblock RAM. On-chip program storage pro-vides maximum speed performance.

Test and Debug ConsiderationsWe recommend you use design tools toquickly and easily test and debug your design.The most useful tool will be an HDL simula-tor, such as Modeltech or Aldec programs.Most problems and bugs can be solved at thebehavioral level. For interactive debugging,the Xilinx ChipScope™ integrated logic ana-lyzer has proved to be the tool of choice. Atthe current time, no source code debuggertools are available for the PB8051.

Designer’s Learning CurveDesigners should have some experience in8051 hardware/software and FPGA designbefore attempting to consolidate the two.The PB8051 core is designed for ease ofuse and integration.

Your 8051 hardware and softwareexpertise should include hardware under-standing of the part and experience in writ-ing 8051 code using software developmenttools. The basic design flow using thePB8051 is identical to the normal pack-aged processor flow.

Integrating the PB8051 onto the Xilinxpart is much the same as instantiating a coreusing the Xilinx CORE Generator tool. Ifyou are familiar with VHDL or Verilog lan-guage, and have a couple of Xilinx designsunder your belt, you’re good to go.

ConclusionIntegrating microprocessors onto FPGAsthrough emulation, as illustrated with thePB8051, is a viable alternative to a full hard-ware functional design. The advantage ofFPGA fabric savings correlates to reducedparts cost.

Integrating the PB8051 processor into your design yields lower componentcount, easier board debugging, less noise, and optimum performance of the peripheral/8051 micro-interface,because both are in an FPGA. For moreinformation about the PB8051 microcon-troller, visit www.roman-jones.com/rj2/PB8051Microcontroller.htm.


What gives the PB8051 its “hardware flavor” is the user back-end interface, where you interface your logic design with the emulated 8051 processor.

by John CarboneVP, MarketingExpress Logic, [email protected]

A fast processor is essential for today’sdemanding electronic products. Efficientapplication software is equally essential, asit enables the processor to keep up with thereal-world demands of networking andconsumer products.

But what about the real-time operatingsystem (RTOS) that handles your systeminterrupts and schedules your applicationsoftware’s multiple threads?

If it’s not optimized for the processoryou’re using, and isn’t efficient in its han-dling of external events, you’ll end up withonly half a processor to do useful work.

An efficient RTOS should do more thanjust handle multiple threads and serviceinterrupts. It must also have a small foot-print and be able to deliver the full power ofthe processor used by your application soft-ware. Express Logic’s ThreadX® real-timeoperating system is just such an RTOS.

The ThreadX RTOS has now been opti-mized to support the Xilinx MicroBlaze™soft processor, and is available off-the-shelffor your next development project.

A Good FitLet’s look at some of the reasons why theThreadX RTOS is a good fit with theMicroBlaze processor.

SizeThe ThreadX RTOS is only about 6 KB -12 KB in total size for a complete multi-tasking system. This includes basicservices, queues, event flags, semaphores,and memory allocation services.

Its small size means that more of yourmemory resources will be preserved foruse by the application and not absorbedby the RTOS. This will result in smallermemory requirements, lower costs, andlower power consumption. Table 1 showscode sizes for various optional compo-nents of the ThreadX RTOS on theMicroBlaze processor.

EfficiencyThe ThreadX RTOS can perform a fullcontext switch between active threads in amere 1.2 µs on a 100 MHz Xilinx Virtex-IIPro™ FPGA. This is particularly critical inapplications with high interrupt incidences,such as network packet processing.

The ability of the ThreadX RTOS tohandle interrupts efficiently means thatyour system can handle higher rates ofexternal events such as TCP/IP packetarrivals, delivering greater throughput.

SpeedThe ThreadX RTOS responds to inter-rupts and schedules required processing inless than 1 microsecond. The single-levelinterrupt processing architecture of theThreadX RTOS eliminates excess over-head found in many other RTOSs.

An RTOS can be critical for getting the most out of the MicroBlaze processor.An RTOS can be critical for getting the most out of the MicroBlaze processor.

Optimize MicroBlaze Processors for Consumer Electronics ProductsOptimize MicroBlaze Processors for Consumer Electronics Products


This keeps the ThreadX RTOS fromgobbling up valuable processor cycles withhousekeeping tasks, thus leaving moreprocessor bandwidth for your application.Thus, you can configure a less expensiveprocessor or add more features to yourapplication without increasing processorspeed and cost.

Table 2 shows execution times for vari-ous ThreadX services measured accordingto the following definitions:

• Immediate Response (IR): Timerequired to process the request imme-diately, with no thread suspension orthread resumption.

• Thread Suspend (TS): Time requiredto process the request when the callingthread is suspended because theresource is unavailable.

• Thread Resumed (TR): Time requiredto process the request when a previous-ly suspended thread (of the same orlower priority) is resumed because ofthe request.

• Thread Resumed and ContextSwitched (TRCS): Time required toprocess the request when a previouslysuspended thread with a higher priorityis resumed because of the request. As theresumed thread has a higher priority, acontext switch to the resumed thread isalso performed from within the request.

• Context Switch (CS): Time requiredto save the current thread’s context,find the highest priority ready thread,and restore its context.

• Interrupt Latency Range (ILR): Timeduration of disabled interrupts.

Ease of UseAll services are available as function calls,and full source code is provided so you cansee exactly what the ThreadX RTOS isdoing at any point.

Many other commercial RTOS prod-ucts are massive collections of unintelligi-ble system calls with non-intuitive names.The ThreadX RTOS uses service callnames that are orderly, simple in struc-ture, and ultimately easier to use. This

Efficient interrupt processing deliversthe best performance from MicroBlazeprocessors, giving your product a boostover the competition.

Full Source CodeBy providing full source code, you havecomplete visibility into all ThreadX servic-es – no more mysteries are hidden withinthe RTOS “black box.”

This full view of source code enablesyou to avoid delays in developmentcaused by an incomplete understandingof RTOS services. With the ThreadXRTOS, full source code is there any timeyou need it.

ConclusionExpress Logic’s ThreadX RTOS is wellmatched for networking and consumerapplications based on MicroBlaze proces-sors. It’s been optimized to deliver theprocessor’s inherent performance directly

through to the application.What’s next? Express

Logic is working to portNetX™, its TCP/IP net-work stack, to MicroBlaze.With so many of today’sconsumer electronics devicesaccessing the Internet, effi-cient TCP/IP software isincreasingly critical. TheNetX TCP/IP stack will pro-vide further support forMicroBlaze in consumerelectronics products, andwill enable developers toconcentrate on their applica-tion code, getting theirproduct to market faster.

To develop a successful,high-performance consumer,networking, or other elec-tronic product aroundMicroBlaze processors, use anefficient RTOS like ThreadXby Express Logic. For furtherinformation on the ThreadXRTOS and other embeddedsoftware products fromExpress Logic, please visitwww.expresslogic.com.

enables shorter development times andfaster time to market.

Fast Interrupt HandlingThe ThreadX RTOS is optimized for fastinterrupt handling, saving only scratch reg-isters at the beginning of an interrupt serv-ice routine (ISR), enabling use of allservices from the ISR, and using the systemstack in the ISR.

ThreadX Instruction Area Size (in bytes)

Core Services (required) 4,804

Queue Services 3,200

Event Flag Services 1,720

Semaphore Services 932

Mutex Services 2,540

Block Memory Services 1,144

Byte Memory Services 1,876

ThreadX Service Call Execution Times

Service IR TS TR TRCS

tx_thread_suspend 1.0 µs 2.0 µs

tx_thread_resume 1.0 µs 2.8 µs

tx_thread_relinquish 0.5 µs 1.9 µs

tx_queue_send 0.9 µs 2.4 µs 1.6 µs 3.4 µs

tx_queue_receive 0.9 µs 2.3 µs 2.5 µs 4.2 µs

tx_semaphore_get 0.3 µs 2.3 µs

tx_semaphore_put 0.3 µs 1.3 µs 3.0 µs

tx_mutex_get 0.4 µs 2.4 µs

tx_mutex_put 1.0 µs 2.0 µs 3.6 µs

tx_event_flags_set 0.6 µs 1.7 µs 3.4 µs

tx_event_flags_get 0.4 µs 2.5 µs

tx_block_allocate 0.4 µs 2.3 µs

tx_block_release 0.4 µs 1.4 µs 3.1 µs

tx_byte_allocate 1.4 µs 2.9 µs

tx_byte_release 0.9 µs 3.0 µs 4.7 µs

Context Switch (CS) 1.2 µs

Interrupt Latency Range (ILR) 0.0 µs-0.8 µs

Table 1 – Code size of various ThreadX services

Table 2 – Execution times of various ThreadX services


by Jean J. LabrossePresidentMicrium, [email protected]

Designing software for a microprocessor-based application can be a challengingundertaking. To reduce the risk and com-plexity of such a project, it’s always impor-tant to break the problem into small pieces,or tasks. Each task is therefore responsiblefor a certain aspect of the application.Keyboard scanning, operator interfaces,display, protocol stacks, reading sensors,performing control functions, updatingoutputs, and data logging are all examplesof tasks that a microprocessor can perform.

In many embedded applications, thesetasks are simply executed sequentially bythe microprocessor in a giant infiniteloop. Unfortunately, these types of sys-tems do not provide the level of respon-siveness required by a growing number ofreal-time applications because the codeexecutes in sequence. Thus, all tasks havevirtually the same priority.

An RTOS called µC/OS-II has been ported to the MicroBlaze soft-core processor.An RTOS called µC/OS-II has been ported to the MicroBlaze soft-core processor.

Use an RTOS on Your Next MicroBlaze-Based ProductUse an RTOS on Your Next MicroBlaze-Based Product


For this and other reasons, consider theuse of a real-time operating system (RTOS)for your next Xilinx MicroBlaze™ proces-sor-based product. This article introducesyou to a low-cost, high-performance, andhigh-quality RTOS from Micrium calledµC/OS-II.

What is µC/OS-II?µC/OS-II is a highly portable, ROM-able,scalable, preemptive RTOS. The sourcecode for µC/OS-II contains about5,500 lines of clean ANSI C code. It isincluded on a CD that accompanies abook fully describing the inner work-ings of µC/OS-II. The book is calledMicroC/OS-II, The Real-Time Kernel(ISBN 1-5782-0103-9) and was writ-ten by the author (Figure 2). The bookalso contains more than 50 pages ofRTOS basics.

Although the source code for µC/OS-IIis provided with the book, µC/OS-II is notfreeware, nor is it open source. In fact, youneed to license µC/OS-II to use it in actu-al products.

Micrium’s application note AN-1013provides the complete details about theporting of µC/OS-II to the MicroBlazesoft-core processor. This application note isavailable from the Micrium website atwww.micrium.com.

µC/OS-II was designed specifically forembedded applications and thus has a verysmall footprint. In fact, the footprint forµC/OS-II can be scaled based on whichµC/OS-II services you need in your applica-tion. On the MicroBlaze processor, µC/OS-II can be scaled (as shown in Table 1) and caneasily fit into Xilinx FPGA block RAM.

µC/OS-II is a fully preemptive real-timekernel. This means that µC/OS-II alwaysattempts to run the highest-priority task thatis ready to run. Interrupts can suspend theexecution of a task and, if a higher-prioritytask is awakened as a result of the interrupt,that task will run as soon as the interruptservice routines (ISR) are completed.

µC/OS-II provides a number of systemservices, such as task and time manage-ment, semaphores, mutual exclusion sema-phores, event flags, message mailboxes,message queues, and fixed-sized memory

• The starting address of the task(MyTask())

• The task priority based on the impor-tance you give to the task

• The amount of stack space needed bythe task.

With µC/OS-II, you provide this infor-mation by calling a function to “create a

task” (OSTaskCreate()). µC/OS-IImaintains a list of all tasks that havebeen created and organizes them bypriority. The task with the highest pri-ority gets to execute its code on theMicroBlaze processor. You determinehow much stack space each task getsbased on the amount of space neededby the task for local variables, func-tion calls, and ISRs.

In the task body, within the infi-nite loop, you need to include calls to

RTOS functions so that your task waits forsome event. One task can ask the RTOS to“run me every 100 milliseconds.” Anothertask can say, “I want to wait for a characterto be received on a serial port.” Yet anoth-er task can say, “I want to wait for an ana-log-to-digital conversion to complete.”

Events are typically generated by yourapplication code, either from ISRs orissued by other tasks. In other words, anISR or another task can send a message toa task to wake that task up. With µC/OS-II, a task waits for an event using a “pend”call. An event is signaled using a “post” call,as shown in the pseudo-code below:

void EthernetRxISR(void){

Get buffer to store received packet;Copy NIC packet to buffer;Call OSQPost() to send packet to task(this signals the task);

}

void EthernetRxTask (void){

while (1) {Wait for packet by calling OSQPend();Process the packet received;

}}

partitions. In all, µC/OS-II provides morethan 65 functions you can call from yourapplication. µC/OS-II can manage as manyas 63 application tasks that could result inquite complex systems.

µC/OS-II is used in hundreds of prod-ucts from companies worldwide. µC/OS-IIis also very popular among colleges anduniversities, and is the foundation formany courses about real-time systems.

RTOS Basics with µC/OS-IIWhen you use an RTOS, very little has tochange in the way you design your prod-uct; you still need to break your applicationinto tasks. An RTOS is software that man-ages and prioritizes these tasks based onyour input. The RTOS takes the mostimportant task that is ready to run and exe-cutes it on the microprocessor.

With most RTOSs, a task is an infiniteloop, as shown here:

void MyTask (void){

while (1) {// Do something (Your code)// Wait for an event, either:// Time to expire// A signal from an ISR// A message from another task// Other

}}

You’re probably wondering how it’s pos-sible to have multiple infinite loops on asingle processor, as well as how the CPUgoes from one infinite loop to the next.The RTOS takes care of that for you bysuspending and resuming tasks based onevents. To have the RTOS manage thesetasks, you need to provide at least the fol-lowing information to an RTOS:


Code (ROM) Data (RAM)

Minimum Size 5.5 Kb 1.25 Kb (excluding task stacks)

Maximum Size 24 Kb 9 Kb (excluding task stacks)

Table 1 – Memory footprint for µC/OS-II on the MicroBlaze processor

The execution profile for these twofunctions is shown in Figure 1.

Let’s assume that a low-priority task iscurrently executing (1). When an Ethernetpacket is received on the Network InterfaceCard (NIC), an interrupt is generated. TheMicroBlaze processor suspends executionof the current task (the low-priority task)and vectors to the ISR handler (2).

With µC/OS-II, the ISR starts by savingall of the MicroBlaze registers, taking amere 300 nanoseconds at 150 MHz. TheISR handler then needs to determine thesource of the interrupt and execute theappropriate function (in this case,EthernetRxISR()) (3).

This ISR obtains a buffer large enoughto hold the packet received by the NIC. Itcopies the contents of the packet receivedinto the buffer and sends the address of thisbuffer to the task responsible for handlingreceived packets (EthernetRxTask() in ourexample) using µC/OS-II’s OSQPost() (4).

When this operation is completed, theISR invokes µC/OS-II (calls a function)and µC/OS-II determines that a higher-priority task needs to execute. µC/OS-IIthen resumes the more important task anddoesn’t return to the interrupted task (5).With µC/OS-II on the MicroBlaze proces-sor, this takes about 750 nanoseconds at150 MHz.

When the packet is processed,EthernetRxTask() calls OSQPend() (6).µC/OS-II notices that there are no morepackets to process and suspends executionof this task by saving the contents of allMicroBlaze registers onto that task’s stack.Note that when the task is suspended, itdoesn’t consume any processing time.

µC/OS-II then locates the next mostimportant task that’s ready to run (the“Low-Priority Task” in Figure 1)and switches to that task by loadingthe MicroBlaze registers with thecontext saved by the ISR (7). Thisis called a context switch. µC/OS-IItakes less than 1 microsecond at150 MHz on the MicroBlazeprocessor to complete this process.The interrupted task is resumedexactly as if it was never inter-rupted (8).

The general rule for an appli-cation using an RTOS is to putmost of your code into tasks andvery little code into ISRs. In fact,you always want to keep yourISRs as short as possible. Withan RTOS, your system is easilyexpanded by simply addingmore tasks. Adding lower-pri-ority tasks to your applicationwill not affect the responsive-ness of higher-priority tasks.

Safety-Critical Systems CertifiedµC/OS-II has been certified by theFederal Aviation Administration (FAA)for use in commercial aircraft by meetingthe demanding requirements of theRTCA DO-178B standard (Level A) forsoftware used in avionics equipment. Tomeet the requirements for this standard,it must be possible to demonstratethrough documentation and testing thatthe software is both robust and safe.

This is particularly important for anoperating system, as it demonstrates thatit has the proven quality to be usable in

any application. Every feature, function,and line of code of µC/OS-II has beenexamined and tested to demonstrate itssafety and robustness for safety-criticalsystems where human life is on the line.

µC/OS-II also follows most of theMotor Industry Software ReliabilityAssociation (MISRA) C CodingStandards. These standards were createdby MISRA to improve the reliability andpredictability of C programs in criticalautomotive systems.

Your application may not be a safety-critical application, but it’s reassuring toknow that the µC/OS-II has been sub-jected to the rigors of such environments.

ConclusionAn RTOS is a very useful piece of softwarethat provides multitasking capabilities toyour application and ensures that the mostimportant task that is ready to run executeson the CPU. You can actually experimentwith µC/OS-II in your embedded productby obtaining a copy of the MicroC/OS-IIbook. You only need to purchase a licenseto use µC/OS-II for the production phaseof your product.

EthernetRxIS()

EthernetRxTask()

Low-Priority Task

(3)

(2)

(4) (5) (6)

(7)

(8)(1)

OSQPost()OSQPend()

Figure 1 – Execution profile of an ISR and µC/OS-II

Figure 2 – MicroC/OS-II book cover


UltraController™-llUltra-Fast, Ultra-Small PowerPC-Based Microcontroller Reference Design

Embedded applications present

a variety of design challenges,

requiring both hardware and

software to accomplish real-time

processing. This is because non-

performance critical functions

and complex state machines

are handled more efficiently in

software, while timing-critical,

parallel structures are handled

more efficiently in hardware.

The Virtex-II Pro and Virtex-4

FX FPGAs meet these requirements

by providing fast FPGA fabric and

immersed PowerPC™ processors on

the same device.

The UltraController-II, an

enhanced version of the popular

UltraController, provides the easiest

way to use the embedded PowerPC

processor in the Virtex-II Pro and

Virtex-4 FX FPGAs. This minimal

footprint embedded processing

engine maximizes performance

and minimizes FPGA resources

by running code strictly from the

integrated PowerPC caches. System

designers can easily incorporate

the UltraController-II design by

using the award winning Xilinx

Platform Studio tool suite.

PowerPC Made Easy

The UltraController-II, an enhanced version of the popular UltraController,

is a minimal footprint embedded processing engine based on the PowerPC

405 processor core in the Virtex-II Pro and Virtex-4 FX FPGAs. With exciting

new features, the UltraController-II further simplifies PowerPC based

FPGA designs.

Ease-of-Use for Hardware and Software Designers – The UltraController-II

reference design is a pre-verified processing engine with 32 bits of user-defined

general purpose input and output as well as interrupt handling capability.

It enables FPGA logic centric designers to partition design functions between

FPGA logic and software while leveraging the embedded PowerPC core in

the Virtex-II Pro and Virtex-4 FX FPGAs, using the award winning Xilinx

Platform Studio design flow.

Maximum Performance – Processing power is determined by program

execution speed. The UltraController-II can be clocked at the maximum

PowerPC clock frequency (up to 400 MHz in Virtex-II Pro and 450 MHz in

Virtex-4 FX FPGA) which far exceeds any soft core processor implementation.

It also delivers up to 702 DMIPs with a low 0.45 mW/MHz power consumption

in Virtex-4 FX.

Minimum FPGA Resource – Code now executes faster by using the 16 KB

instruction and 16 KB data-side cache memory. By utilizing cache memory

to hold instructions and data no additional BRAM resources are required. As

a result, the UltraController-II occupies only 10 logic cells _ 1/5 the logic size

of the original UltraController!


© 2005 Xilinx Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the proiperty of their respective owners.

Printed in U.S.A. PN 0010751-1

Corporate Headquarters

Xilinx, Inc.

2100 Logic Drive

San Jose, CA 95124

Tel: 408-559-7778

Fax: 408-559-7114

Web: www.xilinx.com

Japan

Xilinx, K. K.

Shinjuku Square Tower 18F

6-22-1 Nishi-Shinjuku

Shinjuku-ku, Tokyo

163-1118, Japan

Tel: 81-3-5321-7711

Fax: 81-3-5321-7765

Web: www.xilinx.co.jp

Asia Pacific Distributed by

Xilinx, Asia Pacific Pte. Ltd.

No. 3 Changi Business Park Vista, #04-01

Singapore 486051

Tel: (65) 6544-8999

Fax: (65) 6789-8886

RCB no: 20-0312557-M

Web: www.xilinx.com

Europe Headquarters

Xilinx, Ltd.

Citywest Business Campus

Saggart,

Co. Dublin

Ireland

Tel: +353-1-464-0311

Fax: +353-1-464-0324

Web: www.xilinx.com

The Programmable Logic CompanySM

OPB/Arbiter

OPB/ArbitePPC405

OPBUART

BRAMBlock

BRAMBlock

L/F

OPBGPI0

OPBEMCJTAG

CNTL

PLB20PBBridge

PLB20PBBridge

Integrated Design Flow

Design Flow

The UltraController-II development leverages the proven

hardware flow of ISE and simplified software flow within the

Xilinx Platform Studio (XPS) and Embedded Development Kit

(EDK) to provide rapid prototyping and debug abilities. The sim-

plified software flow enables a direct code-to-download option with

zero hardware impact, yet maintains the standard ISE development

tool flow for full synthesis, simulation and PAR capabilities.

Take the Next StepVisit www.xilinx.com/ultracontroller to download the

free UltraController II reference design. Contents include:

• Complete VHDL/Verilog Source Code

• IP and software included

• Collateral Included

– QuickStart Tutorial

– HDL Sources & Examples

– “C” Source Code Examples

The UltraController II application note (XAPP575) is

available at www.xilinx.com/bvdocsappnotes/xapp575.pdf

The UltraController application note (XAPP672) is available

at www.xilinx.com/bvdocsappnotes/xapp672.pdf

Key Features

• Utilizes the PowerPC in the Virtex-II Pro

and Virtex-4 FX

• Occupies only 10 logic cells

• Runs at up to 400 MHz in Virtex-II Pro

and 450 MHz in Virtex-4 FX

• Code and data execute from the integrated

16KB instruction and 16 KB data caches

• 32-bit input and 32-bit output ports

• External interrupt support

• Timer functionality support (FIT, PIT, Watchdog)

• Easy to integrate using the award winning Xilinx

Platform Studio (XPS) tool flow

• Fully observable with ChipScope Pro

• Verilog and VHDL source code provided

• "C" source code examples provided

UltraController Features Comparison

ISE Flow

Step 1a: Implement HW(Map, Place & Route)

Step 2: Generate Bitstream(Bitgen)

Step 3: Configure FPGA(iMPACT)

Feature UltraController UltraController-II UltraController-IIin Virtex-II Pro in Virtex-II Pro in Virtex-4 FX

Performance

Max fmax 200 MHz 400 MHz 450 MHz

Max DMIPS 220 DMIPs 624 DMIPs 702 DMIPs

DMIPS/MHz 1.1 1.56 1.56

Logic Cells 49 Logic Cells 10 Logic Cells 10 Logic Cells

Power mW/MHz 0.9 0.9 0.45

UC 4x4: 4 0 0UC 8x8: 8

BRAM Usage

XPS Flow

Step 1b: Create and UpdateSW Application

Step 4: Modify & CompileC code in XPS


Enabling success from the center of technology™

1 . 8 0 0 . 4 0 8 . 8 3 5 3 www.em.avnet.com

© Avnet, Inc. 2005. All rights reserved. AVNET is a registered trademark of Avnet, Inc. Virtex-4, Spartan-3, MicroBlaze and Platform Studio are registered trademarks of Xilinx.PowerPC is a trademark of International Business Machines Corporation.

Avnet Electronics Marketing introduces a new FREE, one-day seminar that will detail how to utilize Xilinx FPGA technologies in the areas of programmable logic, connectivity, embedded processing, and design tools. It will include presentations, case studies, and live demonstrations to help you assess how these pioneering offerings can be leveraged in your company’s product plans.

Seminar attendees will qualify for reduced pricing on special bundles of training classes, software tools, development hardware and the Embedded Systems Starter kit. Free Avnet Design Services evaluation kits (choice of Virtex or Spartan-based) will also be given away.

Seminar Topics • The Virtex-4 FX family of FPGAs, which delivers breakthrough performance at the lowest cost and offer a compelling alternative to ASICs and ASSPs. • The Spartan-3 FPGA family, which delivers an unmatched combination of low-cost and full-feature capability. • Detailed coverage of the capabilities of these two families using the Integrated Software Environment (ISE) and Embedded Development Kit (EDK).

Registration: em.avnet.com/xleratedsolutions

Locations Dates

Minneapolis, MN March 22

San Jose, CA March 22

Chicago, IL March 23

Ottawa, ON March 24

Toronto, ON March 25

Boston, MA March 30

Rochester, NY March 31

Atlanta, GA April 4

Huntsville, AL April 5

Los Angeles, CA April 5

Irvine, CA April 6

Baltimore, MD April 6

San Diego, CA April 7

Philadelphia, PA April 7

Seattle, WA April 7

Phoenix, AZ April 8

Salt Lake City, UT April 18

Denver, CO April 19

Raleigh, NC April 19

Orlanda, FL April 20

Ft. Lauderdale, FL April 21

Austin, TX April 27

Dallas, TX April 28

Support Across the Board.™

XLerated Solutions Seminar from Avnet Electronics Marketing Showcases the Latest Innovations from Xilinx®

PN 0010862

issue 1 march 2005 embeddedmagazine · 2019-10-20 · pendent industry experts, xps was selected in...

Documents