List at least 3 advantages of implementing selected portions of a complex design
in software
Problem 1
Software vs. Hardware Trade-offs
Implement more in Software
Implementmore in Hardware
Manage Design ComplexityReduce Design Cost
Stick to Design ScheduleHandle Deep Submicron
Improve PerformanceImprove Energy EfficiencyReduce Power Density
Source: A Practical Introduction to Hardware/Software Codesign
What are the primary advantages of Zynq All Programmable SoC over a traditional FPGA,
such as Virtex 7, with a soft microprocessor core, such as MicroBlaze?
Problem 2
Explain the term differentiation, and describe how Zynq can be
used to accomplish a high-level of differentiation
Problem 3
In economics and marketing, product differentiation (or simply differentiation)
is the process of distinguishing a product or service from others, to make it more attractive to
a particular target market. This involves differentiating it from competitors' products
as well as a firm's own products.
Vivado Design Suite
• scalable data model, supporting designs with up to 100 million ASIC gate equivalents (GEs)
• based on industry standards, such as• AMBA AXI4 interconnect• IP-XACT IP packaging metadata• Tool Command Language (Tcl)• Synopsys Design Constraints (SDC)
Productivity Gains
• Synthesis tool 3x faster than Xilinx XST• Substantial improvement in runtime and
maximum design size compared to Xilinx ISE• Vivado Simulator 3x faster than Isim• Much better visibility into key design metrics,
such as timing, power, resource utilization, and routing congestion much earlier during the design process
• Estimates becomes progressively more accurate
Multidimensional Analytical PlacerISE:• One-dimensional, timing-driven place-and-route algorithms• Simulated annealing algorithms that determine
randomly where the tool should place logic cells• Does adequate job for FPGAs below 1 million GEs
Vivado:• Modern multidimensional analytic placement algorithm• Deterministically finds a solution that primarily minimizes:
timing, congestion, and wire length• Better results, fewer iterations• Efficient up to 100 million GEs
Hierachical Chip Planning& Advantages of Standards
• ability to partition the design for processing by synthesis, implementation and verification
• divide-and-conquer team approach to big projects• design preservation feature enabling repeatable timing results• access to state of the art third-party EDA tools for tasks such as
• constraint generation• formal verification• static timing analysis
Power Optimization and Analysis
• capable of analyzing design logic and removing unnecessary switching activity
• advanced clock gating techniques• up to 30% reduction in dynamic power• power estimates at every stage of the design flow
IP Packager, Integrator, and Extensible IP Catalog
• any part of the design (including the entire design) can beturned into a reusable core at any level of the design flow:
• RTL• netlist• placed netlist• placed-and-routed netlist
• IP-XACT descriptions easy to integrate into future designs• IP Packager specifies the IP data using XML file• Extensible IP Catalog allows users to build their own standard
repositories from IP they’ve created, licensed from Xilinx, or licensed from third-party vendors
High Level LanguageC, C++, System C
Hardware Description LanguageVHDL or Verilog
VivadoHLS
Vivado HLS
Support for Xilinx Families
90 nm Spartan-3, Virtex-4
65 nm Virtex-5
45 nm Spartan-6
40 nm Virtex-6
28 nm Artix-7, Kintex-7, Virtex-7, Zynq 7000
Future families
ISE
Vivado
Explain the ways of minimizing the area of the GPIO core, shown conceptually in the diagram below
Problem 7
Source: LogiCORE IP AXI GPIO: Product Specification
AXI GPIO Resource Utilization andMaximum Clock Frequency
What are the primary limitations of the Accelerator Coherency Port (ACP) of Zynq, which go against the idea of full coherency?
Problem 9
• ACP allows limited support for Hardware Coherency– Allows a PL accelerator to access cache of the Cortex-A9 processors– PL has access through the same path as CPUs including caches,
OCM, DDR, and peripherals– Access is low latency (assuming data is in processor cache)
no switches in path
• ACP does not allow full coherency– PL is not notified of changes in processor caches– Use write to PL register for synchronization
Accelerator Coherency Port (ACP) Summary
Pulse Width Modulation (PWM) Mode
• Two timer/counters are used as a pair to produce an output signal (PWM0) with a specified frequency and duty factor
• Timer 0 sets the period• Timer 1 sets the high time for the PWM0 output
• Can be used to generate• Periodical signals with varying period and duty cycle
Which mode of AXI Timer can be used to measure interval between two external events,such as two consecutive pushes of a button
connected to the pins of PL?
Problem 11
Capture Mode• The counter can be configured as an up or down counter• The value of the counter is stored in the load register when
the external capture signal is asserted• The TINT flag is also set on detection of the capture event• The Auto Reload/Hold (ARHT) bit controls whether the
capture value is overwritten with a new capture value before the previous TINT flag is cleared
• Can be used to measure• Widths of non-periodical signals• Periods of periodical signals• Intervals between edges of two different signals, etc.
AXI Interfaces and Interconnects
Source: The Zynq Book
Interface
A point-to-point connection for passing data, addresses, and hand-shaking signals between master and slave clients within the system
Interconnect
A switch which manages and directs traffic between attached AXI interfaces
Summary of AXI Full and AXI Lite Interfaces
Source: Building Zynq Accelerators with Vivado HLS, FPL 2013 Tutorial
Summary of AXI Stream Interface
Source: Building Zynq Accelerators with Vivado HLS, FPL 2013 Tutorial
Provide the names of at least two System-on-Chip Bus Standards
competing with AMBA AXI4 standards
Problem 14
Bus Developedby
High-PerformanceShared Bus
PeripheralShared
Bus
Point-to-PointBus
AMBA v3 ARM AHB APBAMBA v4 ARM AXI4 AXI4-Lite AXI4-Stream
Coreconnect IBM PLB OPBWishbone SiliCore
Corp.Crossbar Topology
SharedTopology
Point to PointTopology
Avalon Altera Avalon-MM Avalon-MM Avalon-STAMBA: Advanced Microcontroller Bus ArchitectureAXI: Advanced eXtensible InterfaceAHB: AMBA High-speed BusAPB: AMBA Peripheral BusPLB: Processor Local BusOPB: On-chip Peripheral BusMM: Memory MappedST: Streaming
Competing System-on-Chip Bus Standards
Source: A Practical Introduction toHardware/Software Codesign
Explain the need for the highlighted operations in case of the DMA-based communication
between an ARM core and a hardware accelerator using GP Ports between PS and PL.
Problem 15
A. Write to Acceleratorprocessor allocates bufferprocessor writes data into bufferprocessor flushes cache for bufferprocessor initiates DMA transfer
B. Read from Acceleratorprocessor allocates bufferprocessor initiates DMA transferprocessor waits for DMA to completeprocessor invalidates cache for bufferprocessor reads data from buffer
Explain the meaning of each number in the following estimate of
the maximum bandwidth supported by
Problem 16
A. HP ports of Zynq:
Maximum bandwidth = 4 * 64 bits * 150 MHz * 2 = 9.6 GByte/sec
B. external DDR
Maximum bandwidth = 1 * 32 bits * 2 * 533 MHz * 2 = 4.3 GByte/s
A. HP ports of Zynq:
Maximum bandwidth = 4 * 64 bits * 150 MHz * 2 = 9.6 GByte/sec
Bidirectional transfer
PL clock frequency
Data bus width for HP interfaces
Number of HP interfaces
B. external DDR
Maximum bandwidth = 1 * 32 bits * 2 * 533 MHz * 2 = 4.3 GByte/s
Bidirectional transfer
Maximum clock rate (clock frequency) of DDR3
Data bus width for DRAM interface
Number of DRAM interfaces
Double data rate (using both clk edges)
1. Write the address of the starting descriptor to
the Current Descriptor register2. Start the MM2S channel running by setting the run/stop bit
to 1, MM2S_DMACR.RS = 1.
3. If desired, enable interrupts by writing a 1 to
MM2S_DMACR.IOC_IrqEn and MM2S_DMACR.Err_IrqEn.
4. Write a valid address to the Tail Descriptor register.Writing to the Tail Descriptor register triggers the DMA
to start fetching the descriptors from the memory.
Scatter Gather DMA TransferProgramming Sequence for MM2S channel (1)
1. Write the address of the starting descriptor to
the Current Descriptor register2. Start the S2MM channel running by setting the run/stop bit
to 1, S2MM_DMACR.RS = 1.
3. If desired, enable interrupts by by writing a 1 to
S2MM_DMACR.IOC_IrqEn and S2MM_DMACR.Err_IrqEn.
4. Write a valid address to the Tail Descriptor register.Writing to the Tail Descriptor register triggers the DMA
to start fetching the descriptors from the memory.
Simple DMA TransferProgramming Sequence for S2MM channel (1)
What is the primary advantage of using a Scatter-Gather DMA Transfer
rather than a Simple DMA transfer?
Problem 18
Based on the table below calculate the latencyin nanoseconds for communication between:
Problem 19
A. DMA and DDR, using HP interfaceB. ARM and internal memory of AXI Slave
located in PL, using GP interface
L1 Cache L2 Cache DDR OCM IOP Slave M_AXI_GP0
CPU Pipeline 1 25 67 20 122 86
Peripheral Master – – 136 106 – 126
S_AXI_ACP 27 32 89 27 124 –S_AXI_HP0 – – 76 46 – –S_AXI_GP0 – – 118 88 144 –
AXI Interconnects and Interfaces
Source: The Zynq Book
ACP - Accelerator Coherency Port SCU - Snoop Control Unit
• Processing system– CPU at 667 MHz (650 MHz on Zybo board)– DDR at 533 MHz – Programmable logic is running at 150 MHz
• Latency in terms of CPU clock cycles– S_AXI_HP at 76 CPU cycles is
17 cycles for the PL (114 ns)
CPU Clock Period = 1 / CPU Clock Frequency = 1.5 ns
A. 76 * 1.5 ns = 114 ns
B. 86 * 1.5 ns = 129 ns
Provide any missing fragments of the VHDL code describing Xilinx BRAM in the
WRITE_FIRST mode, illustrated in the diagram below
Problem 20
60
Block RAM with synchronous read Write-First Mode - cont'darchitecture behavioral of raminfr is
type ram_type is array (0 to 2**r-1) of std_logic_vector (w-1 downto 0);
signal RAM : ram_type := (others => (others => '0'));
beginprocess (clk) beginif (clk'event and clk = '1') then
if (en = '1') thenif (we = '1') then
RAM(to_integer(unsigned(addr))) <= di;do <= di;
elsedo <= RAM(to_integer(unsigned(addr)));
end if; end if;
end if; end process;
end behavioral;
62
Block RAM
Simple Dual Port (SDP) = one port for read, one port for write(write_A-read_B, read_A_write_B)
True Dual Port (TDP) =both ports can be used for read or write
(read_A-read_B, read_A-write_B, write_A-read_B, write_A-write_B)
1. Location and basic building blocks
2. Distributed RAM has asynchronous outputBlock RAM has synchronous output.
3. Dual Port Distributed RAM requires twice asmany resources (MLUTs)Dual Port Block RAM requires the same numberof resources as a Single-Port Block RAM
4. Distributes Dual Port RAM does not allowtwo writes, only one read and one write.Dual Port Block RAM does not have this restriction
65
Location of Distributed RAMRAM blocks
Multipliers
Logic blocks
Graphics based on The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
DSP unitsRAM blocks
Logic resources
Logic resources(CLB slices)
66
Location of Block RAMsRAM blocks
Multipliers
Logic blocks
Graphics based on The Design Warrior’s Guide to FPGAsDevices, Tools, and Flows. ISBN 0750676043
Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)
DSP unitsRAM blocks
Logic resources
Logic resources(CLB slices)
Determine the number of address bits, data bits, and parity bits in
PORTA of a 36k BRAM configured as True Dual Port RAM,
with 2048 memory locations
Problem 23
Supplement the attached timing diagrams with the values of dout for
Problem 25
A. Standard FIFOB. B. First-Word Fall-Through FIFO