practical design and performance evaluation of completion detection circuits
DESCRIPTION
Practical Design and Performance Evaluation of Completion Detection Circuits. Fu-Chiung Cheng Department of Computer Science Columbia University. Reading 4. Outline. Motivation Previous Work New Completion Detection Circuit Performance Evaluation Conclusion. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
1
Practical Design and Performance Evaluation of
Completion Detection Circuits
Fu-Chiung Cheng
Department of Computer ScienceColumbia University
Reading 4
2
Outline
• Motivation
• Previous Work
• New Completion Detection Circuit
• Performance Evaluation
• Conclusion
Motivation• Circuits: Synchronous or Asynchronous.
• Synchronization:
Sync: a global clock
Async: start and completion mechanisms
Motivation• Potential advantages of async. design:
• No clock skew problem, • Low power consumption, • Average-case performance, • Modularity, composability and reusability• Easier technology migration
• The promise of high performance is
especially attractive.
Motivation• High performance async. design:
1. fast self-timed components with good average case performance
2. fast completion detection circuits, detecting the completion.
Self-timedcomponent
+
+C
AA
BB
0010
0010
SS
SS
0010
0n-11n-1
Ack0
Ackn-1
DoneReset...
......
Motivation• High performance async. design:
1. fast self-timed components with good average case performance
2. fast completion detection circuits, detecting the completion.
Self-timedcomponent
+
+C
AA
BB
0010
0010
SS
SS
0010
0n-11n-1
Ack0
Ackn-1
DoneReset...
......
Motivation• Fast self-timed components:
1. Delay-insensitive carry-lookahead adders
2. Delay-insensitive comparators:
)n(
)nlog(log
:complexity Logic
:complexity Time
)n(
)(
:complexity Logic
:complexity Time 1
Motivation• Fast completion detection circuits:
1. Completion detection circuits (CDCs) are considered as the major overhead.
2. This paper address the design of fast completion detection circuits.
Previous Work:
• Self-timed components may use
1. bundled data protocol
2. dual-rail signaling
Previous Work: • CDCs for bundled data components 1. Delay elements (an inverter chain). delay > worst case delay.
2. Speculative completion [Nowick97] performance depend on A. number of matched delays and B. associated abort detection network 3. Current-Sensing Completion-Detection [Dean94,Grass96] A. consume substantial power B. requires several gate delays
Previous Work: • CDCs for dual-rail self-timed components 1. General model: A. n two-input ORs B. 1 n-input C-element 2. Operations: A. computation cycle: DoneReset=1 B. reset cycle: DoneReset=0
+
+C
SS
SS
0010
0n-11n-1
Ack0
Ackn-1
...... DoneResetSelf-timed
component
AA
BB
0010
0010
...
Previous Work: • N-input C-element: a tree of 2-input C-elms 1. long delay 2. large variance
C
C
C
C
….
….
….….
Ack0
Ack1
Ackn-2
Ackn-1
C
Previous Work: • N-input C-element: 1. More efficient implementation: DoneReset = (done+reset DoneReset) A. done circuit: an n-input AND done = Ack0 Ack1 … Ackn-1
B. reset: circuit: an n-input OR reset = Ack0 + Ack1 + …+ Ackn-1
C. a 2-input C-elem.
2. delay & variance: better than the tree of 2-input C-elem
&...
Ack0
Ackn-1
+...
Ack0
Ackn-1
C
done
reset
DoneReset
Previous Work: • Wuu’s CDCs [Wuu93]:
A. done circuit: a tree of NAND
B. reset circuit: a tree of NOR
C. long delay D. small variance E. use static gates
done
reset
))DoneResetreset(done(
DoneReset)reset(done
DoneReset)reset(doneDoneReset
1n10 Ack...AckAckdone
1n10 Ack...AckAckreset
Previous Work: • Yun’s CDCs [Yun97]:
A. done circuit: a tree of domino logic
B. no reset circuit C. variant delay
D. large variance
E. use dynamic CMOS
11
0
1
1
0
1
0
1
0
0 0
prech
prech
S0i S1
i+( )M
7
i=0
prech
S0i S1
i+( )M
31
i=24
S0i S1
i+( )M
23
i=16
S0i S
1i+( )M
15
i=8
00S 1
0S
07S 1
7S
06S 1
6S
15S0
5S
04S 1
4S
03S 1
3S
02S 1
2S
01S 1
1S
8-bit completiondetection domino logic
done
Our Design • Computation Completion detection circuits (dynamic n-input NOR)
(static 2-input NOR) SSAck
Ack...AckAck
Ack...AckAckdone
i
1
i
0i
1n10
1n10
1
0 0 0 0
Ack 0 Ack 1 Ack n-2 Ack n-1...
Ack i
done
1
0 0
Ack i
S0i
S1i
S0i S
1i
Our Design • Reset Completion detection circuits
(dynamic 2n-input Or)
i
1
i
0i
1n
1
1n
0
0
1
0
0
1n10
SSAck
))S(S...)S((S
Ack...AckAckreset
0
...
1
0 0
S0i
S1i
S0i S
1i
00S 1
0S ... S1n-1
0 0
S0n-1
reset
0
Our Design • Computation cycle:
For the done signal, 1. the PMOS transistor (Acki) will be closed and 2. all NMOS transistors will be open. 3. Thus, the done signal will be turned on.
on. turned eventually be willSor SEither i
1
i
0
1
0 0 0 0
Ack 0 Ack 1 Ack n-2 Ack n-1...
Ack i
done
1
0 0
Ack i
S0i
S1i
S0i S
1i
Our Design • Computation cycle:
For the reset signal, the reset signal is turned on as soon as any Acki signal goes high
on. turned eventually be willSor SEither i
1
i
0
0
...
1
0 0
S0i
S1i
S0i S
1i
00S 1
0S ... S1n-1
0 0
S0n-1
reset
0
Our Design • Reset cycle:
For the done signal, the done signal is turned off as soon as any Acki signal is turned off
off. turned eventually be willSor SEither i
1
i
0
1
0 0 0 0
Ack 0 Ack 1 Ack n-2 Ack n-1...
Ack i
done
1
0 0
Ack i
S0i
S1i
S0i S
1i
Our Design • Reset cycle:
For the reset signal, the reset signal is turned off only after all Acki signals are turned off.
off. turned eventually be willSor SEither i
1
i
0
0
...
1
0 0
S0i
S1i
S0i S
1i
00S 1
0S ... S1n-1
0 0
S0n-1
reset
0
Our Design • done + reset circuits = dual-rail multi-input C-element
• done + reset circuits + 2-input C-element = single-rail multi-input C-element
• Implementation of 2-input C-element: 1
0
1
0
Weak done
reset
done
reset
DoneResetDoneReset
done
reset
done
reset
DIRCA With CDC: part 1
DIRCA With CDC: part 2
Our Design
1
0 0 0 0
Ack 0 Ack 1 Ack n-2 Ack n-1...
Ack i
done
1
0 0
Ack i
S0i
S1i
S0i S
1i
• The PMOS in the pull-up circuit of the done circuit saves power in non-operation mode.
• In a quiescent state, all Acki signals are zero. All pull-down transistors are closed. • To save power, pull-up transistor is open to cut off the path from Vdd to Ground.
Our Design
1
0 0 0 0
Ack 0 Ack 1 Ack n-2 Ack n-1...
Ack i
done
1
0 0
Ack i
S0i
S1i
S0i S
1i
• Input low arrives too early, power is wasted.• Input low arrives too late, take a longer time to turn on the done signal. • Low power consumption latest Acki signal• High performance any not-latest Acki signal
SPICE Output: done circuit
ChengDone0:1. Ack0 is the latest signal.2. input pulses: 3 and 43. buffered input:10044. Ack0:1005. Done:246806. DoneReset: 200
Delay=0.55ns
SPICE Output: done circuit
ChengDone1:1. Ack1 is the latest signal.2. input pulses: 5 and 63. buffered input:10064. Ack1:1015. Done:246806. DoneReset: 200
Delay=0.22ns
SPICE Output: done circuit
ChengDone37:1. All Ack arrive at the same time2. Done:246803. DoneReset: 200
Delay=0.64ns
SPICE Output: reset circuit
Delay=1.23ns
ChengReset0:1. Ack0 is the latest signal.2. input pulse: 3 and 43. buffered input:10045. Reset:135796. DoneReset: 200
SPICE Output: reset circuit
Delay=0.87ns
ChengReset1:1. Ack0 is the latest signal.2. input pulse: 3 and 43. buffered input:10045. Reset:135796. DoneReset: 200
SPICE Output: reset circuit
Delay=1.34ns
ChengReset37:1. All Ack reset at the same time2. Done:246803. DoneReset: 200
Our Design
1
0 0 0 0
Ack 0 Ack 1 Ack n-2 Ack n-1...
Ack i
done
1
0 0
Ack i
S0i
S1i
S0i S
1i
• Constraint: when conducting,
when only one pull-down transistor is conducting. • This can be achieved by properly sizing transistors.
pull-dwonpull-up RR 5
Logic Complexity
done done+resetcircuit
n-bit 32-bit 64-bit n-bit 32-bit 64-bitWuu 10n-4 316 636 14n-8 440 888Yun 4n-5 123 251 N/A
Cheng 5n+1 161 321 7n+5 229 453
# of transistors
Performance Evaluation• SPICE Simulation: 1. use MOSIS 2 micron CMOS level 2 parameters 2. W=3u L=2u (buffer 0.4 ns 2-input Nor 0.18ns)• Computation-completion detection circuits 38 typical cases (for Wuu, Yun and Cheng) The delay measured includes the delay of the OR gate for Acki.• Reset-completion detection circuits: 38 typical cases (Wuu and Cheng)
Performance Evaluation
Computation Completion Detection32-bit done(ns) Speed upCase
Wuu Yun Cheng C vs W C vs YMin 2.18 1.46 0.22 4.1 2.8Max 2.65 3.36 0.64 10.4 14.3Avg 2.27 2.53 0.28 9.2 10.2
Performance Evaluation
Reset Completion Detection 32-bit reset(ns) Speed up Case
Wuu Cheng C vs W Min 2.40 0.87 Max 2.89 1.34 Avg
2.85 0.71 4.0
Conclusions
• A new completion detection circuit for dual-rail self-timed components. 1. very fast computation-completion detection 2. very fast reset-completion detection
• Low-overhead, very fast completion detection
circuit is crucial for high performance
self-timed circuits.
Conclusions
• SPICE simulation results:
1. our computation-completion detection circuit 9 times faster than Wuu's and Yun's
2. our reset-completion detection circuit: 4 times faster than Wuu's.