Title: CEG3470
1CEG3470
Digital Circuits (Spring 2009)
Lecture 6 Memory Decoder Design
Courtesy slides from DIC 2/e and EE141 notes
from Prof. Jan Rabaey
2Why Memory?
'Penryn' 45nm die
Dual-core chips will get up to 6MB of cache,
while quad-core parts will get up to 12MB.
From http//regmedia.co.uk/2007/01/25/intel_penry
n_3.jpg
3Semiconductor Memory Classification
4Random Access Memory (RAM)
- Static (SRAM)
- Data stored as long as supply is applied
- Larger (6 transistors/cell)
- Fast
- Differential (usually)
- Dynamic (DRAM)
- Periodic refresh required
- Smaller (1-3 transistors/cell)
- Slower
- Single Ended
5Random Access Chip Architecture
- Conceptual linear array
- Each box holds some data
- But this does not lead to a nice layout shape
- Too long and skinny
- Create a 2-D array
- Decode row and column address to get data
6Basic Memory Array
- Core
- keep square within a 21 ratio
- rows are word lines
- columns are bit lines
- data in and out on columns
- Decoders
- needed to reduce total number of pins NM
address lines for 2NM bits of storagee.g. if
NM20 ? 220 1Mb. - Multiplexing
- used to select one or more columns for input or
output of data
7Memory Architecture Decoders
M bits
S0
Word 0
Word 1
S1
Storagecell
Word 2
S2
N words
Intuitive architecture for N x M memory Too many
select signals N words N select signals
Word N-2
SN-2
Word N-1
SN-1
Input-outputM bits
8Memory Architecture Decoders
M bits
S0
Word 0
Word 1
A0
Storagecell
Word 2
A1
Decoder
AK-1
Word N-2
Word N-1
Decoder reduces the number of select signals K
log2 N
K log2 N
Input-outputM bits
9Row Decoders
- Collection of 2M complex gates organized in
regular and dense fashion
(N)AND Decoder
NOR Decoder
10Decoder Design Example
Look at decoder for 256 x 256 memory block
(8KBtyes)
11Problem Setup
- Goal Build fastest possible decoder with static
CMOS logic - What we know
- Basically need 256 AND gates, each one of them
drives one word line
2N gates
2N address lines
N 8
12Problem Setup (1)
- Each word line has 256 cells connected to it.
- Total output load is 256 x Ccell Cwire
- Assume that decoder input capacitance isCaddress
4 x Ccell - Each address drives 28/2 AND gates
- A0 drives half of the gates, A0 the other half
of the gates - Neglecting Cwire, the fan-out on each one of the
16 address wires is
13Decoder Fan-out
- FB of at least 213 means that we will want to use
more than log4(213) 6.5 stages to implement the
AND8 - ????
- Need many stages anyways
- So what is the best way to implement the AND
gate? - Will see next that its the one with the most
stages and least complicated gates
14Example 8-input AND
g 10/3 1 G 10/3 P 8 1
g 2 5/3 G 10/3 P 4 2
g 4/3 5/3 4/3 1 G 80/27 P 2 2
2 1
158-input AND
- Using 2-input NAND gates
- 8-input gate takes 6 stages
- Total LE is (4/3)3 ? 2.4
- So PE is 2.4 x 213 optimal N 7.1
16Decoder So Far
- 256 8-input AND gates
- Each built out of tree of NAND gates and
inverters - Issue
- Every address line has to drive 128 gates (and
wire) right away - Cant build gates small enough - forces us to add
buffers just to drive address inputs
256 gates
wl254
wl255
16 address lines
17Look Inside Each AND8 Gate
a0 a1 a2 a3 a4 a5 a6 a7
a0 a1 a2 a3 a4 a5 a6 a7
wl254
wl254
a0 a1 a2 a3 a4 a5 a6 a7
a0 a1 a2 a3 a4 a5 a6 a7
wl255
wl255
18Predecoders
- Use a single gate for each of the shared terms
- e.g., from A0, A0, A1, and A1, generate four
signals A0A1, A0A1, A0A1, A0A1 - In other words, we are decoding smaller groups of
address bits first - And using the predecoded outputs to do the rest
of the decoding
19Predecoder and Decoder
a0 a1
a4 a5
a2 a3
20Predecoder/Decoder Layout
- Predecoder outputs run along height of the memory
array. - Decoder must match height of RAM cell.
SRAM Cell Array
Final Decoders
Address
21Predecoder Options
- Two options for predecoding
Option 1
Option 2
22Predecoder Options (2)
- Larger predecode usually better
- More stages before the long wires
- Decreases their effect on the circuit
- Fewer long wires switch
- Lower power
- Easier to fit 2-input gate into cell pitch
23What We Now Know
- Given decoder structure, input capacitance, final
load - Can size the entire chain using LE for minimum
delay - Is this the best we can do in terms of power
too? - Not necessarily probably want to reduce sizes
?? (especially on final decoder inputs) - Is there anything else we can do to improve
energy even further?