Structured Hardware Design - PowerPoint PPT Presentation

About This Presentation
Title:

Structured Hardware Design

Description:

Building block is instantiated once for each wire in the bus ... Scan at 50Hz to avoid flicker. Drive LEDs hard to make bright ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 43
Provided by: iap7
Category:

less

Transcript and Presenter's Notes

Title: Structured Hardware Design


1
Structured Hardware Design
  • Ian Pratt
  • University of Cambridge
  • Computer Laboratory
  • Ian.pratt_at_cl.cam.ac.uk

2
Designing Hardware Systems
  • A good design should work first time
  • Simulation
  • Verification
  • Testing
  • Top-down methodology
  • Decompose into modules
  • Modules
  • Well-defined functions and interfaces
  • Often different technologies
  • Using pre-existing modules desirable

3
Broadside components
  • Bus
  • parallel signals carrying a binary number
  • Represented with thick lines
  • Broadside components
  • Building block is instantiated once for each wire
    in the bus
  • Building block inputs and outputs connected to
    the corresponding members of the buses
  • Control connections are wired in parallel
  • Registers, buffers, multiplexors

4
Read-Only Memories
  • Non-volatile, but typically slow
  • Mask programmable
  • Cheapest in mass production by far
  • One-time programmable (PROM)
  • UV Eraseable (EPROM)
  • Electrically re-programmable (e.g. FLASH)
  • Expensive, but many rewrite cycles possible
  • Field upgrades possible
  • Choose technology based on units required and
    rewrite cycles expected

5
DRAM
  • Each bit stored in a small capacitor (1T)
  • Needs refreshing periodically
  • Recovery time required after reads
  • Bits arranged in a square array
  • Accessed by row, column (multiplexed address bus)
  • Typically 1,4,8 bits wide
  • E.g. 8Mbx8 (64Mbit) 50ns access time
  • New parts have synchronous interface
  • SDRAM / DDR / RAMBUS (still same core)
  • Modules E.g. 16Mbx64 100MHz SDRAM
  • Made from eight 8Mbx8 parts on a PCB (DIMM)

6
SRAM
  • Transparent latch per bit (6T)
  • Not as dense as DRAM, more expensive
  • Fast (7-50ns) access times
  • Used in caches
  • Easy to use no refresh to worry about
  • Non-multiplexed address bus
  • Modern parts have synchronous interfaces
  • Pipelined design
  • E.g. 256Kbx32 (8Mb) 10ns

7
Clock generation
  • RC oscillators rather inaccurate, but cheap
  • Quartz crystal oscillators commonplace
  • Require a little care to make work
  • Accurate to 50ppm
  • Clock multiplication
  • Phase Locked Loop (PLL)
  • E.g. 133MHz x 7.5 997.5Mhz (Pentium III)
  • Clock distribution trees
  • Buffers, or PLLs to get zero propagation delay

8
Miscellaneous
  • Power-on reset
  • Release reset after power stable
  • Get all flip-flops into known state
  • (manual reset by shorting capacitor)
  • Relays can be used to switch large loads
  • (alternative is to use power transistors)
  • Must protect transistor with a diode
  • Mechanical switches bounce when switching
  • Use a 2-pole switch and RS latch

9
ALUs
  • Combinatorial logic implementation
  • Takes two N-bit inputs and function selector
  • Propagation delay typically determined by carry
    chain
  • Typically twos-complement representation
  • ADD, ADC, SUB, NOT, AND, OR, BIC,
  • Flags Carry-out, Negative, Overflow, Zero
  • Output will typically be latched, along with flag
    status results

10
Microprocessors
  • Simple microprocessor control signals
  • Inputs Clock, Reset
  • Output Request, Read/nWrite, Addrlt0..Ngt
  • InOut Datalt0..Mgt
  • Read cycles to fetch instructions and load data
  • Write cycles when updating memory
  • Begins execution by fetching from reset location
  • PC incremented unless branch/jump instruction

11
Address decoding
  • Devising a memory map for a design
  • Address that memory/peripherals are available at
  • Non-volatile memory typically mapped at the reset
    location
  • Use combinatorial function of high-order address
    bits to generate enable signals
  • Devise memory map for decoding convenience

12
The PC as a component
  • Motherboard cost 30-100
  • 4 wiring layers in PCB
  • CPU, DRAM, keyboard, USB, VGA, IDE, floppy,
    serial, parallel, audio, IRDA
  • Cheap general purpose platform for supporting
    other hardware
  • System-on-a-chip (SOC) implementations available
    soon

13
Interconnecting Modules
  • How much data in bps needs to flow?
  • Will the connection be synchronous or async?
  • Is flow-control needed to limit the flow?
  • How long do the wires need to reach?
  • Is the topology fixed at design time?
  • Is hot-plugging needed?
  • Can we use an existing design?

14
PC Parallel Port
  • 8 data wires, 3 control wires
  • Unidirectional in its most basic form
  • Flow-control mechanism
  • Master drives data then asserts strobe_bar
  • Slave asserts acknowledge
  • Slave optionally asserts busy
  • When both busy and acknowledge are deasserted
    master can send another byte

15
RS232 Serial Ports
  • Asynchronous bit stream
  • One wire for each direction plus ground
  • Start, data, parity, stop
  • Start bits assist clock recovery
  • Baud rate (e.g. 300, 1200, 9600, 115200)
  • Various flow-control schemes
  • s/w XOn/XOff characters
  • h/w CTS/RTS signals
  • Excellent for simple debugging support

16
Finite State Machines
  • Building everything from FSMs
  • Avoid generated clocks / async resets
  • Avoid loops in combinatorial logic
  • Current CAD tools only work with FSMs
  • Timing specifications
  • Tck_to_out, Tsetup, Thold, Tprop
  • Beware of long Tholds
  • Use Moore outputs between modules
  • Easier to characterize delay into next module
  • Critical path is longest logic path ending in an
    FF
  • Determines maximum clock speed

17
Johnson Counters
  • Traditional binary counters require long logic
    paths for high-order bits
  • Limit clock frequency
  • Johnson counters are based on shift registers
    with feedback
  • E.g. using a NOR gate for a /5 with 3FFs
  • Clock prescalers easy clock output
  • PRBS counter (XOR) 2n-1 with n FFs

18
One Hot Coding
  • FSM encoding using 1FF per state
  • Single FF set, others all clear
  • Uses more FFs than necessary, but
  • Only very simple decode logic required
  • High clock speeds
  • Particularly useful in FPGAs

19
Pipelining
  • Split combinatorial logic into stages separated
    by FFs
  • Enables increased clock speed
  • Improved throughput
  • but, increases delay
  • Tsetup Tclock_to_out of each FF
  • Unbalanced pipeline stages
  • Feedback paths can make life tricky
  • CAD tools can help distribute FFs

20
Gated Guarded Clocks
  • Clock Enable safer than derived clocks
  • Internal multiplexor selects between Din and Q
  • But, power is proportional to clock freq, so in
    some designs it is necessary to
  • Gate lower frequency clocks
  • Turn off clocks to currently idle units
  • When necessary, create clock by ORing clock with
    synchronised enable_bar

21
Clock and Data Skew
  • Skew when the same signal arrives at different
    places at slightly different times
  • The enemy of synchronous design
  • Clock signals are especially vulnerable
  • Early clock can cause setup time violation on
    critical paths
  • Late clock can allow output of previous stage to
    race into this one (hold time violation)
  • Take special care routing clocks!

22
Crossing Clock Domains
  • Setup/hold time violations unavoidable
  • Metastability can occur, but typically only
    briefly
  • Allow extra time for setup into next FF
  • Or, use 2FFs for safety
  • Synchronize each signal at a single point
  • Can use guard signal for buses
  • Guard indicates when bus is safe to sample
  • Or, FIFOs with separate read/write clocks

23
FSM clocks derived from another FSM
  • When its necessary to use derived clocks
  • Use a moore output to clock slave
  • Function should be hazard free
  • Be careful to avoid races with other outputs
    connected to slave
  • Mustnt change at same time as clock
  • Outputs from slave back to master may restrict
    max clock rate

24
Integrated Circuits
  • Si or GaAs substrate with implants
  • 200/300mm wafers, 0.3mm thick
  • Only the top few microns active
  • Ion implant and etching steps, controlled via
    stencils created by exposing a photo-resistive
    coating to UV / X-rays via a mask generated by
    CAD tools
  • 7-30 different masks used
  • Masks stepped over wafer for each die
  • 4-500mm2 die size

25
CMOS Technology
  • nMOS, CMOS, ECL (Bipolar)
  • CMOS most popular (and best supported)
  • Feature size reduces at 10-20 p.a.
  • Smaller ? faster, lower power, higher density
  • 0.5, 0.35, 0.25, 0.18, 0.15, 0.13µm
  • Max die size increasing at 10-25 p.a.
  • Number of available Ts increasing at 60-80 p.a.
  • 2-7 metal wiring layers. Al (or now Cu)
  • Separate processes for DRAM, logic, analog

26
Pads and IO
  • Pad ring around edge of die
  • Pads are typically 50 micron square
  • Contain high-power drive outputs and ESD
    protection circuitry
  • Power / ground ring around pads
  • Gold bond wires connect to package pins
  • Up to 1000 pins (with expensive packaging)
  • Packaging eases handling and dissipates heat
  • Core bound vs. Pad bound designs

27
Chip costs
  • Non Recurring Expenditure (NRE)
  • Design costs (labour, tools, overheads...)
  • Mask making costs
  • Per device costs
  • Raw wafer, Processing, Testing, Packaging
  • Influenced by yield
  • P(die defect free) ? Kdie area
  • K is probability that any given mm2 is defect free

28
Taxonomy of ICs
  • Standard parts (off-the-shelf, datasheet
    available)
  • Full-custom ASICs
  • For best performance, but greatest NRE
  • CPUs, memory, DSPs
  • Semi-custom standard cell ASICs
  • Designed from a library of standard gates/cores
  • Semi-custom gate array ASICs
  • Only a few masks required, but inefficient
  • Field programmable parts
  • FPGAs, PALs

29
Field Programmable Gate Arrays
  • Volatile, re-programmable OTP types
  • All programmable in situ
  • Array of Configurable Logic Blocks (CLBs) and
    switch matrices (configurable wiring with
    buffers)
  • IO Blocks (IOBs) around edge of die
  • CLB typically consists of LookUp Table (LUTs),
    1-2 FFs and programmable MUXs
  • 16x1 LUT (SRAM) implements any fn of 4 variables
  • Allowing writes to LUT enables use as RAM
  • Switch matrices provide hierarchical routing

30
Field Programmable Gate Arrays
  • Different families use different CLB sizes
  • Xilinx 4K series 2x 4 input LUTs and 2x FFs
  • Others more or less fine grained
  • Very low NRE, rapid turnaround
  • Only requires a place and route tool run
  • Great for prototypes, but parts typically cost
    10x more than equivalent gate array
  • SRAM/Flash parts enable field upgrades
  • Switch to gate arrays in mature designs

31
Programmable Array Logic Devices (PALs)
  • Programmable sum of products array feeding
    macrocells
  • Good for simple FSMs and glue logic
  • Macrocell enables combinatorial or registered
    output, usually tristateable
  • more complex devices also contain buried
    macrocells, and may organise macrocells into
    clusters with separate clock sources, sometimes
    called CPLDs (Complex Programmable Logic Devices)
  • New parts in-circuit-programmable, while others
    require a special programmer
  • JEDEC description file

32
Delay and Power
  • Si/CMOS
  • nmos/pmos unipolar transistors, generally small
  • Power proportional to frequency
  • Si/BiCMOS
  • CMOS augmented with bipolar for driving large
    loads
  • Si/ECL
  • Bipolar transistors, kept unsaturated
  • x3 performance, but large static current
  • GaAs/MESFET/Bipolar
  • x10 performance, but yield generally poor
  • Up-coming technologies SOI, SiGe

33
Fanout and delay
  • Output stage speed decrease with load
  • Dominant aspect of load is Capacitance
  • Proportional to area of output conductor
  • Sum of input capacitances of devices driven
  • delay intrinsic delay (output load x derating
    factor) propagation delay
  • Gate specification includes intrinsic delay,
    input loads and output derating figures

34
Design Partitioning h/w vs s/w
  • Hardware
  • Use where high throughput required, but
  • Harder to design and debug
  • Harder to modify
  • Software
  • Running on CPU(s) or microcontroller(s)
  • A whole PC on a PCB embedded on an ASIC
  • Better support for complexity
  • Field upgrades
  • Can help debug hardware

35
Hardware partitioning
  • Partitioning logic over chips motivated by
  • Availability of standard parts
  • Use existing parts wherever possible, especially
    for prototypes or low volume designs
  • Speed required by different function units
  • Use exotic technologies as sparingly as possible
  • Interconnection speed and width required
  • External interconnects much slower than on-chip
    and have limited pin count
  • ASIC size, pin count, power

36
Logic Synthesis Layout
  • Complex functions expressed algorithmically, then
    synthesized to gates
  • Good at mechanical tasks on relatively small
    sections of a design
  • Critical sections of a design still done by hand
  • Place tool attempts to layout gates to minimize
    wiring paths
  • Route tool attempts to wire gates
  • Tools are continually improving
  • More feed back and integration between tools

37
The Cambridge Fast Ring
  • 100MHz ECL chip implements
  • Transceivers and serial de/modulator
  • ECL has good high-power line driving
    characteristics
  • Serial to parallel and parallel to serial
  • Byte alignment
  • CMOS chip, 50x more logic than ECL chip
  • Media access control protocol / CRC generation
  • Small buffer memory / Host processor interface
  • Ring monitoring and maintenance
  • DRAM, VCO, PALs for glue logic to host iface

38
External Modem
  • Analogue frontend to telephone line
  • Isolation, surge suppression, off-hook relay
  • Digital Signal Processor as Codec
  • Dedicated to a single task
  • Microcontroller for control
  • Talking to host, processing commands etc.
  • External NVRAM e.g. Flash to store state
  • RS232 Line drivers (/- 12V)
  • Requires special fabrication process

39
Scan multiplexing
  • Scan multiplexing saves wires (and thus pins)
  • Used for LEDs and switches (keyboards)
  • LED matrix
  • Drive column high, write pattern on row
  • Scan at gt50Hz to avoid flicker
  • Drive LEDs hard to make bright
  • Pseudo dual porting enables pixel RAM to be
    updated
  • Keyboard matrix of push-to-make switches
  • Drive column high, read row
  • Pull down resistors keep row wires normally low

40
Audio delay unit
  • Sample clock of 44.1kHz sufficient for audio
  • Single counter provides fixed delay
  • Read cycle followed by write to same location
  • Two counters (one loadable) and a mux enables
    variable delays
  • Lead write counter has over read sets delay
  • Could use LFSR counters, but no need here
  • Could use DRAM, but SRAM easier and dense enough
  • Accesses unlikely to be to same page, hence slow
  • Could use small staging FIFOs to enable burst
    reads writes
  • Audio so slow, we could use a microcontroller

41
Network Camera Device 1
  • Standard parts for
  • Video frontend and resizer, Audio digitizer
  • JPEG compression engine
  • 100Mb/s Network SERDES (de/serializer)
  • Three 8KBx8 SRAMs for scanline to tile
    conversion, controlled by PAL
  • Three 256KBx8 DRAM FIFOs for framebuffer
  • PAL for colour conversion / muxing (non
    compressed)

42
Network Camera Device 2
  • FPGA for assembling audio/video/CPU cells for TX
  • 2KBx8 dual ported SRAM acting as small 3 channel
    FIFO
  • FPGA for network interface control
  • MAC and CRC generation
  • Determines stream priority and reads cell out of
    SRAM and feeds it to SERDES (CoDec)
  • EPROM microcontroller
  • Communicates over network with management
    software
  • Co-ordinates frame capture and compression
Write a Comment
User Comments (0)
About PowerShow.com