Title: Buses: connecting cores to CPU and Memory
1Buses connecting cores to CPU and Memory
- Brendan Boulter
- November 12, 2002
2Connecting I/O devices to CPU and Memory
- In a system-on-chip architecture, the various
cores may need to communicate with each other. - CPU to memory
- CPU to I/O devices
- Device to device
- This communication is performed by a bus.
- The bus is a shared communication link between
cores.
3Bus advantages
- Two major advantages of a bus organization are
- Low cost a single set of wires is shared between
multiple cores - Versatility new cores can be added easily and
cores can be reused in a different design if the
bus interface is the same
4Bus disadvantages
- The major disadvantage of a bus is that it
creates a communications bottleneck. - A bus limits the maximum I/O throughput.
- For high performance systems, designing a bus
system capable of meeting the demands of the
processor is a major challenge.
5Bus classification (1)
- Buses are traditionally classified as
- CPU-memory buses or
- I/O buses
- I/O buses
- may have many types of devices connected to them
- have a wide range in the data bandwidth of the
connected devices - normally follow a bus standard
6Bus classification (2)
- CPU-memory buses
- Are generally high speed
- Matched to the memory system to maximize memory
to CPU bandwidth - The designer of a CPU-memory bus knows all the
types of devices that must connect together (at
the design stage). - The I/O bus designer must allow for devices of
varying latency and bandwidth characteristics.
7Bus transaction
- A bus transaction involves two parts
- Sending the address
- Sending/receiving the data
- Bus transactions are normally defined by what
they do to memory - A read transaction transfers data from memory to
either the CPU or an I/O device - A write transaction writes data to memory
8Typical bus read transaction (1)
Clock
Address
Data
Read
Wait
Figure 1 a typical bus read transaction
9Typical bus read transaction (2)
- In a read transaction, the address is first sent
down the bus to memory, together with the
appropriate control signals indicating a read. - In figure 1, a read is indicated by deasserting
the read signal - The memory responds by returning the data on the
bus, together with the appropriate control
signals. - In this case, this is indicated by deasserting
the wait signal.
10Bus design decisions
- The design of the bus presents many options to
the designer. - Decisions depend on the design goal
- Cost
- Performance
- Typical options and their impact on cost and
performance are outlined in figure 2
11Bus design decisions figure 2
12Bus design decisions
- The first three options in figure 2 are clear and
obvious - Separate address and data lines, wider data lines
and multiple word transfers all give higher
performance at greater cost. - The next item concerns the number of bus masters.
- A bus master is a device that can initiate a read
or a write transaction
13Bus masters
- A bus has multiple masters when there are
multiple CPUs in the system or when I/O devices
can initiate a bus transaction. - If there are multiple masters, an arbitration
scheme is required to decide who gets
access/control of the bus next. - Arbitration is often implemented as
- a fixed priority scheme (round-robin)
- an approximately fair scheme that randomly
chooses which master gets the bus
14Split transaction bus (1)
- With multiple masters, a bus can offer higher
bandwidth by going to packets, as opposed to
holding the bus for a full transaction. - This technique is called split transactions.
- The read transaction is now split into
- A read request transaction that contains the
address - A memory reply transaction that contains the data.
15Split transaction bus (2)
- On a split transaction bus, each transaction must
by tagged so that the processor and memory can
tell what it is. - Split transactions make the bus available for
other masters while the memory reads the words
from the requested address. - The CPU must arbitrate for the bus in order to
send the data and the memory must arbitrate in
order to return the data. - The split transaction bus has higher bandwidth
but at the cost of higher latency than a
non-split bus.
16Clocking (1)
- Clocking concerns whether a bus is synchronous or
asynchronous. - A synchronous bus includes a clock in the control
lines and a fixed protocol for address and data
relative to the clock. - Since little or no logic is required to decide
what to do next, a synchronous bus is both fast
and inexpensive. - Two major disadvantages
- Everything on the bus must run at the same clock
rate - Due to clock-skew, the bus cannot be long.
- CPU-memory buses are typically synchronous.
17Clocking (2)
- Asynchrony makes it easier to accommodate a wide
variety of devices and to lengthen the bus
without worrying about clock-skew or
synchronization problems. - With an asynchronous bus, there is an overhead
associated with synchronizing the bus with each
transaction. - Asynchronous buses scale better with
technological changes. - I/O buses are typically asynchronous.
18Clocking (3)
Long
Asynchronous is better
Clock skew (function of bus length)
Synchronous is better
Short
Similar
Varied
Mixture of I/O device speeds
19Bus standards I/O buses
20Bus standards CPU-memory buses
21The memory system
- The amount of time it takes to read or write a
memory location is called the memory access time. - A related quantity is the memory cycle time.
- The access time measures how quickly you can read
a memory location. - The cycle time measure how quickly you can repeat
memory references. - For example you can ask for data from a DRAM
chip and receive it within 50 ns, but it may be
100 ns before you can ask for more data.
22The memory system vs the CPU
- In the early 1980s, the access time of commodity
DRAMs (200ns) was shorter than the processor
clock cycle (4.77MHz 210ns) - This meant that DRAM could be connected to the
system without worrying about over running the
memory system. - However, CPUs have become faster a lot faster !
- Wait states were added to make the memory system
speed appear to match the processor speed.
23The memory system vs the CPU
CPU speed
1000
100
10
DRAM speed
1
1975
1980
1985
1990
1995
2000
2005
2010
24Memory hierarchies
- The clock time for commodity processors has gone
from 210 ns to less than 1 ns for 1 GHz
processors. - However the access time for commodity DRAMs has
decreased disproportionately less from around
200 ns to around 50 ns. - We could use fast SRAM, but this would be very
expensive. - The solution is to use a hierarchy of memories
- Registers
- 1-3 levels of SRAM cache
- DRAM main memory
25Cache memory
- Caches are small amounts of SRAM that store a
subset of the contents of the memory. - The actual cache architecture has had to change
as the cycle time of the processors has improved. - Processors are now so fast that off-chip SRAM
chips are not fast enough. - This has lead to a multilevel cache architecture.
26The DEC 21164 memory system
Memory access speed on the DEC 21164 Alpha
27Cache effectiveness
- When every memory reference can be found in the
cache, we have a 100 hit rate. - The hit rate of an application depends on a lot
of factors - The algorithm and its locality of reference
- The compiler, linker and other software tools
- The availability of tuned libraries
- The cache implementation method
- When the hit rate is high, the system operates
near the speed of the top of the hierarchy. - When the hit rate is low, the system operates
near the speed of the bottom of the hierarchy.
28Cache organization direct mapped
- The process of pairing memory locations with
cache lines is called mapping. - The simplest method is direct mapping.
Main memory
0
4K
8K
0
4K
4K cache
29Direct mapped cache
- Memory location 0, 4K, 8K, etc map into the same
location in cache. - This can cause problems when alternating runtime
memory references point to the same cache line.
real4 a(1024), b(1024) common /arrays/ a, b do i
1, 1024 a(i) b(i) c(i) end do
- Each reference causes a cache miss ? thrashing
30Fully associative cache
- At the other extreme, fully associative cache
allows any memory location to be mapped into any
cache line. - It is difficult to find real-world examples of
programs that will cause cache thrashing in fully
associative cache. - However fully associative cache is expensive in
terms of size, price and speed.
31Set associative cache
- Set associative cache is composed of a number of
sets of direct mapped cache. - Common choices are 2- and 4-way set associativity.
Memory reference
- In the previous example, references to the a
array might be stored in set 1. Subsequent
references to b would be stored in set 2.
32Interfacing to the processor
CPU-memory bus
Main memory
Bus adapter
Cache
CPU
I/O bus
I/O controller
I/O controller
I/O controller
Network
Disk
Graphics output
33Interfacing to the processor
- Two methods of interfacing to an I/O device
- Memory mapped
- I/O mapped
- For a memory mapped device, portions of the
address space are assigned to I/O space. - Reads/writes to these addresses cause data to be
transferred - Some part of the address space may also be
reserved for control signals
34Interfacing to the processor
- The alternative is to use dedicated I/O opcodes.
- Such devices are known as I/O mapped devices.
- Intel 80x86 processor uses I/O mapped devices and
special opcodes to communicate with these
devices. - I/O mapping is less popular than memory mapping.
35Controlling an I/O device
- I/O devices typically have control and status
registers. - The processor can control the device using two
methods - Programmed I/O the processor periodically checks
the status registers for completion of the
transaction. This method puts a burden on the
processor. - Interrupt-driven I/O allows the processor to work
on some other process while waiting for I/O to
complete. The I/O device raises an interrupt on
completion of the transaction.
36Direct memory access
- Interrupt-driven I/O relieves the processor from
waiting for every I/O event but - There may be a significant amount of CPU cycles
required to move data. - Transferring a disk block of 2048 words might
require 2048 reads and 2048 stores as well as the
overhead for the interrupt. - Direct memory access (DMA) can help to relieve
the processor from the burden of bulk data
movement.
37Direct memory access
- A DMA engine is a specialized processor that
transfers data between memory and an I/O device
allowing the processor to work on other tasks. - A DMA engine is external to the CPU and must act
as a bus master. - The processor writes the start address and number
of words to the DMA control registers. - The DMA interrupts the processor when the
transfer is complete. - There may be multiple DMA devices in a system.
38Advance Microcontroller Bus Architecture
- A system on chip (SoC) design consists of a
collection of cores and an interconnection
scheme. - Using an ad-hoc scheme each time wastes design
cycles. - ARMs AMBA can be used to standardize the on-chip
connections of different cores. - Use of a standard bus facilitates design reuse.
39AMBA
- Three buses are defined within the AMBA
specification - The Advanced High-Performance Bus (AHB)
- The Advanced System Bus (ASB)
- The Advanced Peripheral Bus (APB)
- A typical system will incorporate either an AHB
or ASB together with an APB. - The ASB is the older form of the system bus.
- The AHB was introduced to provide improved
support for high performance, synthesis and
timing verification.
40AMBA system architecture
ARM processor core
On-chip RAM
External bus interface
AHB or ASB
DMA controller
Bridge
Test I/f
UART
APB
Timer
Parallel I/f
41Arbitration
- A bus transaction is initiated by a bus master
which requests access from a central arbiter. - The arbiter decides priorities when there are
conflicting requests. - The design of the arbiter is a system specific
issue. - The ASB only specifies the protocol
- The master issues a request to the arbiter
- When the bus is available, the arbiter issues a
grant to the master.