Buses: connecting cores to CPU and Memory presentation

About This Presentation

Transcript and Presenter's Notes

Title: Buses: connecting cores to CPU and Memory

1
Buses connecting cores to CPU and Memory

Brendan Boulter
November 12, 2002

2
Connecting I/O devices to CPU and Memory

In a system-on-chip architecture, the various
cores may need to communicate with each other.
CPU to memory
CPU to I/O devices
Device to device
This communication is performed by a bus.
The bus is a shared communication link between
cores.

3
Bus advantages

Two major advantages of a bus organization are
Low cost a single set of wires is shared between
multiple cores
Versatility new cores can be added easily and
cores can be reused in a different design if the
bus interface is the same

4
Bus disadvantages

The major disadvantage of a bus is that it
creates a communications bottleneck.
A bus limits the maximum I/O throughput.
For high performance systems, designing a bus
system capable of meeting the demands of the
processor is a major challenge.

5
Bus classification (1)

Buses are traditionally classified as
CPU-memory buses or
I/O buses
I/O buses
may have many types of devices connected to them
have a wide range in the data bandwidth of the
connected devices
normally follow a bus standard

6
Bus classification (2)

CPU-memory buses
Are generally high speed
Matched to the memory system to maximize memory
to CPU bandwidth
The designer of a CPU-memory bus knows all the
types of devices that must connect together (at
the design stage).
The I/O bus designer must allow for devices of
varying latency and bandwidth characteristics.

7
Bus transaction

A bus transaction involves two parts
Sending the address
Sending/receiving the data
Bus transactions are normally defined by what
they do to memory
A read transaction transfers data from memory to
either the CPU or an I/O device
A write transaction writes data to memory

8
Typical bus read transaction (1)
Clock
Address
Data
Read
Wait
Figure 1 a typical bus read transaction
9
Typical bus read transaction (2)

In a read transaction, the address is first sent
down the bus to memory, together with the
appropriate control signals indicating a read.
In figure 1, a read is indicated by deasserting
the read signal
The memory responds by returning the data on the
bus, together with the appropriate control
signals.
In this case, this is indicated by deasserting
the wait signal.

10
Bus design decisions

The design of the bus presents many options to
the designer.
Decisions depend on the design goal
Cost
Performance
Typical options and their impact on cost and
performance are outlined in figure 2

11
Bus design decisions figure 2
12
Bus design decisions

The first three options in figure 2 are clear and
obvious
Separate address and data lines, wider data lines
and multiple word transfers all give higher
performance at greater cost.
The next item concerns the number of bus masters.
A bus master is a device that can initiate a read
or a write transaction

13
Bus masters

A bus has multiple masters when there are
multiple CPUs in the system or when I/O devices
can initiate a bus transaction.
If there are multiple masters, an arbitration
scheme is required to decide who gets
access/control of the bus next.
Arbitration is often implemented as
a fixed priority scheme (round-robin)
an approximately fair scheme that randomly
chooses which master gets the bus

14
Split transaction bus (1)

With multiple masters, a bus can offer higher
bandwidth by going to packets, as opposed to
holding the bus for a full transaction.
This technique is called split transactions.
The read transaction is now split into
A read request transaction that contains the
address
A memory reply transaction that contains the data.

15
Split transaction bus (2)

On a split transaction bus, each transaction must
by tagged so that the processor and memory can
tell what it is.
Split transactions make the bus available for
other masters while the memory reads the words
from the requested address.
The CPU must arbitrate for the bus in order to
send the data and the memory must arbitrate in
order to return the data.
The split transaction bus has higher bandwidth
but at the cost of higher latency than a
non-split bus.

16
Clocking (1)

Clocking concerns whether a bus is synchronous or
asynchronous.
A synchronous bus includes a clock in the control
lines and a fixed protocol for address and data
relative to the clock.
Since little or no logic is required to decide
what to do next, a synchronous bus is both fast
and inexpensive.
Two major disadvantages
Everything on the bus must run at the same clock
rate
Due to clock-skew, the bus cannot be long.
CPU-memory buses are typically synchronous.

17
Clocking (2)

Asynchrony makes it easier to accommodate a wide
variety of devices and to lengthen the bus
without worrying about clock-skew or
synchronization problems.
With an asynchronous bus, there is an overhead
associated with synchronizing the bus with each
transaction.
Asynchronous buses scale better with
technological changes.
I/O buses are typically asynchronous.

18
Clocking (3)
Long
Asynchronous is better
Clock skew (function of bus length)
Synchronous is better
Short
Similar
Varied
Mixture of I/O device speeds
19
Bus standards I/O buses
20
Bus standards CPU-memory buses
21
The memory system

The amount of time it takes to read or write a
memory location is called the memory access time.
A related quantity is the memory cycle time.
The access time measures how quickly you can read
a memory location.
The cycle time measure how quickly you can repeat
memory references.
For example you can ask for data from a DRAM
chip and receive it within 50 ns, but it may be
100 ns before you can ask for more data.

22
The memory system vs the CPU

In the early 1980s, the access time of commodity
DRAMs (200ns) was shorter than the processor
clock cycle (4.77MHz 210ns)
This meant that DRAM could be connected to the
system without worrying about over running the
memory system.
However, CPUs have become faster a lot faster !
Wait states were added to make the memory system
speed appear to match the processor speed.

23
The memory system vs the CPU
CPU speed
1000
100
10
DRAM speed
1
1975
1980
1985
1990
1995
2000
2005
2010
24
Memory hierarchies

The clock time for commodity processors has gone
from 210 ns to less than 1 ns for 1 GHz
processors.
However the access time for commodity DRAMs has
decreased disproportionately less from around
200 ns to around 50 ns.
We could use fast SRAM, but this would be very
expensive.
The solution is to use a hierarchy of memories
Registers
1-3 levels of SRAM cache
DRAM main memory

25
Cache memory

Caches are small amounts of SRAM that store a
subset of the contents of the memory.
The actual cache architecture has had to change
as the cycle time of the processors has improved.
Processors are now so fast that off-chip SRAM
chips are not fast enough.
This has lead to a multilevel cache architecture.

26
The DEC 21164 memory system
Memory access speed on the DEC 21164 Alpha
27
Cache effectiveness

When every memory reference can be found in the
cache, we have a 100 hit rate.
The hit rate of an application depends on a lot
of factors
The algorithm and its locality of reference
The compiler, linker and other software tools
The availability of tuned libraries
The cache implementation method
When the hit rate is high, the system operates
near the speed of the top of the hierarchy.
When the hit rate is low, the system operates
near the speed of the bottom of the hierarchy.

28
Cache organization direct mapped

The process of pairing memory locations with
cache lines is called mapping.
The simplest method is direct mapping.

Main memory
0
4K
8K
0
4K
4K cache
29
Direct mapped cache

Memory location 0, 4K, 8K, etc map into the same
location in cache.
This can cause problems when alternating runtime
memory references point to the same cache line.

real4 a(1024), b(1024) common /arrays/ a, b do i
1, 1024 a(i) b(i) c(i) end do

Each reference causes a cache miss ? thrashing

30
Fully associative cache

At the other extreme, fully associative cache
allows any memory location to be mapped into any
cache line.
It is difficult to find real-world examples of
programs that will cause cache thrashing in fully
associative cache.
However fully associative cache is expensive in
terms of size, price and speed.

31
Set associative cache

Set associative cache is composed of a number of
sets of direct mapped cache.
Common choices are 2- and 4-way set associativity.

Memory reference

In the previous example, references to the a
array might be stored in set 1. Subsequent
references to b would be stored in set 2.

32
Interfacing to the processor
CPU-memory bus
Main memory
Bus adapter
Cache
CPU
I/O bus
I/O controller
I/O controller
I/O controller
Network
Disk
Graphics output
33
Interfacing to the processor

Two methods of interfacing to an I/O device
Memory mapped
I/O mapped
For a memory mapped device, portions of the
address space are assigned to I/O space.
Reads/writes to these addresses cause data to be
transferred
Some part of the address space may also be
reserved for control signals

34
Interfacing to the processor

The alternative is to use dedicated I/O opcodes.
Such devices are known as I/O mapped devices.
Intel 80x86 processor uses I/O mapped devices and
special opcodes to communicate with these
devices.
I/O mapping is less popular than memory mapping.

35
Controlling an I/O device

I/O devices typically have control and status
registers.
The processor can control the device using two
methods
Programmed I/O the processor periodically checks
the status registers for completion of the
transaction. This method puts a burden on the
processor.
Interrupt-driven I/O allows the processor to work
on some other process while waiting for I/O to
complete. The I/O device raises an interrupt on
completion of the transaction.

36
Direct memory access

Interrupt-driven I/O relieves the processor from
waiting for every I/O event but
There may be a significant amount of CPU cycles
required to move data.
Transferring a disk block of 2048 words might
require 2048 reads and 2048 stores as well as the
overhead for the interrupt.
Direct memory access (DMA) can help to relieve
the processor from the burden of bulk data
movement.

37
Direct memory access

A DMA engine is a specialized processor that
transfers data between memory and an I/O device
allowing the processor to work on other tasks.
A DMA engine is external to the CPU and must act
as a bus master.
The processor writes the start address and number
of words to the DMA control registers.
The DMA interrupts the processor when the
transfer is complete.
There may be multiple DMA devices in a system.

38
Advance Microcontroller Bus Architecture

A system on chip (SoC) design consists of a
collection of cores and an interconnection
scheme.
Using an ad-hoc scheme each time wastes design
cycles.
ARMs AMBA can be used to standardize the on-chip
connections of different cores.
Use of a standard bus facilitates design reuse.

39
AMBA

Three buses are defined within the AMBA
specification
The Advanced High-Performance Bus (AHB)
The Advanced System Bus (ASB)
The Advanced Peripheral Bus (APB)
A typical system will incorporate either an AHB
or ASB together with an APB.
The ASB is the older form of the system bus.
The AHB was introduced to provide improved
support for high performance, synthesis and
timing verification.

40
AMBA system architecture
ARM processor core
On-chip RAM
External bus interface
AHB or ASB
DMA controller
Bridge
Test I/f
UART
APB
Timer
Parallel I/f
41
Arbitration

A bus transaction is initiated by a bus master
which requests access from a central arbiter.
The arbiter decides priorities when there are
conflicting requests.
The design of the arbiter is a system specific
issue.
The ASB only specifies the protocol
The master issues a request to the arbiter
When the bus is available, the arbiter issues a
grant to the master.

Write a Comment

User Comments (0)

About PowerShow.com

Buses: connecting cores to CPU and Memory PowerPoint PPT Presentation