Introduction to Computer Hardware

About This Presentation

Title:

Introduction to Computer Hardware

Description:

The libraries are supposed to be optimised ... Diagonals of matrix A are stored in rows of array A ... norms, diagonal scaling, scaled accumulation and addition ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 103

Provided by: alexeylas

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Computer Hardware

1
Array Libraries
2
Array Libraries

Function extensions of C and Fortran 77 with
array or vector libraries
The libraries are supposed to be optimised for
each particular computer
Regular compilers can be used gt no need in
dedicated optimising compilers
One of the most well-known and well-designed
array libraries is the Basic Linear Algebra
Subprograms (BLAS)
Provides basic array operations for numerical
linear algebra
Available for most modern VP and SP computers

3
BLAS

All BLAS routines are divided into 3 main
categories
Level 1 BLAS addresses scalar and vector
operations
Level 2 BLAS addresses matrix-vector operations
Level 3 BLAS addresses matrix-matrix operations
Routines of Level 1 do
vector reduction operations
vector rotation operations
element-wise and combined vector operations
data movement with vectors

4
Level 1 BLAS

A vector reduction operation
The addition of the scaled dot product of two
real vectors x and y into a scaled scalar r

The C interface of the routine implementing the
operation is
void BLAS_ddot(
enum blas_conj_type conj, int n, double alpha,
const double x, int incx, double beta,
const double y, int incy, double r )

5
Level 1 BLAS (ctd)

Other routines doing reduction operations
Compute different vector norms of vector x
Compute the sum of the entries of vector x
Find the smallest or biggest component of vector
x
Compute the sum of squares of the entries of
vector x
Routines doing rotation operations
Generate Givens plane rotation
Generate Jacobi rotation
Generate Householder transformation

6
Level 1 BLAS (ctd)

An element-wise vector operation
The scaled addition of two real vectors x and y

The C interface of the routine implementing the
operation is
void BLAS_dwaxpby(
int n, double alpha, const double x, int
incx,
double beta, const double y, int incy,
double w, int incw )
Function BLAS_cwaxpby does the same operation but
on complex vectors

7
Level 1 BLAS (ctd)

Other routines doing element-wise operations
Scale the entries of a vector x by the real
scalar 1/a
Scale a vector x by a and a vector y by b, add
these two vectors to one another and store the
result in the vector y
Combine a scaled vector accumulation and a dot
product
Apply a plane rotation to vectors x and y

8
Level 1 BLAS (ctd)

An example of data movement with vectors
The interchange of real vectors x and y
The C interface of the routine implementing the
operation is
void BLAS_dswap( int n, double x, int incx,
double y, int incy )
Function BLAS_cswap does the same operation but
on complex vectors

9
Level 1 BLAS (ctd)

Other routines doing data movement with vectors
Copy vector x into vector y
Sort the entries of real vector x in increasing
or decreasing order and overwrite this vector x
with the sorted vector as well as compute the
corresponding permutation vector p
Scale the entries of a vector x by the real
scalar 1/a
Permute the entries of vector x according to
permutation vector p

10
Level 2 BLAS

Routines of Level 2
Compute different matrix vector products
Do addition of scaled matrix vector products
Compute multiple matrix vector products
Solve triangular equations
Perform rank one and rank two updates
Some operations use symmetric or triangular
matrices

11
Level 2 BLAS (ctd)

To store matrices, the following schemes are used
Column-based and row-based storage
Packed storage for symmetric or triangular
matrices
Band storage for band matrices
Conventional storage
An nxn matrix A is stored in a one-dimensional
array a
aij gt aijs (C, column-wise storage)
aij gt ajis (C, row-wise storage)
If sn, rows (columns) will be contiguous in
memory
If sgtn, there will be a gap of (s-n) memory
elements between two successive rows (columns)
Only significant elements of symmetric/triangular
matrices need be set

12
Packed Storage

Packed storage
The relevant triangle of a symmetric/triangular
matrix is packed by columns or rows in a
one-dimensional array
The upper triangle of an nxn matrix A may be
stored in a one-dimensional array a
aij(ij) gt aji(2n-i-1)/2 (C, row-wise
storage)
Example.

gt
13
Band Storage

Band storage
A compact storage scheme for band matrices
Consider Fortran and a column-wise storage scheme
An mxn band matrix A with l subdiagonals and u
superdiagonals may be stored in a 2-dimensional
array A with lu1 rows and n columns
Columns of matrix A are stored in corresponding
columns of array A
Diagonals of matrix A are stored in rows of array
A
aij gt A(ui-j,j) for max(0,j-u) i
min(m-1,jl)
Example.

gt
14
Level 2 BLAS (ctd)

An example of matrix vector multiplication
operation
The scaled addition of a real n-length vector y,
and the product of a general real mxn matrix A
and a real n-length vector x

The C interface of the routine implementing this
operation is
void BLAS_dgemv( enum blas_order_type order,
enum blas_trans_type trans,
int m, int n,
double alpha, const double
a, int stride,
const double x, int incx,
double beta,
const double y, int incy
)

Parameters
order gt blas_rowmajor or blas_colmajor
trans gt blas_no_trans (do not transpose A)

15
Level 2 BLAS (ctd)

If matrix A is a general band matrix with l
subdiagonals and u superdiagonals, the function
void BLAS_dgbmv( enum blas_order_type order,
enum blas_trans_type trans,
int m, int n, int l, int u,
double alpha, const double
a, int stride,
const double x, int incx,
double beta,
const double y, int incy
)

better uses the memory. It assumes that a
band storage scheme is used to store matrix A.
16
Level 2 BLAS (ctd)

Other routines of Level 2 perform the following
operations

as well as many others
For any matrix-vector operation with a specific
matrix operand (triangular, symmetric, banded,
etc.), there is a routine for each storage scheme
that can be used to store the operand

17
Level 3 BLAS

Routines of Level 3 do
O(n2) matrix operations
norms, diagonal scaling, scaled accumulation and
addition
different storage schemes to store matrix
operands are supported
O(n3) matrix-matrix operations
multiplication, solving matrix equations,
symmetric rank k and 2k updates
Data movement with matrices

18
Level 3 BLAS (ctd)

An example of O(n2) matrix operation, which
scales two real mxn matrices A and B and stores
their sum in a matrix C, is

The C interface of the routine implementing this
operation under assumption that the matrices A, B
and C are of the general form, is
void BLAS_dge_add( enum blas_order_type order,
int m, int n,
double alpha, const double a,
int stride_a,
double beta, const double b,
int stride_b,
double c, int stride_c)

There are other 15 routines performing this
operation for different types and forms of the
matrices A, B and C

19
Level 3 BLAS (ctd)

An example of O(n3) matrix-matrix operation
involving a real mxn matrix A, a real nxk matrix
B, and a real mxk matrix C is

The C routine implementing the operation for
matrices A, B and C in the general form is
void BLAS_dgemm( enum blas_order_type order,
enum blas_trans_type trans_a,
enum blas_trans_type trans_b,
int m, int n, int k, double
alpha,
const double a, int stride_a,
const double b, int stride_c,
double beta, const double c,
int stride_c)

20
Level 3 BLAS (ctd)

Data movement with matrices includes
Copying matrix A or its transpose with storing
the result in matrix B
B A or B AT
Transposition of a square matrix A with the
result overwriting matrix A
A AT
Permutation of the rows or columns of matrix A by
a permutation matrix P
A PA or A AP
Different types and forms of matrix operands as
well as different storage schemes are supported

21
Sparse BLAS

Sparse BLAS
Provides routines for unstructured sparse
matrices
Poorer functionality compared to Dense and Banded
BLAS
only some basic array operations used in solving
large sparse linear equations using iterative
techniques
matrix multiply, triangular solve, sparse vector
update, dot product, gather/scatter
Does not specify methods to store a sparse matrix
storage format is dependent on the algorithm, the
original sparsity pattern, the format in which
the data already exists, etc.
sparse matrix arguments are a placeholder, or
handle, which refers to an abstract
representation of a matrix, not the actual data
components

22
Sparse BLAS (ctd)

Several routines provided to create sparse
matrices
The internal representation is implementation
dependent
Sparse BLAS applications are independent of the
matrix storage scheme, relying on the scheme
provided by each implementation
A typical Sparse BLAS application
Creates an internal sparse matrix representation
and returns its handle
Uses the handle as a parameter in computational
Sparse BLAS routines
Calls a cleanup routine to free resourses
associated with the handle, when the matrix is no
longer needed

23
Example

Example. Consider a C program using Sparse BLAS
performing the matrix-vector operation y Ax,
where

24
Example (ctd)

include ltblas_sparse.hgt
int main()
const int n 4, nonzeros 6
double values 1.1, 2.2, 2.4, 3.3, 4.1,
4.4
int index_i 0, 1, 1, 2, 3, 3
int index_j 0, 1, 3, 2, 0, 3
double x 1.0, 1.0, 1.0, 1.0, y
0.0, 0.0, 0.0, 0.0
blas_sparse_matrix A
int k
double alpha 1.0
A BLAS_duscr_begin(n, n) //Create Sparse
BLAS handle
for(k0 k lt nonzeros k) //Insert entries
one by one
BLAS_duscr_insert_entry(A, valuesk,
index_ik, index_jk)
BLAS_uscr_end(A) // Complete construction of
sparse matrix
//Compute matrix-vector product y Ax
BLAS_dusmv(blas_no_trans, alpha, A, x, 1, y,
1)

25
Parallel Languages
26
Parallel Languages

C and Fortran 77 do not reflect some essential
features of VP and SP architectures
They cannot play the same role for VPs and SPs
Optimizing compilers
Only for a simple and limited class of
applications
Array libraries
Cover a limited class of array operations
Other array operations can be only expressed as a
combination of the locally-optimized library
array operations
This excludes global optimization of combined
array operations

27
Parallel Languages (ctd)

Parallel extensions of C and Fortran 77 allows
programmers
To explicitly express in a portable form any
array operation
Compiler does not need to recognize code to
parallelize
Global optimisation of operations on array is
possible
We consider 2 parallel supersets of C and Fortran
77
Fortran 90
C

28
Fortran 90

Fortran 90 is a new Fortran standard released in
1991
Widely implemented since then
Two categories of new features
Modernization of Fortran according to the
state-of-the-art in serial programming languages
Support for explicit expression of operations on
arrays

29
Fortran 90 (ctd)

Serial extensions include
Free-format source code and some other simple
improvements
Dynamic memory allocation (automatic arrays,
allocatable arrays, and pointers and associated
heap storage management)
User-defined data types (structures)
Generic user-defined procedures (functions and
subroutines) and operators

30
Fortran 90 (ctd)

Serial extensions (ctd)
Recursive procedures
New control structures to support structured
programming
A new program unit, MODULE, for encapsulation of
data and a related set of procedures
We focus on parallel extensions

31
Fortran 90 (ctd)

Fortran 90 considers arrays first-class objects
Whole-array operations, assignments, and
functions
Operations and assignments are extended in an
obvious way, on an element-by-element basis
Intrinsic functions are array-valued for array
arguments
operate element-wise if given an array as their
argument
Array expressions may include scalar constants
and variables, which are replicated (or expanded)
to the required number of elements

32
Fortran 90 (ctd)

Example.
REAL, DIMENSION(3,4,5) a, b, c, d
c a b
d SQRT(a)
c a 2.0

33
WHERE Structure

Sometimes, some elements of arrays in an
array-valued expression should be treated
specially
Division by zero in a 1./a should be avoided
WHERE statement
WHERE (a / 0.) a 1./a
WHERE construct
WHERE (a / 0.)
a 1./a
ELSEWHERE
a HUGE(a)
END WHERE

34
Fortran 90 (ctd)

All the array elements in an array-valued
expression or array assignment must be
conformable, i.e., they must have the same shape
the same number of axes
the same number of elements along each axis
Example.
REAL a(3,4,5), b(02,4,5), c(3,4,-13)
Arrays a, b, and c have the same rank of 3,
extents of 3,4, and 5, shape of 3,4,5, size of
60
Only differ in the lower and upper dimension
bounds

35
Array Section

An array section can be used everywhere in array
assignments and array-valued expressions where a
whole array is allowed
An array section may be specified with subscripts
of the form of triplet lowerupperstride
It designates an ordered set i1,,ik such that
i1 lower
ij1 ij stride ( j1,,k-1 )
ik - upper lt stride

36
Array Section (ctd)

Example. REAL a(50,50)
What sections are designated by the following
expressions? What are the rank and shape for each
section?
a(i,1501), a(i,150)
a(i,)
a(i,1503)
a(i,501-1)
a(1140,j)
a(110,110)

37
Array Section (ctd)

Vector subscripts may also be used to specify
array sections
Any expression whose value is a rank 1 integer
array may be used as a vector subsript
Example.
REAL a(5,5), b(5)
INTEGER index(5)
index (/5,4,3,2,1/)
b a(index,1)

38
Array Section (ctd)

Whole arrays and array sections of the same shape
can be mixed in expressions and assignments
Note, that unlike a whole array, an array section
may not occupy contiguous storage locations

39
Array Constants

Fortran 90 introduces array constants, or array
constructors
The simplest form is just a list of elements
enclosed in (/ and /)
May contain lists of scalars, lists of arrays,
and implied-DO loops
Examples.
(/ 0, i1,50 /)
(/ (3.14i, i4,100,3) /)
(/ ( (/ 5,4,3,2,1 /), i1,5 ) /)

40
Array Constants (ctd)

The array constructors can only produce
1-dimensional arrays
Function RESHAPE can be used to construct arrays
of higher rank
REAL a(500,500)
a RESHAPE( (/ (0., i1,250000) /), (/ 500,500
/) )

41
Assumed-Shape and Automatic Arrays

Consider the user-defined procedure operating on
arrays
SUBROUTINE swap(a,b)
REAL, DIMENSION(,) a, b
REAL, DIMENSION(SIZE(a,1), SIZE(a,2)) temp
temp a
a b
b temp
END SUBROUTINE swap

42
Assumed-Shape and Automatic Arrays (ctd)

Formal array arguments a and b are of assumed
shape
Only the type and rank are specified
The actual shape is taken from that of the actual
array arguments
The local array temp is an example of the
automatic array
Its size is set at runtime
It stops existing as soon as control leaves the
procedure

43
Intrinsic Array Functions

Intrinsic array functions include
Extension of such intrinsic functions as SQRT,
SIN, etc. to array arguments
Specific array intrinsic functions
Specific array intrinsic functions do the
following
Compute the scalar product of two vectors
(DOT_PRODUCT) and the matrix product of two
matrices (MATMUL)

44
Specific Intrinsic Array Functions

Perform diverse reduction operations on an array
logical multiplication (ALL) and addition (ANY)
counting the number of true elements in the array
arithmetical multiplication (PRODUCT) and
addition (SUM) of its elements
finding the smallest (MINVAL) or the largest
(MAXVAL) element

45
Specific Intrinsic Array Functions (ctd)

Return diverse attributes of an array
its shape (SHAPE)
the lower dimension bounds of the array (LBOUND)
the upper dimension bounds (UBOUND)
the number of elements (SIZE)
the allocation status of the array (ALLOCATED)

46
Specific Intrinsic Array Functions (ctd)

Construct arrays by means of
merging two arrays under mask (MERGE)
packing an array into a vector (PACK)
replication of an array by adding a dimension
(SPREAD)
unpacking a vector (a rank 1 array) into an array
under mask (UNPACK)

47
Specific Intrinsic Array Functions (ctd)

Reshape arrays (RESHAPE)
Move array elements performing
the circular shift (CSHIFT)
the end-off shift (EOSHIFT)
the transpose of a rank 2 array (TRANSPOSE)
Locate the first maximum (MAXLOC) or minimum
(MINLOC) element in an array

48
C

C (C brackets) is a strict ANSI C superset
allowing programmers to explicitly describe
operations on arrays
Vector value, or vector
An ordered set of values (or vector values) of
any one type
Any vector type is characterised by
the number of elements
the type of elements

49
Vector Value and Vector Object

Vector object
A region of data storage, the contents of which
can represent vector values
An ordered sequence of objects (or vector
objects) of any one type
Unlike ANSI C, C defines the notion of value of
array object
This value is vector

50
Vector Value and Vector Object (ctd)

Example. The value of the array
int a32 0,1,2,3,4,5
is the vector
0,1, 2,3, 4,5
This vector has the shape 3,2.
This vector type is named by int32
The shape of array that of its vector value
In C, array object is a particular case of
vector object

51
Arrays and Pointers

C array is a contiguously allocated set of
elements of any one type of object
C array is a a set of elements of any one type
of object sequentially allocated with a positive
stride
The stride is a distance between successive
elements of the array measured in units equal to
the size of array element
If stride is not specified, it is assumed to be 1

52
Arrays and Pointers (ctd)

C array has at least three attributes
the type of elements
the number of elements
the allocation stride

53
Arrays and Pointers (ctd)

Example 1.
int a3
int a31

Example 2.
int a33

The slot between array elements is of
2xsizeof(int) bytes
54
Arrays and Pointers (ctd)

In C, a pointer has only one attribute
The type of object it points to
It is needed to correctly interpret
the value of the object it points to
the address operators and - (operand(s) and
result should point into the same array)
In C, a pointer has an additional attribute,
stride
If stride is not specified, it is assumed to be 1

55
Arrays and Pointers (ctd)

Example 1. The declarations
int a 0,1,2,3,4
int p1 (void)a
int 2 p2 (void)a4
form the following structure of storage

p12 and p2-1 point to the same array element,
a2
56
Arrays and Pointers (ctd)

Expressions e1e2 or (e2)e1 provide access to
the e2-th element of an array e1
Identical to (((e1)(e2)))
e2 is an integer expression
e1 is an lvalue that has the type array of type
converted to an expression of the type pointer
to type pointing to the intial element of the
array object
the attribute stride of this pointer is identical
to that of the array object

57
Arrays and Pointers (ctd)

C allows dynamic arrays
typedef int (pDiag)nn1
int ann
int j
pDiag p (void)a
...
for(j0 jltn j)
(p)j1

58
Blocking Operator

In C, the value of an aray object is a vector
The i-th element of the vector is the value of
the i-th element of the array object
The postfix operator (the blocking operator)
Supports access to an array as the whole
Its operand has the type array of type
Blocks the conversion of the operand to a pointer
Example. int a5, b52, c53
a, b, and c designate arrays a, b, and c as
a whole
cab

59
Lvector

In C, an lvalue is an expression designating an
object
Example. int d55
dij, d and d0 are lvalues
dij1 and d0 are not.
Modifiable lvalue
May be the left operand of an assignment operator
dij is a modifiable lvalue
d and d0 are not

60
Lvector (ctd)

In C, an lvector is an expression designating a
vector object
Modifiable lvector
May be the left operand of an assignment operator
Example. int d55
d, d0, d, and d0 are lvectors
d, and d0 are modifiable
d and d0 are not modifiable

61
Lvector (ctd)

Example. int a44
((int()45)a)

62
Subarray

An object belongs to an array if
It is an element of the array, or
It belongs to an element of the array
Subarray
A set of objects belonging to an array
An array itself
Example (ctd). The main diagonal is a subarray
It is an array obect of the type int45

63
Subarray (ctd)

Example. int a44
((int()35)(a01))

64
Subarray (ctd)

Not every regular set of objects belonging to an
array makes up its subarray
Example. int a55

No constant modifiable lvector designates this
inner square
65
Array Section

The operator (the grid operator)
Supports access to array sections of general form
Syntax. elrs
Expression e may have type array of type or
pointer to type
Expressions l, r, and s have integer types and
denote
the left bound
the right bound
the stride

66
Array Section (ctd)

Semantics. elrs
A vector object of (r-s)/l1 elements of type
type
Its i-th element is elsi
Expression elrs is lvector
Expression elrs is a modifiable lvector if
All expressions elsi i0,1, are modifiable
lvectors/lvalues
elr1 ? elr

67
Array Section (ctd)

Operand e in elrs may have a vector type
Operator is applied element-wise
Let the vector value of e be u1,,uk
elrs will designate a vector of k vectors
The i-th element of the j-th vector will be
ujlsi (j1,,k)

68
Example

Example. int a55
a1313

69
Example (ctd)

a1313
70
Array Section (ctd)

Operands l and/or r in elrs may be omitted
If l is omitted, the left bound is set to 0
If r is omitted, the right bound is
set to n-1, if the first operand e is an
n-element array
determined from the context, if e is a pointer
Example. int a55
a ? a

71
Element-Wise Vector Operators

The operand of the cast operator and the unary ,
, , -, , !, , and - operators may have a
vector type
The operators are applied element-wise
Example.
int j, k, l, m, n
int p5 j, k, l, m, n
(p13) designates a vector object consisting
of three integer variables k, l, and m.

72
Element-Wise Vector Operators (ctd)

Binary operators , /, , , -, ltlt, gtgt, lt, gt, lt,
gt, , !, , , , , and may have vector
operands
If the operands have the same shape, then the
operator is executed element-wise producing the
result of this shape

73
Element-Wise Vector Operators (ctd)

In general, the operands may have different
shapes but they must be conformable
2 operands are conformable iff the beginning of
the shape of one operand is identical to the
shape of the other operand
Vectors having shapes 9,8,7,6 and 9,8 are
conformable
A non-vector operand is conformable with any
vector operand (why?)

74
Element-Wise Vector Operators (ctd)

Let operands a and b be conformable, and rank(a)
lt rank(b)
The execution of the operator starts from
conformable extension of the value of a to the
shape of b
The conformable extension just replicates the
value by adding dimensions

75
Element-Wise Vector Operators (ctd)

Example. Conformable extension of vector
1,2,3,4,5,6 of shape 2,3 to shape 2,3,2
is vector 1,1,2,2,3,3,4,4,5,5,6,6
Then the operator is applied element-wise to the
result of the conformable extension of the value
of a and the value of b, producing the result of
the same shape as that of b

76
Element-Wise Vector Operators (ctd)

The assignment operators , , , etc. may have
vector operands
The left operand shall be a modifiable lvector
Its rank shall not be less than that of the right
operand
The operands shall be conformable
Two-step execution
The right operand is conformably extended to the
shape of the left one
The assignment is executed element-wise

77
Element-Wise Vector Operators (ctd)

Example.
int amn, bm
...
a b

78
Example

LU-factorization of the square matrix a by using
the Gaussian elimination
double ann, t
int i, j
...
for(i0 iltn i)
for(ji1 jltn j)
t aji/aii
if(aji!0.)
ajin-1-taiin-1

79
Element-Wise Vector Operators (ctd)

By definition, e1e2 is identical to
(((e1)(e2)))
Therefore, e1 and e2 may be of vector type
The programmer can construct lvectors that
designate irregular array sections
Example.
int amn, ind 0,1,6,18
...
aind 0
This code zeros the elements of the 0-th, 1-st,
6-th, and 18-th rows of array a

80
Element-Wise Vector Operators (ctd)

The first operand of the . operator may have a
vector type
The second operand shall name a member of a
structure or union type
The operator is executed element-wise
The result will have the same shape as the first
operand
e-gtid is identical to (e).id

81
Reduction Operators

Reduction operators , , , , ,
, , ?lt, and ?gt
Unary operators
Only applicable to vector operands
If v1,,vn are the elements of the vector value
of the expression e, then the value of the
expression e is that of the expression
(v1vn)

82
Examples

Example 1. Dot product of the vectors a and b.
double an
double bn
double c
...
c (ab)

83
Examples (ctd)

Example 2. Maximal element of the matrix a
int amn
int max
...
max ?gt?gta

84
Examples (ctd)

Example 3. Multiplication of matrices a and b
double aml
double bln
double cmn
int i
...
for(i0 iltm i)
ci (aib)

85
Memory Hierarchy

Parallel programming systems for VPs and SPs take
into account their modern memory structure
Optimal memory management is often more efficient
than optimal usage of IEUs
Approaches to optimal memory management appear
surprisingly similar to optimisation of parallel
facilities
Simple two-level memory model
Small and fast register memory
Large and relatively slow main memory

86
Memory Hierarchy (ctd)

A simple modern memory hierarchy
Register memory
Cache memory
Main memory
Disk memory
Cache memory
A buffer memory between main memory and registers
Holds copies of some data from the main memory

87
Memory Hierarchy (ctd)

Execution of instruction reading a data item from
the main memory into a register
Check if a copy of the data item is already in
the cache
If so, the data item will be actually transferred
into the register from the cache
If not, the data item will be transferred into
the register from the main memory, and a copy of
the item will appear in the cache

88
Cache

Cache
Partitioned into cache lines
Cache line is a minimum unit of data transfer
between the cache and the main memory
Scalars may be transferred only as a part of a
cache line
Much smaller than the main memory
The same cache line may reflect different data
blocks from the main memory

89
Cache (ctd)

Types of cache memory
Direct mapped
each block of the main memory has only one place
it can appear in the cache
Fully associative
a block can be placed anywhere in the cache
Set associative
a block can be placed in a restricted set of
places
a set is a group of two or more cache lines
n-way associative cache

90
Cache (ctd)

Cache miss is the situation when a data item
being referenced is not in the cache
Minimization of cache misses is able to
significantly accelerate execution of the program
Programs intensively using basic operations on
arrays are obviously suitable for that type of
optimization

91
Loop Tiling

The main specific optimization minimizing the
number of cache misses is loop tiling
Consider the loop nest
for(i0 iltm i) / loop 1 /
for(j0 jltn j) / loop 2 /
if(i0)
bjaij
else
bjaij
bj are repeatedly used by successive iterations
of loop 1

92
Loop Tiling (ctd)

If n is large enough, the data items may be
flushed from the cache by the moment of their
repeated use
To minimize the flushing of repeatedly used data
items, the number of iterations of loop 2 may be
decreased
To keep the total number of iterations of this
loop nest unchanged, an additional controlling
loop is introduced

93
Loop Tiling (ctd)

The transformed loop nest is
for(k0 kltn kT) //additional controlling
loop 0
for(i0 iltm i) // loop 1
for(jk jltmin(kT,n) j) // loop 2
if(i0)
bjaij
else
bjaij
This transformation is called tiling
T is the tile size

94
Loop Tiling (ctd)

In general, the loop tiling is applied to loop
nests of the form
for(i1...) / loop 1 /
for(i2...) / loop 2 /
...
for(in...) / loop n /
...
ei2in
...
The goal is to minimize the number of cache
misses for reference ei2in, which is
repeatedly used by successive iterations of loop
1

95
Loop Tiling and Optimising Compilers

The recognition of the loop nests, which can be
tiled is the most difficult problem to be solved
by optimising C and Fortran 77 compilers
Based on the analysis of data dependencies in
loop nests
Theorem. The loop tiling is legally applicable
(to the above loop nest) iff the loops from loop
2 to loop n are fully interchangeable
To prove the interchangability an analysis of
data dependence between different iterations of
the loop nest is needed

96
Loop Tiling and Array Libraries

Level 3 BLAS is specified to support block
algorithms of matrix-matrix operations
Partitioning matrices into blocks and performing
the computation on the blocks maximizes the reuse
of data held in the upper levels of memory
hierarchy

97
Loop Tiling and Parallel Languages

Compilers for parallel languages do not need to
recognize loops suitable for tiling
They can translate explicit operations on arrays
into loop nests with the best possible temporal
locality

98
Virtual memory

Instructions address virtual memory rather than
the real physical memory
The virtual memory is partitioned into pages of a
fixed size
Each page is stored on a disk until it is needed
When the page is needed, it copied to main
memory, with the virtual addresses mapping into
real addresses
This copying is known as paging or swapping

99
Virtual memory (ctd)

Programs processing large enough arrays do not
fit into main memory
The swapping takes place each time when required
data are not in the main memory
The swapping is a very expensive operation
Minimization of the number of swappings can
significantly accelerate the programs
The problem is similar to minimization of cache
misses and can be, therefore, approached
similarly

100
Vector and Superscalar Processors Summary

VPs and SPs provide instruction-level
parallelism, which is best exploited by
applications with intensive operations on arrays
Such applications can be written in a serial
programming language and complied by dedicated
optimizing compilers performing some specific
loop optimizations
Modular, portable, and reliable programming are
supported
Efficiency and portable efficiency are also
supported but only for a limited class of
programs

101
Vector and Superscalar Processors Summary (ctd)

Array libraries allow the programmers to avoid
the use of dedicated compilers
The programmers express operations on arrays
directly using calls to carefully implemented
subroutines
Modular, portable, and reliable programming are
supported
Limited efficiency and portable efficiency
Excludes global optimization of combined array
operations

102
Vector and Superscalar Processors Summary (ctd)