Title: OPTIMIZING C CODE FOR THE ARM PROCESSOR
1OPTIMIZING C CODE FOR THE ARM PROCESSOR
- Optimizing code takes time and reduces source
code readability - Usually done for functions that are critical for
performance or power consumption and are executed
frequently - Usually in combination with profiling
2LOCAL VARIABLES
- ARM registers are 32-bit. Therefore it is more
efficient to use 32-bit data types - Use signed and unsigned integer types and avoid
char and short - Only exception is if you want wraparound to occur
- Unsigned int is more efficient for division
3LOOP STRUCTURES (incrementing for loop)
- int checksum_v5(int data)
-
- unsigned int i
- int sum0
- for (i0 ilt64 i)
-
- sum (data)
-
- return sum
checksum_v5 MOV r2,r0 r2data MOV r0,0
sum0 MOV r1,0 i0 checksum_v5_loop LDR
r3,r2,4 r3 (data) ADD r1,r1,1
i CMP r1,0x40 compare i, 64 ADD r0, r3,
r0 sum r3 BCC checksum_v5_loop if
(ilt64) goto loop MOV pc,r14 return sum
4LOOP STRUCTURES (decrementing for loop)
- int checksum_v6(int data)
-
- unsigned int i
- int sum0
- for (i64 i!0 i--)
-
- sum (data)
-
- return sum
checksum_v6 MOV r2,r0 r2data MOV r0,0
sum0 MOV r1,0x40 i64 checksum_v6_loop LDR
r3,r2,4 r3 (data) SUBS r1,r1,1 i--
and set flags ADD r0, r3, r0 sum r3 BNE
checksum_v6_loop if (i!0) goto loop MOV
pc,r14 return sum
5LOOP UNROLLING
checksum_v7 MOV r2,0 sum0 checksum_v6_loop
LDR r3,r2,4 r3 (data) SUBS r1,r1,4
N -4 and set flags ADD r2, r3, r2 sum
r3 LDR r3,r2,4 r3 (data) ADD r2, r3,
r2 sum r3 LDR r3,r2,4 r3
(data) ADD r2, r3, r2 sum r3 LDR
r3,r2,4 r3 (data) ADD r2, r3, r2 sum
r3 BNE checksum_v6_loop if (N!0) goto
loop MOV r0,r2 r0 sum MOV pc,r14 return
r0
- int checksum_v7(int data,unsigned int N)
-
- int sum0
- do
-
- sum (data)
- sum (data)
- sum (data)
- sum (data)
- N -4
- while (N!0)
- return sum
6Loop Unrolling example
- Unroll the following loop by a factor of 2, 4,
and eight - for (i0 ilt64 i)
-
- ai bi ci1
-
7Factor of 2
- for (i0 ilt32 i)
-
- a2i b2i c2i1
- a2i1 b2i1 c2i11
-
8Factor of 4
- for (i0 ilt16 i)
-
- a4i b4i c4i1
- a4i1 b4i1 c4i11
- a4i2 b4i2 c4i21
- a4i3 b4i3 c4i31
-
9Factor of 8
- for (i0 ilt8 i)
-
- a8i b8i c8i1
- a8i1 b8i1 c8i11
- a8i2 b8i2 c8i21
- a8i3 b8i3 c8i31
- a8i4 b8i4 c8i41
- a8i5 b8i5 c8i51
- a8i6 b8i6 c8i61
- a8i7 b8i7 c8i71
-
10REGISTER ALLOCATION
- Limit the number of local variables in the
internal loop of functions to 12 - Use the important variables in the innermost loop
to help the compiler
11CALLING FUNCTIONS
- Try to restrict functions to four arguments. Use
structures to group related arguments and pass
structure pointers instead - Define small functions in the same source file
and before the functions that call them.
12REGISTER ALLOCATION
- Limit the number of internal loop variables to 12
so they can be stored in registers
13SUMMARY
- Use signed int and unsigned int types for local
variables, function arguments and return values - The most efficient form of loop is the do-while
loop that counts down to zero - Unroll important loops
- Try to limit functions to four arguments.
- Avoid divisions. Use multiplication by reciprocal
- Use the inline assembler
14ARM INLINE ASSEMBLY
- int main()
-
- int n1,n2,m
- n15
- n23
- __asm //inline assembly code
-
- MUL m,n1,n2
-
- printf("The result is d\n",m)
- return(0)
-
15USING INLINE ASSEMBLY
- Used for ARM instructions not supported by the C
compiler (coprocessor instruction set extensions) - Creates portability issues
16ALTERNATIVE CALLING ASSEMBLY FUNCTION FROM C
- include ltstdio.hgt
- extern void multip(int n1, int n2, int m)
- int main()
-
- int n1,n2,m
- n15 //Assigning numbers
- n23
- multip(n1,n2,m) //calling function
- printf("The result is\n",m)
-
17Assembly function
- AREA example, CODE, READONLY
- EXPORT multip external function name
- IMPORT n1 input
- IMPORT n2
- IMPORT m return variable
- Multip function begins
- LDR r3,n1 load data from memory to
registers - LDR r1,r3
- LDR r4,n2
- LDR r2,r4
- LDR r5,m
- LDR r0,r5
- MUL r0,r1,r2
- STR r0,r5 store result to m memory location
- MOV pc,lr return from call
- END
18PORTABILITY ISSUES
- Char type Unsigned on ARM, signed on many other
processors - Alignment ARM lw, sw instructions assume the
address is a multiple of the type you are loading
or storing - Endianess Little endian (default), can be
configured to big endian - Inline assembly Separate inline assembly into
small inlined functions
19EXAMPLE
- Write a program that reads 8-element row and
column vectors from memory and - Multiplies both by a scalar also found in memory
- Calculates the scalar product of the two vectors
- Assume no partial product may exceed 32 bits
- Use v1 1 2 3 4 5 6 7 8, v2 0 1 2 3 4 5 6
7T, s5 as test inputs - Unroll the loop by two and four
- Repeat using inline assembly for the
multiplications