Cirrus Logic AN253 User Manual

AN253
Optimizing Code Speed for the MaverickCrunch™
Coprocessor
Brett Davis

1. Introduction

This application note is intended to assist developers in optimizing their source code for use with the Mav­erickCrun ch co proc essor . This do cum ent begins with a brie f ove rview of the M ave rickC runc h cop roce s­sor, followed by optimization guidelines and concludes with an example applying the guidelines discussed.
Multiple facets of code optimization must be considered in order to realize the full benefit of the Maverick­Crunch coprocessor. The guidelines in this document are categorized as algorithm, compiler, or hardware optimizat ions . The dis cu ssion on alg ori thm op tim ization cent ers on high le vel pr ogr ammi ng detail s such as compound expressions and loop unrolling. Next, the compiler optimization guidelines deal with the ef­fects of co mpil er op timi zatio n on co de pe rfor mance - pri mari ly cod e siz e and ex ecut ion s peed. Fina lly , the hardware optimiza tio n s ec t ion enume rat es optimization guideline s r ela te d t o th e M av erickCru nc h copro­cessor im plementat ion such as IEEE-754 im plementat ion and pipeline stalls.
Note: Algorithm selection will not be discussed in this applications note. It is assumed that the
developer has selec t ed and imple m ented the co rrect algorit hm f or t heir applic at ion.

2. MaverickCrunch

This sectio n introduc es and sum marizes the features, instruction s et and arch itecture of the Maver ick­Crunch coprocessor. For further in-depth information on these topics, please read Chapter 3 of the User's Guide.

2.1 Features

The MaverickCrunch coprocessor accelerates IEEE-754 floating point arithmetic, and 32-bit and 64-bit fixed poin t arithmetic . The Ma verickCru nch co process or is an exce llent cand idate fo r encodin g and de­coding digital audio, digital signal processing (such as IIR, FIR, FFT) and numeric approximations. Key features of t he M averickCrunch include:
- IEEE-754 based single and double pr ecisio n f l oating point sup port
- Full IEEE -754 rounding suppor t
- Inex ac t, Overflow, Underflow , and Invalid Operator IEEE-754 exceptions
- 32/64-bit fi xe d point integer operati ons
- Add, multiply, and compare functions for all data types
- Fixed point integer MAC 32-bit input with 72-bit accumulate
- Fixe d point int eger shifts
http://www.cirrus.com
Copyright Cirrus Logic, Inc. 2004
(All Rights Reserved)
JAN ‘04
AN253REV1
1
AN253
- Conversion between floating point and in te ger data representatio ns
- Sixt een (16) 64 - bit general-pu rpose regi s ters
- Four (4) 72-bit accumulators
- Stat us and contr ol regist ers

2.2 Instruction Set

The MaverickCrunch coprocessor's i nstructio n set is robust and in cludes memory, control, a nd arithmetic operations. MaverickCrunch mnemonics are translated by the compiler or assembler into ARM coproces­sor instruc t ions. For ex ample, th e M averickCrunch mnemonic f or double p rec is ion floating-point m ult iply is:
cfmuld c0, c1 , c2
The equivalent ARM coprocessor instruction is:
cdp p4, 1, c0 , c1 , c2 , 1
There are five categories of ARM coprocessor instructions: Data Path (CDP), Load (LDC), Store (STC), Coprocessor to ARM Moves (MCR), and ARM to coproc essor moves (MRC). CDP instr uctions i nclude all arithmetic operation s , an d any other operation internal to the c oprocesso r. LD C and STC instructions in ­clude the set of operations responsible for moving data between memory and the coprocessor. MCR and MRC instructions ar e responsible for movin g data between ARM and c oprocess or registers .
Table 1, Table 2 and Table 3 summarize all of the MaverickCrunch's instruction mnemonics. For more information on the MaverickCrunch inst ruction se t, please see t he table:
MaverickCrunch Instruction Set
in the User's Gui d e.

Table 1. MaverickCrunch Load/Store Mnemonics

cfldrs Cd, [ R n ] cfld r d Cd, [Rn ] cfld r 3 2 C d , [Rn]
cfldr64 Cd, [Rn] cfstrs Cd, [Rn] cfstrd Cd, [Rn]
cflstr32 Cd, [Rn] cfstr64 Cd, [Rn] cfmvsr Cn, Rd
cfmvdlr Cn, Rd cfmvdhr Cn, Rd cfmv64lr Cn, Rd
cfmv64hr Cn, Rd cfmvsr Rd, Cn cfmvrdl Rd, Cn
cfmvrdh Rd, Cn cfmvr64l Rd, Cn cfmvr64h Rd, Cn
cfmval32 Cd, Cn cfmvam32 Cd, Cn cfmv32a Cd, Cn
cfmv64a Cd, Cn cfmvsc32 Cd, Cn cfmv32sc Cd, Cn
cfcpys Cd, Cn cfcpyd Cd, Cn
2
AN253

Table 2. MaverickCrunch Data Manipulation Mnemo nics

cfcvtsd Cd, Cn cfcvtds Cd, Cn cfcmp64 Rd, Cn, Cm
cfcvt32d Cd, Cn cfcvt64s Cd, Cn cfcvt32s Cd, Cn
cfcvts32 Cd, Cn cfcvtd32 Cd, Cn cfcvt64d Cd, Cn
cfrshl32 Cm, Cn, Rd cftruncs32 Cd, Cn cftruncd32 Cd, Cn
cfsh64 Cd, Cn, <imm> cfrshl64 Cm, Cn, Rd cfsh32 Cd, Cn, <imm>
cfcmp32 Rd, Cn, Cm cfcmps Rd, Cn, Cm cfcmpd Rd, Cn, Cm

Table 3. MaverickCrunch Arithmetic Mnemonics

cfabss Cd, Cn cfnegs Cd, Cn cfadds Cd, Cn, Cm cfsubs Cd, Cn, Cm cfnegd Cd, Cn cfaddd Cd, Cn, Cm cfsubd Cd, Cn, Cm cfmuld Cd, Cn, Cm cfabs32 Cd, Cn
cfadd64 Cd, Cn, Cm cfneg32 Cd, Cn cfadd32 Cd, Cn, Cm cfsub32 Cd, Cn, Cm cfmul32 Cd, Cn, Cm cfmac32 Cd, Cn, Cm
cfmsc32 Cd, Cn, Cm cfabs64 Cd, Cn cfneg64 Cd, Cn
cfsub64 Cd, Cn, Cm cfmul64 Cd, Cn, Cm cfmadd32 Ca, Cd, Cn, Cm
cfmsub32 Ca, Cd, Cn, Cm cfmadda32 Ca, Cd, Cn, Cm cfmsuba32 Ca, Cd, Cn, Cm
cfmuls Cd, Cn, Cm cfabsd Cd, Cn

2.3 Architecture

The MaverickCrun c h c oprocess or uses the st andard AR M c oprocessor interface , sharing its m emory in­terface and instruction stream. The MaverickCrunch coprocessor is pipelined, has data forwarding capa­bilities, and can run sy nc hronousl y o r as y nc hronous ly wi th res pect to the AR M 920T pipe line.
There are two se parate pipelin es in the Mav erickCr unch cop roces sor (see Figure 1). Th e first p ipeline, five stages long, is used for LDC, STC, MCR, and MRC instructions. Its stages are Fetch (F), Decode (D), Execute (E), Memory Access (M), and Register Write-Back (W). The second pipeline, seven stages long, is used for t he CDP instructions. I ts s t ages are Fet c h (F ), Decode (D ), Ex ecute/Op erand Fetc h (E), Exe­cute (E1), Ex ecute (E2), Execute (E3), and Register Wr it e-Back (W) .
The MaverickCrunch LDC/STC/MCR/MRC pipeline is identical to, and 'follows' the ARM920T's pipeline. That is, the c onten ts of the LD C/ST C/MCR /MRC pipeline ar e iden tical to the c onten ts of the AR M920 T pipeline.
Note: The ARM pipeline is not shown in Figure 1, but is identical to the LDC/STC/MCR/MRC
pipeline.
The MaverickCru nc h C D P pipeline is near ly twice as deep and runs at half t he speed of the ARM920T's pipe line. T he CDP pipel ine may r un asynchr onously w ith respect to t he ARM920 T ' s pipeline afte r t he ini­tial execution stage. Specifically, the CDP pipeline may run asynchronously in the E1, E2, E3 and W stag­es. Runni ng Mave rickCrunch i n synchro nous mo de forces th e CDP p ipeline to se rialize th e instruction stream resulting in an eight-cycl e stal l per data path inst ructi o n. Th e CDP pipeline's a synchr onous capa­ble stages are shaded in Figure 1.
3

Figure 1. MaverickCrunch Pipelines

AN253
LDC/STC MCR/MRC
CDP
ARM MCLK
F D E M W
F D E E1 E2 E3 W

3. Code Optimization for MaverickCrunch

This section describes guidelines for writing optimized code for the MaverickCrunch coprocessor. These guidelin es are divided int o algorithm, compiler and ar chitectur e sections. It is as sumed that th e correct algorithm has been chosen, and that all non-hardware specific optimizations have been completed. How-
ever, optimization should not begin until all of the code has been written and tested for function­ality.

3.1 Algorithms

This section focuses on metho ds to r educe alg ori th m execut i on ti me. Afte r the co de's functionality is ver­ified, profile and disa ssemble the objects. Look for and optimize the following:
- Sections of code that are executed most frequently
- Sections of code that take the most CPU cycles to execute
- Inefficiencie s in assembly code from the compilation
When opt im iz ing these se c ti ons keep in m ind the follow ing general c oncepts of c ode optimiz at ion:
- Avoid Redundancy - store computations rather than recomputing them
- Serialize C ode - code should be des igned with a mi nim um amount of branching. Code branching
is expensiv e. The ARM920T doe s
not
support b ranch predi c tio n
- Code Locality - code executed closely together in time should be placed closely together in memory, increasing spatial locality of reference and reducing expensive cache misses
Unless your goal is to create small code, code density is not always an indicator of code optimization. Loop Unrolling is an optimiza ti on te chni que that gener all y i ncre ases code siz e, but also incr eases code speed. This is because unrolled loops iterate fewer times than their unoptimized versions resulting in few­er index ca lc ulations, com parisons and branc hes taken.
Note: Taken bra nches are expensive o perations, as th ey take th ree cycles to co mplete and cause
the pipeline to be flush ed. (There is no branch pr ediction in th e AR M 920T.)
Induction Variable Analysis is another speed optimization technique used in the case where a variable in a loop is a funct ion of the loop index. This var iable can be up dat ed each time th e index is updated, re­ducing th e number of ca lc ulations in th e loop.
4
Loading...
+ 8 hidden pages