This application note is intended to assist developers in optimizing their source code for use with the MaverickCrun ch co proc essor . This do cum ent begins with a brie f ove rview of the M ave rickC runc h cop roce ssor, followed by optimization guidelines and concludes with an example applying the guidelines
discussed.
Multiple facets of code optimization must be considered in order to realize the full benefit of the MaverickCrunch coprocessor. The guidelines in this document are categorized as algorithm, compiler, or hardware
optimizat ions . The dis cu ssion on alg ori thm op tim ization cent ers on high le vel pr ogr ammi ng detail s such
as compound expressions and loop unrolling. Next, the compiler optimization guidelines deal with the effects of co mpil er op timi zatio n on co de pe rfor mance - pri mari ly cod e siz e and ex ecut ion s peed. Fina lly , the
hardware optimiza tio n s ec t ion enume rat es optimization guideline s r ela te d t o th e M av erickCru nc h coprocessor im plementat ion such as IEEE-754 im plementat ion and pipeline stalls.
Note: Algorithm selection will not be discussed in this applications note. It is assumed that the
developer has selec t ed and imple m ented the co rrect algorit hm f or t heir applic at ion.
2. MaverickCrunch
This sectio n introduc es and sum marizes the features, instruction s et and arch itecture of the Maver ickCrunch coprocessor. For further in-depth information on these topics, please read Chapter 3 of the User's
Guide.
2.1Features
The MaverickCrunch coprocessor accelerates IEEE-754 floating point arithmetic, and 32-bit and 64-bit
fixed poin t arithmetic . The Ma verickCru nch co process or is an exce llent cand idate fo r encodin g and decoding digital audio, digital signal processing (such as IIR, FIR, FFT) and numeric approximations. Key
features of t he M averickCrunch include:
-IEEE-754 based single and double pr ecisio n f l oating point sup port
-Full IEEE -754 rounding suppor t
-Inex ac t, Overflow, Underflow , and Invalid Operator IEEE-754 exceptions
-32/64-bit fi xe d point integer operati ons
-Add, multiply, and compare functions for all data types
-Fixed point integer MAC 32-bit input with 72-bit accumulate
-Fixe d point int eger shifts
http://www.cirrus.com
Copyright Cirrus Logic, Inc. 2004
(All Rights Reserved)
JAN ‘04
AN253REV1
1
AN253
-Conversion between floating point and in te ger data representatio ns
-Sixt een (16) 64 - bit general-pu rpose regi s ters
-Four (4) 72-bit accumulators
-Stat us and contr ol regist ers
2.2Instruction Set
The MaverickCrunch coprocessor's i nstructio n set is robust and in cludes memory, control, a nd arithmetic
operations. MaverickCrunch mnemonics are translated by the compiler or assembler into ARM coprocessor instruc t ions. For ex ample, th e M averickCrunch mnemonic f or double p rec is ion floating-point m ult iply
is:
cfmuld c0, c1 , c2
The equivalent ARM coprocessor instruction is:
cdp p4, 1, c0 , c1 , c2 , 1
There are five categories of ARM coprocessor instructions: Data Path (CDP), Load (LDC), Store (STC),
Coprocessor to ARM Moves (MCR), and ARM to coproc essor moves (MRC). CDP instr uctions i nclude all
arithmetic operation s , an d any other operation internal to the c oprocesso r. LD C and STC instructions in clude the set of operations responsible for moving data between memory and the coprocessor. MCR and
MRC instructions ar e responsible for movin g data between ARM and c oprocess or registers .
Table 1, Table 2 and Table 3 summarize all of the MaverickCrunch's instruction mnemonics. For more
information on the MaverickCrunch inst ruction se t, please see t he table:
MaverickCrunch Instruction Set
in the User's Gui d e.
Table 1. MaverickCrunch Load/Store Mnemonics
cfldrs Cd, [ R n ] cfld r d Cd, [Rn ] cfld r 3 2 C d , [Rn]
cfldr64 Cd, [Rn] cfstrs Cd, [Rn] cfstrd Cd, [Rn]
cflstr32 Cd, [Rn] cfstr64 Cd, [Rn]cfmvsr Cn, Rd
cfmvdlr Cn, Rd cfmvdhr Cn, Rd cfmv64lr Cn, Rd
cfmv64hr Cn, Rdcfmvsr Rd, Cn cfmvrdl Rd, Cn
cfmvrdh Rd, Cn cfmvr64l Rd, Cn cfmvr64h Rd, Cn
cfmval32 Cd, Cn cfmvam32 Cd, Cn cfmv32a Cd, Cn
cfmv64a Cd, Cncfmvsc32 Cd, Cn cfmv32sc Cd, Cn
cfcpys Cd, Cn cfcpyd Cd, Cn
2
AN253
Table 2. MaverickCrunch Data Manipulation Mnemo nics
cfcvtsd Cd, Cn cfcvtds Cd, Cn cfcmp64 Rd, Cn, Cm
cfcvt32d Cd, Cn cfcvt64s Cd, Cn cfcvt32s Cd, Cn
cfcvts32 Cd, Cn cfcvtd32 Cd, Cncfcvt64d Cd, Cn
cfrshl32 Cm, Cn, Rd cftruncs32 Cd, Cncftruncd32 Cd, Cn
cfcmp32 Rd, Cn, Cm cfcmps Rd, Cn, Cm cfcmpd Rd, Cn, Cm
Table 3. MaverickCrunch Arithmetic Mnemonics
cfabss Cd, Cncfnegs Cd, Cncfadds Cd, Cn, Cm
cfsubs Cd, Cn, Cmcfnegd Cd, Cn cfaddd Cd, Cn, Cm
cfsubd Cd, Cn, Cm cfmuld Cd, Cn, Cmcfabs32 Cd, Cn
cfadd64 Cd, Cn, Cm cfneg32 Cd, Cn cfadd32 Cd, Cn, Cm
cfsub32 Cd, Cn, Cm cfmul32 Cd, Cn, Cm cfmac32 Cd, Cn, Cm
cfmsc32 Cd, Cn, Cmcfabs64 Cd, Cncfneg64 Cd, Cn
cfsub64 Cd, Cn, Cm cfmul64 Cd, Cn, Cm cfmadd32 Ca, Cd, Cn, Cm
cfmsub32 Ca, Cd, Cn, Cm cfmadda32 Ca, Cd, Cn, Cm cfmsuba32 Ca, Cd, Cn, Cm
cfmuls Cd, Cn, Cmcfabsd Cd, Cn
2.3Architecture
The MaverickCrun c h c oprocess or uses the st andard AR M c oprocessor interface , sharing its m emory interface and instruction stream. The MaverickCrunch coprocessor is pipelined, has data forwarding capabilities, and can run sy nc hronousl y o r as y nc hronous ly wi th res pect to the AR M 920T pipe line.
There are two se parate pipelin es in the Mav erickCr unch cop roces sor (see Figure 1). Th e first p ipeline,
five stages long, is used for LDC, STC, MCR, and MRC instructions. Its stages are Fetch (F), Decode (D),
Execute (E), Memory Access (M), and Register Write-Back (W). The second pipeline, seven stages long,
is used for t he CDP instructions. I ts s t ages are Fet c h (F ), Decode (D ), Ex ecute/Op erand Fetc h (E), Execute (E1), Ex ecute (E2), Execute (E3), and Register Wr it e-Back (W) .
The MaverickCrunch LDC/STC/MCR/MRC pipeline is identical to, and 'follows' the ARM920T's pipeline.
That is, the c onten ts of the LD C/ST C/MCR /MRC pipeline ar e iden tical to the c onten ts of the AR M920 T
pipeline.
Note: The ARM pipeline is not shown in Figure 1, but is identical to the LDC/STC/MCR/MRC
pipeline.
The MaverickCru nc h C D P pipeline is near ly twice as deep and runs at half t he speed of the ARM920T's
pipe line. T he CDP pipel ine may r un asynchr onously w ith respect to t he ARM920 T ' s pipeline afte r t he initial execution stage. Specifically, the CDP pipeline may run asynchronously in the E1, E2, E3 and W stages. Runni ng Mave rickCrunch i n synchro nous mo de forces th e CDP p ipeline to se rialize th e instruction
stream resulting in an eight-cycl e stal l per data path inst ructi o n. Th e CDP pipeline's a synchr onous capable stages are shaded in Figure 1.
3
Figure 1. MaverickCrunch Pipelines
AN253
LDC/STC
MCR/MRC
CDP
ARM MCLK
F D EMW
F D EE1E2E3W
3. Code Optimization for MaverickCrunch
This section describes guidelines for writing optimized code for the MaverickCrunch coprocessor. These
guidelin es are divided int o algorithm, compiler and ar chitectur e sections. It is as sumed that th e correct
algorithm has been chosen, and that all non-hardware specific optimizations have been completed. How-
ever, optimization should not begin until all of the code has been written and tested for functionality.
3.1Algorithms
This section focuses on metho ds to r educe alg ori th m execut i on ti me. Afte r the co de's functionality is verified, profile and disa ssemble the objects. Look for and optimize the following:
-Sections of code that are executed most frequently
-Sections of code that take the most CPU cycles to execute
-Inefficiencie s in assembly code from the compilation
When opt im iz ing these se c ti ons keep in m ind the follow ing general c oncepts of c ode optimiz at ion:
-Avoid Redundancy - store computations rather than recomputing them
-Serialize C ode - code should be des igned with a mi nim um amount of branching. Code branching
is expensiv e. The ARM920T doe s
not
support b ranch predi c tio n
-Code Locality - code executed closely together in time should be placed closely together in
memory, increasing spatial locality of reference and reducing expensive cache misses
Unless your goal is to create small code, code density is not always an indicator of code optimization.
Loop Unrolling is an optimiza ti on te chni que that gener all y i ncre ases code siz e, but also incr eases code
speed. This is because unrolled loops iterate fewer times than their unoptimized versions resulting in fewer index ca lc ulations, com parisons and branc hes taken.
Note: Taken bra nches are expensive o perations, as th ey take th ree cycles to co mplete and cause
the pipeline to be flush ed. (There is no branch pr ediction in th e AR M 920T.)
Induction Variable Analysis is another speed optimization technique used in the case where a variable
in a loop is a funct ion of the loop index. This var iable can be up dat ed each time th e index is updated, reducing th e number of ca lc ulations in th e loop.
4
Loading...
+ 8 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.