Texas Instruments SPRAA56 User Manual

Size:
402.53 Kb
Download

Application Report

SPRAA56 – September 2004

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

Brian Jeff

DSP Field Software Applications

Arnie Reynoso

Software Development Systems

ABSTRACT

DSP/BIOS and the Reference Frameworks allow developers to non-intrusivelyinstrumentreal-timeapplications. The software provided with this application note appliesreal-timeanalysis (RTA) services to a working application a H.263 encode/decode loopback example for the TMS320DM642 evaluation module. The software demonstrates techniques for benchmarking and controlling video software. It also introduces a service to programmatically measure CPU and TSK loading. Debugging and troubleshooting techniques forreal-timeapplications, using Code Composer Studio, is also discussed.

 

 

Contents

 

1 Important Benchmarks for Video Applications..........................................................................

2

2

Base Application Overview .........................................................................................................

3

 

2.1

DSP/BIOS and RF5 Components Used..................................................................................

5

 

2.2

Requirements for Viewing RTA Benchmarks ..........................................................................

7

3 Modifications to the Base Example.............................................................................................

7

 

3.1

Splitting the Encode and Decode CELLs ................................................................................

8

 

3.2

Adding the Control TSK and MBX Communication .................................................................

8

 

3.3

Querying the H.263 Encoder for Status ..................................................................................

9

 

3.4

Controlling the Frame Rate...................................................................................................

10

4 RTA Techniques for Performance Measurement .....................................................................

11

 

4.1

Measuring Function Execution Time with the UTL Module ...................................................

11

 

4.2

Measuring Task Scheduling Latencies .................................................................................

12

 

4.3

Measuring End-to-End Latencies..........................................................................................

12

 

4.4

Measuring the Frame Rate ...................................................................................................

13

 

4.5

Simulating High CPU Load Stress Conditions with Dummy NOP Loads...............................

14

 

4.6

Programmatic Measurement of Total CPU Load...................................................................

14

 

4.7

Memory Bus Utilization .........................................................................................................

15

 

4.8

Bitrate and Frame Type ........................................................................................................

17

 

4.9

Methods for Transmitting Measured Performance Data........................................................

18

 

4.10 Application-Specific Control via GEL Scripts in CCStudio.....................................................

19

5 Viewing Benchmarks in the Instrumented Application ...........................................................

19

 

5.1

Requirements .......................................................................................................................

19

 

5.2

Running the Application........................................................................................................

20

 

5.3

Interpreting the Benchmarks.................................................................................................

22

 

5.4

Controlling the Run-Time Parameters Dynamically...............................................................

25

6

References..................................................................................................................................

26

Appendix A. Performance Impact.....................................................................................................

27

 

A.1

Overhead of Performance Measurement Techniques...........................................................

27

 

A.2

RTA Effects on CPU Load ....................................................................................................

27

 

A.3

Memory Footprint .................................................................................................................

28

1

SPRAA56

 

 

Figures

 

Figure 1. Basic Data Flow of the Video Application......................................................................

4

Figure 2. Detailed Application Data Flow Showing Memory Buffers ...........................................

8

Figure 3. Task Partitioning in the Modified Application ...............................................................

9

Figure 4.

CPU Load Measurement at Run-Time ..........................................................................

15

Figure 5.

External Internal Memory Transfers, YUV4:2:0 to 4:2:2 Conversion Function ....

16

Figure 6.

Workspace Including RTA Windows............................................................................

22

Figure 7.

Statistics View Showing Benchmark Measurements..................................................

23

1Important Benchmarks for Video Applications

Diverse video applications often require similar benchmarks to quantify their performance. Some of the most commonly needed benchmarks are as follows:

Frame rate

Resolution

End-to-endlatency

Processor utilization

Bitrate*

Quantization factor*

Frame type*

Group-of-pictures(GOP) structure*

Items marked with an asterisk are of importance in applications where encoders or decoders are involved. This application note provides a method for measuring many of these benchmarks during the capture, processing, and display phases of the example video application.

Frame rate is the rate at which frames are captured, processed, and displayed. The capture, process, and display frame rates can differ by design or under overloaded conditions where frames are dropped. Therefore, it is important to measure all three frame rates separately.

Resolution is the size in pixels of the capture, processing, and display. Resolution is typically static at run-time,so it is not usually benchmarked withreal-timetools. However, it is important to know the capture, processing, and display resolutions of the system design. For example, the H.263 loopback application used in this application note captures and displays video in D1 resolution and processes in 4CIF resolution.

End-to-endlatency is a measurement of the time between the capture of a video frame in realtime and the display of that same video frame some number of milliseconds later.

Processor utilization is the percentage of DSP resources used by an algorithm. In video applications, the significant benchmarks of processor utilization include not only the number of CPU cycles used, but also the memory bus utilization since such large amounts of data must be moved from external memory to L2 and back repeatedly.

Bitrate is the number of bits per second output by a video encoder, or delivered to a video decoder. Higher bitrates are generally associated with higher quality video. The bitrate often varies with the complexity and motion in a video source, so it is important to measure bitrate dynamically in video applications.

2

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

SPRAA56

Quantization is the process of dividing a continuous range of input values into a finite number of subranges. Each subrange is assigned a specific output value. The Q factor, or quantization factor, describes the level of quantization used to store the frequency domain representation of the encoded image. Q factor often varies dynamically in an encoder when a constant bitrate is targeted, so it is useful to display the Q factor dynamically with the video stream.

Frame type designates whether a particular frame was encoded independently (I frame) or whether it depends upon previous frames (P) or both previous and future frames (B). Frame type is a useful benchmark when shown in real-time.Note that P and B frame types are relevant for H.263,MPEG-4,MPEG-2,and similar compression standards. They are not relevant for JPEG or uncompressed video applications.

Group-of-pictures(GOP) structure is the sequence of frame types (I, P, and B) produced by the encoder. Common structure lengths are 12 and 15 frames. For example, IBBPBBPBBPBBPBB. If the video stream does not change greatly from frame to frame, P frames may be about 10% the size of I frames, and B frames may be about 2% the size of I frames.

2Base Application Overview

The base "h263_loopback" example used to create the application described here is a video application supplied with the TMS320DM642 evaluation module board support package. After you install the board support package, the source code and included object libraries for the base example are in the <CCS_install_dir>\boards\evmdm642\examples\video\h263_loopback directory.

The H.263 loopback example was chosen because it integrates the following pieces of eXpressDSP software in a working video system:

xDAIS-compliantalgorithms

eXpressDSP-compliantvideo device drivers from the device driver kit (DDK)

DSP/BIOS real-timekernel for scheduling

Chip Support Library (CSL) for low level device function calls

Reference Framework Level 5 (RF5) as a software / scheduling foundation

This example could be used as the basis for any video design that uses an xDAIS-compliantcodec. It could be modified to support networking or streaming input/output by following the video networking examples provided with the EVM s board support package.

While some real-timeanalysis tools are enabled in the base example, this note describes a more comprehensive set of tools forreal-timeanalysis, benchmarking, and debugging. This set of tools can be used with any video application that has a similarDSP/BIOS-basedfoundation.

The design of the base example is described in detail in the H.263 Loopback on the DM642 EVM (SPRA933), but a brief description of the design and components used is provided here for reference.

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

3

SPRAA56

Figure 1 shows a simplified view of the sequential flow of capture, processing, and display tasks in the application.

Camera

 

 

TSK

TSK

TSK

tskInput

tskVideoProcess

tskOutput

Device

 

Device

Driver

 

 

Driver

 

 

SCOM

Figure 1. Basic Data Flow of the Video Application

Before video data reaches the first stage, it must be converted to digital data, a process that is managed by the input device driver. Analog video input is converted by an on-boardNTSC decoder chip into a digital bitstream compliant with the BT.656 format with embedded synchronization. The decoder chip sends the bitstream to the TMS320DM642 DSP s video port. A device driver, implementing the IOM interface recommended in theDSP/BIOS Driver Developer’s Guide (SPRU616), is used to manage the initialization and synchronization of the EDMA channel, the video port, and the NTSC decoder used for video capture.

In Figure 1, TSK refers to a DSP/BIOS task, which is described in detail in theDSP/BIOS User's Guide and theDSP/BIOS API Reference. Tasks support blocking calls, which are used to synchronize the application and the video data stream. The main data flow then has three stages:capture, processing, and display. Each stage has its own task object.

The example s first stage is a task called tskInput, which runs the tskVideoInput function. The task receives digital video buffers from the device driver. It then converts the buffers to the 4:2:0 format from the 4:2:2 formatted data it receives from the driver.

The next stage, the tskVideoProcess task, which runs the tskProcess function. The task includes algorithms that require input data in the 4:2:0 format. The tskInput task sends a message to the tskVideoProcess task with pointers to the newly formatted data buffers. The tskVideoProcess task then calls an xDAIS-compliantH.263 encoder algorithm to compress the data, which is stored in an intermediate buffer. A secondxDAIS-compliantalgorithm, an H.263 decoder, is called to decode the data in the buffer.

The tskOutput task runs the tskVideoOutput function. It converts the data back to 4:2:2 format as required by the output driver and the NTSC encoder chip, and calls the driver with the data buffer for display. The output driver is also an eXpressDSP compliant device driver with the same API interface as the input driver.

Data is passed between the tasks using SCOM messaging objects to pass the pointers and synchronization semaphore required to ready the output task. The SCOM module is from Reference Framework 5, which is described in the next section.

4

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

SPRAA56

2.1DSP/BIOS and RF5 Components Used

The base application leverages various DSP/BIOS real-timeanalysis components to support debugging capabilities that are not intrusive to the system performance. The following three modules are included with the core DSP/BIOS library, and can be used in any application that uses DSP/BIOS and on any TI DSP supported by DSP/BIOS:

LOG Logging events

STS Statistics accumulators

TRC Control of real-timecapture

In addition to these DSP/BIOS components, the application also uses the UTL module for debugging and diagnostics. This module is provided in the Reference Frameworks distribution. The UTL module is described in more detail in Reference Frameworks for eXpressDSP Software: API Reference (SPRA147).

In addition to modules used for real-timeanalysis and debugging, the base application uses the following DSP/BIOS and Reference Frameworks (RF) modules.

MBX Mailbox software module for inter-taskcommunication (DSP/BIOS)

TSK Task scheduling module (DSP/BIOS)

SCOM Synchronization and pointer-passingmechanism for data flow between TSKs (RF)

CHAN Instantiates and serially executes xDAIS-compliantalgorithms (RF)

CELL Container for xDAIS algorithms in a CHAN (RF)

ALGRF Encapsulates the procedure for xDAIS algorithm instantiation (RF)

The following module provides an interface to the video port device driver, and is described in

The TMS320DM642 Video Port Mini-Driver(SPRA918).

FVID Frame Video APIs for communicating with video port device drivers

A brief description of the DSP/BIOS and RF5 modules used extensively in benchmarking the application is given in the following subsections.

2.1.1 LOG

The LOG module captures events in real time while the target program executes. You can use the system log (LOG_system) or create user-definedlogs, such as myTrace. Log buffers are of a fixed size and reside in data memory. Individual messages use four words of storage in the log’s buffer. The first word holds a sequence number that allows the Event Log to display logs in the correct order. The remaining three words contain data specified by the call that writes the message to the log. The LOG module is much less intrusive to a running system (both in MIPS and memory) than the RTS printf function, while providing a similar capability.

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

5

SPRAA56

2.1.2 STS

An STS object accumulates the following statistical information about an arbitrary 32-bitwide data series: count, total, and maximum.

Statistics are accumulated in 32-bitvariables on the target DSP and in64-bitvariables on the host PC. When the host polls the target forreal-timestatistics, it resets the variables on the target. This minimizes space requirements on the target, while allowing you to keep statistics for long test runs.

As part of using the DSP/BIOS instrumented kernel, the application automatically acquires STS information for HWI, PIP, PRD, SWI, and TSK objects. To use this built-infeature on TSKs, the application must call the TSK_settime and TSK_deltatime APIs to obtain STS information.

Custom STS objects can also be created in the DSP/BIOS configuration. By using the STS APIs for the created objects, you can determine what statistical information needs to be acquired by the system application during run-time.

2.1.3 TRC

The TRC module manages a set of trace control bits that control the real-timecapture of program information through event logs and statistics accumulators. For greater efficiency, the target does not execute log or statistics APIs unless tracing is enabled.

This module contains two user-definedTRC flags that can be toggled using the DSP/BIOS RTA Control Panel in Code Composer Studio. The application can use these bits to enable or disable sets of explicit instrumentation. The program can use the TRC_query API to check the settings of these bits and either perform or omit instrumentation calls based on the result. DSP/BIOS does not use or set these bits.

2.1.4 UTL

UTL is part of the Reference Frameworks distribution. The UTL module is used for debugging and diagnostics.

The module is essentially a set of macros that can either be expanded to code that performs the desired debugging function, or removed completely when building depending on the value of the UTL_DBGLEVEL preprocessor flag. The UTL module encapsulates DSP/BIOS services such as CLK, STS, and LOG in APIs. These services can be easily removed in the final build by using the preprocessor flag, -d “UTL_DBGLEVEL=0”.

With conditional expansion of macros to code you can reduce code size and remove unnecessary functionality in the deployment phase without having to remove development debugging/diagnostics aids. This technique also means you don t need to modify code at deployment time, thus reducing the possibility of error.

6

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

SPRAA56

2.2Requirements for Viewing RTA Benchmarks

In order for any of the DSP/BIOS-basedRTA tools to be visible, the DSP/BIOS components in Code Composer Studio version 2.30 or earlier and version 3.0 require that the application s .cdb configuration file be accessible and consistent with the executable .out file.

This requirement is easily met during development. It can also be satisfied in demonstrations or delivered test examples. If you do not want to deliver source code with the application for external testing or demonstration, you can still enable all the RTA tools by providing a current DSP/BIOS configuration .cdb file along with the executable .out file to be tested. The tester will be able to view the CPU load, individual thread statistics, and other important benchmark details described in the sections to follow.

The RTA tools can be used in stop mode or real-timemode. In the GBL module of the DSP/BIOS configuration, you can enable or disablereal-timeanalysis. If you disablereal-timeanalysis, the three RTA functions in the IDL background loop are removed. Those functions normally move RTA data from buffers on the DSP to the host PC and calculate the CPU load for the load graph.

When RTA is disabled, the Message Log, Statistics View, Execution Graph, and other RTA windows are updated only when the DSP is halted. An update displays the most recent contents of their respective buffers. This stop mode of RTA offers a good compromise when some visibility is required, but the additional code and background function calls are undesirable. Stop mode can also occur if RTA is enabled but the CPU is so heavily loaded that it never runs the IDL background loop long enough to provide real-timeupdates. In either case ofstop-modeoperation, the CPU Load Graph is not updated. However, the programmatic method for CPU load measurement discussed later in this application note provides a useful working alternative.

The next section describes structural modifications made to the application to make it more suitable for benchmarking and further development.

3Modifications to the Base Example

The application associated with this document has very few structural changes from the base application shipped with the TMS320DM642 evaluation module. Some variables have been renamed for readability, the encoder and decoder have been separated, and an additional task has been added for application control. The data flow in the application has not been modified.

The steps to convert the base example to the modified example are provided in a readme file in the directory that contains the source code.

Figure 2 shows a more detailed look at the data flow in the modified H.263 loopback example:

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

7

SPRAA56

Device

 

YAfter420 720x576

bitBuf

 

y

 

Device

 

414 KB

 

 

414 KB

 

Driver

 

 

 

 

Driver

 

 

512 KB

 

 

 

 

 

 

 

 

Buffer

 

 

 

 

 

 

Buffer

Y uv

 

 

 

 

 

Y uv

 

 

 

 

 

 

 

3 frames

422to

 

H .263

 

H .263

Cr

422to

3 frames

420

CbAfter420

 

420

 

enc

 

dec

CbArrau

 

 

 

 

Shared

 

 

 

 

 

 

 

 

 

 

 

 

 

CrAfter420

 

Scratch

 

Cb

 

 

 

 

207 KB

6 KB

92 KB

1.5 KB

207 KB

 

 

 

scratch1

 

 

scratch2

 

 

 

Instance

 

Instance

 

 

 

14 KB = 20 lines

m em ory

 

m em ory

 

14 KB

 

Key

 

 

Internal Memory

 

 

 

 

DMA Read/W rite (background)

External Memory

 

 

 

 

CPU Read/W rite

 

DSP CPU Function

 

 

 

 

Figure 2. Detailed Application Data Flow Showing Memory Buffers

Note: The dotted lines inFigure 2 indicate EDMA moves, and the solid lines indicate CPU reads/writes. The application performs only CPU reads/writes from mapped internal memory, relying on the EDMA to copy working data into internal scratch buffers.

3.1Splitting the Encode and Decode CELLs

In the base example, the H.263 encoder and decoder are wrapped in sequential CELLs in a single channel. This is suitable for an example application, but in actual video systems the input to the decoder would be an encoded bitstream from an external source, and the output from the encoder would be sent to an external source such as a network stream or a hard disk drive. Splitting the encoder and decoder into separate channels better supports external sourcing or transport of the encoded bitstream. Additionally, splitting the encoder and decoder allows them to be benchmarked separately for execution time.

A separate CHAN was created and initialized for the H.263 encoder and the H.263 decoder. At run-time,a separate CHAN_execute command can be executed for each channel.

3.2Adding the Control TSK and MBX Communication

The second change to the base example was the addition of a control TSK to send control commands to the process TSK using the MBX module from DSP/BIOS. A MBX object, mbxProcess, was added in the DSP/BIOS text-basedconfiguration file appThread.tci. That MBX object transmits control commands to the tskVideoProcess TSK to changerun-timeparameters such as the video frame rate and the encoder bitrate.

8

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

SPRAA56

if(controlVideoProc.frameRateChanged) { txMsg.cmd = FRAMERATECHANGED; txMsg.arg1 = chanNum;

txMsg.arg2 = controlVideoProc.frameRateTarget; controlVideoProc.frameRateChanged = FALSE; MBX_post( &mbxProcess, &txMsg, 0 );

}

While implementing control via the host PC did not specifically require a separate task in the modified application, adding a discrete control task makes the application more scalable. For example, a user interface or communications link from another processor could send control commands to a DSP-basedvideo system. The control task could then service that user interface or communications link. In the modified example, the control task simply monitors a global structure for commands, and sends appropriate commands to the processing task if necessary.

The priority of the control TSK is set to a lower level than that of the tskVideoProcess, tskInput, and tskOutput TSKs. This prevents the control task from adding latency or CPU overhead when responding to control commands. The control commands are only serviced at times when the three TSKs in the data stream are all in the blocked state and the processor would normally be running its background loop.

Figure 3 shows the task partitioning added to the application flow inFigure 2.

Device

 

YAfter420

 

414 KB

Driver

 

 

 

Buffer

Yuv

 

 

 

3 frames

422to

 

420

CbAfter420

 

 

 

CrAfter420

 

 

207 KB

 

scratch1

 

 

14 KB = 20 lines

 

tskInput

 

bitBuf

 

y

 

Device

 

 

414 KB

 

 

512 KB

 

 

Driver

 

 

 

 

 

 

 

 

Yuv

Buffer

 

 

 

 

 

H.263

 

H.263

Cr

422to

3 frames

 

420

enc

 

dec

CbArrau

 

 

 

 

 

 

Shared

 

 

 

 

 

Scratch

 

Cb

 

 

6 KB

92 KB

1.5 KB

207 KB

 

 

 

scratch2

 

Instance

 

Instance

 

 

memory

 

memory

 

14 KB

 

tskProcess

tskO utput

 

tskControl

 

 

 

 

Figure 3. Task Partitioning in the Modified Application

3.3Querying the H.263 Encoder for Status

The third change made to the base application was the use of a run-timeAPI call to query the algorithm as to its status after each frame. The eXpressDSP algorithm standard (xDAIS) states that algorithms should provide a control API such as the following.

H263ENC_cellControl(&(chanHandle->cellSet[CELLH263ENC]),IH263ENC_GETSTATUS, (IALG_Status *) &encStatus);

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application

9

SPRAA56

This call returns a status structure of type IH263ENC_Status that contains the number of bits sent to the encoder, the frame type, and other data.

The features implemented in the control API can vary widely from one algorithm to another. The bitrate and frame type measured by this API may not be available with all third-partyvideo algorithms unless specifically requested. Thus, it is important that the encoder and decoder algorithms used by your application have the necessary hooks to allow complete benchmarking of the end application.

3.4Controlling the Frame Rate

The final structural change made to the base example was the addition of a mechanism for controlling the processing frame rate of the application. This change required the introduction of some counters and a conditional statement to measure the number of frames skipped during the last 30. The conditional statement is shown here:

if( DISPLAYRATE*(frameCnt-frameSkip)> frameCnt*frameRateTarget ) { frameSkip++;

// Tell the capture routine we're done SCOM_putMsg(fromProctoInput,&(thrProcess.scomMsgRx)); continue;

}

The condition requires that the ratio of the target frame rate to the display frame rate be the same as the ratio of the number of frames currently shown to the number that should be shown at the set target frame rate. If the counters indicate that the ratio is exceeded, then the current captured frame will not be processed or displayed, prompting the display driver to re-displaythe most recent frame.

The capture frame rate and display frame rate are left unchanged at DISPLAYRATE, which is set to 30 frames for second in NTSC applications or 25 frames per second in PAL applications. Because the capture driver is using external memory bandwidth to copy unused frames from the video port FIFO to external buffers, it may be desirable or necessary to control the frame rate at the driver to eliminate this overhead. The frame rate control allows you to quickly evaluate the visual quality of an encoder and decoder when using a lower frame rate.

The frame rate target can be controlled at runtime from a GEL script. Code Composer Studio s General Extension Language (GEL) provides a message for script-basedcontrol of most of the debugger functions available in CCStudio. You can also manipulate variables on the target using GEL, though this briefly halts the processor to update the value.

The GEL file included with the modified application is h263rateControl.gel. It provides sliders and dialog boxes to control bitrate, frame rate, and other application parameters. Its control is implemented by manipulating flags and variables in a global structure visible to the tskVideoProcess and tskControl tasks. The control task passes bitrate and frame rate control messages to the processing task, while other manipulations are handled directly by tskVideoProcess.

The remaining changes to the application are not structural in nature. Instead, they consist of short API calls added for run-timebenchmarking. These remaining modifications are therefore described in the next section on RTA techniques.

10

DSP/BIOS Real-TimeAnalysis (RTA) and Debugging Applied to a Video Application