Brian JeffDSP Field Software Applications
Arnie Reynoso Software Development Systems
ABSTRACT
DSP/BIOS and the Reference Frameworks allow developers to non-intrusively instrument
real-time applications. The software provided with this application note applies real-time
analysis (RTA) services to a working application—a H.263 encode/decode loopback
example for the TMS320DM642 evaluation module. The software demonstrates
techniques for benchmarking and controlling video software. It also introduces a service to
programmatically measure CPU and TSK loading. Debugging and troubleshooting
techniques for real-time applications, using Code Composer Studio, is also discussed.
Contents
1
Important Benchmarks for Video Applications.......................................................................... 2
2 Base Application Overview ......................................................................................................... 3
2.1 DSP/BIOS and RF5 Components Used.................................................................................. 5
2.2 Requirements for Viewing RTA Benchmarks.......................................................................... 7
3 Modifications to the Base Example............................................................................................. 7
3.1 Splitting the Encode and Decode CELLs................................................................................ 8
3.2 Adding the Control TSK and MBX Communication................................................................. 8
3.3 Querying the H.263 Encoder for Status .................................................................................. 9
3.4 Controlling the Frame Rate................................................................................................... 10
4 RTA Techniques for Performance Measurement..................................................................... 11
4.1 Measuring Function Execution Time with the UTL Module ................................................... 11
Basic Data Flow of the Video Application...................................................................... 4
1 Important Benchmarks for Video Applications
Diverse video applications often require similar benchmarks to quantify their performance. Some
of the most commonly needed benchmarks are as follows:
• Frame rate
• Resolution
• End-to-end latency
• Processor utilization
• Bitrate*
• Quantization factor*
• Frame type*
• Group-of-pictures (GOP) structure*
Items marked with an asterisk are of importance in applications where encoders or decoders are
involved. This application note provides a method for measuring many of these benchmarks
during the capture, processing, and display phases of the example video application.
Frame rate is the rate at which frames are captured, processed, and displayed. The capture,
process, and display frame rates can differ by design or under overloaded conditions where
frames are “dropped.” Therefore, it is important to measure all three frame rates separately.
Resolution is the size in pixels of the capture, processing, and display. Resolution is typically
static at run-time, so it is not usually benchmarked with real-time tools. However, it is important
to know the capture, processing, and display resolutions of the system design. For example, the
H.263 loopback application used in this application note captures and displays video in D1
resolution and processes in 4CIF resolution.
End-to-end latency is a measurement of the time between the capture of a video frame in realtime and the display of that same video frame some number of milliseconds later.
Processor utilization is the percentage of DSP resources used by an algorithm. In video
applications, the significant benchmarks of processor utilization include not only the number of
CPU cycles used, but also the memory bus utilization since such large amounts of data must be
moved from external memory to L2 and back repeatedly.
Bitrate is the number of bits per second output by a video encoder, or delivered to a video
decoder. Higher bitrates are generally associated with higher quality video. The bitrate often
varies with the complexity and motion in a video source, so it is important to measure bitrate
dynamically in video applications.
2 DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application
Quantization is the process of dividing a continuous range of input values into a finite number of
subranges. Each subrange is assigned a specific output value. The Q factor, or quantization
factor, describes the level of quantization used to store the frequency domain representation of
the encoded image. Q factor often varies dynamically in an encoder when a constant bitrate is
targeted, so it is useful to display the Q factor dynamically with the video stream.
Frame type designates whether a particular frame was encoded independently (I frame) or
whether it depends upon previous frames (P) or both previous and future frames (B). Frame
type is a useful benchmark when shown in real-time. Note that P and B frame types are relevant
for H.263, MPEG-4, MPEG-2, and similar compression standards. They are not relevant for
JPEG or uncompressed video applications.
Group-of-pictures (GOP) structure is the sequence of frame types (I, P, and B) produced by the
encoder. Common structure lengths are 12 and 15 frames. For example, IBBPBBPBBPBBPBB.
If the video stream does not change greatly from frame to frame, P frames may be about 10%
the size of I frames, and B frames may be about 2% the size of I frames.
2 Base Application Overview
The base "h263_loopback" example used to create the application described here is a video
application supplied with the TMS320DM642 evaluation module board support package. After
you install the board support package, the source code and included object libraries for the base
example are in the <CCS_install_dir>\boards\evmdm642\examples\video\h263_loopback
directory.
SPRAA56
The H.263 loopback example was chosen because it integrates the following pieces of
eXpressDSP software in a working video system:
• xDAIS-compliant algorithms
• eXpressDSP-compliant video device drivers from the device driver kit (DDK)
• DSP/BIOS real-time kernel for scheduling
• Chip Support Library (CSL) for low level device function calls
• Reference Framework Level 5 (RF5) as a software / scheduling foundation
This example could be used as the basis for any video design that uses an xDAIS-compliant
codec. It could be modified to support networking or streaming input/output by following the
video networking examples provided with the EVM’s board support package.
While some real-time analysis tools are enabled in the base example, this note describes a
more comprehensive set of tools for real-time analysis, benchmarking, and debugging. This set
of tools can be used with any video application that has a similar DSP/BIOS-based foundation.
The design of the base example is described in detail in the H.263 Loopback on the DM642 EVM (SPRA933), but a brief description of the design and components used is
provided here for reference.
DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application 3
SPRAA56
KTSK
Figure 1 shows a simplified view of the sequential flow of capture, processing, and display tasks
in the application.
Camera
TS
tskInput
Device
Driver
tskVideoProcess
TSK
tskOutput
SCOM
Device
Driver
Figure 1. Basic Data Flow of the Video Application
Before video data reaches the first stage, it must be converted to digital data, a process that is
managed by the input device driver. Analog video input is converted by an on-board NTSC
decoder chip into a digital bitstream compliant with the BT.656 format with embedded
synchronization. The decoder chip sends the bitstream to the TMS320DM642 DSP’s video port.
A device driver, implementing the IOM interface recommended in the DSP/BIOS Driver Developer’s Guide (SPRU616), is used to manage the initialization and synchronization of the
EDMA channel, the video port, and the NTSC decoder used for video capture.
In Figure 1, TSK refers to a DSP/BIOS task, which is described in detail in the DSP/BIOS User's Guide and the DSP/BIOS API Reference. Tasks support blocking calls, which are used to
synchronize the application and the video data stream. The main data flow then has three
stages: capture, processing, and display. Each stage has its own task object.
• The example’s first stage is a task called tskInput, which runs the tskVideoInput function.
The task receives digital video buffers from the device driver. It then converts the buffers to
the 4:2:0 format from the 4:2:2 formatted data it receives from the driver.
• The next stage, the tskVideoProcess task, which runs the tskProcess function. The task
includes algorithms that require input data in the 4:2:0 format. The tskInput task sends a
message to the tskVideoProcess task with pointers to the newly formatted data buffers. The
tskVideoProcess task then calls an xDAIS-compliant H.263 encoder algorithm to compress
the data, which is stored in an intermediate buffer. A second xDAIS-compliant algorithm, an
H.263 decoder, is called to decode the data in the buffer.
• The tskOutput task runs the tskVideoOutput function. It converts the data back to 4:2:2
format as required by the output driver and the NTSC encoder chip, and calls the driver with
the data buffer for display. The output driver is also an eXpressDSP compliant device driver
with the same API interface as the input driver.
Data is passed between the tasks using SCOM messaging objects to pass the pointers and
synchronization semaphore required to ready the output task. The SCOM module is from
Reference Framework 5, which is described in the next section.
4 DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application
2.1 DSP/BIOS and RF5 Components Used
The base application leverages various DSP/BIOS real-time analysis components to support
debugging capabilities that are not intrusive to the system performance. The following three
modules are included with the core DSP/BIOS library, and can be used in any application that
uses DSP/BIOS and on any TI DSP supported by DSP/BIOS:
• LOG – Logging events
• STS – Statistics accumulators
• TRC – Control of real-time capture
In addition to these DSP/BIOS components, the application also uses the UTL module for
debugging and diagnostics. This module is provided in the Reference Frameworks
distribution. The UTL module is described in more detail in Reference Frameworks for eXpressDSP Software: API Refer ence (SPRA147).
In addition to modules used for real-time analysis and debugging, the base application uses the
following DSP/BIOS and Reference Frameworks (RF) modules.
• MBX – Mailbox software module for inter-task communication (DSP/BIOS)
• TSK – Task scheduling module (DSP/BIOS)
• SCOM – Synchronization and pointer-passing mechanism for data flow between TSKs (RF)
• CHAN – Instantiates and serially executes xDAIS-compliant algorithms (RF)
• CELL – Container for xDAIS algorithms in a CHAN (RF)
• ALGRF – Encapsulates the procedure for xDAIS algorithm instantiation (RF)
SPRAA56
The following module provides an interface to the video port device driver, and is described in
The TMS320DM642 Video Port Mini-Driver (SPRA918).
• FVID – Frame Video APIs for communicating with video port device drivers
A brief description of the DSP/BIOS and RF5 modules used extensively in benchmarking the
application is given in the following subsections.
2.1.1 LOG
The LOG module captures events in real time while the target program executes. You can use
the system log (LOG_system) or create user-defined logs, such as myTrace. Log buffers are of
a fixed size and reside in data memory. Individual messages use four words of storage in the
log's buffer. The first word holds a sequence number that allows the Event Log to display logs in
the correct order. The remaining three words contain data specified by the call that writes the
message to the log. The LOG module is much less intrusive to a running system (both in MIPS
and memory) than the RTS printf function, while providing a similar capability.
DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application 5
SPRAA56
2.1.2 STS
An STS object accumulates the following statistical information about an arbitrary 32-bit wide
data series: count, total, and maximum.
Statistics are accumulated in 32-bit variables on the target DSP and in 64-bit variables on the
host PC. When the host polls the target for real-time statistics, it resets the variables on the
target. This minimizes space requirements on the target, while allowing you to keep statistics for
long test runs.
As part of using the DSP/BIOS instrumented kernel, the application automatically acquires STS
information for HWI, PIP, PRD, SWI, and TSK objects. To use this built-in feature on TSKs, the
application must call the TSK_settime and TSK_deltatime APIs to obtain STS information.
Custom STS objects can also be created in the DSP/BIOS configuration. By using the STS APIs
for the created objects, you can determine what statistical information needs to be acquired by
the system application during run-time.
2.1.3 TRC
The TRC module manages a set of trace control bits that control the real-time capture of
program information through event logs and statistics accumulators. For greater efficiency, the
target does not execute log or statistics APIs unless tracing is enabled.
This module contains two user-defined TRC flags that can be toggled using the DSP/BIOS RTA
Control Panel in Code Composer Studio. The application can use these bits to enable or disable
sets of explicit instrumentation. The program can use the TRC_query API to check the settings
of these bits and either perform or omit instrumentation calls based on the result. DSP/BIOS
does not use or set these bits.
2.1.4 UTL
UTL is part of the Reference Frameworks distribution. The UTL module is used for debugging
and diagnostics.
The module is essentially a set of macros that can either be expanded to code that performs the
desired debugging function, or removed completely when building depending on the value of the
UTL_DBGLEVEL preprocessor flag. The UTL module encapsulates DSP/BIOS services such as
CLK, STS, and LOG in APIs. These services can be easily removed in the final build by using
the preprocessor flag, -d “UTL_DBGLEVEL=0”.
With conditional expansion of macros to code you can reduce code size and remove
unnecessary functionality in the deployment phase without having to remove development
debugging/diagnostics aids. This technique also means you don’t need to modify code at
deployment time, thus reducing the possibility of error.
6 DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application
2.2 Requirements for Viewing RTA Benchmarks
In order for any of the DSP/BIOS-based RTA tools to be visible, the DSP/BIOS components in
Code Composer Studio version 2.30 or earlier and version 3.0 require that the application’s .cdb
configuration file be accessible and consistent with the executable .out file.
This requirement is easily met during development. It can also be satisfied in demonstrations or
delivered test examples. If you do not want to deliver source code with the application for
external testing or demonstration, you can still enable all the RTA tools by providing a current
DSP/BIOS configuration .cdb file along with the executable .out file to be tested. The tester will
be able to view the CPU load, individual thread statistics, and other important benchmark details
described in the sections to follow.
The RTA tools can be used in stop mode or real-time mode. In the GBL module of the
DSP/BIOS configuration, you can enable or disable real-time analysis. If you disable real-time
analysis, the three RTA functions in the IDL background loop are removed. Those functions
normally move RTA data from buffers on the DSP to the host PC and calculate the CPU load for
the load graph.
When RTA is disabled, the Message Log, Statistics View, Execution Graph, and other RTA
windows are updated only when the DSP is halted. An update displays the most recent contents
of their respective buffers. This “stop mode” of RTA offers a good compromise when some
visibility is required, but the additional code and background function calls are undesirable. Stop
mode can also occur if RTA is enabled but the CPU is so heavily loaded that it never runs the
IDL background loop long enough to provide real-time updates. In either case of stop-mode
operation, the CPU Load Graph is not updated. However, the programmatic method for CPU
load measurement discussed later in this application note provides a useful working alternative.
SPRAA56
The next section describes structural modifications made to the application to make it more
suitable for benchmarking and further development.
3 Modifications to the Base Example
The application associated with this document has very few structural changes from the base
application shipped with the TMS320DM642 evaluation module. Some variables have been
renamed for readability, the encoder and decoder have been separated, and an additional task
has been added for application control. The data flow in the application has not been modified.
The steps to convert the base example to the modified example are provided in a readme file in
the directory that contains the source code.
Figure 2 shows a more detailed look at the data flow in the modified H.263 loopback example:
DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application 7
SPRAA56
Device
Driver
Buffer
3 frames
Key
Key
DMA Read/Write (background)
DMA Read/Write (background)
Yuv
422to
420
scratch1
14 KB = 20 lines
Figure 2. Detailed Application Data Flow Showing Memory Buffers
Note: The dotted lines in Figure 2 indicate EDMA moves, and the solid lines indicate CPU
reads/writes. The application performs only CPU reads/writes from mapped internal memory,
relying on the EDMA to copy working data into internal scratch buffers.
YAfter420
414 KB
CbAfter420
CrAfter420
207 KB
720x576
H.263
enc
6 KB
Instance
memory
Internal Memory
Internal Memory
External Memory
External Memory
DSP CPUFunctionCPU Read/Write
DSP CPU FunctionCPU Read/Write
bitBuf
512 KB
Shared
Scratch
92 KB
H.263
dec
1.5 KB
Instance
memory
y
414 KB
Cr
CbArrau
Cb
207 KB
Yuv
422to
420
scratch2
14 KB
Device
Device
Driver
Driver
Buffer
Buffer
3 frames
3 frames
3.1 Splitting the Encode and Decode CELLs
In the base example, the H.263 encoder and decoder are wrapped in sequential CELLs in a
single channel. This is suitable for an example application, but in actual video systems the input
to the decoder would be an encoded bitstream from an external source, and the output from the
encoder would be sent to an external source such as a network stream or a hard disk drive.
Splitting the encoder and decoder into separate channels better supports external sourcing or
transport of the encoded bitstream. Additionally, splitting the encoder and decoder allows them
to be benchmarked separately for execution time.
A separate CHAN was created and initialized for the H.263 encoder and the H.263 decoder. At
run-time, a separate CHAN_execute command can be executed for each channel.
3.2 Adding the Control TSK and MBX Communication
The second change to the base example was the addition of a control TSK to send control
commands to the process TSK using the MBX module from DSP/BIOS. A MBX object,
mbxProcess, was added in the DSP/BIOS text-based configuration file appThread.tci. That MBX
object transmits control commands to the tskVideoProcess TSK to change run-time parameters
such as the video frame rate and the encoder bitrate.
8 DSP/BIOS Real-Time Analysis (RTA) and Debugging Applied to a Video Application
While implementing control via the host PC did not specifically require a separate task in the
modified application, adding a discrete control task makes the application more scalable. For
example, a user interface or communications link from another processor could send control
commands to a DSP-based video system. The control task could then service that user interface
or communications link. In the modified example, the control task simply monitors a global
structure for commands, and sends appropriate commands to the processing task if necessary.
The priority of the control TSK is set to a lower level than that of the tskVideoProcess, tskInput,
and tskOutput TSKs. This prevents the control task from adding latency or CPU overhead when
responding to control commands. The control commands are only serviced at times when the
three TSKs in the data stream are all in the blocked state and the processor would normally be
running its background loop.
Figure 3 shows the task partitioning added to the application flow in Figure 2.
Device
Driver
Buffer
3 frames
Yuv
422to
420
scratch1
14 KB = 20 lines
YAfter420
414 KB
CbAfter420
CrAfter420
207 KB
tskInputtskOutput
H.263
enc
6 KB
Instance
memory
tskProcess
bitBuf
512 KB
Shared
Scratch
92 KB
tskControl
Figure 3. Task Partitioning in the Modified Application
3.3 Querying the H.263 Encoder for Status
The third change made to the base application was the use of a run-time API call to query the
algorithm as to its status after each frame. The eXpressDSP algorithm standard (xDAIS) states
that algorithms should provide a control API such as the following.