ST AN910 Application note

AN910
APPLICATION NOTE
ST7 AND ST9 PERFORMANCE BENCHMARKING
INTRODUCTION
STMicroelectronics has developed a set of test routines related to 8-bit and low-end 16-bit microcontroller applications to evaluate computing performance and interrupt processing performance of microcontroller cores. These routines have been implemented on ST7 and ST9 Microcontroller Units (MCUs) as well as several MCUs available on the market.
The routines have been written in assembler language to optimize their implementation and focus on core performance, without being dependent upon compiler code transformation.
For each test, the two parameters of interest are execution time and code size. Timings have been either measured whenever possible, or theoretically calculated when there was no other alternative. In most cases, programs have really run and execution times have actually been measured, so that assembly sources should not contain implementation errors and results can be considered as correct and reliable.
The results of this study point out the capability of the ST9+ to compete with 16-bit MCUs on 8-bit and low-end 16-bit applications and confirms its position of high-end 8/16-bit MCU. It also confirms the ST7 as an outstanding 8-bit MCU.
The first four sections provide synthetical information:
1. Overview of the Test Routines on page 2
2. Overview of the MCU cores on page 3
3. Benchmark results on page 4
4. Result analysis on page 11
More detailed information is provided in the appendixes:
5. Description of MCU work environments on page 17
6. Complete numerical results on page 21
7. MCU Core architecture analysis on page 25
8. Description of the test routines on page 43
9. Measurement proceeding and calculation on page 46
Rev. 2.0
AN910/1104 1/51
1
ST7 AND ST9 PERFORMANCE BENCHMARKING
1 OVERVIEW OF THE TEST ROUTINES
Eleven different test routines have been implemented in assembler language.
The first ten routines are oriented at measuring core computing performance. They are based on known algorithms and represent currently used operations in 8-bit and low-end 16­bit applications. They mix bit, 8-bit and 16-bit operations as many applications do.
This set of tests is described in Table 1.
Table 1. Test routine overview
Abbreviated name Full name Description Features stressed
sieve Eratosthenes sieve
acker(m,n)
string String search
char Character search search a byte in a 40-byte array
bubble(n)
blkmov(n)
convert Block translation
16mul 16-bit integer multiplication
shright 16-bit value right shift
bitsrt Bit manipulation
1) The couple of values used are (m,n)=(3,5) and (m,n)=(3,6)
2) The values used are n=10 (words) and n=600 (words)
3) The values used are n=64 (bytes) and n=512 (bytes)
1)
2)
3)
Ackermann function
Bubble sort
Block move
find prime numbers 3 out of 8189 elements
make recursive function calls number of calls depending upon two parameters (m,n)
search a 16-byte string in a 128­character array
sort of a one-dimension array of n 16-bit integers
move a n-byte block from a place in memory to another
translate a 121-byte block in a different format
multiplication of two unsigned words giving a 32-bit result
shift a 16-bit value five places to the right
set, reset, and test of 3 bits in a 128-bit array
16-bit data computation bit manipulation
function calls stack use
8-bit data block manipulation string manipulation
8-bit data manipulation char manipulation
16-bit data manipulation integer manipulation
8-bit data block manipulation block move
8-bit data manipulation use of a lookup table
16-bit data computation integer manipulation
16-bit data manipulation bit manipulation
bit computation bit and 8-bit data manipulation
Another test routine handling a timer interrupt has been used to measure core interrupt
processing performance:
Abbreviated name Full name Description Features stressed
standard timer input capture or/
interrupt Timer interrupt
and output compare interrupt service routine
interrupt processing
A more precise description of the test routines is available in section 8.
2/51
2
ST7 AND ST9 PERFORMANCE BENCHMARKING
2 OVERVIEW OF THE MCU CORES
The set of MCUs evaluated is composed of various 8-bit, 8/16-bit, and 16-bit microcontrollers with accumulator, register file or mixed architectures.
Table 2 is an overview of the MCU cores.
Table 2. MCU cores overview
MCU name Architecture Short core description Freq
80C51XA
PHILIPS
68HC16
MOTOROLA
68HC12
MOTOROLA
ST9+
STMicroelectronics
ST9
STMicroelectronics
H8/300
HITACHI
68HC11
MOTOROLA
68HC08
MOTOROLA
ST7
STMicroelectronics
80C51
INTEL, PHILIPS...
KS88
SAMSUNG
78K0
NEC
1) As the goal is to obtain the best of each MCU core, the maximum internal frequency (Freq) available, for each MCU, on development board has been used (unless other specified). Note that results are directly proportional to this frequency.
16-bit; register file
16-bit; two accumulators
16-bit; two accumulators
8/16-bit; register file
8/16-bit; register file
8/16-bit; register file
8-bit; two accumulators
8-bit; accumulator
8-bit; accumulator
8-bit; register file and accumulator
8-bit; register file
8-bit; register file and accumulator
eXtended Architecture (XA) of 80C51’s - upward compatible 8/16-bit register bus - 16-bit data/program memory buses register file programming model with sixteen 16-bit banked registers
core architecture superset of 68HC11’s - upward compatible accumulator programming model with two 16-bit accumulators, and three 16-bit index registers (all with 4-bit extensions)
instruction set is superset of 68HC11’s - upward compatible programming model identical to 68HC11’s
evolution of the ST9 enhanced clock speed, instruction cycle time enlarged memory space
8/16-bit architecture; 8-bit register bus - 16-bit memory bus register file programming model with 14 groups of sixteen 8-bit registers, useable as 16-bit registers modular paged registers for access to peripheral registers
RISC-like architecture and instruction set register file programming model with sixteen 8-bit registers
market standard 8-bit MCU accumulator programming model with two 8-bit accumulators or one 16-bit accumulator, and two 16-bit index registers
superset of the 68HC05 - upward compatible enhanced performance and instruction set accumulator programming model with one 8-bit accumulator, and one 16-bit index register
upward compatible with the 68HC05 accumulator programming model with one 8-bit accumulator, and two 8-bit index registers
mixed accumulator and register file programming model with four banks of eight 8-bit registers (include accumulator), and a 16-bit data pointer
core architecture superset of SUPER8’s; 8-bit register bus register file programming model with 192 8-bit prime data registers, and two register sets with system/peripheral/data registers
mixed accumulator and register file programming model with four banks of eight 8-bit or four 16-bit registers (include accumulator)
1)
20 MHz
16 MHz
8 MHz
25 MHz
12 MHz
10 MHz
4 MHz
8 MHz
4 MHz 8 MHz
20 MHz
8 MHz
10 MHz
A description of the MCU work environments is available in section 5.
3/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
3 BENCHMARK RESULTS
3.1 CORE COMPUTING PERFORMANCE
The two following charts show benchmark results for computing performance. Execution time and code size are presented as global ratios taken the ST9+ as reference.
Preliminary ratios have been calculated for each test. Using those results, a global execution time ratio and a global code size ratio have been calculated as an average of all ratios. As all the tests could not have been implemented on all MCUs (
considerations<Italic end>), one or two different results are presented for each MCU. The
first one, available for all the MCUs, has been calculated with the reduced set of tests performed on all the MCUs. The second one, only available for some MCUs, has been calculated with the full set of tests.
Refer to section 6 for complete results. Refer to section 9 for measurement proceeding and calculation description.
see <Italic>9.2.2 Memory
Figure 1. presents execution time ratios and Figure 2. shows code size ratios.
Notes: The reduced set of tests is composed of:
string, char, bubble(10 words), blkmov(64 bytes), convert, 16mul, shright, bitrst
The full set of tests is composed of:
string, char, bubble(10 words), blkmov(64 bytes), convert, 16mul, shright, bitrst, sieve, acker(3,5), acker(3,6), bubble(600 words), blkmov(512 bytes)
The 80C51 results are preliminary results. They may change in later versions.
4/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
best performance
8-bit MCUs 16-bit MCUs8/16-bit MCUs
Figure 1. Computing performance global execution time ratios (ST9+ as reference)
5/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
best density
8-bit MCUs 16-bit MCUs8/16-bit MCUs
6/51
Figure 2. Computing performance global code size ratios (ST9+ as reference)
ST7 AND ST9 PERFORMANCE BENCHMARKING
3.2 CORE INTERRUPT PROCESSING PERFORMANCE
The three following charts show benchmark results for interrupt processing performance. Execution time results are presented as time values (in microseconds), and also as ratios taken the ST9+ as reference. Code size results are presented as ratios taken the ST9+ as reference.
Refer to section 6 for complete results and details on calculation.
Figure 3. presents execution time results in microseconds, showing interrupt latency & return time. Figure 4. presents execution time ratios, and Figure 5. presents code size ratios.
7/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
best performance
8-bit MCUs 16-bit MCUs8/16-bit MCUs
8/51
Figure 3. Interrupt processing performance execution time values
ST7 AND ST9 PERFORMANCE BENCHMARKING
best performance
8-bit MCUs 16-bit MCUs8/16-bit MCUs
Figure 4. Interrupt processing performance execution time ratios (ST9+ as reference)
9/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
best density
8-bit MCUs 16-bit MCUs8/16-bit MCUs
10/51
Figure 5. Interrupt processing performance code size ratios (ST9+ as reference)
ST7 AND ST9 PERFORMANCE BENCHMARKING
4 RESULT ANALYSIS
This section is an analysis of computing performance and interrupt processing performance results (for execution time and code size). Based on core architecture analysis
section 7), two comparisons are presented, pointing out the strong and weak points of
(see each MCU. The first concerns the high-end to medium-end MCUs versus ST9+. The second concerns the medium-end to low-end MCUs versus ST7.
4.1 PRELIMINARY REMARK
Results show that the two different ratios, for execution time and code size, calculated with full and reduced sets of tests, are in fact not very different. In most cases, the classification of the MCUs is kept. Thus we can consider that the reduced set is sufficient to make the MCU core comparison.
4.2 HIGH-END TO MEDIUM-END MCU ANALYSIS VERSUS ST9+
The Table 3 presents the strong and the weak points for high-end to medium-end MCUs, compared to the ST9+ MCU.
Notes: ICT means Instruction Cycle Time and IL means Instruction Length.
Refer to paragraph <Italic>7.2.2 Average ICT/CPI and IL<Italic end> for details on
calculation.
Refer to paragraph <Italic>7.3.4 ST9+ MCU core<Italic end> to see the main characteristics of the ST9+ MCU core.
4.2.1 Computing performance results
Regarding speed, the ST9+ MCU ranks at the top of 8/16-bit MCUs. This new version of the ST9 has been improved on several points, including clock per instruction and clock speed. These enhancements have considerably reduced its instruction cycle time. A large and powerful register file organized in groups allow the ST9+ to perform strong computation (with many registers), have an easy access to peripheral and i/o port registers (with paged registers), and manage multitasking (with register group pointers). Addressing modes like register pair, register indirect with pre/post-increment, and indexed give the ST9+ the ability to perform 16-bit data computation and manipulation, easily manipulate tables and move blocks. A new memory management unit enlarges the memory space up to 4 Mbytes. New instructions have been added to handle this new space and improve the C-language support.
11/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
Concerning code efficiency, the position of the ST9+ MCU is also among the best MCUs. The 16-bit MCUs are only a little better, although favoured by their true 16-bit computing and data manipulation instructions. In the 8/16-bit MCUs, the H8/300 takes a little advantage due to its special block move instruction. But all 8-bit MCUs, even with shorter instruction lengths, have longer code size results.
4.2.2 Interrupt processing performance results
Regarding speed, the ST9+ MCU ranks at the first position. The value chart shows that it has the shortest interrupt latency but also an interrupt routine execution time which is among the best. These results show that its interruption management and instruction cycle time have been considerably enhanced. The register groups bring in addition fast context switching capabilities.
Some 8-bit MCUs, such as the 68HC08, work quite well in this test. But their performance must be moderated because such MCUs can manage only one interrupt at the time and so cast off a complex arbitration phase. The interrupt management of the ST9+ is one of the more advanced, allowing nested interrupts with full software programmable priorities and program priority level control.
Code efficiency results for interrupt processing performance are not really significant. The code represents only a very small part of an entire interrupt service routine, and so no conclusion can be made.
4.2.3 Conclusion
Global results and all its characteristics allow the ST9+ to compete with the true 16-bit MCUs on 8-bit and low-end 16-bit applications, and confirm its position of high-end 8/16-bit MCU.
12/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
Table 3. High-end to low-end MCU strong and weak points
MCU Strong points Weak points
7-byte prefetch queue predecoding 16-bit datapath 600 ns 8x8 multiplication 250 to 300 ns indirect with 8/16 offset or
auto-increment compare & branch like decrement & branch like memory-to-memory moves context switching capabilities up to 16 Mbytes nested mode 4-bit program priority register programmable priority levels
3-stage prefetch queue predecoding 16-bit datapath 625 ns 8x8 multiplication 375 to 440 ns post-modified indexed
with 8-bitoffset memory-to-memory moves context switching capabilities up to 1 Mbyte up to 16 Mbytes with memory
expansion module nested mode 3-bit program priority register programmable priority levels
2-stage prefetch queue predecoding 20-bit datapath 375 ns 8x8 multiplication 375 to 500 ns auto-incr/decrement indexed accumulator offset indexed memory-to-memory moves incr/decrement & branch like test & branch like up to 4 Mbytes with memory
expansion module
risc-like encoding 2 to 3 bytes register indirect, 16-bit offset
or pre/post-increment block moves
address alignment:
lacking addr. modes:
address alignment:
instruction lengths: lacking addr. modes: lacking instructions:
multitasking:
interrupt processing:
instruction processing: medium 8/16-bit ALU: medium average ICT: lacking instructions:
multitasking: memory space: interrupt processing:
80C51XA (20 MHz)
68HC16
(16 MHz)
68HC12 (8 MHz)
H8/300
(10 MHz)
instruction processing:
fast 8/16-bit ALU:
short average ICT: special addr. modes:
special instructions:
multitasking: large memory space: interrupt processing:
instruction processing:
fast 8/16/32-bit ALU:
short average ICT: special addr. modes:
special instructions: multitasking: large memory space:
interrupt processing:
instruction processing:
fast 8/16-bit ALU:
short average ICT: special addr. modes:
special instructions:
large memory space:
instruction encoding: short average IL: special addr. modes:
special instructions:
even jump/branch address even word operand address NOP instructions in assembly
code
no indexed addressing
performance penalty if odd
word operand addresses only even no direct addressing index register manipulation compare & branch like decrement & branch like
need memory expansion
module one interrupt at a time
recommended no program priority register hardware fixed priorities
standard (no prefetch) 1400 ns 8x8 multiplication 500 to 600 ns 16-bit shifts/rotations compare & branch like decrement & branch like no special capabilities 64 kbytes one interrupt at a time
recommended no program priority register hardware fixed priorities
13/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
Table 3. High-end to low-end MCU strong and weak points (cont’d)
MCU Strong points Weak points
instruction processing: medium 8/16-bit ALU: long average ICT: lacking instructions:
68HC11 (4 MHz)
68HC08 (8 MHz)
instruction processing: fast 8-bit ALU:
special addr. modes:
special instructions:
large memory space:
1-byte prefetch queue 8-bit datapath 625 ns 8x8 multiplication indexed with 8-bit offset or
post-increment memory-to-memory moves compare & branch like decrement & branch like up to 4 Mbytes with memory
expansion module
multitasking: memory space: interrupt processing:
medium average ICT: lacking addr. modes: multitasking: interrupt processing:
standard (no prefetch) 2500 ns 8x8 multiplication 1500 to 1750 ns compare & branch like decrement & branch like no special capabilities 64 kbytes one interrupt at a time
recommended no program priority register hardware fixed priorities
500 to 625 ns no indirect addressing no special capabilities one interrupt at a time
recommended no program priority register hardware fixed priorities
14/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
4.3 MEDIUM-END TO LOW-END MCU ANALYSIS VERSUS ST7
The Table 4 presents the strong and the weak points for medium-end to low-end MCUs, compared to the ST7 MCU.
Notes: ICT means Instruction Cycle Time and IL means Instruction Length.
Refer to paragraph <Italic>7.2.2 Average ICT/CPI and IL<Italic end> for details on
calculation.
Refer to paragraph <Italic>7.3.9 ST7 MCU core<Italic end> to see the main characteristics of the ST7 MCU core.
4.3.1 Computing performance results
Regarding speed, the ST7 MCU takes the second position just below the newly arrived 68HC08. With no prefetch mechanism, it comes even so ahead of all the other MCUs. A short clock per instruction added to a standard frequency explains its short instruction cycle time and its advantageous position. The two index registers and the indirect addressing mode allow the ST7 to easily perform data manipulation like table manipulation and block move. A direct addressing mode in a 256-byte zero page give a rapid access to important data and peripheral registers.
Concerning code efficiency, the ST7 MCU ranks among the 8-bit MCUs, very closely above the 68HC08. A standard instruction length explains its average position.
4.3.2 Interrupt processing performance results
Regarding speed, the ST7 MCU ranks very close to the 68HC08. A longer instruction cycle time explains this tiny gap. The strong point of its interrupt management is the automatic stacking of the cpu state, accumulator and index register. This process eliminates software stacking, and so saves time and space.
Code efficiency results for interrupt processing performance are not really significant. The code represents only a very small part of an entire interrupt service routine, and so no conclusion can be made.
4.3.3 Conclusion
Global results and all its characteristics confirm the ST7 as an outstanding 8-bit MCU.
15/51
ST7 AND ST9 PERFORMANCE BENCHMARKING
Table 4. Medium-end to low-end MCU strong and weak points
MCU Strong points Weak points
medium 8/16-bit ALU:
68HC11 (4 MHz)
68HC08 (8 MHz)
80C51
(20 MHz)
KS88
(8 MHz)
78K0
(10 MHz)
instruction processing: fast 8-bit ALU:
short average ICT: special addr. modes:
special instructions:
large memory space:
short average IL: special addr. modes:
special instructions:
multitasking:
special addr. modes:
special instructions:
multitasking: interrupt processing:
special addr. modes:
special instructions: multitasking:
1-byte prefetch queue 8-bit datapath 625 ns 8x8 multiplication 500 to 625 ns indexed with 8-bit offset or
post-increment compare & branch like decrement & branch like memory-to-memory moves up to 4 Mbytes with memory
expansion module
1 to 2 bytes register indirect stack pointer relative compare & branch like decrement & branch like bit test & bit clear & jump memory-to-memory moves context switching capabilities
register pair indirect register/address indexed (short/long) compare & increment &
branch like decrement & branch like context switching capabilities nested mode level priority control register
register indirect stack pointer relative indexed with 8-bit offset decrement & branch like context switching capabilities
long average ICT: lacking instructions:
multitasking:
lacking addr. modes: multitasking:
slow 8-bit ALU: long average ICT:
slow 8-bit ALU: long average ICT: data memory location:
mixed architecture: slow 8-bit ALU: long average ICT:
2500 ns 8x8 multiplication 1500 to 1750 ns compare & branch like decrement & branch like no special capabilities
no indirect addressing no special capabilities
2400 ns 8x8 multiplication 900 to 1000 ns
3000 ns 8x8 multiplication 1250 to 1500 ns off-chip only
only accumulator oriented 3200 ns 8x8 multiplication 1400 to 1600 ns
16/51
Loading...
+ 35 hidden pages