APPLICATIONNOTE [ETC]

Application Note - Builders Guide for 2P Capable Servers and Workstations ; 应用指南 - 建筑商指南2P功能的服务器和工作站\n
APPLICATIONNOTE
型号: APPLICATIONNOTE
厂家: ETC    ETC
描述:

Application Note - Builders Guide for 2P Capable Servers and Workstations
应用指南 - 建筑商指南2P功能的服务器和工作站\n

服务器
文件: 总45页 (文件大小:514K)
中文:  中文翻译
下载:  下载PDF数据表文档文件
3DNow!™  
Instruction  
Porting Guide  
Application Note  
Publication # 22621  
Rev: B  
Issue Date: August 1999  
© 1999 Advanced Micro Devices, Inc. All rights reserved.  
The contents of this document are provided in connection with Advanced  
Micro Devices, Inc. (“AMD”) products. AMD makes no representations or  
warranties with respect to the accuracy or completeness of the contents of  
this publication and reserves the right to make changes to specifications and  
product descriptions at any time without notice. No license, whether express,  
implied, arising by estoppel or otherwise, to any intellectual property rights  
is granted by this publication. Except as set forth in AMD’s Standard Terms  
and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims  
any express or implied warranty, relating to its products including, but not  
limited to, the implied warranty of merchantability, fitness for a particular  
purpose, or infringement of any intellectual property right.  
AMD’s products are not designed, intended, authorized or warranted for use  
as components in systems intended for surgical implant into the body, or in  
other applications intended to support or sustain life, or in any other applica-  
tion in which the failure of AMD’s product could create a situation where per-  
sonal injury, death, or severe property or environmental damage may occur.  
AMD reserves the right to discontinue or make changes to its products at any  
time without notice.  
Trademarks  
AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, and K86 are trademarks, and AMD-K6  
is a registered trademarks of Advanced Micro Devices, Inc.  
Microsoft is a registered trademark of Microsoft Corporation.  
MetroWerks and CodeWarrior are trademarks of Metrowerks, Inc.  
MMX is a trademark and Pentium is a registered trademark of Intel Corporation.  
Other product names used in this publication are for identification purposes only and may be trademarks of their  
respective companies.  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Contents  
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii  
3DNow!™ Instruction Porting Guide  
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1  
Detecting 3DNow!™ Technology Support . . . . . . . . . . . . . . . . . . . . . . 2  
Related Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3  
3DNow!™ Instruction Porting  
Code Support Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5  
Separate Executables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5  
Separate DLL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5  
Different Optimized Versions. . . . . . . . . . . . . . . . . . . . . . . . . . . 6  
Conditional Code Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6  
3DNow! Porting Preparations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6  
Perform High-Level Optimizations. . . . . . . . . . . . . . . . . . . . . . . 6  
Profile Existing Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6  
Port Major Hotspots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7  
Use Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7  
Use MASM Code for Critical Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8  
Port Code in Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9  
3DNow! Code versus x87 FPU Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 9  
Optimize Register Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11  
Schedule Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11  
3DNow! Code Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13  
Decode Degradation Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13  
[ESI] Inhibits Short Decode . . . . . . . . . . . . . . . . . . . . . . . . . . . 13  
Instructions Longer Than Seven Bytes . . . . . . . . . . . . . . . . . . 13  
Crossing Cache Line Boundary. . . . . . . . . . . . . . . . . . . . . . . . . 14  
Instruction Length Determination . . . . . . . . . . . . . . . . . . . . . . 14  
Align Loops on 32-Byte Boundary. . . . . . . . . . . . . . . . . . . . . . . 14  
Contents  
iii  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
Blended Code Guidelines  
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15  
Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16  
Alignment of Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16  
Alignment of Structure Components . . . . . . . . . . . . . . . . . . . . 16  
Alignment of Dynamically Allocated Memory . . . . . . . . . . . . 16  
Alignment of Stack Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16  
Maximize SIMD Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17  
Use PREFETCH and PREFETCHW Instructions . . . . . . . . . . . . . . . 18  
Take Advantage of Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . 19  
Use FEMMS Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19  
Load-Execute Instruction Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20  
Scheduling Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20  
Instruction and Addressing Mode Selection . . . . . . . . . . . . . . . . . . . 21  
General Porting Guidelines  
®
Minimize AMD-K6 -2 Processor Switching Overhead . . . . . . . . . . . 23  
Using PREFETCH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24  
PREFETCH on the AMD-K6 Processor . . . . . . . . . . . . . . . . . . 25  
PREFETCH on the AMD Athlon™ Processor . . . . . . . . . . . . . 25  
PREFETCHW Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26  
Multiple Prefetches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26  
Determining Prefetch Distance . . . . . . . . . . . . . . . . . . . 26  
Prefetch at Least 64 Bytes Away from Surrounding  
Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27  
Use PFSUBR Instruction When Needed. . . . . . . . . . . . . . . . . . . . . . . 27  
Using PAND and PXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27  
Swapping MMX™ Registers Halves . . . . . . . . . . . . . . . . . . . . . . . . . . 28  
PUNPCKL* and PUNPCKH* Instructions . . . . . . . . . . . . . . . . . . . . . 28  
Storing the Upper 32 Bits of an MMX Register. . . . . . . . . . . . . . . . . 29  
PFMIN and PFMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29  
iv  
Contents  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Precision Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30  
Moving Data Between MMX and Integer Registers . . . . . . . . . . . . . 30  
Store-to-Load Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31  
Block Copies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31  
Instruction Cache and Branch Prediction Effects . . . . . . . . . . . . . . . 33  
Use the Linker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33  
Code Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34  
Software Write Combining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34  
Addressing Modes on the AMD-K6-2 and AMD-K6-III  
Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36  
Contents  
v
3DNow!™ Instruction Porting  
22621B/0—August 1999  
vi  
Contents  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Revision History  
Date  
Rev  
Description  
August 1999  
B
Initial public release.  
Revision History  
vii  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
viii  
Revision History  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Application Note  
3DNow!InstructionPorting  
Guide  
Introduction  
This document contains information to assist programmers in  
creating optimized code for AMD processors with 3DNow!™  
technology. Compiler and assembler designers and assembly  
language programmers writing execution-sensitive code  
sequences as well as high-level C programmers will also find the  
guidelines useful. This document assumes that the reader  
possesses in-depth knowledge of the x86 instruction set, the x86  
architecture (registers, programming modes, etc.), and the IBM  
PC-AT platform.  
This document has three sections of guidelines for 3DNow!  
porting:  
3DNow!™ Instruction Porting  
Blended Code Guidelines  
General Porting Guidelines  
The 3DNow! Instruction Porting section describes the actual  
process of converting existing code to 3DNow! code. The  
Blended Code Guidelines section deals specifically with the  
creation of blended code—3DNow! code that provides high  
performance on AMD-K6® processors as well as on the  
AMD Athlon™ processor. New applications should use blended  
code to ensure optimal performance on current and future  
Introduction  
1
3DNow!™ Instruction Porting  
22621B/0—August 1999  
platforms. The General Porting Guidelines section describes a  
number of important issues for 3DNow! code optimization  
mainly for the family of AMD-K6 processors, but also  
addressing the AMD Athlon processor.  
Detecting 3DNow!™ Technology Support  
3DNow! technology is an open standard that has been adopted  
by multiple processor vendors. Therefore, checking for 3DNow!  
technology capability should not be limited to AMD processors.  
All 3DNow! technology licensees have agreed to indicate  
3DNow! technology capability through bit 31 of the extended  
feature flags. Checks for 3DNow! technology support can be  
made without first checking for the processor vendor. This  
allows current detection code to also detect future 3DNow!  
technology licensees.  
The basic steps of the 3DNow! technology capability detection  
are as follows:  
1. Test that the processor has the CPUID instruction.  
2. Check that CPUID instruction also supports extended  
function 8000_0001h.  
3. Execute CPUID extended function 8000_0001h and retrieve  
the EDX register.  
4. If bit 31 of the EDX register is set, the processor supports  
the 3DNow! instruction set.  
The following assembly language code shows how this can be  
implemented:  
;; check whether CPUID is supported  
:: (bit 21 of Eflags can be toggled)  
pushfd  
pop  
mov  
;save Eflags  
;transfer Eflags into EAX  
;save original Eflags  
eax  
edx, eax  
eax, 00200000h  
xor  
;toggle bit 21  
push eax  
popfd  
pushfd  
;put new value of stack  
;transfer new value to Eflags  
;save updated Eflags  
pop  
xor  
jz  
eax  
eax, edx  
NO_CPUID  
;transfer Eflags to EAX  
;updatedEflagsandoriginaldiffer?  
;no diff, bit 21 can’t be toggled  
2
Detecting 3DNow!™ Technology Support  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
;;test whether extended function 80000001h is supported  
mov  
cpuid  
eax, 80000000h  
;call extended function 80000000h  
;reports back highest supported ext.  
; function  
cmp  
jbe  
eax, 80000000h  
NO_EXTENDED  
;supports functions > 80000000h?  
;no 3DNow! support, either  
;;test if function 80000001h indicates 3DNow! support  
mov  
cpuid  
test edx, 80000000h  
jnz YES_3DNow!  
eax, 80000001h  
;call extended function 80000001h  
;reportsbackextendedfeatureflags  
;bit 31 in extended features  
;if set, 3DNow! is supported  
Related Documents  
Related documents can be downloaded at the following URL:  
http://www.amd.com/support/techdocdir.html  
Including:  
AMD-K6® Processor Code Optimization Application Note,  
order# 21924  
3DNow!™ Technology Manual, order# 21928  
AMD-K6® Processor Multimedia Technology, order# 20726  
Implementation of Write Allocate Application Note, order#  
21326  
AMD Athlon™ Processor x86 Code Optimization Guide, order#  
22007  
AMD Extensions to the 3DNow!™ and MMX™ Instruction Sets  
Manual, order# 22466  
AMD Processor Recognition Application Note, order# 20734  
Related Documents  
3
3DNow!™ Instruction Porting  
22621B/0—August 1999  
4
Related Documents  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
3DNow!™ Instruction Porting  
Code Support Considerations  
Consider how your software can support several paths through  
the different code optimized for various processors. Choices  
include the following methods:  
Separate Executables  
Build separate executables optimized for each platform. This is  
probably the highest performance option, but can be  
impractical due to code distribution issues and other problems.  
Separate DLL  
Place all performance-sensitive code into a separate DLL,  
providing several DLLs optimized for each target platform to be  
supported. This is a high-performance solution as the overhead  
is typically no more than selecting and loading the DLL version  
most appropriate for the platform detected at run time. The  
problem with this approach is that the performance-sensitive  
code can come from different and unrelated parts of the source  
tree, but becomes grouped together in a single DLL.  
Code Support Considerations  
5
3DNow!™ Instruction Porting  
22621B/0—August 1999  
Different Optimized Versions  
Provide optimized versions of each performance-critical  
function for each target platform, and call the functions  
through pointers that are initialized at run time based on the  
system processor the software is running on. This has a negative  
performance impact on AMD-K6® processors because function  
calls through pointers are slower than regular function calls.  
Conditional Code Paths  
Inside performance-critical parts of the code, conditionally  
select code paths based on capability flags. On AMD-K6  
processors, this can be faster than the approach using function  
pointers, because the branches will be well predicted since the  
capabilities do not change during run time. On the other hand,  
this approach can make the code less clear and more difficult to  
maintain.  
3DNow!™ Porting Preparations  
Perform High-Level Optimizations  
Before starting a 3DNow! porting effort, perform all high-level  
optimizations that can be done at the source-code level. This  
primarily affects loops, which can be transformed in a variety of  
ways for better performance—loop unrolling, loop splitting,  
loop merging, loop inversion, loop switching, and hoisting of  
loop invariant expressions and conditionals. Function calls can  
also be optimized by inlining. It is much more difficult to  
perform high-level transformations once the code has been  
ported to the assembly-language level.  
Profile Existing Code  
Before starting the actual porting process, profile the existing  
code on the target platform to identify the hotspots that merit  
manual porting work.  
6
3DNow!™ Porting Preparations  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Profilers come in various types. Some require source code, some  
can instrument binaries, others use a sampling approach. All  
profilers should work on AMD processors. VTUNE works too,  
but doesn’t have event-based profiling or disassemble 3DNow!  
code capabilities. Some profilers, like Metrowerks™ CATS,  
have built-in support for 3DNow! instructions and can be easier  
to use when reprofiling code during the porting process.  
Port Major Hotspots  
Candidates for 3DNow! porting are hotspots that frequently use  
x87 floating-point unit (FPU) instructions. AMD-K6 processors  
incur a penalty called switching overhead whenever the  
instruction flow changes between the use of x87 instructions  
and MMX™/3DNow! instructions (or vice versa). For full  
3DNow! optimization, port all x87 code down to hotspots that  
take up only a small percentage (approximately 2%) of the total  
execution time. Due to switching overhead, porting a few small  
functions to 3DNow! can often be detrimental to overall  
performance. The goal is to keep the processor operating on  
3DNow!/MMX code for long periods of time, with only  
occasional use of x87 code.  
Some manual porting work can be saved by compiling the code  
which contains fewer hotspots with a compiler that can  
generate native 3DNow! code, such as Metrowerks  
CodeWarrior™ Professional Release 4 and later. At this time,  
major hotspots require manual porting for optimal  
performance.  
Use Compiler Optimizations  
To achieve the best performance from hotspots that are not  
floating-point intensive and so do not lend themselves to  
3DNow! porting, experiment with compiler flags to find which  
flag settings provide the best code for AMD processors. Most  
compilers allow processor-specific optimizations based on the  
capabilities of Intel processors. Since AMD processors are  
different from Intel processors, the available processor-specific  
settings are not fully optimal for AMD processors. The  
microarchitecture of AMD processors most closely resembles  
Use Compiler Optimizations  
7
3DNow!™ Instruction Porting  
22621B/0—August 1999  
the P6, Pentium® II, and Pentium III microarchitecture, and in  
most cases selecting P6/PII/PIII-specific optimization results in  
the highest performance for AMD processors (for example, -G6  
for Microsoft® Visual C/C++). The Metrowerks CodeWarrior  
compiler has a specific optimization setting for AMD  
processors.  
Use MASM Code for Critical Code  
Use standalone MASM code for performance-critical parts of  
code that are ported to 3DNow!. This gives the best control over  
the code (for example, code alignment). To assemble 3DNow!  
code, use MASM 6.13 or MASM 6.14. Upgrade from an existing  
installation of MASM 6.11 to 6.13 by downloading ML613.EXE  
from the following ftp site:  
ftp://ftp.microsoft.com/Softlib/MSLFILES  
Apply this patch. To enable MMX instructions, use the .MMX  
directive. To enable 3DNow! instructions, use the .K3D  
directive after using the .MMX directive. It is order dependent.  
MASM 6.14 supports most of the new 3DNow! and MMX  
extensions introduced in the AMD Athlon processor. Use the  
.XMM directive to enable the use of these new extensions. Note  
that the new instructions, PFNACC and PFPNACC are not yet  
accessible in MASM 6.14. Also, in order to use the new  
PSWAPD instruction, users need to define the text macro as  
follows:  
pswapd TEXTEQU <pswapw>  
For some big functions where only a small part of the C code is  
replaced, use inline assembly. Since Microsoft Visual C 5.0 does  
not have native inline assembly support for the 3DNow!  
instruction set, download instruction macros from the AMD  
web site. The macros are in the amd3d.h file in the 3DNow!  
SDK, which can be downloaded from the following URL:  
http://www.amd.com/3dsdk/index2.html  
To get started on assembly language code, have the C compiler  
generate an assembly language listing and use that as the  
initial assembly language version. Make sure to compile with  
maximum optimizations to have the compiler perform all the  
8
Use MASM Code for Critical Code  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
high-level optimizations up front. The compiler will convert  
symbolic constants in the source code to “magic numbers”;  
however, the programmer may need to set up a mechanism to  
extract symbolic constants from C code and import them into  
the assembly code to maintain the assembly code as well as the  
C code.  
Port Code in Blocks  
Most functions contain several more-or-less self-contained  
blocks. Port blocks one-by-one to 3DNow! and surround the  
3DNow! code with FEMMS. This block-by-block approach  
minimizes debug time. After each block is ported, run the code  
to verify that it is still working. If the code is not working, it’s  
usually easy to locate errors because they are isolated to a  
block. With this approach a debugger is only rarely necessary in  
3DNow! porting work. The commenting conventions for 3DNow!  
code show the most significant half of the operand on the left  
hand side, the least significant half of the operand on the right  
hand side, with the halves separated by a vertical bar.  
3DNow!™ Code versus x87 FPU Code  
When porting, most programmers find that 3DNow! code is  
much easier to write than x87 FPU code because the register  
file is flat and because with the 3DNow! single instruction  
multiple data (SIMD) capability twice as many operands can be  
manipulated. It is often possible to remove local temporary  
variables.  
Maximize the use of SIMD—always try to do useful work on  
both parts of the operands. It can be advantageous to add  
overhead to pack and unpack operands in order to use the SIMD  
arithmetic. Consider modifying existing data structures so the  
data layout is more conducive to SIMD processing, thereby  
eliminating the need for additional pack and unpack  
instructions.  
Replace integer code with MMX code. Unroll small loops  
completely. This can free up integer registers, and branches  
Port Code in Blocks  
9
3DNow!™ Instruction Porting  
22621B/0—August 1999  
that do not exist cannot be mispredicted. Due to the large  
number of global history bits, the AMD-K6 processor does not  
predict well on many short loops. If possible, use computations  
to replace branches caused by “if...then...else” constructs  
acting on 3DNow! data. Branching on 3DNow! data is a bit  
slower since 3DNow! instructions don’t affect the integer flags.  
Also, branching is disruptive to SIMD code as it is an inherently  
scalar operation which diminishes the advantages of SIMD  
processing.  
Avoid moving data between the integer and the MMX registers  
because this is time-consuming on the AMD-K6 and  
AMD Athlon™ processors. To move data between the integer  
and the MMX registers, use the MOVD instruction. Write MMX  
and 3DNow! code in a load/store construction—but do not use  
load execute instructions such as PFADD MM0, [FOO]. Using a  
load/store construct enables aggressive scheduling which is  
essential for good performance. (See Schedule Instructions on  
page 11.)  
Maximize the use of instructions that guarantee high decode  
bandwidth. These are called short-decode instructions on  
AMD-K6 family processors and DirectPath for AMD Athlon  
family processors. The optimization guides for both processors  
list the short-decode or DirectPath instructions. Maintaining a  
high decode bandwidth is essential for high performance code.  
Using short-decoded instructions, the AMD-K6 family  
processors can decode two instructions per cycle. Using  
DirectPath instructions, the AMD Athlon family processors can  
decode three instructions per cycle. On the AMD-K6 family  
processors, the only 3DNow!/MMX instructions that are not  
short-decoded are EMMS, FEMMS, and PREFETCH.  
Avoid indirect calls and jumps, as the AMD-K6 processors do  
not apply branch prediction to these control-transfer  
instructions. At the source code level, this affects functions  
called through a function pointer (such as entry points into  
DLLs). The latency of a JMP DWORD PTR is eight cycles, and  
the latency of a CALL DWORD PTR is seven cycles. Note that  
AMD-K6 processors use the return stack on indirect calls, so the  
return from an indirectly called routine is still accelerated. The  
AMD Athlon processor applies branch prediction to indirect  
calls and jumps.  
10  
3DNow!™ Code versus x87 FPU Code  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Optimize Register Allocation  
After porting a complete function, optimize register allocation  
across the function. Keep as much data as possible in registers  
to reduce overall memory traffic. Make sure all data is aligned  
to natural boundaries—QWORDs on QWORD boundaries,  
DWORDs on DWORD boundaries. Note that data that is  
accessed by 3DNow! code as QWORDs is not necessarily  
declared as QWORDs in the program, and therefore can not be  
properly aligned even if compiler switches are used to force  
data alignment to natural boundaries. Ensuring alignment can  
require slight changes or padding to data structures outside the  
ported code, and can require manual QWORD alignment of  
pointers returned by dynamic memory allocation routines such  
as malloc(), calloc(), etc. Use the /zp8 switch on Microsoft Visual  
C to pad and align structs to QWORD boundaries. Note however  
that /zp8 doesn’t always do a perfect job, so a small amount of  
manual padding may still be needed.  
Schedule Instructions  
Schedule the code according to instruction latencies.  
Scheduling is important for AMD-K6-2 and AMD-K6-III  
processors because their scheduler is six deep and four wide,  
and it holds 24 OPs. OPs are pushed into the scheduler four OPs  
(an op-quad) at a time. As new OPs come in at the top, the  
previous lines shift down. When a line reaches the bottom of the  
scheduler and the OPs haven’t all completed yet, the scheduler  
stalls—no new OPs can be pushed in at the top. If all OPs have  
completed and the line is at the bottom of the scheduler, the  
results of the OPs are committed to architectural state (retired)  
and the op-quad is discarded from the scheduler, allowing the  
following lines to shift down.  
In the best possible case, the decoders push in a new op-quad  
every cycle. The OPs must complete after six cycles or else the  
processor loses performance. The 24 OPs are equivalent to 12  
short-decoded x86 instructions. So, the out-of-order window is  
not very big, and an instruction that doesn’t get its source  
operands right away can get to the bottom of the scheduler  
without having completed, this prevents the scheduler from  
shifting.  
Optimize Register Allocation  
11  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
There are a few basic scheduling rules. All 3DNow! instructions  
in the AMD-K6-2 and AMD-K6-III processors have two-cycle  
latency. All MMX instructions have one-cycle latency, except  
MMX multiplies which are two cycles. Loads have two-cycle  
latency. To guarantee smooth flow of code through the machine,  
group instructions into pairs that can decode together, issue  
together, and retire together. To achieve this, observe the  
following rules:  
No dependencies between instructions in a decode pair  
No resource conflicts between instructions in a decode pair  
Per cycle, the AMD-K6-2 and AMD-K6-III processors can  
perform the following:  
One load  
One store  
Two integer ALU operations  
One integer shift  
Two MMX ALU operations  
One MMX shift  
One 3DNow! add pipe op  
One 3DNow! mul pipe op  
One branch  
LEA counts as a store op. PUNPCK* instructions are MMX ALU  
instructions.  
One scheduling method is to first group the code following the  
above rules, marking the empty slots with <> and then move  
instructions to fill the slots. For example:  
movd  
<>  
mm1, [foo_var]  
;0 | v[3],v[2],v[1],v[0]  
<>  
<>  
punpcklbw mm1, mm0 ; 0,v[3],0,v[2] | 0,v[1],0,v[0]  
<>  
movq  
mm2, mm1 ; 0,v[3],0,v[2] | 0,v[1],0,v[0]  
punpcklwd mm1, mm0 ;  
punpckhwd mm2, mm0 ;  
pi2fd  
pi2fd  
0,0,0,v[1] | 0,0,0,v[0]  
0,0,0,v[3] | 0,0,0,v[2]  
float(v[1]) | float(v[0])  
float(v[3]) | float(v[2])  
mm1, mm1 ;  
mm2, mm2 ;  
12  
Schedule Instructions  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
3DNow!™ Code Debugging  
To debug 3DNow! code, it is best to have a debugger that  
supports both the disassembly of 3DNow! instructions, and  
allows the MMX registers to be viewed as pairs of single-  
precision floating-point values. NuMega SoftICE version 3.24  
and later has both these capabilities. Microsoft Visual C/C++  
6.0 can also disassemble 3DNow! instructions; however, it does  
not provide a convenient way of viewing the MMX registers as  
pairs of floating-point numbers.  
Decode Degradation Checking  
After code has been scheduled and thoroughly tested, the last  
stage of tweaking is to make sure there is no decode  
degradation. All AMD-K6 processors use a technique called pre-  
decode to speed up decoding. In certain instances, the pre-  
decode information can be degraded, resulting in decode of  
only one instruction per cycle (long decode) or even one  
instruction per two cycles (vector decode), even though the  
instruction itself is listed as short decoded. Use the following  
guidelines for AMD-K6 family processors:  
[ESI] Inhibits Short Decode  
Use of [ESI] addressing mode inhibits short decode. Note that  
[ESI+disp], [ESI+reg] etc. is acceptable. Also, note that  
specifying [ESI+0] is optimized by most assemblers to [ESI].  
Instructions Longer Than Seven Bytes  
If the length of an instruction exceeds seven bytes, short  
decode is inhibited, and the instruction can never be short  
decoded.  
3DNow!™ Code Debugging  
13  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
Crossing Cache Line Boundary  
If an instruction crosses a cache line boundary and the opcode  
byte and modR/M byte are not in the same cache line, short  
decode is inhibited. Instruction cache lines are 32-bytes long in  
the AMD-K6 family processor. If the code segment is only  
paragraph (16-byte) aligned, check all 16-byte boundaries for  
the occurrence of this case. Bad cases can be remedied as  
follows:  
Swap instructions in a decode pair.  
Choose alternative instructions to move code.  
(For example, use CMP EAX, 0 instead of TEST EAX, EAX)  
Insert filler instructions like NOPs. Since an instruction  
degraded to vector decode takes up two cycles, it’s better to  
add an additional instruction and have both be short  
decoded.  
Hand code an instruction to add a zero displacement or to  
make a displacement 32 bits instead of 8 bits.  
Instruction Length Determination  
Short-decode is inhibited if more than three instruction bytes  
are required to determine the length of an instruction. This  
happens for certain SIB addressing modes where the decoder  
needs to look at the SIB byte to determine instruction length,  
but 0Fh, opcode, and modR/M already make up the maximum of  
three bytes. Avoid these SIB addressing modes. For more  
information, see the AMD-K6 Processor Code Optimization  
Application Note, order# 21924. AMD-K6-2 processors with the  
CXT core (CPUIDs of 588h to 58Fh) and AMD-K6-III processors  
eliminate this particular form of degraded predecode.  
Align Loops on 32-Byte Boundary  
Align important loops on a 32-byte cache line boundary. At the  
minimum, make sure that after the start of the loop there are at  
least two instructions before the next 32-byte boundary.  
14  
Decode Degradation Checking  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Blended Code Guidelines  
Introduction  
Blended code is 3DNow!™ optimized code that runs well on  
both the AMD-K6® and AMD Athlon™ processor platforms. The  
basic approach to blended code optimization is to address the  
AMD-K6 processor requirements first, and then to look for  
specific AMD Athlon processor improvements and issues which  
do not adversely affect AMD-K6 processor performance.  
With much larger buffers and a much larger out-of-order  
instruction window than other x86 processors, the AMD Athlon  
processor is good at automatically extracting performance out  
of existing executables, even if they are specifically optimized  
for a different processor. Of course, the best AMD Athlon  
performance can be achieved by optimizing code to exploit the  
specific strengths of the AMD Athlon processor.  
To learn more about AMD Athlon code optimization, refer to  
the AMD Athlon™ Processor x86 Code Optimization Guide, order#  
22007.  
Introduction  
15  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
Data Alignment  
Data alignment is very important for both AMD-K6 and  
AMD Athlon processor performance. Standard processor  
designs will work to their full potential if data is aligned.  
Alignment is specially important for data that is written by one  
instruction and subsequently read by another instruction.  
Three typical areas to watch for data alignment are:  
Alignment of structures and structure components  
Alignment of dynamically allocated memory  
Alignment of stack data  
Alignment of  
Structures  
With regard to alignment of structures, many compilers offer  
switches to automatically pad and align structures. These  
switches do not always work perfectly. It is best to check the  
alignment and to pad manually if necessary.  
Alignment of  
Structure  
Components  
Arranging structure components in order of decreasing size  
may help. For example, declare components with larger base  
type (e.g., DWORD) ahead of components with smaller base  
types (e.g., BYTE).  
Alignment of  
With regard to alignment of dynamically allocated memory, if  
your programming environment does not guarantee pointers  
returned by dynamic memory allocators, such as malloc(), to be  
suitably aligned, allocate a slightly larger chunk of memory and  
align the pointer manually. For example, a QWORD alignment  
should be:  
Dynamically  
Allocated Memory  
p=(QWORD *)malloc(sizeof(QWORD)*number_of_qwords)+7L);  
np=(QWORD *)((((long)(p))+7L) & (-8L));  
Alignment of Stack  
Data  
Alignment of stack data is hard to control unless the complete  
functions are written in assembly language. In this case, use  
code like the following example to keep local 3DNow! data  
QWORD aligned.  
Prolog:  
PUSH  
EBP  
MOV  
AND  
SUB  
EBP, ESP  
ESP, -8  
ESP, size_of_local_variable  
16  
Data Alignment  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Note: Use EBP to access arguments, use ESP to access local  
variables.  
Epilog:  
MOV  
POP  
RET  
ESP, EBP  
EBP  
Maximize SIMD Processing  
Maximize the amount of SIMD processing in your code. If the  
programmer exploits the SIMD nature of the 3DNow!  
instructions aggressively, using 3DNow! instructions in the code  
can provide significant performance benefits as compared to  
x87 code.  
Using PUNPCK instructions to combine scalar data for SIMD  
processing can create significant overhead and should be  
avoided where possible. It is best to rearrange computations  
and data structures in the source such that the amount of SIMD  
computation can be maximized.  
Example 1 (Avoid):  
float Xscale, Xoffset, Yscale, Yoffset;  
xnew = x*Xscale+Xoffset;  
ynew = y*Yscale+Yoffset;  
Example 2 (Better):  
float Xscale, Yscale, Xoffset, Yoffset;  
xnew = x*Xscale+Xoffset;  
ynew = y*Yscale+Yoffset;  
The second example can now be efficiently implemented using  
3DNow! instructions:  
MOVQ mm0, x  
;y | x  
;Yscale | Xscale  
;Yoffset | Xoffset  
;y*Yscale | x*Xscale  
;y*Yscale+Yoffset|x*Xscale+Xoffset  
;store ynew | xnew  
MOVQ mm1, Xscale  
MOVQ mm2, Xoffset  
PFMUL mm0, mm1  
PFADD mm0, mm2  
MOVQ xnew, mm0  
As a rough goal, strive to use 90% or more of the available  
computational slots provided by the SIMD instructions.  
Maximize SIMD Processing  
17  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
Use PREFETCH and PREFETCHW Instructions  
Use PREFETCH and PREFETCHW as aggressively as possible.  
On the AMD-K6-2 processor, use of PREFETCH results in only a  
small performance improvements, because the prefetches share  
the frontside bus (FSB) bandwidth. However, due to the high  
FSB utilization at high core-clock multipliers, the prefetches  
often get bumped because they are a low priority memory  
access. This situation improves with the AMD-K6-III processor,  
where L2 traffic is redirected to a separate backside bus, which  
frees up FSB bandwidth.  
The AMD Athlon processor has large amounts of FSB  
bandwidth available, and application-level improvements of up  
to 20% have been observed using PREFETCH(W) aggressively.  
Examine code carefully to find opportunities for using  
PREFETCH(W). Good use of PREFETCH requires that  
essentially all of the prefetched data is actually used, and it  
therefore works best if data is accessed with unit stride and in  
ascending order. Sometimes algorithms can be rewritten to  
create such a data access pattern. On the AMD-K6 processor,  
PREFETCH creates a small overhead, since it is a vector  
decode instruction. On the AMD Athlon processor, PREFETCH  
is DirectPath.  
Use PREFETCH as aggressively as possible without decreasing  
AMD-K6 processor performance due to the overhead of the  
PREFETCH instruction. This is possible in almost all cases.  
PREFETCH on the AMD Athlon processor brings in 64 bytes  
per PREFETCH due to the cache line length having doubled  
over the AMD-K6 processor (32 bytes versus 64 bytes), but it is  
acceptable to have overlapping (on the AMD Athlon processor)  
prefetches to account for the shorter 32-byte cache lines of the  
AMD-K6 processor. Make sure to prefetch to addresses at least  
64 bytes apart from the target address of any stores in the  
vicinity of a PREFETCH(W) instruction. Also, for best  
AMD Athlon performance, prefetch about three cache lines  
(192 bytes) ahead of current loads. For a more detailed formula,  
see the PREFETCH usage guideline in the AMD Athlon™  
Processor Code Optimization Guide, order# 22007.  
18  
Use PREFETCH and PREFETCHW Instructions  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Take Advantage of Write Combining  
Make sure to take advantage of the write-combining  
mechanisms provided by the hardware. For the AMD-K6  
processor, the best performance is achieved by using software  
write combining. (See “Software Write Combining” on  
page 34.) Also enable the write-combining features provided by  
the hardware on the AMD-K6-2 processor with the CXT core  
and on the AMD-K6-III processor. Aggressive software write  
combining can often do a better job than the AMD-K6  
processor’s hardware write-combining mechanism, but enabling  
the hardware write-combining mechanism provides the  
additional benefit of shorter latency writes to non-cacheable  
memory areas.  
The AMD Athlon processor has a very powerful write-  
combining mechanism that achieves even better acceleration of  
writes to non-cacheable space than is possible with write  
combining on the AMD-K6 processor. Specifically, the  
AMD Athlon write-combining buffer is 64 bytes and can  
combine writes of any size. The programming of the write-  
combining hardware is through model-specific registers  
(MSRs), which have been implemented compatibly with the  
Intel Pentium II processor. In addition to accelerating writes to  
write-combining (WC) regions, the AMD Athlon write  
combining can also accelerate writes to write-through (WT)  
memory areas if they occur in strictly ascending order. (Writes to  
WC areas can be combined regardless of the order of the  
writes.) See the Write Combining chapter for the AMD Athlon™  
Processor Code Optimization Guide, order# 22007 for more  
details.  
Use FEMMS Instruction  
The AMD Athlon processor does not have any switching  
overhead when switching between 3DNow!/MMX instructions  
and x87 instructions. Also, the FEMMS and EMMS instructions  
are essentially free because they execute with apparent zero-  
cycle latency. However, for blended code it is important to avoid  
frequent switching between 3DNow!/MMX and x87 code blocks  
and to use FEMMS before entering and after leaving a block of  
Take Advantage of Write Combining  
19  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
3DNow!/MMX code. Otherwise, AMD-K6 processor  
performance can suffer significantly.  
Load-Execute Instruction Usage  
The AMD Athlon processor performs well when load-execute  
instructions (i.e., instructions that have a register and a  
memory source, where the result goes to a register) are used. In  
fact, use of load-execute instructions is recommended for the  
AMD Athlon processor because they improve code density.  
However, for blended code, do not use load-execute instructions  
in 3DNow!/MMX code to enable proper scheduling of loads and  
to avoid potential problems with load-execute instructions  
(degradation to vector decode due to instruction length) on  
AMD-K6 family processors. The AMD Athlon processor has a  
built-in mechanism that enables a sequence of a load and a  
dependent 3DNow!/MMX instruction to execute just as quickly  
as a load-execute instruction. Avoiding load-execute  
instructions does not cause performance degradation on the  
AMD Athlon processor but can help the AMD-K6 processor.  
Scheduling Instructions  
Schedule instructions for the AMD-K6 processor. (See  
“Schedule Instructions” on page 11.) Due to the relatively small  
instruction re-order buffer in the AMD-K6 processor,  
instruction scheduling is important for maximizing  
performance on AMD-K6 processors. However, the AMD Athlon  
processor is a very aggressive out-of-order machine with a huge  
instruction re-order buffer. Therefore, instruction scheduling  
on the AMD Athlon processor is of minor importance, because  
the CPU can extract the available parallelism automatically.  
Scheduling code for the AMD-K6 processor has no adverse side  
effects on AMD Athlon performance.  
20  
Load-Execute Instruction Usage  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Instruction and Addressing Mode Selection  
As far as instruction selection is concerned, only a few issues  
require attention. On the AMD Athlon processor, transferring  
data between integer and MMX registers is somewhat slower  
than on the AMD-K6 processor. Therefore, such transfers should  
be minimized. Usually, this is not difficult to do.  
Among the integer instructions, avoid the LOOP instruction.  
While very fast on the AMD-K6 processor, it is somewhat slower  
on the AMD Athlon processor. It should be replaced with the  
sequence DEC ECX;JNZ. This will, in most cases, not reduce  
AMD-K6 performance, and if so, only to a very limited amount.  
The AMD Athlon processor uses a different instruction pre-  
decode scheme than the AMD-K6 processor. It therefore has no  
sub-optimal addressing modes. However, since this is a real  
performance issue on the AMD-K6 processor, addressing modes  
considered sub-optimal for the AMD-K6 processor should be  
avoided in blended code. Sub-optimal addressing modes are  
described in “Addressing Modes on the AMD-K6®-2 and  
AMD-K6®-III Processors” on page 36.  
Instruction and Addressing Mode Selection  
21  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
22  
Instruction and Addressing Mode Selection  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
General Porting Guidelines  
Minimize AMD-K6®-2 Processor Switching Overhead  
Minimize the FPU to 3DNow!™ and MMX™ switching overhead  
by porting all hotspots containing x87 code to 3DNow! code.  
Even if FEMMS is used, switching incurs about 25 cycles in each  
direction—50 cycles round-trip. Always use FEMMS, and not  
EMMS, as the switching overhead with EMMS is about 100  
cycles round-trip.  
Always bracket 3DNow! code with FEMMS to ensure proper  
operation and minimize switching overhead. If there are  
function calls to functions that can contain FPU code, bracket  
the function call with FEMMS.  
It is also beneficial to simply minimize the number of FEMMS.  
One technique to use if there are multiple calls to a DLL (where  
the functions are _stdcall), is to perform the following in order:  
Push all the arguments first  
Execute a FEMMS  
Call all the functions (which unload the stack)  
Execute another FEMMS  
Since FEMMS is a three-cycle vector path instruction, functions  
should not be made very small to avoid adding significant  
Minimize AMD-K6®-2 Processor Switching Overhead  
23  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
overhead (functions have been observed in OpenGL that consist  
of just five x86 instructions).  
Note: For the AMD Athlon processor, it is important that CALLs  
not be spaced too closely together. No more than two CALLs  
for every 16 bytes of code are recommended.  
The switching overhead occurs on the first floating-point unit  
instruction after a piece of 3DNow!/MMX code, and it occurs on  
the first MMX or 3DNow! instruction after a piece of x87 code.  
FEMMS and EMMS are 3DNow!/MMX instructions. Thus,  
looking at the following sample code:  
code  
cycles  
<FPU instructions>  
FEMMS  
3 + switching overhead  
<MMX/3DNow! instructions>  
FEMMS  
<1st FPU instruction>  
3
x + switching overhead  
Note that PREFETCH(W), although introduced as part of the  
3DNow! instruction set extension, is treated like an ordinary  
integer instruction and therefore never incurs switching  
overhead. PREFETCH(W) can be used to accelerate integer,  
x87, MMX, or 3DNow! code.  
Using PREFETCH  
Use PREFETCH judiciously. PREFETCH on the AMD-K6®-2  
and AMD-K6®-III processors is microcoded, so it adds some  
overhead. Also on the AMD-K6-2 processor, all cache and  
memory accesses have to flow through the same frontside bus.  
Do not waste bandwidth on the frontside bus by executing  
useless prefetching.  
Opportunities for using PREFETCH are typically inside loops  
that process large amounts of data. If the loop goes through less  
than a cache line of data per iteration, partially unroll the loop.  
Make sure that close to 100% of the prefetched data is actually  
being used. This usually requires unit stride access—all  
accesses are to contiguous memory locations.  
24  
Using PREFETCH  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
PREFETCH on the AMD-K6® Processor  
The usefulness of PREFETCH on AMD-K6-III processors is  
limited by hardware constraints, the most important is that the  
AMD-K6-III processor allows only one load miss to be  
outstanding at any time.  
The cases where PREFETCH can most likely provide benefits  
are characterized as follows:  
The bandwidth requirements of the code are moderate—  
there is a relatively large amount of computation and  
relatively few memory accesses. An example of moderate  
bandwidth requirements would be code that consumes  
about 250 Mbytes per second worth of data when running  
out of the L1 cache on a 400-MHz processor.  
Stores in the code that access cacheable memory write to a  
small area of memory only—the working sets for stores is  
small or empty. Due to the write-allocate feature of the  
AMD-K6-2 and AMD-K6-III processors, stores bring lines  
into the cache which are subsequently dirtied and must be  
written back from the cache when the cache line is replaced  
with data brought in by PREFETCH. Cache writebacks use  
up bandwidth on the front-side bus.  
PREFETCHes do not overlap—no two PREFETCH  
instructions try to bring in the same data.  
The number of distinct memory regions being prefetched is  
small, preferably only one region—if there are multiple  
memory regions being prefetched (like multiple source  
arrays), the density of the loads must be low compared to the  
amount of computation, such that the computation can be  
overlapped with each PREFETCH. The PREFETCH  
instructions should be scheduled separately in such cases to  
allow each to overlap with computation, and to avoid the  
first PREFETCH blocking subsequent PREFETCHes due to  
the limit of one load miss in the machine at any time.  
PREFETCH on the AMD Athlon™ Processor  
PREFETCH on the AMD Athlon™ processor is a very powerful  
tool both because of the much larger available bandwidth that  
it can exploit and because of the ability to have multiple  
outstanding load misses.  
Using PREFETCH  
25  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
PREFETCHW Usage  
Code that intends to modify the cache line brought in through  
prefetching should use the PREFETCHW instruction. While  
PREFETCHW works the same as a PREFETCH on the  
AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a  
hint to the AMD Athlon processor of an intent to modify the  
cache line. The AMD Athlon processor will mark the cache line  
being brought in by PREFETCHW as modified. Using  
PREFETCHW can save an additional 15-25 cycles compared to  
a PREFETCH and the subsequent cache state change caused by  
a write to the prefetched cache line.  
Multiple Prefetches  
Programmers can initiate multiple outstanding prefetches on  
the AMD Athlon processor. While the AMD-K6-2 and  
AMD-K6-III processors can have only one outstanding prefetch,  
the AMD Athlon processor can have up to six outstanding  
prefetches. For example, when traversing more than one array,  
the programmer should initiate multiple prefetches.  
Example (Multiple Prefetches):  
double a[A_REALLY_LARGE_NUMBER];  
double b[A_REALLY_LARGE_NUMBER];  
double c[A_REALLY_LARGE_NUMBER];  
for (i=0; i<A_REALLY_LARGE_NUMBER/4; i++) {  
prefetchw (a[i*4+64]); // will be modifying a  
prefetch (b[i*4+64]);  
prefetch (c[i*4+64]);  
a[i*4]  
= b[i*4]  
* c[i*4];  
a[i*4+1] = b[i*4+1] * c[i*4+1];  
a[i*4+2] = b[i*4+2] * c[i*4+2];  
a[i*4+3] = b[i*4+3] * c[i*4+3];  
}
Determining Prefetch  
Distance  
To make sure code with PREFETCH works well on the  
AMD Athlon processor, prefetch several cache lines ahead of  
the current loads. A good heuristic is to fetch three  
AMD Athlon cache lines (at 64 bytes each), or 192 bytes ahead  
of current loads. That is, if the code is currently operating on  
data at address X, prefetch at X+192.  
Given the latency of a typical AMD Athlon processor system  
and expected processor speeds, the following formula should be  
used to determine the prefetch distance in bytes:  
Prefetch Distance = 200 (DS/ ) bytes  
C
Round up to the nearest 64-byte cache line.  
26  
Using PREFETCH  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
The number 200 is a constant that is based upon expected  
AMD Athlon processor clock frequencies and typical system  
memory latencies.  
DS is the data stride in bytes per loop iteration.  
C is the number of cycles for one loop to execute entirely  
from the L1 cache.  
Prefetch at Least 64  
Bytes Away from  
Surrounding Stores  
The PREFETCH and PREFETCHW instructions can suffer  
from false dependencies on stores. If there is a store to an  
address that matches a request on bits 14–6, that request (the  
PREFETCH or PREFETCHW instruction) is blocked until the  
store is written to the cache. Therefore, code should prefetch  
data that is located at least 64 bytes away from any surrounding  
store’s data address.  
If PREFETCH helps on a piece of code, but doesn’t affect the  
AMD-K6-III processors, keep the PREFETCH code anyway.  
There is a good chance that it will help on the AMD Athlon  
processor, because the AMD Athlon processor’s  
implementation of PREFETCH is very aggressive. If an  
AMD Athlon processor is available, check that it benefits from  
the PREFETCH, and then make sure that the PREFETCH  
doesn’t hurt the AMD-K6-III processor.  
Use PFSUBR Instruction When Needed  
Note that there is a PFSUBR instruction, so in a subtraction the  
programmer can choose which operand to destroy.  
Using PAND and PXOR  
Use PAND and PXOR to perform FABS and FCHS work on  
3DNow! operands. For example:  
mabs DQ 07fffffff7fffffffh  
sgn DQ 08000000080000000h  
movq mm0, [mabs]  
movq mm1, [sgn]  
pxor mm2, mm1  
pand mm2, mm0  
;change sign  
;absolute value  
Use PFSUBR Instruction When Needed  
27  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
Use a PXOR MMreg, MMreginstruction to clear the bits in an MMX  
register.  
Use a PCMPEQD MMreg, MMreg instruction to set the bits in an  
MMX register.  
Swapping MMX™ Registers Halves  
To swap the two register halves of an MMX register (which  
should be avoided) use the following:  
;mm1 = swapd (mm0), mm0 destroyed  
movq  
mm1, mm0  
;y | x  
;x | x  
;x | y  
punpckldq mm0, mm0  
punpckhdq mm1, mm0  
;mm1 - swapd (mm0), mm0 preserved  
movq  
mm1, mm0  
;y | x  
;y | y  
;x | y  
punpckhdq mm1, mm1  
punpckldq mm1, mm0  
For code being used only on AMD Athlon family processors, use  
the new PSWAPD instructions. See the AMD Extensions to the  
3DNow!™ and MMX Instruction Sets Manual, order# 22466 for  
the instruction usage.  
PUNPCKL* and PUNPCKH* Instructions  
PUNPCKL* and PUNPCKH* are essential facilities for  
manipulating MMX and 3DNow! operands. Besides  
MOVQ/MOVD, these are the most frequently used MMX  
instructions in 3DNow! code. For example, converting a stream  
of unsigned bytes into 3DNow! floating-point operands:  
; outside loop:  
pxor  
mm0, mm0  
;inside loop:  
movd  
mm1, [foo_var]  
;0 | v[3],v[2],v[1],v[0]  
punpcklbw mm1, mm0  
;0,v[3],0,v[2] | 0,v[1],0,v[0]  
;0,v[3],0,v[2] | 0,v[1],0,v[0]  
;0,0,0,v[1] | 0,0,0,v[0]  
movq  
mm2, mm1  
punpcklwd mm1, mm0  
punpckhwd mm2, mm0  
;0,0,0,v[3] | 0,0,0,v[2]  
pi2fd  
pi2fd  
mm1, mm1  
mm2, mm2  
;float(v[1]) | float(v[0])  
; float(v[3]) | float(v[2])  
28  
Swapping MMX™ Registers Halves  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
Storing the Upper 32 Bits of an MMX™ Register  
To store the upper 32 bits of an MMX register using MOVD, one  
can use either a PSRLQ or a PUNPCKHDQ instruction to move  
the high-order 32 bits of the register to the low-order 32 bits of  
the register. In this situation, it is optimal to use the  
PUNPCKHDQ instruction. The AMD-K6-III processor has only  
one MMX shifter (which can execute a PSRLQ), but two MMX  
ALUs (which can execute a PUNPCKHDQ). Using PUNPCHDQ  
therefore maximizes the likelihood of an execution unit being  
available.  
PFMIN and PFMAX  
Use PFMIN and PFMAX where possible. They are much faster  
than the equivalent code using MMX and 3DNow! instructions.  
PFMIN and PFMAX can be used for clamping. They can also be  
used in SIMD code that avoids branching by replacing it with  
computation. For example:  
float x,z;  
z = abs(x);  
if (z >= 1) {  
z = 1/z;  
}
can be coded using branchless SIMD code as follows:  
;;in: mm0 = x  
;;out: mm0 = z  
movq  
movq  
pand  
pcmpgtd  
pfrcp  
movq  
mm5, mabs ;0x7fffffff  
mm6, one  
mm0, mm5  
mm6, mm0  
mm2, mm0  
mm1, mm0  
;1.0  
;z=abs(x)  
;z < 1 ? 0xffffffff : 0  
;1/z approx  
;save z  
pfrcpit1 mm0, mm2  
pfrcpit2 mm0, mm2  
;1/z step  
;1/z final  
;z = z < 1 ? z : 1/z  
pfmin  
mm0, mm1  
Storing the Upper 32 Bits of an MMX™ Register  
29  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
Another example. The following code:  
#define PI 3.14159265358979323f  
float x,z,r,res;  
z = abs(x)  
if (z < 1) {  
res = r;  
}
else {  
res = PI/2-r;  
}
This can be code as branchless SIMD code as follows:  
;;in:  
;;  
mm0 = x  
mm1 = r  
;;out: mm1 = res  
movq  
movq  
pand  
pcmpgtd  
movq  
mm5, mabs ;0x7fffffff  
mm6, one  
mm0, mm5  
mm6, mm0  
;1.0  
;z=abs(x)  
;z < 1 ? 0xffffffff : 0  
mm4, pio2 ;pi/2  
pfsub  
pandn  
pfmax  
mm4, mm1  
mm6, mm4  
mm1, mm6  
;pi/2-r  
;z < 1 ? 0 : pi/2-r  
;res = z < 1 ? r : pi/2-r  
Precision Considerations  
Carefully consider whether to use reciprocals, divides, square  
roots, and reciprocal square roots to full precision. If full  
precision is not required, accelerate code by using just the  
approximations returned by PFRCP (14 bits accuracy), and  
PFRSQRT (15 bits accuracy) instead of coding the reciprocal or  
reciprocal square root sequence with the Newton-Raphson step  
instructions. For lighting computations, the accuracy of the  
approximation instructions often suffices, but geometry  
transforms typically require full precision.  
Moving Data Between MMX™ and Integer Registers  
For the AMD Athlon processor, avoid moving data between  
MMX and integer registers or vice versa. If this cannot be  
avoided, use the MOVD instruction to accomplish the transfer,  
and do not pass the data manually through memory (except  
30  
Precision Considerations  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
where the store can be scheduled at least 15 instructions ahead  
of the load).  
Store-to-Load Forwarding  
Avoid any store-to-load forwarding (store feeding into load) that  
does not have address and size matches. The only exception is a  
wide store feeding into a small load where the addresses match:  
movq  
mov  
[foo], mm0  
eax, [foo]  
Here are some cases to avoid:  
mov  
mov  
[foo], eax  
[foo+4], edx  
mm0, [foo]  
[foo], mm0  
eax, [foo+4]  
[foo], mm0  
movq  
movq  
mov  
movq  
movq  
movq  
[foo+8], mm1  
mm2, [foo+4]  
Block Copies  
For memory block copies on the AMD-K6-III processor, most  
code will have very similar performance for large blocks,  
because it is limited by the bus interface. For the AMD-K6-2  
processor, this was verified by creating multiple block copy  
types and discovering that there were insignificant  
performance differences. This is also true for block copies  
inside L2 (for off-chip L2). However, in L1-to-L1 block copies  
there can be a big difference.  
The following are measurements performed with an  
AMD-K6-2/300 on an Epox motherboard with VIA MVP3  
chipset and PC100 DRAM. Data blocks are QWORD aligned.  
MSV 5.0  
memcpy()  
985 MB/s  
122 MB/s  
71 MB/s  
aggressive MOVQ loop  
1718 MB/s  
L1-to-L1  
L2-to-L2  
mem-to-mem  
124 MB/s  
72 MB/s  
Store-to-Load Forwarding  
31  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
The L2-to-L2 and mem-to-mem throughput increases with the  
AMD-K6-III processor and further increases on the  
AMD Athlon processor.  
The aggressive MOVQ loop performs at a minimum as well as a  
memcpy(), and does much better for L1-to-L1 transfers. It is also  
preferable for copies to non-cacheable areas on the AMD-K6-III  
processor due to the doubled chunk size over the REP MOVSD  
inside the memcpy() function. For this reason, consider using it  
for all block copies. The code is as follows:  
_asm { mov  
eax, [src]  
mov  
mov  
edx, [dst]  
ecx, (SIZE >> 6)  
xfer:  
movq mm0, [eax]  
add edx, 64  
movq mm1, [eax+8]  
add eax, 64  
movq mm2, [eax-48]  
movq [edx-64], mm0  
movq mm3, [eax-40]  
movq [edx-56], mm1  
movq mm4, [eax-32]  
movq [edx-48], mm2  
movq mm5, [eax-24]  
movq [edx-40], mm3  
movq mm6, [eax-16]  
movq [edx-32], mm4  
movq mm7, [eax-8]  
movq [edx-24], mm5  
movq [edx-16], mm6  
dec  
ecx  
movq [edx-8], mm7  
jnz xfer  
}
Care should be taken to make the label xfer: 32-byte-aligned for  
maximum performance. As a side note, the Microsoft Visual C  
5.0 without Service Pack 3 appears to ignores align directives in  
32  
Block Copies  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
inline assembly. This problem may not occur after applying  
Service Pack 3 to Microsoft Visual C 5.0.  
Instruction Cache and Branch Prediction Effects  
Try different function ordering to see how it affects  
performance, there are sometimes interesting differences of  
several FPS (frames per secondas a measure of the  
performance of graphics applications) based on that.  
Instruction cache thrashing is one suspect in this. The other one  
is the branch prediction which has a global history component  
where branches can influence the prediction of other branches.  
Most of the time this helps. (Two branches might be closely  
correlated—if one is taken the other one is always not taken.)  
But it can also hurt, like all heuristic algorithms.  
In order to reduce the potential for instruction cache thrashing,  
group all the program’s hotspots close together. For example  
extract all the performance-critical functions into a single file.  
Use the Linker  
There is another way to affect function ordering that may be  
more desirable. The linker allows the programmer to specify  
the exact order of every function in a DLL/executable as  
follows:  
1. All source code must be compiled with the /Gy switch. This  
creates packaged functions—a COMDAT record is emitted  
into the object file for each function.  
2. At link time, use the /ORDER:@filename switch to fix the  
order of functions in the DLL/executable. The term  
filename, refers to a file that lists all function names in the  
order to be emitted, one function name per line. For C code  
it’s simply the function name as it appears in the source (no  
pre-pended underscore, no @xx suffix for Pascal calling  
convention).  
3. This does not work for object files produced by MASM.  
MASM doesn’t have a switch to create packaged functions,  
and it does not allow the user to create a COMDAT entry  
manually by putting COMDAT func into your source.  
To reduce potential problems due to branch prediction,  
eliminate as many branches as possible. The AMD-K6-III and  
Instruction Cache and Branch Prediction Effects  
33  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
AMD Athlon processors have large instruction caches, and  
aggressive loop unrolling (which increases the code size) helps.  
It is also worthwhile to eliminate branches which have small  
amounts of code, replacing the branches with in-line  
computation.  
Code Alignment  
To get 32-byte alignment in MASM 6.13, forgo the convenience  
of new-style segment declarations, and use something like the  
following:  
_TEXT SEGMENT PAGE PUBLIC USE32 ’CODE’  
ASSUME CS:FLAT, DS:FLAT, SS:FLAT, ES:FLAT  
ALIGN 32  
_TEXT ENDS  
MASM may not allow ALIGN to be more restrictive than the  
SEGMENT alignment. If .CODE is used, the result is a PARA  
aligned segment—a 16-byte aligned segment.  
For inline assembly in Microsoft® Visual C, the best alignment is  
16-byte alignment by using align 16 in the inline assembly code.  
Microsoft Visual C 5.0 without SP3 ignores this directive, so  
check whether the alignment is actually there. Microsoft Visual  
C 4.2 seems to work in this regard. At present, correct operation  
of align under Microsoft Visual C 5.0 with SP3 has not been  
verified.  
For inline assembly in Metrowerks™ CodeWarrior Pro 4, align  
32 is accepted and works. See the specific vendor for more  
information.  
Software Write Combining  
The writes-to-non-cacheable space is an important issue for low-  
level drivers. Processors communicate with graphics chips  
through a command buffer on the graphics card which is  
mapped to non-cacheable PCI (or AGP) space. On a Pentium® II,  
this can be made high-performance by setting up that space as  
34  
Code Alignment  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
UCWC (non-cacheable write-combining), in which case the  
Pentium II does write-combining and even bursting to that  
space.  
AMD-K6-2 processors that predate the CXT core (CPUID less  
than 588h) do not support a UCWC memory type, and they  
neither perform write combining, nor do they burst to UC  
memory. AMD-K6-2 processors with the CXT core (CPUID 588h  
to 58Fh) and AMD-K6-III processors support write combining to  
non-cacheable space, but they are not able to use burst  
transfers when writing to non-cacheable memory areas.  
Also, AMD-K6-2 processors predating the CXT core do not  
pipeline writes to non-cacheable space well. This can create a  
bottleneck when a lot of data needs to be transferred to the  
graphics card, which for 3D graphics drivers happens  
predominantly in texture download and triangle download  
code. (These two can cover about 99% of all writes.) Therefore,  
for good performance with the millions of existing AMD-K6-2  
processors and even for AMD-K6-2 processors with the CXT  
core and AMD-K6-III processors, software needs to organize the  
PCI or AGP writes carefully to achieve around a 20%  
performance gain in the process.  
This technique is called software write combining. The basic  
technique is to collect all writes to non-cacheable space into  
aligned QWORDs as much as possible. This is accomplished by  
using an MMX register as a write buffer and collecting DWORD  
writes using PUNPCK. Then store data out using aligned MOVQ  
stores. The following two basic approaches can align the  
QWORD writes:  
1. If there is a NOP command consisting of a single DWORD,  
which takes no processing time in the graphics chip, issue  
the NOP command if the buffer pointer is not QWORD  
aligned, then continue writing out QWORDs. This works if  
the DWORDs in the command buffer are at least DWORD  
aligned. It has the drawback of wasting some bandwidth for  
the NOP commands.  
2. We can split the code into two code streams. If the buffer  
pointer is not QWORD aligned, take path one and write the  
first chunk as a DWORD, then continue writing QWORD. If  
the buffer pointer is aligned, take path two and start  
writing out QWORDs immediately.  
Software Write Combining  
35  
3DNow!™ Instruction Porting  
22621B/0—August 1999  
In both cases there is the end case where we need to flush the  
write buffer (MMX register) at the end of the write loop.  
Option 2 is recommended for highest possible performance, but  
option 1 is often easier to implement and often provides similar  
performance.  
For AMD-K6-2 processors with the CXT core and for  
AMD-K6-III processors, use both software write combining and  
enable the hardware write-combining features of these  
processors.  
Addressing Modes on the AMD-K6®-2 and AMD-K6®-III  
Processors  
The addressing modes listed below are sub-optimal for all  
instructions. They degrade short-decoded instructions to vector  
decode (degrade to long-decode in the case of 3DNow!  
instructions). This is due to the lack of on-the-fly corrections to  
the instruction length that is computed during predecode.  
16-bit addressing: [SI], [SI+disp8], [SI+disp16], [DI]  
32-bit addressing: [ESI]  
The following addressing modes are sub-optimal for all  
instructions with 0Fh prefix (including all MMX/3DNow!  
instructions). Again, it degrades short-decoded instructions to  
vector (long decode in the case of the 3DNow! instruction set).  
This is due to the inability to determine the instruction length  
from the first three bytes (0F-prefix, opcode, ModR/M). Note:  
This category has been eliminated in AMD-K6-2 processors with  
the CXT core and AMD-K6-III processor. However millions of  
existing AMD-K6-2 processors are affected by this issue, so it is  
highly recommended to avoid these addressing modes.  
1. ModR/M = 00_xxx_100b is the only ModR/M encoding that  
requires the SIB value to determine instruction length. For  
this ModR/M, the processor doesn’t know whether there is a  
disp32 or not until it looks at the SIB (which predecode  
cannot do in the case of MMX/3DNow!). For ModR/M =  
01_xxx_100b there is always a disp8, and for ModR/M =  
36  
Addressing Modes on the AMD-K6®-2 and AMD-K6®-III Processors  
22621B/0—August 1999  
3DNow!™ Instruction Porting  
10_xxx_100b there is always a disp32, so the length can be  
determined from looking at ModR/M without looking at the  
SIB byte.  
2. This ModR/M encoding is encountered with the following  
source-level addressing modes:  
[base+scale×index]  
[scale×index+disp]  
[scale×index]  
[base+index]  
The following example demonstrates the ModR/M byte and SIB  
byte resulting from several addressing modes; note that the  
MOV instructionisnotaffected bytheissuedescribedhere.  
opc mod sib disp  
8B 04 F2  
8B 04 B3  
8B 04 D5 00000000  
8B 04 13  
mov eax, [edx+8*esi]  
mov eax, [4*esi+ebx]  
mov eax, [8*edx]  
mov eax, [edx+ebx]  
Note that the third mode is actually identical to the second as  
far as the actual encoding is concerned (basically it’s encoded  
as [scale×index+0]).  
Also, there is a length restriction. Any instruction longer than  
seven bytes cannot be short decoded. For MMX instructions,  
avoid addressing modes with SIB and 32-bit displacement. For  
3DNow! instructions, avoid all addressing modes with 32-bit  
displacement.  
Addressing Modes on the AMD-K6®-2 and AMD-K6®-III Processors  
37  

相关型号:

APPLICATIONNOTE2

MIC4807 80V 8-Channel Addressable Low-Side Driver
ETC

APPLICATIONNOTE22

MICRF001 Theory of Operation
ETC

APPLICATIONNOTE23

MICRF001 Antenna Design Tutorial
ETC

APPLICATIONNOTE28

Data Squelch Using the MICRF002
ETC

APPLICATIONNOTES

Application Notes - Pentium Pro Processor Performance Brief
ETC

APPLICATIONNOTE_1

32/48 COM Application Circuit|Graphic STN LCD Drivers with Built-in RAM
ETC

APPLICATIONS2SK3132

Silicon N Channel MOS Type Chopper Regulator DC−DC Converter and Motor Drive Applications
TOSHIBA

APPNOTE17

Universal Serial Bus Power Management
ETC

APPNOTE32

Dual Current-Limiting Switch for USB Applications
ETC

APPNOTE36

Electroluminescent Display Drivers
ETC

APPNOTESMP9100

Understanding the Power Rating of Caddock MP9100 Resistors
ETC

APPSA04-41CGKWA

14 SEG ALPHANUMERIC DISPLAY, GREEN, 10.16mm, SURFACE MOUNT PACKAGE-16
KINGBRIGHT