32bit x86 tips

=========================================================================== TITLE: x86 TIPS AND TRICKS AUTHOR: LAURA FAIRHEAD LAST MODIFIED: 04/00 =========================================================================== 32bit CODE I find it strange that an awful lot of ASM programmers who write real-mode code seem to think that they (i) have to be 8086 compatible? (ii) can only use 16bit functions. The first point is just a disease but the 2nd I think is due to misinformation. In real-mode you have access to ALMOST all of the functionality of the processor, in fact you typically have MORE access than the average p-mode program (because the O/S will block you). Worse still it has been common knowledge for a long time now that you can even access the WHOLE of memory (that is ALL 4GB) without any help from special drivers. (Mind you you do have to toggle that bit in CR0 albeit momentarily....) Any assembler will assemble code that uses 32-bit operand=size or address-size by adding the prefixes 066h and 067h for you. Of course this is the hit, you add one byte each time you access a 32-bit register, but the same goes if you were accessing a 16-bit register in 32-bit code. Apart from using 32-bit registers/operands you also have access to a 32-bit address size. This holds so many benefits, for example if you want to access some data on the stack you DONT have to do that MOV BP,SP rubbish just:- MOV AX,[ESP+somedisplacement] TRICK #1675301 One of the notable deficiencies of the x86 architecture is that a MOV does not effect the flags. This is the most natural state of affairs since we have an instant test for zero/sign. One of the ways around:- MOV reg,data AND reg,reg JZ wherever becomes.... ADD regX,data JZ wherever Where regX was=0, adding the data is equivalent to a MOV that sets the flags for us. Do you like writing "cryptic" code ? :- XOR reg,reg ADD reg,data JZ wherever So what's the point other than to make your code "unreadable" ? Aha!, secrets.... FLAGS One of the keys to understanding assembly language is the flags. Efficient use of the flags is also the way to write very compact code. I remember the major block I had for grasping assembly was the conditional branch, in particular:- JE where_to_go (it was actually BEQ, the equivalent 6502 instruction, but the principle is the same.) Here I remember asking the question, yes but jump if equal WHAT??! The fact is that most people seem to view CMP and JE/JNE as inseperable partners and they are not. This is machine code-- it is NOT a language, all we have is the machine state. CMP AL,1 SBB AH,AH There is a synonym in x86 for JE called JZ, also its partner JNE can be called JNZ. I always use these synonymns for the basic reason that the name more accurately describes what the instruction actually does (as opposed to what it is generally used for). JZ => Jump if Zero, the JZ instruction branches iff ZF=1, there is no EQUAL !! So:- DEC AX MOV AX,[another_value] JZ gohereifaxwas1 will branch to gohereifaxwas1 if AX was equal 1 at the start, since the DEC instruction will set the zero flag if it results in a 0, and the MOV doesn't affect the flags (many of the intel instructions don't). BT,BTC,BTS,BTR When you have arrays where the elements can only take two possible values it is best to use bits. The x86 from 386+ has very nice bit test instruction which allows you to address individual bits without messing about. God knows why but a lot of ASM programmers are still programming 8086 compatible code, didn't someone tell them the 286 is dead? BT [mem],AX Tests the AX'th bit of a bit array that starts at mem. You have to use a 16/32bit register but otherwise this instruction is extremely useful; it sets the carry flag to the previous state of the bit. The others BTC,BTS,BTR also complement, set, and reset the respective bit (AFTER getting it into the CF!). Do you know how many C instructions are needed to perform this operation ? BSWAP reg32 Although documented as being useful for endian conversion this instruction also allows the programmer to make the hi-word of a 32bit register more accessible. EAX before 1 2 3 4 after 4 3 2 1 Of course you get all your words al-reverso, but if you were dealing with 4-bytes of data now you have the other 2 in AL and AH. For example if you are copying DS:SI->ES:DI but each byte in the target you want to be 0FFh if the source was non-zero and 000h otherwise, it is best to do it in dwords:- LODSD CMP AL,1 SBB AL,AL CMP AH,1 SBB AH,AH BSWAP EAX CMP AL,1 SBB AL,AL CMP AH,1 SBB AH,AH BSWAP EAX STOSD Not the MOST efficient way, granted, but you can see my point. For further examples of BSWAP in action take a look at the graphic primitives in some decent demo source. SIGN EXTENSION ENCODING The instruction family CMP/SUB/OR/XOR/AND/ADD/SBB/ADC have an encoding where the immediate 8-bit source operand is sign extended to the destination size. This can be used to set word/dword memory location to 0 or -1 using less bytes than the equivalent MOV :- sets memory to -1 82 xx FF OR WORD PTR [mem],-1 82 xx FF OR DWORD PTR [mem],-1 equivalent MOVs: C7 xx FF FF MOV WORD PTR [mem],-1 C7 xx FF FF FF FF MOV DWORD PTR [mem],-1 sets memory to 0 82 xx 00 AND WORD PTR [mem],0 82 xx 00 AND DWORD PTR [mem],0 equivalent MOVs: C7 xx 00 00 MOV WORD PTR [mem],0 C7 xx 00 00 00 00 MOV DWORD PTR [mem],0 LAHF LAHF/SAHF are under-used instructions which offer a wonderful service. LAHF copies the least significant byte of EFLAGS to the AH register, likewise SAHF copies the AH register to the least significant byte of EFLAGS. The least significant byte of EFLAGS is:- b7 b6 b5 b4 b3 b2 b1 b0 SF ZF x AF x PF x CF So you can use this to save all of the general-purpose flags minus OF. This can be used instead of PUSHF/POPF in a lot of cases, you don't save any bytes but your code will be quicker as you are not accessing memory anymore. INC/DEC reg16/32 There is a special encoding for INC/DEC that only takes 1 byte. This is INC/DEC reg16 (or INC/DEC reg32 in 32-bit mode). There are many places which you can use this, note that :- INC AX is the same as (*) INC AL JNZ ko INC AH ko: So if we don't care about AH, we can substitute INC AL for INC AX, and save a byte. (*) (well, okay not the flags but you get my point) The same thing works for DEC :- DEC AX is the same as DEC AL CMP AL,0FFh JNZ ko DEC AH ko: So if we don't care about AH, we can substitue DEC AL for DEC AX, and save another byte. If you know that AL will never = 0FFh, then INC AX is doing the same operation as INC AL. EXTRA REGISTERS Running out of registers? You are probably just not using them efficiently. So we have EAX,EBX,ECX,EDX,ESI,EDI,EBP Yes 32-bit, I only rarely even see people use these in 16-bit code, don't know why because they are still there. Okay so you waste a byte with that operand prefix, true. 7*32bits, thats 224bits, almost enough to write a tiny program!! Don't believe me? Think of a bit as a matchbox which can have a pebble in it or not, just imagine how much state information you have with 224 of these, could you fit them all on the kitchen table? It's really amazing what you can do with just 3 registers let alone, how many??, XCHG comes in very handy you know. TRAP #5466332 Oh! Do watch out! I didn't warn you; when an interrupt occurs the processor will only save the 16-bit register set to the stack. So if your interrupt routine uses any 32-bit registers it must save/restore them itself. Or you could catch a bug like Windoze did. JUST PLAIN BAD ? I am a horror really; do you know that I once used SP as an extra register. Naughty, naughty... Of course when you get an interrupt, SPLAT!, all that stack space used by the routine is actually writing over your most important data structures. Using SP/ESP is really fine though if and only if you don't use the stack. So NO interrupts. Even the hottest hackers start to cringe if they have to get this dirty, it's just PLAIN INELEGANT, however it is important to be aware of possibilities even if you are sure you are not going to use them. SO ARE THERE REALLY ANY EXTRA REGISTERS ? Oh yes, right in front of your eyes. DS, ES, FS, GS. 4 lovely 16-bit chunks of memory right in the processor core. Still I'm not absolutely sure anymore whether its worth it to use these. The way the processors are being built these days it would probably have been ( a lot ) quicker to use the stack frame.... MOV AX,0 lop: MOV FS,AX MOV AH,2 MOV DL,030h INT 021h MOV AX,FS INC AL JNO lop Okay so you've run out of registers in that inner loop. Your trying to blit your super-rotating-warping sprites at the speed of light and it's no good if we get all those L1 cache accesses (which do take TIME). Other places? Well how many of these registers you need are going to be constant for the period of the loop? I bet you there are at least a few. Here is where we get off:- *LOOP START* . . . MOV AX,[BX] ADD AX,CX . . . *LOOP END* say this is part of your loop, but you know that CX is never going to change. What a waste !! *INITIALISATION* . . MOV WORD PTR CS:[k1],konstant . . JMP SHORT $+2 *LOOP START* . . . MOV AX,[BX] k1 EQU $+2 ADD AX,0AA55h . . . *LOOP END* You just saved a register. The JMP SHORT is a just-in-case, if the code at offset k1 in the loop is in the prefetch queue when it is modified you could end up with the processor modifying the memory but not the prefetch queue, so the 1st time around the loop executes improperly. This doesn't happen post-Pentium since the Pentium flushes the prefetch queue if you write to it (oh, all those debugger traps that are now going wrong....). LEA LEA is useful for so many things, I find it quaint now that on first encountering it I thought all it was was a MOV instruction. LEA allows you to access no less than 3 adders in the CPU simultaneously. One of these adders can be scaled by 1,2,4,8 allowing multiplication. There are two address formats with x86 you can use BOTH regardless of the code size. LEA reg16/32,EA If the address is 32-bit and the operand size is only 16-bit, the effective address gets truncated into the destination. If the address is 16-bit and the operand size is 32-bit then the effective address gets zero-extended into the destination. This later point allows you to get a MOVZX instruction on immediate values, ala:- LEA EAX,[01234h] which is equivalent to:- MOV EAX,01234h only the LEA is taking a byte less, as the data is represented as a WORD only. LEA can give you various multiples of a register LEA EAX,[EAX*2] *2 LEA EAX,[EAX*2+EAX] *3 LEA EAX,[EAX*4] *4 LEA EAX,[EAX*4+EAX] *5 LEA EAX,[EAX*8] *8 LEA EAX,[EAX*8+EAX] *9 Of course the source and destination don't have to be the same so you can get an extra MOV out of it. In addition you have the displacement factor, so instead of:- MOV EBX,EAX ADD EAX,EAX ADD EAX,EBX ADD EAX,01234h You can do: LEA EAX,[EAX*2+EAX+01234h] Pretty impressive ! LEA doesn't affect the flags, so if you need to add without affecting flags here is your instruction. CMP [SI],EAX LEA SI,[SI+4] JZ wherever As an aside it is at this point where MASM becomes rather embarrased. It will not assemble the following instruction:- LEA EAX,[01234h] Instead requires you to put:- LEA EAX,DWORD PTR DS:[01234h] Which makes your code look right up the garden path (what the hell does the DWORD and DS: have to do with this instruction??). But then again MASM specializes in bulk and gimmicks rather than precision in functionality.

Welcome to sxlist.com!

Welcome to sxlist.com!