Angelo Ken Pesce (ken@uniserv.uniplan.it) MODE 13H OPTIMIZATION TUTORIAL ------------------------------------------------------------------------------ ' PowerBasic Mode13h optimization tutorial 1: ' HOW TO OPTIMIZE THE PUTPIXEL ROUTINE ' BY Angelo KEN Pesce ' ken@uniserv.uniplan.it ' ________________________________________________________________ ' If you find this tutorial useful please send me some COMMENTS!!! ' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ' Ok, you're going to read my first mode13h optimization tutorial ' but before you start you should read the... ' HARDWARE/SOFTWARE/KNOWLEDGE REQUIREMENTS: ' A brain ' A little mode 13h knowledge ' A little asm knowledge (for chapter 3/4/5a) ' Powerbasic 3.2+ (Hehe!!!) ' 386+ CPU (286 is needed for pp4, 386 for pp5) ' Everything ok??? So let's start!!! 'INDEX: ---------------->>> '1) USING INTERRUPT 10h '2) THE VGA MEM '3) INLINE ASM POWER '4) SHL vs MUL: asm optimization '5) LOOK-UP TABLES ' 1) USING INTERRUPT 10h ******************************** ' A simple method to put a pixel (in any vga screen mode) is using int 10,C: ' ----------------------------------------------------------- ' INT 10,C - Write Graphics Pixel at Coordinate ' AH = 0C ' AL = color value (XOR'ED with current pixel if bit 7=1) ' BH = page number, see VIDEO PAGES ' CX = column number (zero based) ' DX = row number (zero based) ' ----------------------------------------------------------- sub PP1(x as integer,y as integer,col as byte) regax%=col+&h0c00 :' Remember:col is a byte so it can't modify &h0c (ah) reg 1,regax% :' AX reg 3,x :' CX reg 4,y :' DX call interrupt &h10 end sub ' A simple test: ' Init mode 13h mode13h ' Call pp1 sub pp1 10,10,15 beep sleep ' Works!!!!! But using interrupts is VERY slow... ' Let's try to fill a page using pp1: for x%=0 to 319 for y%=0 to 199 pp1 x%,y%,14 next y% next x% beep sleep ' We can't do a realtime game or demo with pp1, we have to use another ' method... ' 2) THE VGA MEM **************************************** ' A faster way to put a pixel on the screen is direct vga mem access. ' Vga mem. is located at A000:0000h and we know that mode 13h is 320x200x8 ' (x res, y res, color depth) so we can easily write a faster putpixel using ' def seg and poke. sub PP2(x as integer, y as integer, col as byte) def seg=&hA000 :' VGA mem segment poke y*320+x,col :' y*320+x = pixel offset end sub ' Now we can clear the screen usign the new (and faster) pp routine... for x%=0 to 319 for y%=0 to 199 pp2 x%,y%,0 next y% next x% beep sleep ' But this is still too slow for real game & demo coding stuff... ' We can optimize the mult using shift left and put the def seg into ' the mode13h init sub. but it's better to move the entire routine to... ' 3) INLINE ASSEMBLER POWER ***************************** ' Ok... Let's convert pp2 to inline asm... ' STARTING CONVERSION... ' >>> sub PP2(x as integer, y as interger, col as byte) <<< sub PP3(byval x as integer, byval y as integer, byval col as byte) ' >>> def seg=&Ha000 <<< ! mov ax,&ha000 ! mov es,ax ' >>> poke y*320+x,col <<< ' >> step1: y*320 << ! mov ax,y ! mov bx,320 ! mul bx ' >> step2: +x << ! add ax,x ' >> step3: poke offset,col << ! mov di,ax ! mov al,col ! mov es:[di],al ; FASTER THAN STOSB ' >>> end sub <<< end sub ' CONVERSION FINISHED!!! ' Now we can perform our "standard" fill screen test: for x%=0 to 319 for y%=0 to 199 pp3 x%,y%,y% next y% next x% beep sleep ' Uhmmm... Fast... But we can do this better!!! ' 4) SHL vs MUL: asm optimization *********************** ' The slow part of pp3 is the mul (*320) op. It takes up to 26 cycles on a ' 486!! But... wait a minute... we can replace mul using 2 shl (shift left)! ' SHL/SHR Example: ' ' ________ EasyBinDecod (TM) Table ;) ' 1||||||| ' 2631|||| ' 84268421 ' |||||||| ' 1 dec= 00000001 bin == (original number) ' SHL ,1 <<<<<<<< == (shl ,1= *2) ' 00000010 = 2 dec == (result 1) ' SHL ,2 <<<<<<<< == (shl ,2= *4) ' 00001000 = 8 dec == (result 2) ' SHR ,3 >>>>>>>> == (shr ,3= /8) ' 00000001 = 1 dec == (result 3) ' Understood???? So: ' (y*256)+(y*64)=y*(256+64)=y*320 !!!! ' ^^^^^^^ ^^^^^^ ' Shift Shift ' Left left ' y,8 y,6 ' But there's a problem... With pb 8086 inline asm we can only do shl dest,cl ' or shl dest,1... To do shl dest,imm we must use 286 opcodes... ' SHL dest,8 done using 8 shl dest,1 ' shl dest,1 = 3 cycles on a 486 ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' SHL dest,6 done using 6 shl dest,1 ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' shl dest,1 = 3 cycles ' total: (8*3) + (6*3) = 42 cycles to do 2 shl... ' Mul was only 26 cycles, this isn't a good optimization work... ' Let's try using shl dest,cl ' mov cl,8 = 1 cycle ' shl dest,cl = 3 cycles ' mov cl,6 = 1 cycle ' shl dest,cl = 3 cycles ' total: (1*2) + (3*2) = 8 cycles!!! PRETTY FAST!!! ' But we can remove those mov cl,source too!!! ' Using a good disassembler like hiew 5.40 we can decode shl ax,imm and code ' our FOURTH PP ROUTINE: sub PP4(byval x as integer, byval y as integer, byval col as byte) ! mov ax,&ha000 ! mov es,ax ! mov di,x ' AX=Y*64 ! mov ax,y ' shl ax,6 ! dw &he0c1 ;shl ax ! db 6 ;,6 ! add di,ax ; DI=(DI+AX)=(X+(Y*64)) ' AX=Y*64*4=Y*256 ' shl ax,2 ! dw &he0c1 ;shl ax (DW &he0c1= DB &hc1,DB &he0) ! db 2 ;,2 ! add di,ax ; DI=(DI+AX)=((X+(Y*64)+(Y*256)))=(X+Y*320) ! mov al,col ! mov es:[di],al end sub ' Another test (Yhawn...) for x%=0 to 319 for y%=0 to 199 pp4 x%,y%,x% next y% next x% beep sleep ' PP4 takes only 10 cpu-cycles!!! ' 5) LOOK UP TABLES ********************************* ' Another method to optimize mul is using look up tables... ' A Look Up Table is a precalculated array that replaces time-wasting ' op. like mul,sin,cos etc... ' You can use lookup tables with PB pointers or inside an asm ' routine (I don't use pointers for my progs so I don't know which method ' is faster...) ' PP5/NO POINTERS VERSION ************ ' INIT YMULT LOOKUP TABLE (put this in your init mode13h code...) dim Yarr(199) as word yseg=varseg(Yarr(0)) for I=0 to 199 Yarr(i)=i*320 next sub pp5(byval x as integer, byval y as integer,byval col as byte,_ byval yarsg as word,byval vgaseg as word) ! mov es,vgaseg ;VgaSeg is needed in this routine, so you can do double ' buffering too!!! ! mov cx,yarsg ! mov di,x ! mov bx,y ! add bx,bx ' mov fs,cx FS ISN't used by pb... We don't need to push/pop ! db &h8e ! db &he1 ' add di,[fs:bx] ! db &h64 ! db &h03 ! db &h3f ! mov al,col ! mov es:[di],al end sub ' Try and guess... A test routine!!!! for x%=0 to 319 for y%=0 to 199 pp5 x%,y%,y%,yseg,&ha000 next y% next x% beep sleep ' PP5 ISN'T PERFECT AND *CAN* BE OPTIMIZED... When I find how, I'll tell you ;-) ' I'm not sure that the current pp5 is faster than pp4... ' PP5/POINTERS VERSION *************** (not tested, not optimized, never used...) ' LOOKUP TABLE OF THE POINTER VERSION OF PP5 dim parr(199) as shared byte ptr dim ptemp as shared byte ptr for i=0 to 199 parr(i)=&Ha0000000+(i*320) next sub pp5p(x as integer, y as integer, col as byte) ptemp=parr(y) incr ptemp,x @ptemp=col end sub ' Ahhh... THIS IS THE LAST TEST ROUTINE OF THIS TUT. ENJOY!!!! for x%=0 to 319 for y%=0 to 199 pp5p x%,y%,0 next y% next x% beep sleep END ' That's all folks... ' In tutor 2 I'm going to talk about: ' HOW TO OPTIMIZE BIG VGAMEM WRITES: ' THE X-Y LINE ROUTINE ' THE FILLSCREEN ROUTINE (386 code... And mabye fpu-tricks) ' AND (mabye) THE flat TRIFILL ROUTINE!!! (If I debug it) ' MODE 13h INIT SUB sub mode13h ! mov ax,&h13 ; AL=vga mode AH=00=Int 10 function number ! int &h10