with debugging output:
1057/30 sec = 35fps    with nodisplay in UpdateDisplayDC
300/30 sec = 10fps     normally
380/30 sec = 12fps     without waitvbl + pageflip

4/10 24.5fps normal
7/20 59.72fps normal SMB1 ;)

1/11/02 (scores are for gcc-0312; gcc-3.0.3 scores in parens)
        42.2 fps w/sound cvania 46.5 w/o sound (soundname="none") (42.2/46.7) [42.5/]
        46.7 fps w/sound zelda 51.2 w/o sound (45.95/50.7)
		42.2 fps w/sound zelda2 46.1 w/o sound (40.75/44.6) [41.18/]
		45.25 fps w/sound smario 49.2 w/o sound (44.04/48.2)
     gcc-3.0.3 seems to lose us between 0 and 1 fps depending on the game.

1/12 add about 1-2 fps to the above due to use of memory write technique that
     omits "bit" in pixels.h

1/12 with new bgmask_ptr handling, additional speedup of 3-4 fps
	zelda2 46.35 w/sound 50.72 w/o sound [47.13 w/o before]
     cvania 49.20 w/   54.11 w/o [50.3 w/o before]
	 smario 50.43 w/   54.96 w/o
	 zelda  52.17 w/   56.68 w/o

from cvania:
frames: 1461
time  : 37.54 sec
fps   : 38.91
Profiles:
000: calls 000003350 ticks 000000022667045 (29.01 sec)  [drawimage, total]
001: calls 000350640 ticks 000000003955460 (5.06 sec)   [drawimage, sprites]
002: calls 000001461 ticks 000000000001112 (0.00 sec)   [UpdateAudio (off)]
004: calls 000003350 ticks 000000000008528 (0.01 sec)   [drawimage, init]
006: calls 001118124 ticks 000000005208807 (6.66 sec)   [nes6502_execute]

frames: 1213
time  : 38.29 sec
fps   : 31.67
Profiles:
000: calls 000002694 ticks 000000023329442 (29.86 sec)
001: calls 000291120 ticks 000000000059197 (0.07 sec)
002: calls 000001213 ticks 000000000000942 (0.00 sec)
003: calls 009385608 ticks 000000005358337 (6.85 sec)
004: calls 000002694 ticks 000000000006620 (0.00 sec)
005: calls 000002694 ticks 000000023320546 (29.85 sec)
006: calls 000900483 ticks 000000005397218 (6.90 sec)
007: calls 009385608 ticks 000000011058497 (14.15 sec)
008: calls 000291120 ticks 000000000260964 (0.33 sec)
009: calls 000291120 ticks 000000022663591 (29.00 sec)


1: surrounding sprite draw
5: outside while(curclock<endclock) loop
3: inside while (hposition < HCYCLES) loop--around per-tile calcs
7: inside while (hposition < HCYCLES) loop--around per-tile memory write
8: end of line code at end of while(curclock<endclock) loop
9: outside while (hposition < HCYCLES) loop (which should show profiling overhead)
   in this case it's high because the loop is called 8 million times
   3+7 should equal 9 (minus profiling overhead)
   this shows memory writes take up 66% of the loop, 33% to calcs

