globals found, along with where they're assigned to, and
maybe where they're accessed:

MAPTABLE[]
XMAP
STACK
STACKPTR
--- CTNI
see VBL
...
---
VFLAG
FLAGS
PCR
*XPC
ZP
INRET
VRAMPTR
VRAM[]
RAM[]
Mapper[]
DMOD
LASTBANK
OUTPUT()
U()
-- irqflag 
  checked in NMI:
  set in NMI:irq:
  set in mapper.c in MMC3: looks like this is the only one that uses it
--

--- NMI()
x86.S CTNI,CLOCK,CPF,VFLAG,STACKPTR,FLAGS,MAPTABLE

ESI=7;
ESI -= CTNI;
ESI += CLOCK;
xxx ESI -= CPF;
  sbb ecx, ecx         (ecx was scratch; ecx is now -1 or 0)
  and CPF, ecx         (ecx is now CPF or 0)
  add ecx, esi         (esi += CPF, or esi += 0)  carry discarded
--> if (ESI >= CPF) { ESI -= CPF; }
CLOCK = ESI;

==   CLOCK = (7 - CTNI + CLOCK) % CPF;

if (irqflag) goto irq;
if (!donmi()) goto skipint;
/* do the actual NMI code */
...
ESI=CTNI

...
we can skip irqflag handling for now, only mmc3 uses it
--- if irqflag not handled, smb3 cannot display bottom 1/5 of screen

/* Normal nmi handling, according to m6502.c */
/* push high byte then low byte of PC */
/* push processor flags, but with B flag zero */
/* set direction flag to 0  (CLD) */
/* jump to 0xFFFA */

popa/pusha
/* Push the return address */
EAX=STACKPTR
*EAX=BH  (*(char *)STACKPTR   = BH)
EAX--
EAX |= 0x100   (set bit 9 of EAX)   (ensure if stack < 0x100, we wrap around to 0x1ff)
*EAX=BL  (*(char *)STACKPTR-1 = BL)
EAX--
/* End push return address */
/* php ? */
FLAGS &= 0x3C   (zero out N,V,Z,C, which are stored elsewhere)
EBX=0
add $0xffffffff,%ebp;mov _FLAGS,%ebp;adc $0,%ebp /* store FLAGS in EBP, set C if EBP was nonzero */
    /* I'm not sure why shr %ebp wasn't used instead of add; difference is shr will trigger if %ebp was odd; add if %ebp is nonzero */
testl $0xff,%edi; setz %bl; shll %ebx; orl %ebx,%ebp /* set bit 1 (Z) in %ebp if %edi zero flag set */
EBX=EDI; EBX>>=1; EBX &= 0x80; EBP |= EBX /* get N flag from %edi bit 8 into %ebp bit 7 */

EBX=VFLAG   /* not sure how vflag works, seems bit 10 is relevant */
EBX += 0x80
EBX += 0xffffff00   (looks like overflow if VFLAG >= 0x80)
sbbl %ebx,%ebx  (%ebx is 0 if !V or -1 if V)
and $0x40,%ebx (%ebx bit 6 is now V)
orl %ebp,%ebx  (combine ebp into ebx; ebx holds all flags)

*EAX=BL (*char *)STACKPTR-2 = processor flags)
EAX--

popa
movl $0,-4(%esp) -- move 0 into top-of-intel-stack dword, but don't modify stack pointer -- prob. for U:
   -- U moves this into DMOD (dest addr to modify) -- something to do with "linking" or debugging in U -- but this code skips null address and so that code is not executed when movl $0,-4(%esp)
ESI=CTNI
FLAGS=(FLAGS & 0x2C) | 0x04;  (looks like break flag is cleared--isn't this supposed to be done on the stack?)
EBX=(zero-extend-int)(MAPTABLE[15]+0xFFFA)   (jump to nmi vector)
goto U

skipint:
    popa
	movl $0,-4(%esp)
	ESI=CTNI
	goto U

irq:
... looks almost identical to nmi 
---

--- INPUT()
x86.S
ECX=ESI
ESI -= CTNI
CTNI=ECX
ESI += CLOCK
   sub CPF, esi 
   sbb ecx, ecx
   and CPF, ecx
   add ecx, esi     
-> if (ESI < CPF) { ESI -= CPF; ESI += CPF } else { ESI -= CPF }
--> if (ESI >= CPF) { ESI -= CPF; }   // I think this is a modulus, ESI %= CPF

CLOCK=ESI
input(%ebx == addr)
does input use %ecx? no, clobbered
  input() (and only input()) modifies INRET
ESI=CTNI
sign extend INRET to %edi (set/clear 6502 N and Z flags)
---

--- OUTPUT()
oldESI = ESI
ESI = ESI - CTNI + CLOCK;
CTNI = oldESI
ESI %= CPF
CLOCK = ESI  (if we went past frame, start CLOCK over)
output(%ebx,%eax == addr,val)
ESI = CTNI  (fxn does not preserve ESI, above could be written with locals)
    curESI = ESI;
	curESI = curESI - CTNI + CLOCK;
	CTNI = ESI;
	curESI %= CPF;
	CLOCK = curESI;
	ESI = CTNI;

where do we see this pattern: [INPUT,OUTPUT,NMI,...]

---

--- PPF
consts.h #define PPF 89342 (ppu cycles per frame)
used to set CTNI in mapper.c
---

--- CPF
consts.h #define CPF 29781
x86.S NMI, INPUT, OUTPUT, MMC1, MMC* (always same access pattern)
mapper.c  used to set CTNI
io.c      CTNI = -CPF + 7;
---

--- HCYCLES
consts.h #define HCYCLES 341
---

--- CLOCK
io.c  CLOCK  = VBL+7
io.c  CLOCK += 514
x86.S ???
---

--- VBL
consts.h #define VBL 27428    (cpu cycles until vblank interrupt)
x86.S, in irq:   CTNI = CLOCK - VBL;
x86.S, in START: ESI = 0; CLOCK = ESI; ESI -= VBL; CTNI = ESI; /* CTNI = CLOCK - VBL; */
---






--------


[Z+1] used to access Zero Page; prob. endianness issue or word-size
movsbl _INRET,%edi   copies and sign extends byte _INRET to dword %edi
                     normally sign compare instructions test bit 8 (0x100) of edi ??
					 and arith instructions do movsbl %al,%edi -- why would bit 9 be
					 changed by this? -- ah, sign bit extended into all higher bits of course!    0x80 -> 0xffffff80
                     INRET is signed int !!
					 So we don't have to worry about INRET return value except that it should be sign extended to set/clear the 6502 sign flag (%edi bit 8 is sign, bit 7-0 are combined zero)
					 this should be automatically taken care of by the cpu core


created input_shim, output_shim to replace INPUT: and OUTPUT:, although
right now INPUT,OUTPUT remain as asm->C call gateways.

table.x86 only jumps to NMI during a branch instruction
x86.S jumps to NMI in i_next (self-mod interpreter), and when compiling, in U:

x86 mov does NOT affect any flags

smb3 has droppings in left column on level select, possible sprite clip problem
  -- present in original code

MAPPER [Y] only called during store (write) to $8000-$FFFF
LDA # movsbl 0x0000AAAA(XMAP),%eax
   # XMAP is [X+1], Precalculated remap table address for word16 at SRC+1

STA (ind,X): addresses >= 8000 -> mapper
             addresses 2000-4FFF -> io   [does this by subtracting
			     2000 from address, then io if mod_address < 4000
				  -- because address < 2000 becomes mod_address < 0,
				  -- which is >= E000 unsigned.]
				  -- I believe registers 2000-7 repeat every 8 bytes,
				     and registers 4000-4015 repeat every 16 bytes.
				  -- Note special LDA/STA absolute cases  only check
				     for 2000-2007 (almost identical code), but also 40xx

STA absolute   does NOT call mapper or io??  ah, dynamically
               recompiled special cases
STA absolute,X does NOT call mapper or io except for $40xx; which
               means legal code STA $2000,X will not work!!
looks like reads use the MAPTABLE, writes don't

esi might be implemented by R->IPeriod - R->ICount, total number of
cycles executed since last period -- but INPUT/OUTPUT/NMI/MAPPERS change %esi
(set it to CTNI after fxn) -- although no one else does except for cycle
increment.

   global ESI
   ESI = 0      before emulation start
   ESI += icycles for each instruction
      ... could update ESI after every instruction, but
	      it's only used in input/output, and checked occasionally for
			  NMI purposes (jmp NMI  when %esi > 0, at "certain points")
   ESI = CTNI   at end of input, output, nmi, mappers  shims

STACKPTR is not preserved yet by new code; but no one uses it.
   I could "#define STACKPTR R->S"

mmc1,mmc3,aorom only require the CTNI/CLOCK/CPF preamble
  -- shims for these, others are called directly

dynrec.c can't be removed yet b/c it defines CNTI, etc.
   move these somewhere else

gradius.nes gives unrecognized instructions by m6502



calls to replace/remove:
gettimeofday -> libdream timing calls (or rip out timing completely :P)
paletteGGI->paletteDC ?
mmap calls are problematic
memcpy?

line "
  r = (int) mmap (RAM, 0x8000,
  " in emu.c should be RAM, 0x9000  I think, b/c ROM pages are mapped at
  RAM + 0x8000   but also at RAM + 0xC000 !!  so doesn't make sense.
  I think this is a bug in the code

  references to ROM_BASE  - offset, since RAM is mmapped just before
  ROM in memory this points to RAM.  This is BAD!!
  nevermind this is ok, maptable is accessed using real RAM offset
  so e.g. $FFFF -> maptable[15]+0xffff = ROM_BASE-0x8000+0xffff = ROM_BASE+0x7fff
  
  yet I cannot change the location of _RAM !!  even when mmapping at
  different location, it doesn't work
  -- cpu-glue.c wasn't being recompiled ;)
  -- on the dc we can simulate a mmap function b/c we have free access
  to all memory.  but if not want to, we can make uchar array, size
  0x210000 -- with RAM = 0x0, ROM_BASE=0x10000.
  -- actually, two separate arrays will work, it didn't work b4 prob.
     b/c of cpu-glue not being recompiled. -- done

magstep is declared in x11.c (!)
CLOCK is not initialized in io.c
CTNI is not initialized in io.c (orig. dynrec.c)


oh god... adding an fprintf(drawimage: frameskip" into pixels.h makes
it work, take it out and no


RAM is not zeroed out!  If we memset entire region to 0x05, smb1
starts on world 6, although everything else is normal.  This is
because:

World # is kept in $075f
A copy of world # is kept in $7fd
if, on bootup, high score ($7d7) is valid and boot_flag ($7ff) is set to $A5,
   memory is cleared from $0000 - $07d6.
otherwise, it's cleared from $0000 - $07fe (except for stack page, 100-12F).
Score appears to be loaded from $7fd and inc'ed.  So we expect $7fd to
be 0 and inc it to 1, the first world.
This may explain why all values get corrupted sometimes---smario finds
$a5 in location from previous boot, and accepts values (if we set $7ff
to $a5, this is confirmed).  But if that's the case, why would it
happen on first boot?  Has it ever?  Perhaps it's never happened after
-hard- dc reset.
(Also explains how world and high score survive soft reset.  NES does not
 zero out memory on soft reset, apparently).
But it doesn't explain why world# can be changed by a memset.
Boot_flag should not be set to 0xa5, so memory 0-7fe should be
cleared, so world should be initialized to 1. ...
Checking.  Regardless of boot_flag, setting $75f before boot does not
affect world, as I expect.
Write 0x05 from $700-$7ff has no effect.
Write 0x05 from $000-$fff has effect.
$13e holds the start world (on the $100 page, only
$100-$12F are zeroed out on bootup.).  If $13e holds $5a,
then the inital world is zeroed out.  Otherwise,
$13e is used as-is.  Either way the value is stored in $76B.
I believe this is vestigial debugging code.  No other interest.


Okay, all games I tested work.  However if smario.h is not compiled
in, making emu.o smaller by 0xa010 bytes, Zelda e.g. fails.
So there is still memory corruption going on.

QBERT is mapper 0, but graphics are corrupt in control select screen.
   This occurs in original tuxnes as well

Mapper 0 seems great with or without the smario.h compiled in.  Is
this due to size (40976 bytes) or no mapper?

this version of GRADIUS.NES I think has wrong mapper number

Confirmed that MMC1 games play incorrectly if I don't compile in the
40976-byte smario[] array.  Points to memory overrun.  Other mappers
seems okay.  Zelda crashes; but metroid and megaman2 just have
background swap-in problems.
--- static int mmc1reg[4], mmc1shc[4] were not initalized to {0,0,0,0}.
Lots more statics in mapper.c, io.c.. unchecked.
--- static variables should automatically ;) be initialized to 0!
.bss section is not being properly zeroed, I assume.
--- fixed by adding bss-clear code to dan's crt0.s


SMARIO:
i386 m6502 core, none renderer gives 350 fps (15000 frames in 42 sec)
i386 m6502 core + FAST_RDOP, none renderer gives 440 fps (15000 frames in 34 sec)
i386 m6502 core + FAST_RDOP + FAST_ZP gives no additional benefit here
i386 nsf core gives 600 fps (15000 frames in 25 sec)
i386 dynamic core, none renderer gives 4250 fps (!)

TIMING
-----

(1min)

defaults: metroid / dc renderer 3/10 6pm  640x480 frameskip=1 -O2 vblank
frames: 912 = 15.20 fps
inst/cycles 10287056/26713749 = 170k/445k ips/cps

no_vblank
frames: 965 = 16fps 
inst/cycles 10885205/28267422 = 180k/470k ips/cps

320x200
frames: 1200 (20.00)
inst/cycles 13531571/35155101 (225k/585k)
320x200,fast_rdop+fast_zp -- same

320x200,novblank
frames: 1238 (20.63)
inst/cycles 13961768/36269087 (232k/604k)
320x200,novblank,fast_rdop+fast_zp -- same (maybe +1fps)

none renderer, 30sec
frames: 3601 (120fps)
inst/cycles 40694643/105547627 (1350k/3500k)

none renderer, 35sec, -O
frames: 3601 (105fps)

320x200, -O, fast_rdop+fast_zp
frames: 910 (1516)
inst/cycles 10264689/26655132 (171078/444252)







--------

possible cpu speedups:
rptr++ can be removed from pixels.h for BPP != 1, 4, 24
explicit read_zp function for zeropage addressing opcodes
With R.IPeriod=9999999 (so nmi, irq functions skipped), and -O, and
        neverending sequence of ORA $05, and 50,000,000 instructions:
With FAST_RDOP+FAST_ZP, 24sec
With FAST_RDOP,31sec
With FAST_ZP,36sec
With neither,42sec


40M inst, neither, none renderer, metroid, 36sec
40M inst, FAST_RDOP+FAST_ZP, none renderer, metroid, 25sec

---

mem.c:

48sec/3600       none, 76800 16bit
35sec/3600frames -O
35sec/3600frames -O2

13sec/3600frames -O2 mainmem, 76800 32-bit

34sec/3600frames -O2 video, 76800 16-bit
34sec/3600frames -O2 video, 76800 32-bit
16.5sec/         -O2 videomem, 38400 32-bit   (~220fps)
16.5sec/         -O2 videomem, 38400 16-bit

If we combine two consecutive 16-bit writes into one 32-bit write,
speedup is 2x.

new mem.c:

70sec 640x480x16 32-bit writes vram 3600 frames (~52fps)  <-- is this right?  we can't achieve 60fps? yep
28sec 640x480x16 store queues  vram 3600 frames (~128fps)
22sec 640x480x16 32-bit writes RAM  3600 frames (~164fps)
10sec 640x480x16 store queues  RAM 3600 frames  (~360fps) !!!

17sec 320x240x16 32-bit writes vram 3600 frames (~210fps)
 7sec 320x240x16 store queues  vram 3600 frames (~512fps)
 6sec 320x240x16 32-bit writes RAM  3600 frames (~600fps)
 3sec 320x240x16 store queues  RAM  3600 frames (~1200fps) !!!

----

last try
frames: 1358 (2263)
inst/cycles 13433897/39780518 (223898/663008)

w/o all drawimage (zelda)
frames: 9162 (15270)
inst/cycles 89626978/268607426 (1493782/4476790)
w/o all drawimage (smario)
frames: 9804 (16340)
inst/cycles 96969580/287333116 (1616159/4788885)

w/o drawimage
frames: 10360 (17266)
inst/cycles 102470333/303631447 (1707838/5060524)

w/drawimage, and looping through every clock cycle but doing no operation
frames: 7008 (11680)
inst/cycles 69321550/205384331 (1155359/3423072)
(addition of one variable increment every line has no effect)
fps increases to 136 if I cache endclock in a register (eliminating two memory loads)

w/drawimage, and looping through half the clock cycles but doing no operation
frames: 8343 (13905)
inst/cycles 82521821/244511127 (1375363/4075185)
fps increases to 155 if I cache endclock in a register

w/o sprites,bg
  i can get around 47fps, which means drawimage is taking up a huge
  amt of time even not drawing.  turn sprites on, lose 10fps.
  turn bg solid color on, lose 3fps.  turn bg on, lose another
  12fps.

Upgrading to newest binutils/gcc produces a worse tight loop (giving
105 fps instead of the 116 or 136 above).  The code looks better---more
registers and less memory, use of cmp/movt/cmp/movt/and/tst/jump instead of 
cmp/jmp/cmp/jmp.  Perhaps branch prediction outweighs the extra
instructions.  We still draw at 20fps with full drawing on, so
there's no effect there, but this is disconcerting.  Hand optimization
may give a significant benefit.

A later tight loop (with more operations) gives 140fps.  SH-4 is weird.
Even later I get 131 with first tight loop, not because of tight loop optimization (I think) but
because nsf core is faster.
And now 136, after removing unused hscroll[]/vscroll[]/linereg[] code in io.c.

new gcc
51 fps solid bg, no sprites
51 fps no bg, no sprites
25 fps bg, no sprites
30 fps no bg, no sprites, if do all bg calcs except vram write
38 fps bg, no sprites, if do only vram write (no other calcs)
57 fps no bg, no sprites, minimal (curclock,x,hpos) calcs except at end of loop
57 fps solid bg, no sprites, minimal (curclock,x,hpos) calcs except at end of loop
69 fps solid bg, no sprites, minimal (curclock,x,hpos) calcs even at end of loop
74 fps solid bg, no sprites, tight loop
79 fps solid bg, no sprites, tight loop w/nsf
78 fps no bg, no sprites, minimal (curclock,x,hpos) calcs even at end of loop
106 fps no bg, no sprites, minimal (curclock,x,hpos) calcs even at end of loop, if only half the number of pixels are done
86 fps no bg, no sprites, minimal (curclock,x,hpos) calcs even at end of loop, removing rptr++
105 fps no bg, no sprites, minimal (curclock,x,hpos) calcs even at end of loop, removing rptr++ and if (bit < 0) bit=7;
102 fps no bg, no sprites, minimal (curclock,x,hpos) calcs even at end of loop, removing rptr++ and changing bit if to bit &= 0x08;
65 fps no bg, no sprites, minimal (curclock,x,hpos) calcs at end of loop, full per-pixel
59 fps no bg, no sprites, all calcs
51 fps no bg, no sprites, all calcs, skip vram section completely
   -- this conflicts with earlier measurement of 30 fps above!  wtf??
49 fps 1/256 bg, no sprites
41 fps 1/5 bg, no sprites
26 fps bg, no sprites
  28 fps if remove x==256 test
30 fps bg, no sprites, skip every other pixel, x==256 test missing
32 fps bg, no sprites, skip all pixels (but do bit--, byte shift), x==256 test missing
32 fps 1/256 bg, no sprites, skip all pixels and bit--, byte shift, x==256 test missing
           -- this conflicts with earlier measurement of 49 fps, again!
66 fps no bg, no sprites, all calcs --- conflicting with earlier measurement of 59fps
56 fps 1/256 bg, no sprites, all calcs --- conflicting with earlier measurement of 51fps (and 32 fps!)
32 fps bg, no sprites, skip all pixels (but do bit--, byte shift), x==256 test missing -- reconfirmed
56 fps 1/256 bg, no sprites, all calcs --- reconfirmed, earlier was probably a failed recompile.
45 fps bg, no sprites, all calcs except bit--, byte shift
45 fps bg, no sprites, all calcs except bit--
57 fps bg, no sprites, all calcs, skip if (bit < 0)
62 fps bg, no sprites, all calcs, skip if (bit < 0) and bit--
49 fps 1/256 bg, no sprites, all calcs, shorten to if (bit < 0) bit=7;
   44 fps bg, no sprites, all calcs, shorten to if (bit < 0) bit=7;
   49 fps bg, no sprites, all calcs, eliminate if (bit < 0) bit=7;

33 fps no bg, sprites
21 fps bg, sprites




m6502->nsf
bank_readbyte, called from mem_read after range checks, or when getting opcode/address mode, also DMA, also ZP immediate byte.  == Op6502
bank_writebyte, called only from mem_write after checking for mapper/output
bank_readaddress, gets 16bit addr, ?equivalent to double read
zp_address, gets 16bit value from zp, called from indirect addressing macros
ZP_READ, called from zero page macros, gets "ram" directly, must rewrite (or set ram = RAM or MAPTABLE[0])
ZP_WRITE, see ZP_READ
mem_read, == Rd6502
mem_write, == Wr6502
duplicate nes6502_setcontext functions, removing unnecessary
call nes6502_init
call nes6502_execute in a loop, m6502 does this for you; we can call the equivalent of Loop6502 at the end of the loop,
   or simply inline it.
Interrupts?


   !! When writing into banked memory, we should always use the MAPTABLE, right?  the m6502 code does NOT do this, it just writes into RAM[Addr].


SMB1 choking on jmp (X0006) in magic jump handler (jumps to brk); perhaps indirect absolute addressing/jumps are messed up.
or X0006 holds wrong value, or...

magic_jump at $8E04  -- fixed, zp_address problem

--

106 fps with m6502 tightloop degrades to 96 fps with nsf, even though
nsf is faster on i386

none renderer, 20sec, m6502
frames: 3601 (180fps)
inst/cycles 40694643/105547627 (1350k/3500k)
** this is a big boost over previous 30 sec -- what happened?

none renderer, 27sec, nsf
frames: 3601 (133fps)
inst/cycles 40694643/105547627 (1350k/3500k)

none renderer, 12sec, nsf  !!
frames: 3601 (300fps)
inst/cycles 40694643/105547627 (1350k/3500k)

none renderer, 17sec, nsf (later build; after hscroll[] redact from io.c, it's slower??)
frames: 3601 (300fps)
inst/cycles 40694643/105547627 (1350k/3500k)

now I'm getting 31.5 fps bg, no sprites for m6502
but I'm getting 36.5 fps bg, no sprites for nsf  -- tightloop and none
renderer are slower so why is this faster??

24fps nsf bg, sprites (regardless of inlining read functions)
21fps nsf bg, no sprites  (???)

--

Leftmost column now kind of renders, except it gets shifted off the
screen at twice the rate expected.  I don't know what I changed to
effect this, except maybe the different handling/shifting of "bit" and
"byte".

Top row no longer scrolls correctly in megaman2 (did it ever?)

0 fps bonus for writing only half the pixels (but same # of iterations)
1 fps bonus for removing bgmask[hpos-85] clause from inner loop
no bonus for removing per-loop RAM[0x2001] & 8 test (though maybe this is optimized out?)
                (previous indications that perf dropped 5 fps notwithstanding)


We can remove rptr at outer loop in pixels.h...

39.5 bg,sprites   with initial 8-pixels-per-loop kludge
60!  bg,nosprites
72   bg,nosprites rendering into RAM, so a speedup with stoq's may be possible.
     With the single-pixel-per-loop code, rendering into RAM provides
	 absolutely -no- speedup.
76   bg,nosprites into VRAM but skipping half the pixels, which is like
     using 32-bit writes (we're currently doing 16-bit).  76fps appears to
	 be enough to use waitvbl (though 60fps is not).
~100 bg,nosprites but pixel pushing proper skipped.

57   bg,nosprites after I fixed the colors, though this should have had
     no effect.
53   bg,nosprites when I put the bgmask stuff back.
39   bg,sprites now.  Same as above.  Sprites don't render quite
     correctly though (mushroom "dissolves");

---
pixels.h will only work correctly when "x" is a multiple of 8; since
we add 8 every time and x is tested for equality to 16 or 32 (x & 0xf == 0 etc.), 
that condition won't be satisfied if we start with e.g. x==2.  Assuming we
start at the beginning of a scanline, the only time "x' will be non-zero is
when we're scrolling, i.e. curhscroll != 0.  If we move the whole screen x&7
pixels to the left, we can start rendering where x==0 (offscreen).  

1) We can do this by changing the start address of vram, which means we'll
start rendering on a 32-byte boundary (important for the stoq's).  This
also means we'd have to render x&7 pixels extra at the right side, and the
left side of the image -must- be aligned with the left side of the physical
screen, or we'd have to clip.  -Also-, the stride or width of VRAM would
have to be increased (i.e. a virtual screen) by at least 8 pixels
horizontally, or the left side of the image will show up on the right side
of the screen.

2) Alternatively, we can render x&7 bytes before the main loop, and skip x&7
bytes at the end, without moving VRAM.  However, then we don't start on a
32-byte boundary.  Also, the code is more ugly with extra start and stop
sections.

3) Alternatively, we can take the approach of pNES, without doing the tile
thing.  We can render the whole screen to a texture (via the store queues,
hopefully), and put the resulting two triangular polygons x&7 pixels to the
left of 0.  Then we take advantage of TA clipping to prevent the left edge
of the image from appearing on the right side.  Unfortunately, we are then
missing x&7 pixels on the right, which cannot be included in the 256x256
texture.  Hmm, that's bad.  Well, with the TA we don't have to worry about
the 32-byte boundary thing, because the texture will always be on a 32-byte
boundary in texture memory, only the polygons move [and presumably there's
no penalty for TA blitting to a non-32-byte boundary].  That makes choice 2
more attractive, except for the added code mess... no, that won't work, we
still have to write the texture at 32-byte boundary.

What about a DMA copy?  May be too much trouble (and still may be boundary
issues, I don't know).

Too bad pixels don't go through the renderer pipeline, otherwise we could
clip them in hardware...  TA clipping appears to have no effect.

Remember, tiles are 8x8 -- which means two tiles per stoq write in 16-bit
mode.

Try triple buffering, suggested by Ken of nesterdc fame, to avoid
simultaneous write and read access of vid memory.  I think this only
woprks if you vertical sync; which means it would help if you're at
less than 60fps.  It should provide extra few fps you need if you're
at the borderline...
--triple buffering also lets us not waste idle time

tuxnes-dc-0-1  -- the release just before branching the multi-pel code

tuxnes-dc-0-1-multipel   -- the multi-pixel per loop (chunking) branch
tuxnes-dc-0-1-multipel-1 -- working multi-pel (except clipping); + timer
tuxnes-dc-0-1-multipel-2 -- now clips, and renders line-at-a-time
tuxnes-dc-0-1-multipel-3 -- TA code present, and sprites re-enabled
tuxnes-dc-0-1-multipel-4 -- mmc2 sprite latch disabled (tag for branch)
tuxnes-dc-0-1-multipel-4-sq -- the store queue testing branch
tuxnes-dc-0-1-multipel-5 -- some speedups
tuxnes-dc-0-1-multipel-6 -- more speedups (59.72 fps now), and tag in preparation for 20010718 merge
tuxnes-dc-0-1-multipel-6-merge20010718 -- merge tuxnes-20010718 cvs
tuxnes-dc-0-1-multipel-7 -- tuxnes-20010718 cvs snapshot merged
tuxnes-dc-0-1-multipel-7-sound -- sound branch
tuxnes-dc-0-1-multipel-7-sound-1 -- sound engine abstracted and framework added
tuxnes-dc-0-1-multipel-7-sound-1 -- initial sound engine written, but not enabled
tuxnes-dc-0-1-multipel-7-sound-2 -- sound engine enabled, sounds like crap
tuxnes-dc-0-1-multipel-7-sound-3 -- sound engine enabled, it works!  choppy though
tuxnes-dc-0-1-multipel-7-sound-4 -- vbl,nosprites,16bit,zelda,pingpong
tuxnes-dc-0-1-multipel-7-sound-5 -- abstract sound engine further (merged in tuxnes-20010718-sndabs-3)
tuxnes-dc-0-1-multipel-7-sound-6 -- use AICA sample position polling
tuxnes-dc-0-1-multipel-7-sound-7 -- address AICA alignment issues
tuxnes-dc-0-1-multipel-7-sound-8 -- new shutdown audio function call
tuxnes-dc-0-1-multipel-7-sound-9 -- use 4 buffers instead of 2; sound works good now
tuxnes-dc-0-1-multipel-8 -- merged in multipel-7-sound-9
tuxnes-dc-0-1-multipel-9 -- multi- (actually, two-) player support
tuxnes-dc-0-1-multipel-10 -- prepare for kos branch
tuxnes-dc-0-1-multipel-10-kos -- kos conversion branch
tuxnes-dc-0-1-multipel-10-kos-1 -- converted to kos.
tuxnes-dc-0-1-multipel-11 -- merge in kos branch (initial thru kos-1)
tuxnes-dc-0-1-multipel-12 -- fix sprite display (right edge); rudimentary single save thru /pc
tuxnes-dc-0-1-multipel-13 -- multiple saves (1 per ROM); game select menu
tuxnes-dc-0-1-multipel-14 -- make dirtyheader handling sane; allow main() to retake control
tuxnes-dc-0-1-multipel-15 -- game select now reappears
tuxnes-dc-0-1-multipel-16 -- cleaned up old crust in emu.c; abstracted palettes to palettes.h
tuxnes-dc-0-1-multipel-17 -- cleaned up even more crust in emu.c; add delay before repeat in game select menu
tuxnes-dc-0-1-multipel-18 -- added splash loading screen; kludged around dancing status bar (sprite0hit);
                             fix major screen corruption with x &= 255 in wrong place
tuxnes-dc-0-1-multipel-19 -- add help bar below game select; change exit key sequence; #define ROMDIR/SAVDIR
tuxnes-dc-0-1-2           -- merged in multipel branch
tuxnes-dc-0-1-3           -- cleaned up old, unnecessary files; updated dc-aica Makefile

tuxnes-20010718-sndabs-1 -- inital sound abstraction; only one renderer (oss)
tuxnes-20010718-sndabs-2 -- add mute and none renderers; allow user to choose
tuxnes-20010718-sndabs-3 -- InitAudio* responsible for dev select; update Makefile.am

http://www.sfu.ca/~ccovell/ NEStechfaq:
"HBlank is the period on each scanline when the electron beam moves
back from the right side of one scanline to the left side of the next
scanline.  Generally, changes to scrolling and name table updating
will not become apparent until after the next HBlank.  So, say for
example you are doing some wavy scrolling by updating $2005 in a timed
loop.  The changes that you make to $2005 will not be reflected
on-screen until after the next HBlank; thus, they will show on the
next scanline.  The same applies to mid-screen palette changing or
writes to $2006.  However, turning off the screen, and changing the
Colour Emphasis or Monochrome bits of $2001 can be reflected in the
middle of a scanline."

We must forcibly increment endclock to end of the current line,
the first time endclock enters a line (the first time we draw a line).
Since endclock will then be on a line boundary, the next time we
call drawimage() lastclock will also be on a line boundary.
So we can't start mid-line.

curclock = lastclock;        <- At beginning of fxn, curclk is synonym for lastclock
if (endclock > PBL) return;  <- Don't draw anything after frame 
                                completed/NMI (UpdateDisplay, where endclock == PBL).
if (curclock > PBL) curclock = 0; <- Reset curclock to 0 the first time we try to
                                     draw after frame completes (NMI/vblank start).
if (endclock <= curclock) return;  <- If endclock is in same line as
                                      previous drawn line, return.
									  because curclock==lastclock is 
									  always end of previous line.
if (curclock == 0) { Begin new frame } <- curclock is zero the first time we draw
                                          in this frame; endclock must be > curclock
										  due to previous test, so we're actually
										  drawing 1 or more lines.

if 0<=hpos<=84 (in hblank) tile,byte1,byte2 etc. are set.  This is fine.
However curclock is increased by 85 - hposition 

[[ I think this sets it to currentline + 85 because: curclock = lastclock;
hposition=lastclock%HCYCLES; therefore curclock += 85 - hposition -> curclock =
curclock + 85 - (curclock % HCYCLES) -> curclock = curclock - (curclock %
HCYCLES) + 85 -> curclock = currentline + 85 ]]
   
   which is end of hblank, which means further hblank updates are discarded.  Because for another hblank to be accepted, endclock must be < currentline + 85, which means next frame curclock < currentline + 85, 

   Stop, stop.  The above is incorrect, drawimage() handles hblank correctly.  What I missed was, at the end of every line (hposition >= HCYCLES) the variables are refreshed.  But they are -not- refreshed if we start mid-line (hpos >= 85), so even if values are changed mid-line they are not "propagated" to drawimage() until the next line.  Scanpage, FWIW, is calculated entirely in drawimage... not modified outside of drawimage() at all.  So in fact, the hscrolling values -are- cached in hscrollreg, because scanpage is not updated with the new value of hscrollreg until next line.
   And we -can- have multiple mid-(real)-hblank updates; it's just that each one will always set the variables since hpos < 85, which is kind of wasteful.  Remember, curclock and hposition are calculated anew at the top of the fxn.  Updates before the last one have endclock < currentline + 85, so they won't be drawn.  For the last update,   endclock will be >= currentline + 85, we update the variables since hpos < 85, and we will force curclock = currentline + 85 and hpos = 85, and proceed normally.
*** Why, then, did I see mid-line scrolling when I changed the asm core??

With full-line method, hposition never >= 85 at the beginning of the function, because previous endclock is always forced to somewhere within hblank.  So hposition <= 84, and the variables are always refreshed at the top whenever this fxn is called.
Note: we must only increase endclock to the next line when endclock is on the screen, i.e. endclock >= currentline + 85. (That implies endclock > curclock.)
We should be able to set endclock to having an hpos of anywhere between 0 and 84.  The lower it is, the higher probability of wasted variable-sets, if we're called multiple times in hblank.  If we set to 84, though, the function will immediately exit when endclock is not on screen (endclock <= curclock).

There may be an off-by-one error in the compare operators somewhere, because if endclock == 85 and curclock == 84 then we fall through the endclock <= curclock test and set the variables, even though we don't draw anything since hpos < 85 and curclock is immediately set to 85, failing the (while curclock < endclock) loop.  During the next call, hpos >= 85, and we don't set the variables.  This is ok, because we set them when endclock was 85/curclock was 84, and that was the last possible update anyway.  Though technically, we haven't drawn the first pixel yet, so the last IO should take effect... however emulation timing isn't that precise anyway.  To be fair, this occurs in the original per-pixel code as well.  To sum up, we must increase endclock if IT has an hpos of 85 or greater.

Ensure drawimage updates correctly at end of screen.


Re: clipping.  A 256x256  texture (probably) won't work, b/c we're writing more than 256 horiz pixels when scrolling.  Right side should work, because it will be overwritten by beginning of next line, but left side will back up over previous line's right side.


Next up---figure out why left tile scrolls weird.  In original tuxnes,
scroll is okay, just left half of sprites disappears.  It must have
started when I flipped the meaning of bit around.
Removed shl x&7 of byte1 and byte2 and fixed.
This seems to have lowered fps by about 2. ??

Elminating byte1/byte2 calculation, as would happen with precalculated
(cached) tiles, and replacing with a RAM-VRAM transfer (just curpal[i]
for testing, speeds us up by about 4 fps.)

Testing integrating sprites.
55 fps bg,nosprites
38 fps bg,sprites
  43.5 fps bg,sprites without drawing (no for i = 0..256 ptr=linebuffer[x] loop)
  43.0 fps bg,sprites drawing direct-to-screen, without linebuffer (behind doesn't work yet)
  44.6 fps bg,sprites drawing direct-to-screen, removed memset(linebuffer,0,256).
  40.0 fps bg,sprites re-enabled bgmask
  44.5 fps bg,sprites fixed bgmask and behind processing (why the fps increase? but I'll take it)
  44.0 fps bg,sprites playing smb1 for a while (to halfway thru 1-2)
  46.0 fps bg,sprites smb1 after #ifdef cleanup, I attribute this to endian_fix being removed.
  44.5 fps bg,sprites playing zelda for a while
  42.5 fps bg,sprites megaman2 air man stage
  49.5 fps bg,sprites smb1 if disable MMC2 sprite latch, whatever that is.
!   53.5 fps bg,sprites smb1 if also disable MMC2 tile latch
!   62.5 fps bg,nosprites
  45.7 fps bg,sprites megaman2 air man stage now

Try initial storeq test in pixels.h
Try replacing double rectangle blackout bars with 8-byte storeq stores
(extra, but may be faster--test by removing black bars for theoretical
max speed first)

52.6 fps bg,nosprites with and without letterboxing, so ignore sq optimization for now.
  Strangely, bg,nosprites has dropped to 52.6 fps, meaning sprites only
  cost ~2.5-3 fps in smb1.
62.7 fps solidbg,nosprites
   same fps solidbg,nosprites writing to RAM -- weird, no speedup.
97.6 fps nodraw solidbg, nosprites
85.7 fps solidbg,nosprites using store queues, initial try (+23fps over direct VRAM write)
91.0 fps solidbg,nosprites using store queues, eliminating redundant writes. (+28fps)
91.0 fps solidbg,nosprites writing directly to RAM, eliminating redundant writes. (+28fps)
  This is only a 7 fps penalty over not touching VRAM at all, as opposed to 35 for direct writes!

~64.0 fps bg,nosprites with 32-bit writes, without updating bgmask
~57.5 fps bg,nosprites with 32-bit writes, updating bgmask -- (only +5fps due to 32-bit writes :( )
~56.5 fps bg,nosprites with store queues, updating bgmask -- (no benefit from store queues?)
  ~59.0 fps now (moved QARC0/1 assign out of loop -- +7.5 fps)
  ~66.5 fps now (removed mmc2_latch)
    ~54.6 fps (re-enabled sprites)
~60.5 fps bg,nosprites with store queues, not updating bgmask -- (no benefit from store queues?)
~62.5 fps bg,nosprites with 32-bit RAM writes -- but was this w/ or w/o bgmask? (+10fps to RAM)

53.8 fps bg,sprites normal 16-bit writes    120.1 with 1/8 of solid screen; 90.4 with 1/8,+sprites
    55.1 fps (removed curclock from inner loop) -- 123.1 with 1/8 solid pixels
    55.6 fps (unknown)
	   51.1 fps megaman 2 played thru entire air stage and got password
    56.24 fps (moved int_pending and dma_cycles code out of inner cpu loop)
    58.25 fps (also removed total_cycles from nes6502_execute; put back in
             	with better calculation method and get 58.00 fps)

Could the mysterious top-scrolling line crap in smb1 and zelda be because
scanpage etc. are only set at the bottom of the loop, and when we enter
drawimage after changing registers during hblank, previous line's registers are
therefore used?  

BUG in original pixels.h:  Yes, setting scanpage always at function enter fixes
the crap on the top of the screen in smb1; it does not fix the Zelda line,
although that seems to be on the wrong scanpage, rather than on a continuation
of the status bar.  smb1 may still be hiding this bug because its line is solid blue,
while zelda's is patterned.
Zelda's line is -not scrolled- i.e. hscroll has no effect on it, --if hscroll is being set by this time
RCR has no status line problem, neither does athena
BUG cvania doesn't have this problem, but it does have a sprite-behind-other-sprite problem 
(walk in door on first pre-stage to test) and a graphical flicker in status bar during slow scroll (big door)

BUG in original pixels.h: only left half of sprites partially off the right side of the screen is displayed.


Hmm, apparently I didn't get rid of "bit" in trunk

asm takes up 56.3% of time in smb1 -- branch and try to optimize asm

0x4c (jmp) 0x8d () take up most time -- jmp is executed the most
(takes 1/5 tick) but 0x8d (STA abs) takes about 16 ticks and 0xad (LDA abs) takes about 1.2 ticks
!!! because these can trigger I/O events!!
  once drawimage() calls are removed from io.c, STA drops down to
  about 1.2 ticks.

-fschedule-insns
-fschedule-insns2
-ffast-math
-fomit-frame-pointer
-fverbose-asm
-frename-registers
-fssa

Merging with 20010716 (previous 20010205)  on tuxnes-dc-0-1-multipel-6-merge20010718
  AUTHORS -done
  BUGS -done
  CHANGES -done
  COPYING -done
  ChangeLog -done
  HACKING -done
  INSTALL -done
  INSTALL.BSD -done
  Makefile.am -done
  NEWS -done
  README -done
  THANKS -done
  acconfig.h -done
  autogen.sh -done
                                                            comptbl.c -n/a
  configure.in -done
  consts.h -done
						   cpu-glue.c -n/a
                                                            d6502.c -n/a
						   dc-glue.c -n/a
  dc.c -done
						   dcload-syscalls.c -n/a
						                                    dynrec.c -n/a
  emu.c  -done
  fb.c -done
  gamegenie.c -done
  gamegenie.h -done
                                                            ggi.c -n/a
  globals.h -done
  io.c -done
  mapper.c -done
  mapper.h -done
  nes6502.c -n/a
  ntsc_pal.c -done
  pixels.h -done
  renderer.c -done
  renderer.h -done
  sound.c -done
  sound.h -done
                                                            table.x86
  tuxnes.xpm -done
  tuxnes2.xpm -done
                           types.h -n/a															
  ulaw.c -done
                                                            unzip.c
                                                            unzip.h
                                                            w.c
															x11.c
															x86.S
															ziploader.c
															ziploader.h



----------------- sound ----------------

Should probably be rewritten to allow multiple sound devices through
indirection as in UpdateDisplay -> UpdateDisplayDC.  I.e.
UpdateAudio -> UpdateAudioDC.
v Abstracted in this manner; user can't select engine yet though
v InitSoundParms
v Implement engine select
v Implement Mute (calc/no output) and None (no calc, no output) engines
v "OSS sound" message is out of order, fixed
v Setting audiofile should be done by Init fxns; not by emu.c.
  emu.c can set audio device through -e; if unset, Init fxns will set to a default.
  emu.c no longer tests DSP to see if available and therefore sets default--
  it just defaults to NULL in emu.c and is set in Init.
v sound_config.audiofile set to NULL on mute, we want to change mute to use Init/UpdateAudioNone.
. ulaw not implemented (at all, no hooks)
. Since OSS driver returns immediately if we can't open device file,
  sound calcs will not be done (samples_per_vsync==0)
  This condition was in the original code
v perhaps replace/augment device select, allow user to select device?
v help text
- name change OSS to DSP? 
- Auto select device just like renderer?
- Sounder name change?
- xxx-sound.c name change -> sound-xxx.c?
? There may be an error in use of buf_size in oss-sound.c
- Abstraction of sample_format may be difficult or not worthwhile
- Terms not standardized, I'm using "sounder", sound engine, and sound driver


53.6 fps with inital sound engine (no output)
  56.0 fps if I disable the progressive sound update
    49.3 fps if I enable mid-screen sound updates
	  55.0 fps if I then make ProgressiveSoundUpdate a no-op
47.5 fps with progressive sound completely enabled
56.8 fps when I added sound engine code, with progressive
     sound on but actual sound output disabled -- don't understand
	 speedup

As suspected, the vbl is causing timing issues with the code.
It may not take me the same amount of time to render each frame.
Sound starts when I'm done rendering the frame, so there will
be "jitter".  There are two ways to solve:
1) have the AICA poll itself for completion, then start the new
   sample.  Probably easier to have it control the ping pong buffers
   then as well, and just continually stream.
2) continuous loop as suggested by Marcus.
Either way, sound will screw up if we're not running at an
effective 60HZ.


Perhaps I am only generating half the samples I should be.
No, tuxnes's "sample" is always 1 byte, even in 16-bit mode.
Looping works -- sound now distorts then cleans up, 
repeat ad infinitum.  Periodically we get out of sync with
the buffer (we're writing into the buffer we're currently using)
which causes the distortion.

- Don't forget to use DC mixing hardware once sound is good!
- Verify that Mute driver actually does the calculations
- Advancing the ping-pong buffer manually ends the distortion.
  We may be de-syncing because of the samples_per_vsync + 2?
  Distortion seems to last for 6 seconds; then 6 seconds of good sound.
. "To poll the current mixing address for a channel, write the channel
  number to 0x80280d (8 bit) and then read 0x802814 (32 bit)."
  (0xa070280d, 0xa0702814)
v Checking the position and dumping the next sample where we aren't
  playing eliminates almost all the distortion.
- With frameskip==1, we've got (periodic) distortion again.
  There are a few notes.
  1. Because the skipped frame isn't drawn, it takes very little CPU time
  to calculate; also, we don't wait for vbl.  Therefore, we may complete 
  the frame during the same sample as the last frame, so we write to the
  wrong buffer.
  2. Because of vbl, every second frame is synchronized at a 30Hz rate.
  The sample buffer also loops at a 30Hz rate (44100Hz / (735*2)) = 30Hz.
  Sound is updated before the display is drawn, pretty close to after the
  previous vbl:
     vbl  ...  drawimages if register touch ... update audio ... complete frame ... vbl
  In a 0 frameskip situation, then, we should never cross buffers; sound should
  usually be updated around the same sample position, and always within the same buffer.
  Note we may cross buffers if we miss a frame because we're running too slow, though.
  This could cause a buffer switch, which is what the sample position polling code is
  designed to catch.
  3. In a 1 frameskip situation, every second frame hits at 30Hz.  So, really,
  every second frame should hit the same buffer (2), while the other frames hit
  within the same buffer (1).  If we direct the alternate frames to buffer (2)--or at least the next buffer--,
  it should sound just like frameskip==0.
  4. However, it appears that the sample buffer does not loop at exactly 30Hz (2VBL)!
     For some reason, using 1474 samples (not 1470) almost fixes the problem, as
	 shown by examining the sample positions.
  probably because ntsc refresh is 59.94 fps ?
  if so, we need 750 or 800 samples per frame (44955 or 47962 Hz)

  with 735 samples and 43992 Hz, we loop too fast
  with 735 samples and 43991 Hz, we loop too slow
  this implies a screen refresh rate of ~59.8525 fps!  nonsensical!
  Also, when i change the Hz, my fps calculation changes!! 
    guess not.
  I was able to clean up most of the crackling (by overcoming the problem with needing to be on a 4-byte boundary for AICA accesses;
  essentially I buffered the last sample and moved it to the next frame (734 samples one frame, 736 the next)
  However the sample position is now -required- to reside in alternating buffers each frame---it will resync if you cross a buffer boundary, but if you enter the same buffer twice, the second time it will not load the sound at all.  This is a side effect of the 734/736 samples thing,
  a requirement for alignment purposes.  It works fine when we display every frame at full speed, since the positions are
  almost always 735 samples apart (vbl occurs every frame). However, if we frameskip, a skipped frame does not wait for
  vbl (since the previous frame used >1 vbl periods).  Thus every two frames occurs at a constant rate (30Hz), but two 
  frames themselves can occur very close together (the purpose of frameskip!)--within the same buffer in fact.  When 
  this happens no or garbled sound is enforced by the current alignment rules.  This occurs for half the time.  For the other
  half, two buffers will be used, even though the position distance is less than 735 samples, because it is still possible to straddle a boundary.
  So.  One idea for fixing is, figure out if the (absolute) last position differential is somewhat less than 735 samples; if so,
  we're probably using frameskip and this is the second (fast) frame, so we can force it to use the next buffer.
  With properly implemented frameskip, we should never hopefully skip a frame without intending to, as would happen
  occasionally when trying to run at full 60fps even with sprites disabled.  We seem easily capable of 30fps without
  dropping, with sprites enabled. ... Again, the first (slow) frame will still determine which buffer we write the next sample to, 
  so that we still sync if we drift over a buffer boundary. ... I believe that for frameskip > 1, we may need more than two buffers.

  still crackling if initialize audio_buffer to all 0x50... 0x50 shows up at 0x5c0, 0x5c1 == 1472, 1473; i.e. the very last 16-bit sample
    snd_load has an off-by-one error!
	However, this did not cause the crackling; apparently the AICA can only -play- multiples of four bytes (2 16-bit samples) as well
	  so it always plays an extra sample (1472 bytes, instead of 1470)
	  solution: for now, play 1468 bytes
	            in future, double (at least) the number of buffers

increased #buffers from 2 to 4, because 2 was unsolvable (given buffers A and B,
if this_pos == A twice in a row (due to frameskip), then I write to B the first
time, and A the second... but I'm still in A!)  With 4, if we're in A twice,
the first time we write to B, second to C.  Cleanest solution is generating
only the samples I need, but I can't do this yet.
We don't need (currently) to check that the position differential is less than
one buffer, because checking that this_pos and last_pos are in the same buffer suffices.

v todo: convert to kos
v todo: dctool uses 100% CPU and hangs after a few minutes
  v temp solution to hang: printf(".") every 15 seconds
  v solution to 100% CPU: use select()
v todo: OLYMPUS no longer works, faxanad.nes,  (dirtyheader)
v todo: research duff's device
v todo: sort game menu		
v todo: dirtyheader check is almost always right; make it the default.
v todo: fix legend of kage (KAGE.NES) / double dragon 1 (DRAGON1.NES) 
  . note tuxnes original does not have the major screen corruption!
  . x not being reset to zero every time it reaches 256 ; fixed in pixels.h 1.6.2.28
  ! but this trivial change slows down Zelda 2!
- todo: backport new shutdown audio routines to tuxnes
- todo: create new shutdown renderer routines (how is this handled now?)
  alternatively, at every point we can call quit, prepend a shutdown renderer
     of course, then we should rethink the shutdown audio routines
- todo: fix left/right sprite disappearing (top/bottom too?) if incorrect
  v SMARIO: on real nintendo, right sprite tile does not disappear (but left does)
    this may vary between roms; must obtain one with the title screen intact
  - ZELDA: on real nintendo, during item intro, double-high item does not disappear
    until first line of bottom half leaves window.  In tuxnes, it disappears as
	soon as first line of top half leaves window.
- todo: fix top row tile incorrect in megaman2
- todo: why does SMARIO have a problem with the select button??
  - looks like the rom with the full title screen doesn't experience this
    (of course that ROM seems to execute faster than normal too, so)
  - tuxnes orig has same problem
- TODO: implement save games!  -- nesterdc vmu.cpp has some good routines.
  v partially done: save one per rom thru /pc (to ng/save/basename.sav)
- todo: cleanup kos conversion
- todo: move palette data to palette.h
  - done, but it's suggested that I might want to use palettes.c instead.
    research this.
- todo: recheck strlen return value w/r/t strncpy 
- todo: strncpy pads with \0 and not guaranteed to set \0 at end of string;
        recommend convert to strncat

- todo: tuxnes orig. chg sprite code so it doesn't call mmc2_4_latch 63 times per drawimage if mmc2/4 not enabled
   mmc2 and 4 are mappers number 9 and 10 respectively (stored in MAPPERNUMBER)
- todo: backport dirtyheader fix to tuxnes
v todo: loop main() control flow (don't quit() directly any more)
- todo: find unfreed mallocs now that main() loops
- todo: figure out source of annoying border flickering after certain games run
  (I believe this has to do with border color or color 0)
- todo: game aborts (nicely!) once in a while, as if quit sequence was pressed


- todo: fix smb3 left/right scrolling.
  Both colors and shapes are wrong.
  It only occurs when hscrollval % 8 != 0 (i.e. mid-tile scroll).
  Sometimes, the right tiles are extended onto the left side of the screen 
    (as if there is a wraparound bug only valid for the first column).
  This is when moving right.  Upcoming tiles are correctly drawn, just wrong
  colors.  When going left, the left tiles are wrapped onto the right side,
  and the left side is drawn correctly with wrong colors.  Though I just noticed the leftmost 4 pixels are still reflecting the right side...
  
  Left and right tiles always seem to use the same palette, even when colors are wrong.
  Could this have to do with the register that stops display of 8 leftmost pixels?
  Need to see actual game.


- todo: palettized textures
  it is indicated that you can have 64 palette tables of 16 entries each (4BPP textures)
  each tile can contain a maximum of 4 colors (2 bits / pixel) taken from 4 4-color (2bit) palettes
  so it is probably not necessary to have 4 copies of each tile in memory---just 1 copy,
    and send which palette we want to use in the "TSP" control bits for each texture.
  we'd only be using the first 4 entries of the first four 16-entry DC palettes in this case.
  is that true?
  also, when the nintendo updates its own palette entries, we can just update the DC palette
  entries as well, without having to re-upload every affected texture.
  we would have to put the actual RGB value into each DC palette entry, while the nintendo
  just says "use hardware color X"
  mid-screen palette updates would be weird.  we could use unused palette space for these,
  but it would be difficult (how many games do this anyway?)
  what about sprite palettes?  we can use 4 more DC palette entries for these (leaving 56 left!)
     but check pNES for comments on why sprite textures were not practical

- todo: sprite 0 hit
  . Sprite 0 hit has to do with why status bar ends early in SMB1, Zelda, Castlevania.
  Rendering could be wrong, or it could be that new cpu core changes the timing from the (fudged)
  tuxnes timings.  
  . NESTECH claims sprite 0 hit is set when PPU -starts- refreshing first line when sprite 0 is
  located, and is cleared at end of frame (VBL).  This appears to be incorrect.
  conte says clear after VBL not before, marat agrees--confirm with s0, since it displays background
    when sprite ends.
  check PPU patent, but it should only be set when the first visible pixel that overlaps a non-zero
  bg pixel is drawn.a  note: we can't emulate this perfectly with line-based rendering; changes can only
  take place as soon as next line.
  . zelda: clock 05520: sprite0hit is at 16581, y=48, x=213; spriteram[0](y)=39
  . smb1:  clock 03487: sprite0hit is at 10443, y=30, x=213; spriteram[0](y)=24
  . I think the calculation for sprite0hit is wrong.  It adds 40 cycles, ostensibly for time from which
  sprite0 is hit until flag is set---is this right?  It also adds 85 cycles, for the HBLANKing period.
  . when I throw in a drawimage in $2002 read when flag is set, s0.NES bar flickers.  This doesn't make sense
  . i'm thinking drawimage should be called whenever the sprite0 flag is changed..once the
    drawimage issues are fixed
  v cpu timing is off and can cause $2002 read a few hundred cycles too late.  Moving sprite0hit
    back by 80 cycles fixes smb1, zelda, castlevania, but is NOT A GOOD FIX!!

- todo: battery-backed memory -- apparently a header byte signifies whether or not battery backed memory
  is present -- rather than deducing based on contents of $6000-$7fff

- todo: check vblank flag cleared after first read to $2002 after flag set:
	  ppustatus equ $2002 nestc_detect
	  bit ppustatus  ;wait for a vblank
	  bpl nestc_detect
	  bit ppustatus  ;NES should have cleared it after the last read...
	  bmi is_nesticle  ;...but NESticle and friends don't.
	  ; if you got here, you're not on nesticle is_nesticle
	  lda #nesticle_warning_screen
	  jsr show_dialog

- todo: check I flag not set when NMI taken
v todo: fix flashing screen bug
- todo: figure out why at game init we will display background color of previous game for a few frames,
  even though we clear all video memory at game init
- todo: fix initial sound burst (from previous game) at game init
