How many micro sprites could the 3DO handle?
Long time ago, I had watched the Atari Jaguar demo Facts ( https://www.pouet.net/prod.php?which=50467 ). It was boasting the power of the Jaguar using the object processor to push as many sprites as possible. I was quite impressed back then and wondered whether it will be possible to do the same thing in 3DO. I started a project loading an image similar in size and bit depth as the one in Jaguar, but this time it was a different animal. Assuming the mascot of the 3DO is a gecko lizard. The total image size is 192*160, it's in 16bpp and broken into pieces of 4*4 pixels, totalling 1920 sprites. For now I tried to implement the second part of the Jaguar demo btw, I could try the spirals part in the future. The result of my attempts is here, now reaching 80fps when vsync is off!
But it wasn't the case initially. The earliest performance was 40fps. The animation code running on CPU would go over it alone at 109fps (when disabling the rendering of the CELs but still repositioning them every frame). With some optimizations that went over 200fps (it's only 1920 elements to move anyway). But I was barely at 56fps then. Later on, I tried to push the CEL. Rendering but not moving the CELs would initially give 50fps max, then with few easy techniques got to 64fps, later discovered some more things that can reduce performance when you bother rendering many microcels. One fix got me to 72fps with rendering alone. Final fix surprised me and I never expected such an improvement by disabling a CCB flag that I considered to be unnecessary to bother with. Could you say CEL bug? Sort of. But it's interesting and good to know and I'd like to write something about it.
The first thing that pushed me from 50fps to 64fps was to avoid using a WOFFSET technique, where I load the initial big 192*160 bitmap on a single CEL, then I create new CELs with window subcel, so a 4*4 micro CEL will set it's SourcePtr to point on the small 4*4 tile right on the big 192*160 image, but then it's woffset shouldn't be 2 Words, but rather should be altered to reflect the big 192 pixels width (192 / 2 - 2, since WIDTH is in 32bit Words but bit depth is 16bit, also subtract minus 2 as we do in the woffset, hardware wants it like that, minimum width of bitmap line must be 8 bytes, pretty fitting for our 4*4 16bpp sprite though). I quickly realized that this maybe doesn't cache well in the hardware or something else is going on. I spent the extra memory to copy 4*4 blocks from the big linear image to a buffer where each 4*4 tile is stored linearly on it's own and no need of woffset tricks are needed to subwindow small tiles from bigger image. This seems to work better with the nature of how the CEL hardware loads the little data for the microcels than tricking it/feeling too clever with woffset.
That might make sense kinda, because who knows how the CEL hardware really works in every aspect internally? And here is my big surprise. I was looking at what happens when enabling/disabling various CEL flags. Most of them doesn't seem to affect (enable CCB_ACW and not CCB_ACCW? Set CCB_TWD? Other flags that could alter the way the hardware prepares to render each microcel?). Oh, almost forgot to say, I've already implemented the tricks from my post https://3dologic.blogspot.com/2021/11/the-mysterious-cel-flags-that-disable.html. That also gave a slight improvement (but can't remember how much and if it was before the woffset optimization). But the one I didn't disable because I thought it was meaningless, was the LD_PLUT. I am using PRE0_LINEAR when creating the cel, which means the cel type is CEL_TYPE_UNCODED. Not using the palette as it's a 16bpp texture that doesn't need it. I even set CCB_PlutPtr to NULL to be sure.
And here is the surprise. I never thought then I even need to disable LD_PLUT (which tells the CEL to try load a palette) as I assumed the hardware in it's 16bpp UNCODED type will just skip reading the palette even if you tell it. And NULL on the palette pointer would also help with that. But No!!! For a naive test, thinking it wouldn't matter, I did disable LD_PLUT on the microcels and what did I got? 98FPS!!!!!!!!! From 72 max bottleneck! Just for the little LD_PLUT. I even made a test and set Source_PlutPtr to 0x12345678 out of curiosity of what the hardware will do. And,. well? It froze!!! It tried to load the palette even if it's 16bpp and UNCODED, where it should know that this bitmap format will skip palette loading. When even the 3DO API functions create a CEL, they always enable LD_PLUT in the flags, regardless if it's CODED or UNCODED. But that's not good for the hardware.
I don't know if it's a hardware bug or oversight or maybe I need to recheck my code, but I was flabbergasted. And also happy that the CEL rendering could improve, while previously I thought I hit the wall and nothing I can do can save us from getting well beyond 60fps, so that I can safely beat the record and have time to implement more on the CPU (I haven't implemented the added sphere mapping offset from the Jaguar demo, but that one should be fast if it operates on few cel positions). Now it's 80fps when vsync is off. Plenty of room for more and still keep the throne of 60fps!
But that's another thing to know. If you use UNCODED CELs, especially many many numerous ones, disabling LD_PLUT is still a good thing to do. Previously I assumed it was unnecessary. Well,. the lesson here is Never Assume.