How many micro sprites could the 3DO handle?
Long time ago, I had watched the Atari Jaguar demo Facts ( https://www.pouet.net/prod.php?which=50467 ). It was boasting the power of the Jaguar using the object processor to push as many sprites as possible. I was quite impressed back then and wondered whether it will be possible to do the same thing in 3DO. I started a project loading an image similar in size and bit depth as the one in Jaguar, but this time it was a different animal. Assuming the mascot of the 3DO is a gecko lizard. The total image size is 192*160, it's in 16bpp and broken into pieces of 4*4 pixels, totalling 1920 sprites. For now I tried to implement the second part of the Jaguar demo btw, I could try the spirals part in the future. The result of my attempts is here, now reaching 80fps when vsync is off!
But it wasn't the case initially. The earliest performance was 40fps. The animation code running on CPU would go over it alone at 109fps (when disabling the rendering of the CELs but still repositioning them every frame). With some optimizations that went over 200fps (it's only 1920 elements to move anyway). But I was barely at 56fps then. Later on, I tried to push the CEL. Rendering but not moving the CELs would initially give 50fps max, then with few easy techniques got to 64fps, later discovered some more things that can reduce performance when you bother rendering many microcels. One fix got me to 72fps with rendering alone. Final fix surprised me and I never expected such an improvement by disabling a CCB flag that I considered to be unnecessary to bother with. Could you say CEL bug? Sort of. But it's interesting and good to know and I'd like to write something about it.
The first thing that pushed me from 50fps to 64fps was to avoid using a WOFFSET technique, where I load the initial big 192*160 bitmap on a single CEL, then I create new CELs with window subcel, so a 4*4 micro CEL will set it's SourcePtr to point on the small 4*4 tile right on the big 192*160 image, but then it's woffset shouldn't be 2 Words, but rather should be altered to reflect the big 192 pixels width (192 / 2 - 2, since WIDTH is in 32bit Words but bit depth is 16bit, also subtract minus 2 as we do in the woffset, hardware wants it like that, minimum width of bitmap line must be 8 bytes, pretty fitting for our 4*4 16bpp sprite though). I quickly realized that this maybe doesn't cache well in the hardware or something else is going on. I spent the extra memory to copy 4*4 blocks from the big linear image to a buffer where each 4*4 tile is stored linearly on it's own and no need of woffset tricks are needed to subwindow small tiles from bigger image. This seems to work better with the nature of how the CEL hardware loads the little data for the microcels than tricking it/feeling too clever with woffset.
That might make sense kinda, because who knows how the CEL hardware really works in every aspect internally? And here is my big surprise. I was looking at what happens when enabling/disabling various CEL flags. Most of them doesn't seem to affect (enable CCB_ACW and not CCB_ACCW? Set CCB_TWD? Other flags that could alter the way the hardware prepares to render each microcel?). Oh, almost forgot to say, I've already implemented the tricks from my post https://3dologic.blogspot.com/2021/11/the-mysterious-cel-flags-that-disable.html. That also gave a slight improvement (but can't remember how much and if it was before the woffset optimization). But the one I didn't disable because I thought it was meaningless, was the LD_PLUT. I am using PRE0_LINEAR when creating the cel, which means the cel type is CEL_TYPE_UNCODED. Not using the palette as it's a 16bpp texture that doesn't need it. I even set CCB_PlutPtr to NULL to be sure.
And here is the surprise. I never thought then I even need to disable LD_PLUT (which tells the CEL to try load a palette) as I assumed the hardware in it's 16bpp UNCODED type will just skip reading the palette even if you tell it. And NULL on the palette pointer would also help with that. But No!!! For a naive test, thinking it wouldn't matter, I did disable LD_PLUT on the microcels and what did I got? 98FPS!!!!!!!!! From 72 max bottleneck! Just for the little LD_PLUT. I even made a test and set Source_PlutPtr to 0x12345678 out of curiosity of what the hardware will do. And,. well? It froze!!! It tried to load the palette even if it's 16bpp and UNCODED, where it should know that this bitmap format will skip palette loading. When even the 3DO API functions create a CEL, they always enable LD_PLUT in the flags, regardless if it's CODED or UNCODED. But that's not good for the hardware.
I don't know if it's a hardware bug or oversight or maybe I need to recheck my code, but I was flabbergasted. And also happy that the CEL rendering could improve, while previously I thought I hit the wall and nothing I can do can save us from getting well beyond 60fps, so that I can safely beat the record and have time to implement more on the CPU (I haven't implemented the added sphere mapping offset from the Jaguar demo, but that one should be fast if it operates on few cel positions). Now it's 80fps when vsync is off. Plenty of room for more and still keep the throne of 60fps!
But that's another thing to know. If you use UNCODED CELs, especially many many numerous ones, disabling LD_PLUT is still a good thing to do. Previously I assumed it was unnecessary. Well,. the lesson here is Never Assume.
This is really impressive! It would be interesting to see if the 3DO could do the spirals as well as presumably this would be the ARM/maths co-processor doing the calculations and not the Cel Engine(?). I'd also assume that the music would be straightforward as well given there is an audio DSP that is not doing anything?
ReplyDeleteAre there any 3DO-specific enhancements that could push it further, for example, using the SPORT bus to display a background image at the same time?
It all goes to demonstrate just how little of the 3DO's potential has so far been realised. I used to think the Jaguar was the more powerful console, and perhaps it is in a raw processing use-case, but the 3DO certainly has a lot of specialised functionality that I suspect the Jaguar would be hard pushed to reproduce.
Right now the rotations of those sprites are happening purely in the CPU. Maybe I could have tried to use the hardware matrix on the 3DO to see if it can be faster, but the way I have optimized the CPU it's so simple and straight forward that I am not sure I would gain and not lose anything by the HW matrix. I think I could figure out similar optimizations with the spirals too. The bigger challenge in the beginning was to figure out why the CEL couldn't reach over 60fps in the pure rendering of the sprites even without moving them. Once I've achieved that, the CPU rotation of 1920 positions was easier to optimize.
DeleteI can drop a nice background with the SPORT and I think it won't affect the frame rate, it's very very fast. If I clear the screen with the SPORT (which I already do) it's pretty fast. If I render a background, I don't know how fast it is, as the copy background SPORT function will force vsync on no matter what. As long as it's a steady 60fps after trying that, I will be certain it's ok, but won't be able to benchmark how much above 60fps it is. But SPORT is fast and I am confident it will work well.
I think the Jaguar sometimes has bigger bandwidth with blitting images to the screen, but the 3DO is not bad either if you know the traps of the CEL to avoid. The Jaguar might do better with the faster programmable processors, the main is worse, but the others could achieve more than the slow ARM. I know there is only a bottleneck when they try to do real texture mapping. There is a bottleneck between the memory and the processors that have to be manually programmed to rasterize texture mapped scanlines. 3DO is going to be faster with that because the CEL hardware does it for you.