The mysterious CEL flags that disable loading elements

 There are a series of CEL flags with the LD prefix that I hadn't properly used before. Except one that seemed to be the only one working, the LDPLUT. In fact, I phrased it wrong. When you use CreateCel to make a new CCB, all these four flags, LDSIZE, LDPRS, LDPIXC and LDPLUT are enabled by default. It is when you decide to disable them that things go awry!


I have (ab)used LDPLUT in OptiDoom once (although I may have removed this trick in recent versions as some refactoring for speed made it not practical to use it anymore) when a wall segment will send individual CELs (in a linked list) for each column of a wall. So if a single wall is covering the whole screen, 280 CCBs will be passed (in reality Doom has a bigger array fed with various different elements that is only send and flushed when it reaches the max). Since a wall uses a palletized texture that is the same for every column, why every CCB has to be fed with the palette pointer and the hardware has to reload the same palette again and again? Needless to say, this was a hack proving the concept of "premature optimization is the root of all evil"  as the CPU bottleneck compared to this small optimization is so big, that there wouldn't be any difference. But I was just experimenting with whatever I had in mind at the time. And I learned about these LD flags a bit.

So, these flags are telling the CEL hardware to load things from a CCB or skip it if the flags are absent. And it works like this. If I send a CEL with a pointer to the palette, the palette will be read from memory and change the hardware state to use this palette. There might be certain palette registers in the hardware for all I know, whose state has now changed. But for every individual CEL you send, this palette has to be reloaded from the palette pointer (ccb_PLUTPtr) in the CEL. The idea is that you could reset this flag for all the CCBs except the first one. So the first one will load the palette in the hardware, the rest will not provide the same palette. This saves a lot of unnecessary uploading to the hardware.

That worked with a test of 1024 sprites of size 16*16*8bpp. The 8bpp palettized mode will really use 5bits for palette index (32 colors) and the rest of the bits will affect other things (will still affect colors but by adding shade to the existing color, I've not experimented with this one to be honest). With LDPLUT enable (force loading of each individual palette) I got 35fps. With it disabled I got 38fps. Slight improvement but good for a simple trick. I even think the bottleneck is reading the texture here. 1024* 16*16 = 256k while 1024*32*2(16bit palette) = 64k. I could try later with smaller sprites or 4bpp data.


But this works. However I didn't have the same luck with the rest of the flags. I would get all my sprites dissapear and/or extremely low frame rate depending on where I am running it (different emulators, real machine). Yet I guessed and found what should you do in case you disable the other three LDSIZELDPRSLDPIXC flags. You should truncate the CCB struct! And I did it by hand as I haven't seen a suitable function on Lib3DO to do this.

First of all what are these flags doing (when resetting them)?

  • LDSIZE: If not set, it will avoid loading the scale vectors HDX, HDY, VDX and VDY. You don't need to load them for several sprites that do not have any scale or rotation. But the values already in the hardware state will be used, so you have to send a single sprite (your 1st in the list) with this flag enabled and those four vectors in their default values for 1x1 scale.
  • LDPRS: If not set, it will not load the 3rd vector HDDX, HDDY. Whatever was in the hardware state before will be reused. These are only necessary if you are rendering arbitrary polygons. But if you have a big list of sprites, either scaled/rotated or unscaled, then these should always by set to zero. The first sprite with LDPRS enabled will pass them as zero, and the rest of the sprites on the list can simply disable the flag.
  • LDPIXC: This one affects the shading/blending capabilities of the CEL. Why reloading it again and again (with default value 0x1F001F00 for opaque) if you have a thousand sprites where no shading or blending is applied? Or better say, with the same shading/blending applied. If you have a thousand sprite that all blend with the background using the exact same PIXC value, you could enable the flag for the 1st one and disable it for the rest, so that PIXC is loaded in the hardware once and then all the rest of the sprites will use the same value in the hardware, no need to reupload.
The two scaling vectors (loaded with LDSIZE) take 16 bytes. The 3rd projection (not real texture projection, just called like this) is 8 bytes (loaded with LDPRS). PIXC is even less costly, just a 32bit word or 4 bytes. Really, the biggest gain happened with resetting LDPLUT already with it's 64byte palette. Disabling all of the four in my sprite experiment yielded 39fps for some reason. Yes,. from 35 to 38fps simply with the palette, then 39 with the rest. I might not be seeing the true effects (I was expecting more) as the total bitmap reading bottleneck hides it. I should try with lower bit depths or smaller sprite sizes, even an 1x1 dot sprite. I could even scale these up a bit, remember LDSIZE doesn't mean no sprite scaling as I might have said, but "using the same scale as the last state in hardware". So if you set your first sprite vectors to 2x2 scaling, and disable it, all your sprites will be scaled up. It's just that much fewer data for the rest of the sprites have to be read. The last hardware state is being used.

And now, here is an example of the truncation to the rest of my sprites to make this work. It's awkward but that's how I made it work. If you have 1024 sprites, Do Not truncate the first one, leave it as it is and enable the flags (by default they are). Do the truncation in the next 1023 sprites and disable the flags. Here is one where I prepare my CCBs before I disable all 3 flags (LDSIZE, LDPRS, LDPIXC), LDPLUT doesn't need truncation.


The last two ccb_Width and ccb_Height are not part of the CEL hardware. In fact, the bits defining the width and the height in pixels (and the stride in words I think) are somewhere hidden in ccb_PRE0 and ccb_PRE1, but the 3DO libraries thought it would be nice to have it handy for easy reuse. As a fact, the original Doom source code declares it's own struct MyCCB where these two are simply missing because it's a waste of space, unless you use certain Lib3DO functions that might expect to read them. So, the real hardware doesn't read these. But as it seems, the real hardware will read the rest of the data sequentially and if you reset the 3 LD flags (but not LDPLUT) it will assume that you provide a truncated stream of data and it won't skip those gaps, it will expect that after YPos, PRE0 is the next one in the stream of the data. I would originally give the full CCB and disable the flags, resulting it passing the wrong data to the hardware.

Another thing to mention, my solution is lazy. I leave big gaps of empty data (or not exactly empty but whatever happened to be there before). I could initialize a new kind of struct that is missing the empty1-7 variables so a list of MyTruncatedCCB would be much smaller overall. The original CCB is 17*4=68 bytes. The new struct, assuming we also remove the useless ccb_Width and ccb_Height (but be careful if you are using a Lib3DO function that expects them) would be 8*4=32 bytes. A list of 1024 CCBs would go from 68k to 32k.

Finally, since we want that the 1st CCB of all our 1024 CCBs will provide all the data and enable the flags, and only the rest 1023 will disable them, the 1st should be CCB and the rest MyTruncatedCCB. And the ccb_NextPtr can link the generic CCB structure of the very 1st element to the next one with your uniquely truncated struct, it might be a CCB* but if the compiler complains cast it. It's a hack but I haven't seen any Lib3DO function that can make it easy to abuse those kinds of flags and generate suitable special CCBs for you.

My hacky code to truncate for various cases from an originally stored CCB to a manipulate CCB (I keep the CCB struct, didn't define my own, just move elements around and ignore the garbage I leave).

for (i=1; i<SPRITES_NUM; ++i) {
switch(cont) {
case LD_ALL_ON:
default:
break;

case LD_PIXC_OFF:
memcpy(&blobSpr[i]->cel->ccb_PIXC, &origSprCCB[i].ccb_PRE0, 8);
break;
case LD_SIZE_OFF:
memcpy(&blobSpr[i]->cel->ccb_HDX, &origSprCCB[i].ccb_HDDX, 20);
break;
case LD_PRS_OFF:
memcpy(&blobSpr[i]->cel->ccb_HDDX, &origSprCCB[i].ccb_PIXC, 12);
break;
case LD_ALL_OFF:
memcpy(&blobSpr[i]->cel->ccb_HDX, &origSprCCB[i].ccb_PRE0, 8);
break;
}
}

I start from i=1 to avoid manipulating the very first zero index CCB. Spr is just my own struct for sprites, that holds extra info besides the pointer to the CCB (cel). Based on the default struct variables location, I simply move some elements before, overwritting others we won't need anyway. Here, I was testing various cases, like what if I we only disable LD_PIXC, or only disable LD_SIZE or LD_PRS, or finally all the three. I just wanted to check performance, needles to say it didn't move much for individual cases and only a little with LD_ALL_OFF (which also at another place disabled LD_PLUT for the palette too).

These could be useful performance-wise at some cases (maybe with much smaller textures and more CCB elements) but that's a start if you ever wanna use them and wonder why? It needs dirty truncation of the CCB structs or creation of your own new manipulated structs. It needs some pain but this is how they work. I will post new experiments if I find something else or a really cool case that vastly improves the performance.

Comments

  1. If I’m understanding correctly, by setting LDSIZE for the first cel, and then not passing it for subsequent cels, is how you’d achieve the sort of screen scaling effect seen in Samurai Showdown? If so, then I can see how straightforward and simple that would be for a programmer.

    ReplyDelete
    Replies
    1. Interesting, I never though about that. When Samurai Showdown scales up, everything scales up the same size. One needs only to reposition things based on the scale. I remember looking at an emulator back then, when they had a hack specifically for Samurai Showdown. I didn't know what the hack was then, it was a code hack that was only applied to fix an issue on Samurai Showdown. It could be the LDSIZE. I didn't know about it back then.

      Generally, the CEL has these flags that can tell during rendering of each CEL element to skip loading various things, as they are already loaded. Like a state machine renderer, where the previous state for certain things have remained.

      Delete

Post a Comment