Faster RGB Lighting
This is from a March 1998 posting to codeoptimize by Terje Mathisen. It describes my technique for combining an RGB texture with a RGB light, normally costing three multiplies, with only one integer multiply:
A couple of months ago, I made a post to UseNet (probably comp.lang.asm.x86) about how it is possible to use FMUL to do antialiasing of (8bit palletized) textures.
I.e. load the two nearest texels, scale them both using the fractional distance to the sample point, and then combine them to get the resulting RGB quad.
All this can be done very quickly by using a few tricks:
1. Make a double prec palette, where each entry contains a truecolor 8:8:8 RGB triplet, but with 8 or 9 zero bits between the parts, i.e.
dRGB = (double) R + (double) G * 2048.0 + (double) B * (2048.0 * 2048.0);
The scaling then becomes a simple matter of taking the fp U texture coordinate, FADD a constant such that the result (when FSTP'ed to memory) is a fixedpoint number with 8 fractional bits, and then use the integer part as the texture address.
The 8 fractional bits can then be used either directly, or as an index into a 256entry scaling table, to scale the two (double) texels surrounding the texture coordinate.
The same idea can also be used for bilinear filtering, but this will be a little slower due to the extra table lookups, two more FMULs and two more FADDs.
Anyway the final result must be stored to memory before the final RGB values can be extracted. Total time for all of this should be in the 1015 cycle range.
Some time later David Stafford sent me an email, where he mentioned that a similar trick could be used to do true RGB lighting, i.e. not just light intensity!
He had found a way to take 15bit (5:5:5) texel data, and a 12bit (4:4:4) rgb intensity, and then multiply this together using a _single_ integer MUL operation!
I must admit that I had to ponder this for a while, with a couple of false starts, before I was able to figure out how this could work:
The key idea is that the RGB texel and the rgb light values must both be aligned so they start on the same bit offsets in a 32bit register, with the gap between the components setup in such a way that there will be no overlap between the rR,gG,bB results we want and the other 6 unwanted pairs (rG, gR, rB, bR, gB, bG):
After setting up the matrices, it became obvious that this problem only has a single possible solution:
RGB = R + (G << 9) + (B << 27); rgb = r + (g << 9) + (b << 27);
After multiplication, the result becomes:
RGB * rgb = rR + ((rG + gR) << 9) + (gG << 18) + ((rB + bR) << 27) + ((gB + bG) << 36) + (bB << 54);
All the partial results consists of 9 bits, with no possibility of overlap, except for the middle (gG) product, where the (rG + gR) sum could conceivably generate a carry into the least significant bit of gG.
This is still not critical, because there is no way this could lead to an overflow of the gG value, worst case (odds of 1:16) it would give a one ulp error in the green component.
After confirming that this is indeed how he does it, he told me that he uses this with palettized textures, where he has already setup the palette values with the proper shifts between the RGB components. This makes the approach very efficient, esp. on PPro/PII cpus, where MUL is fast, and can overlap with fp operations.
The rgb intensity value is of course kept in a pair of 32bit registers, with the extra bits in between used for fractional intensity.
Before the MUL step you just MOV and AND the top register to generate a properly scaled intensity triplet.
The final packing of the 64bit result into a 32/24/16/15 bit result uses 4.5 Pentium cycles, so the full rgb lighting code can run in about the same time as my fp anitaliasing algorithm.
Very, very ingenious code. Vintage David Stafford indeed. :)
Terje


Using selfdiscipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"