Faster Texture Mapping

This is from my October, 1996 posting to the code-optimize mailing list. It describes an efficient way to organize the addressing within a texture to improve the number of L1 cache hits:

At Chris's request, here is a brief description of my stripped texture mapper.

Texture mapping code can spend a lot of its time, even a majority in extreme cases, doing nothing but loading the CPU cache. An obvious way to reduce the quantity of cache misses is to arrange the texture in memory in a tiled format. Sampling will tend to occur within the same tile and touching the same cache lines before moving on to the next tile.

The catch-22 is that changing the texture memory format to a tiling scheme increases the complexity of the addressing. You risk giving back all the cycles you saved and possibly even more (plus you get more code to debug as a bonus!)

My thinking was that it would be all right to expand the inner loop by one or two instructions if I could eliminate most of the cache misses. The problem was to find a way to simplify the addressing to give me the benefit of tiling but at a small cost.

The idea I hit on is to arrange the texture not in tiles but in vertical strips of eight pixels width. This has the same effect as tiles but the addressing is much simpler. Each four adjacent rows of a strip sit in one cache line on a Pentium so each cache line holds the equivalent of an 8x4 tile. This works very well.

The nifty part of an eight-pixel width is that it has the magical property of solving the hard (to me) problem of addressing across rows. Moving from columns is easy but moving across rows involves adding a number that can't be conveniently held in the carry flag. _Except_ in this case where we can use an SIB addressing mode and let the CPU do the work for us. This trick buys back most of the additional complexity.

Here's how the addressing works. You use two registers. One keeps track of the strip and the column within the strip. It has a gap of unused bits where the row goes. The row comes from another register. It is shifted into the gap and added with the other register to give you the address.

Here is an example: EBX and ESI are used together to address the texture. ESI is shifted left by three bits [ebx+esi*8] to position the row address. The register diagram below shows how they are used. The numbers reflect the bits required for a 256x256 texture. A 1024x1024 texture would require 7 bits in the Strip field and 10 in the Y field.

            +-------------16--------------+----5----+-----8-----+---3---+
            |                             |         |           |       |
EBX:        |                             |  Strip  | reserved  |   X   |
            |                             |         |           |       |
            +-----------------------------+---------+-----------+-------+

     +---3--+-------------------21------------------+-----8-----+
     |      |                                       |           |
ESI: | free |                                       |     Y     |
     |      |                                       |           |
     +------+---------------------------------------+-----------+

A serious problem to solve is how to carry across strips when X overflows. The solution is to keep the Reserved bits set to zero and add in all ones as a part of the X constant. This way, a carry from the column will propagate over to the Strip bits. The Reserved bits need to be cleared after this operation. I've racked my brain to find a way to avoid this step but with no success.

Here's the good news. The "AND" instruction to clear the Reserved bits is the only additional cost over the texture mapper used by Michael Abrash in Quake. Just one more instruction. To be thorough, this technique does have a soft cost in that it uses an additional register to do the addressing. Intel registers are a scarce commodity so it is possible that this could translate into a hard cost in your implementation.

The "AND" does have a couple of useful side benefits that are worth mentioning. It makes tiling in the X direction free. It also makes cylindrical mapping free.

There a three free bits at the top of the Y register which can be used for a counter (or even a low-precision fraction).

The inner loop works something like this:

          add   ecx,[Yfrac]          ; add y fraction
          adc   esi,[Yconst]         ; add y constant
          add   edx,[Xfrac]          ; add x fraction
          adc   ebx,[EbxAdd]         ; add x constant
          and   ebx,[EbxMask]        ; clear Reserved bits
          mov    al,[ebx+esi*8]      ; get pixel

The texture base address can be stored in either register (if it is suitably aligned) or it can be written into the texture mapping code.


Back to David's attic