Describing the Problem
Itís been eight years since I wrote Qlife. I remember how exciting it was to see it running at hundreds of generations per second on my 386-33. My K7-650 runs it at about 8,000 generations per second now. That was on a 201 x 200 grid and with todayís faster computers itís practical to run larger simulations. Unfortunately, the old Qlife topped out at less than 64K cells and there's just not much you can do with a small grid. There was nothing inherent in the algorithm that would prevent it from being used for larger grids so bringing Qlife up-to-date in flat 32-bit code seems like a reasonable thing to do. You can get my old Qlife here: qlife.zip
How large a grid could we make and still get good performance? At 8,000 generations per second on a 201 x 200 grid weíre effectively processing 321 million cells per second! On a 650 mhz machine that comes to about one cell every two cycles. This number is a bit of a fake since the core idea behind Qlife is to use a change-list and avoid touching every cell but itís still a useful number for purposes of estimating how well Qlife will scale to larger grids. Also, the 8,000 generations was with a grid that was initially random. Iíve seen much larger numbers on grids that had very little activity. On a large grid configured by a human there wonít be a lot of random activity so the performance should be even better. Iíll stick with the 8,000 number though to be conservative.
A 1,024 x 1,024 grid should run about 306 generations per second. Thatís about what my 386 did on a 201 x 200 grid! Even an 8,192 x 8,192 grid, if we could build such a thing, would run almost five generations per second so it would still be quite usable. We can expect to find cache complications with larger grids so our performance might not scale as well but the key point is that very large grids are practical now.
Thatís enough incentive to update Qlife but it wouldn't be satisfying unless the new Qlife is even faster than the old one. I thought about this off and on for a few years but nothing ever came to mind that would work. One idea I tried back in 1992 was to make a single pass through the change-list rather than two. I never submitted this version because it actually ran a few percent slower than the two-pass method. It executed less code but it had to touch more memory and the result was poorer cache coherency. If anything, this trend is even more extreme today as CPU speeds continue to outrun memory speeds. Maybe there was a way to make this more efficient? I couldn't see one but it doesn't mean that it can't be done. It's worth keeping in mind.
Another idea would be to make the grid itself more efficient. I was always annoyed at storing the odd number of three cells in a word. It would be nice if I could store four. The benefit would be huge. The grid would take less memory and a smaller grid is more cache friendly. The change-list would be significantly shorter and we would visit fewer cell clusters on each pass. Less data and less code. Everything gets better.
Right, this is a great goal, but is it possible? If you examine the old Qlife format it looks tightly packed. This is the description I wrote at the time:
Each three cells in the life grid are packed into two bytes:OABCabcX XXYYYZZZ O : 0 if an internal cell, 1 if on an edge (must handle wrapping) ABC : the life/death cell states for the next generation abc : the life/death cell states for the current generation XXX : neighbor count for cell 'A' YYY : neighbor count for cell 'B' ZZZ : neighbor count for cell 'C'
So, it is convenient if the width of the cell array is an even multiple of three. There's nothing in the algorithm which prevents it from supporting any arbitrary size but the code is a bit simpler this way. So if you want a 200 x 200 grid I recommend just using a 201 x 200 grid and be happy with the extra free column. Otherwise the edge wrapping code gets more complex.
Since every cell has from zero to eight neighbors you may be wondering how I can manage to keep track of them with only three bits. Each cell really has only a maximum of seven neighbors since we only need to keep track of neighbors OUTSIDE of the current cell word. That is- if cell 'B' changes state then we don't need to reflect this in the neighbor counts of cells 'A' and 'C'. Updating is made a little faster.
The basic idea is to maintain a "change list". This is an array of pointers into the cell array. Each element points to a word which changes in the next generation. This way we don't have to waste time scanning every cell since most of them do not change. For a 201 x 200 cell array the change list will be, at maximum, 26,800 bytes (plus one more word to terminate the list). That's only if EVERY cell changes in one generation. In practice it is much less.
Two passes are made through the change list. The first pass updates the cell display on the screen, corrects the life/death status of each cell and updates the neighbor counts for the adjacent cells. There are some efficiencies gained by using cell triples rather than individual cells since we usually don't need to examine all eight neighbors.
The second pass checks (and possibly resets) the life/death status for the cells and their neighbors to build the change list for the next generation.
Processing each word is a little complex but very fast. A 64k block of code exists with routines on each 256-byte boundary. Generally speaking, the entry point corresponds to the high byte of the cell word. This byte contains the life/death values and a bit to indicate if this is an edge condition. During the first pass we take the high byte of the cell, AND it with 0xFE and jump to that address. During the second pass we take the high byte, OR it with 0x01 and jump to that address.
Determining which changes must be made to a cell triple is easy and surprisingly quick. There's no counting! Instead, I use a 64k lookup table indexed by the cell triple itself. The value is equal to what the high byte should be in the next generation. If this value is equal to the current high byte then no changes are necessary to the cell. Otherwise it is placed in the change list. Look at the code in the Test() and Fix() functions to see how this is done.How each segment is used CS : 64k code (table). DS : DGROUP (1st pass). 64k cell life/death classification table (2nd pass). ES : Both change lists (so we can use STOSW). SS : DGROUP. The life cell grid and row/column table. Indexed via BP. FS : Video segment. GS : Unused.
There doesn't appear to be any flab in this format. Adding one more cell looks impossible. We could go to a 32-bit format instead of 16-bits but that has its own problems. First, the grid gets larger (exactly the opposite of what I wanted to achieve.) Second, tables that are indexed by cell values grow beyond 64K and cache performance will deteriorate.
No, it didn't look like it was possible to make this work.
And this is where I found myself stuck for a very long time.
On to Part 2. The first insight
Back to the main page