Let it crash?

I was playing around with a script to describe the behavior of turrets, when I encountered a little problem: if an object the turret is following suddenly disappears from the scene (for whatever reasons), the script would crash because it would try to refer to an object which does not exist. Remember that for memory safety and other reasons, objects are referred via ID’s in the scripting system: a script object always wraps an ID, not a reference. The ID becomes invalid once the object is removed from the scene and so, when a script tries to call a method on such an object (for example, “obj GetPosition”), we get a crash. How do we solve this?

My first idea was to modify the spec of the scripting language and its do..undo constructs to also deal with failures. However, I decided, it’s not worth it, as it would overcomplicate the language. Do..undo blocks, first and foremost, deal with expectable cancellations, not unplanned-for hard failures. And as such, undo blocks can expect invalid objects to be still valid inside undo blocks: if we crash inside a “do” block when accessing “obj GetPosition”, we will, most likely, crash in the corresponding “undo” block too, as it might try to access “obj” too (for example, to set the object’s animation back to “idle”, as is the practice).

The second, brief, idea was to create a new set of blocks, similar to do…undo blocks, but with the semantics of dealing with hard failures. I immediately dismissed it, as it would require me to write too much code (introducing tons of bugs in the process).

The third idea was “let it crash”. It required few code modifications. Basically, we check if the script track crashed, and if it did, we just create a new track state. It works OK for the turret: the target disappears, the turret’s script crashes as it tries to refer to a dead object, the script is reloaded, and the turret tries to find the next target to follow (according to the script).

It works OK for my simple turret script but I’m not yet sure when it comes to state stability. A script which crashes in this way may leave the state of its object or the state of other objects in a partial, corrupt state. Most importantly, it tears atomic operations! If an access to a dead object happens inside an atomic block, that atomic block will essentially be preempted, which is against the spec.

I guess I have to devise another way to deal with this. All in all, the modification to reload scripts on crash is a good move forward: now, if there’s a problem in a script, a game object won’t freeze forever, instead, it will be reloaded. Even if its state is partly corrupt (wrong animation, for example), it’s better than nothing! When devising a new way to “softly” deal with such problems, I need to also strengthen the (currently implicit) rule that scripts should never modify state of objects they are not attached to. That is, we should only modify “self” and no one else. If we want another object to be modified, we should rather instead send it a message asking it to modify itself.

Let it crash?

Time serialization?

As scripts are suspendable/preemptable at any point, it’s possible to have busy loops without blocking the entire thread. This allows us to describe animations in a very straightforward manner:


while { true } {
   local curTime = Application GetCurrentTime
   local delta = curTime - lastTime

   local angle = angle + (delta * 0.05)
   self SetGroundAngle angle

   local lastTime = curTime
}

The problem is that, if the script state is serialized and deserialized back, the variable “lastTime” will be wrong: it’s the time at which the original script was saved: when we try to compare it to “Application GetCurrentTime”, we’ll get wrong results on deserialization: the delta will be too large.

To solve this, now I serialize the current time. Upon deserialization, we look at the real current time and calculate the time adjustment value. Now, each time the application calls “Application GetCurrentTime”, it will get adjusted time, not the real time. This makes scripts think that the world never stopped across serializations.

Time serialization?

Oops… the runtime can have dangling references!

After I implemented full script state serialization, it’s now possible to create full snapshots of the game state, which allow the player to return back in time. And I encountered an interesting bug I hadn’t thought about.

My VM allows to access objects in many ways. The most straightforward way is to wrap pointers to C++ objects. Another way is to, instead, wrap ID’s to objects managed by someone else. For that, script objects expose intptr_t. The VM doesn’t care how it’s used. A native class implementation can provide “ref” and “unref” function pointers which can implement reference counting, for example. But you can also just ignore them (if it’s an ID) and expect someone else to manage those ID’s. ID’s are safer because we can tolerate errors such as access to deleted objects more gracefully, without memory corruption.

And so, game objects are exposed through ID’s in my setup. Testing the game, I noticed that sometimes, traveling back in time would complain that the VM attempted to serialize an object that does not exist. I traced the problem to a script called “walkHere.clone” which is used by player clones in the past. The path finding system is based on “crowds”, i.e. to make an entity navigate through the level, one must first attach it to a crowd and provide it with a “walk target” (the advantage is that you can dynamically move the walk target and path navigation will update). To achieve that, the script creates a temporary entity to be used as a walk target to point to a position in space:



do {
    atomic {
        global targetMarker = GameInternal AddWalkTarget activator
        targetMarker SetPosition targetX targetY targetZ
        crowd AddMember activator targetMarker

In the undo block, when the target is reached, we delete the temporary walk target:



} undo {
   crowd RemoveMember activator
   Scene Remove targetMarker
   sleep 500
}

After a bit of debugging, I noticed that the type of object it complains about that it does not exist is “marker”. And I realized that when we create a game state snapshot (to later travel in time to), we eventually may end up serializing the track state somewhere after the line “Scene Remove targetMarker”.

Basically, after the call to the said method, “targetMarker” does not exist anymore because it was removed from the scene and its ID now is invalidated, yet it still exists as a global variable. The engine tries to serialize the track state, including the global variables, and upon trying to serialize “targetMarker”, it fails.

The solution is to null out the variable inside an atomic block:


    atomic {
        Scene Remove targetMarker
        global targetMarker = nothing
    }

    sleep 500

An atomic block guarantees that the script cannot be preempted in the middle of it, and track state serialization happens only after a script was preempted. So basically, the VM got its own idea of a “dangling pointer”. It’s funny how coroutine-based code, despite being physically single-threaded, still suffers from the same problems as real multi-threaded code. Thankfully, we have atomic blocks.

Oops… the runtime can have dangling references!

My scripting language

It’s usually argued against writing your own stuff, because NIH is looked down on. I love writing my own stuff, but once I sat down and thought, maybe it’s me who is wrong after all. So many people argue against NIH, they may be right. And so I used Lua. You see, Lua is a choice number one in gamedev. I had pretty much no other choice. However, I came to dislike Lua.

First of all, vanilla Lua’s C interface is extremely, EXTREMELY confusing. The authors boast as if their C interface (i.e. the way you glue your native code to the scripting language) is very simple because it’s stack-based with no complex marshalling stuff or rules involved. But it’s like saying that Brainfuck is simple just because it has 6 or so operators.

I guess there can be a decent, passable stack-based implementation, but the way they implemented it in Lua is very, very error-prone. The API is just horrendous. I can never remember how functions should be called. You must remember by heart what exact stack index a particular function manipulates, and there can be many indices per function with various gotchas. It’s easy to confuse one index with another index, or overflow/underflow the stack, and things start breaking in mysterious ways. Some functions are plain dangerous, like those that modify objects while iterating. So instead of dealing with all that mess, I implemented a bunch of declarative wrappers which my engine read and then it would construct the appropriate stack frames for Lua. So I had to deal with Lua API only once (the ugliest part of my code was the one which had calls into Lua). So far so good.

Until I decided to add green threads/coroutines. Coroutines in Lua, as far as I learned, are actually quite limited and awkward to use. Most of the examples (at least in the official documentation) showed explicit creation of coroutines from under the language itself, manual yielding, etc. I couldn’t figure out how to just feed it pieces of code and say, “pseudo-paralellize it automatically as green threads, with atomic blocks, unwindable actions, asynchronous functions, hot-swapping etc.” Looked like I had to write tons of code to make it work the way I wanted, and it would be a monstruous barely working something in the end fixed in place with a duct tape… It’s not like I really cared much at that point because I was becoming more and more frustrated with the feel of dirtiness and hackiness of Lua. During one of my experimentations, I started getting random segfaults deep inside the Lua call stack. I tried to look at the source, but the part where it crashed looked to me quite cryptic, no comments, one-letter identifiers, very dense code so it’s just one large unidentifiable sheet of code (with a deep call stack). And I thought: enough, I can write a better DSL than this in a smaller timeframe than reading its documentation and trying to remember which index goes where.

(People may ask, why use a vanilla Lua, if there’s LuaJIT and other magical things. Well, LuaJIT depends on a lot of assembly trickeries. It’s a very slippery slope. If something breaks, it breaks spectacularly and you can’t do much about it because go figure what kind of race condition or exotic misalignment in the marshal stub triggered this bug in the tracing JIT while concurrently marking the heap… It’s not scalable in many senses (non-portable, potential for difficult to track bugs etc.). The more complex a library, the more problems I’m going to have in the long term. I don’t want to have my infrastructure rely on such a fragile library, considering that the domain does not require such voodoo anyway.)

So in a week I created a replacement which was designed with all those requirements in mind. One more week to refine some corner cases (like unwinding inside an atomic block inside a sleep function etc.) and it was pretty much production-ready (I haven’t had problems with it since then, and it’s been 4 months).

1. First of all, there’s no global state. Non-value type (stateful) objects in my engine are reference-counted from a common root, and so the scripting language’s objects are too. So there’s no need to have some sort of global state object because objects are automatically deallocated when no one uses them. It means that several tracks (what I call coroutines/threads in the system) can share same global variable state and refer to same C++ objects with no problem. There’s no need for a GC either, as reference-counting is intrinsic to both the C++ side and the DSL side. Basically, to wrap a C++ object and expose it to the DSL, all I need is to construct a script value like so: SScriptValue(myCppObject). That’s it!

2. Syntax is the least interesting part. It’s quite simple and done using a manual parser. It’s basically a set of specialized S-expressions, where functions have {} syntax, top level expressions don’t need parentheses and several operators like “return”, “while” etc. have built-in support.

An example:


        local i = 0
        while { i < 10 } {
            Console WriteLine (i ToString)
            local i = i + 1
        }

Here, `while` is an expression that accepts two function expressions (condition and body). It’s intrinsic and compiles to a very specific bytecode. This kind of syntax requires a shockingly simple parser.

3. As told above, scripts are compiled to bytecode (in memory). This is a requirement for tracks. A “track” is a piece of little interpreter state which executes “executables” (objects that represent compiled scripts). A track has its own stack and global variables (unless shared). You can create how many tracks you want, provided you point them to an existing executable. All you need to run a track is to call Track::Resume(). That’s it. So, basically, this is how you set up my DSL:


    auto exe = Executable::Compile(code);
    auto track = new Track(exe);
    track->SetVariable("enemy", enemyObj); // some environment-specific variables to glue it all together
    while(track->Resume()) { }

That’s it. Several tracks can share same executable. You can create how many tracks you want. 1000 (for each NPC on the scene)? No problem. Create and forget about it, reference counting will do its work. Now the best part is ::Resume(). The track’s execution can be suspended/resumed at any time. There are two modes. In the explicit mode, the code itself should call yield to stop executing and return from Resume. This is the worst mode. In implicit mode, you set the track’s “quantum”:


    track->SetQuantum(100);

This means that the track in Resume() will execute 100 bytecode instructions and will suspend itself automatically (return from the Resume function). You can see how with this simple system you can build very custom green thread implementations. All I need to do is to iterate over known tracks, set appropriate quanta and call Resume(). You can interleave it easily with other work, like rendering frames, doing physics etc. As you’re in charge of setting the quanta, you can implement your own scheduler.

In the same way, it’s very simple to implement stop the world scenario, for example when the user opens the game menu. To do it, you just… don’t call Resume() on the tracks.

4. However, all that wouldn’t be so amazing without do..undo blocks. In my game, the player should be able to cancel whatever his character was doing. This is accomplished with do..undo blocks. Consider this theoretical script of dancing:


    do {
        self SetAnimation "dance"
        while { true } { }
    } undo {
        self SetAnimation "idle"
    }

Undo blocks basically tell how to return to the original state which was “spoiled” by the corresponding do block. When the player activates the script, the character starts dancing. See the infinite loop? It’s not really blocking. It’s basically just a busy waiting. First, if the track has a set quantum, say, 100 instructions, then Resume() will periodically exit from the interpreter to give time to other tracks. And second, the DSL has “cancellation injection”.

Basically, when the user clicks on the “cancel” button, my code calls track->Cancel(). This injects cancellation into the track. The track starts unwinding. It jumps to the corresponding “undo” block from wherever it was and returns the character to the original state (“idle” animation). It’s kind of similar to try…finally, except you can inject it at any point from outside the code and the whole unwinding process is suspendable as well. For example, a script may want to actually make the character return to the original place, which can take several seconds. So undo blocks can happen for an indetermined amount of time. Or you can put sleep 10000 inside undo and the script won’t do anything for 10 seconds (sleep suspends the track immediately and refuses to resume for the specified time, unless cancelled). There can be complex nested do…undo blocks.

5. Now that is cool and all, but some action should be *atomic*. A script’s execution can be suspended at any time (or cancellation can happen at any time), and we often want to make sure some piece of logic is not preempted in the middle (for example, an update of a global variable by first comparing it to some variable — several tracks can be competing for this global variable, which would introduce race conditions). The solution is the introduction of atomic blocks:


    atomic {
        doSomething1 a b
        doSomething2 a b
        doSomething3 a b
    }

An atomic block guarantees that the track never suspends in the middle of it (meaning that atomic { while { true {}} } will hang the whole thread). If cancellation is injected when a track is executing inside an atomic block, the track will be flagged and the first time execution leaves all the atomic blocks (atomic blocks can be nested), only then will cancellation be injected.

Another guarantee is that cancellation cannot be injected between do and atomic:


    do {
        atomic {
            # initialization code...
        }
        
        # Some other code...
    } undo {
        # deinitialization code
    }

The atomic block here guarantees that initialization code is always called, so that deinitialization always matched it (it would be strange to undo an action which never happened due to premature cancellation). Same can be done by placing important initialization outside of do..undo blocks (only do..undo blocks are cancellable).

6. Another feature (“feature”) I have is the fact that closures are pseudo-closures. For example:


    local a = 0
    local inc = {
        return a + 1
    }

This looks like a cool-ass closure, but really isn’t. When the DSL resolves “a”, it just scans the current stack state for the appropriate symbol. It won’t work if “inc” is remembered and called in a context with no “a” on the stack. This, again, allows to implement pretty simple lambdas with no memory overhead. Most functions take a lambda, do something with it, and forget about it. Rarely do they actually save the lambda somewhere (at least in my scripts), so actually saving the captured context isn’t a necessity for me.

7. Now, you’d think that it would take a lot of time to reintroduce this DSL to my existing project (originally Lua-based) but actually it took, like, 2 days. The fact is that Lua from the beginning was reading declarative structures to have an understanding of what the types/methods are exposed. So all I needed was to hook the DSL to that declarative structure again. And the DSL got access to the existing scripting function implementations in no time. I only needed to rewrite the scripts (I didn’t have many by that time).

8. Another feature I implemented recently is hot-swapping. When you’re editing a script, you want to look how it behaves in the engine. The engine automatically detects changes and hot-swaps tracks with new tracks based on the new code (the script manager exposes tracks via ID’s so this replacement is opaque to the user). All the global variables of the old track are copied to the new track. But what’s cooler is that the engine doesn’t just reload the script, it actually gracefully cancels outdated versions of scripts. So, for example, if a script modified some object, it would unmodify it back (by unwinding undo blocks), returning to the original state, before being hot-swapped. This is good because the new version expects such an object to be in the original state. It avoids problems with partial/corrupt state due to sudden script termination. The only problem is that some scripts never enter do..undo blocks and so can never be gracefully cancelled, so for those, I have a 3 second timeout maximum.

9. And the last feature is asynchronous methods. You can associate a track with a thread pool (again, there’s no centralized state, so each track can have its own thread pool, or you can make 100 tracks share 1 single thread pool etc.). A method marked asynchronous (only native methods can be at this point) will be automatically scheduled to run on the thread pool (with all the required marshalling of the arguments). The track which waits for the result of an asynchronous method will suspend and return from Resume() immediately until the result is there. So it allows to quite easily add asynchronous methods to the language in such a way that they look no different from synchronous methods.

What’s important is that all this doesn’t employ any assembly trickery, it’s pretty simple and well-documented C++ code. I have no desire to return to Lua or C#.

Next step is implement script state serialization for saving/reloading a game state at any point without fearing that a script may bug out. This requires coordination of many systems.

My scripting language

My time rewind (game replay backwards)

When I got an idea to implement game replay backwards for a time rewind effect, my engine had already been designed to redirect all access to the visual world state via one flat, uniform API:

int id = StageModel("models/chair.obj", &err);
SetModelPosition(id, pos, &err);

This has several advantages:

1) Better memory management: not possible to leak memory by other layers or have dangling pointers.
2) The API is potentially serializable (network? replay?)
3) Encapsulation: layers that deal with game logic have no idea how models are implemented at all (even what classes they are made of), all we have is a model ID (or spline ID, etc.) which simplifies a lot of code.
4) Models in a session can be guaranteed to have practically unique IDs, which gracefully solves the problem if a script (especially a user script/mod) tries to reference a deleted model: we can simply log the error without crashing hard.

Disadvantages:

1) All accesses to modify a model are done via a layer of indirection, which is some overhead: first of all, we need to locate the internal model object by its ID. I currently solve this by caching the last accessed value, so a chain of accesses such as:

SetModelPosition(id, pos, &err);
SetModelScale(id, scale, &err);

etc.

are as almost fast as direct field access.

So what about replay?

The API is easily replayable because it has simple POD structures. When recording is enabled, all accesses to these API’s are recorded to a stream: we remember the function ID and the parameters. For example, SetModelPosition(id, pos) is basically 20 bytes of data in the stream: 4 bytes to identify the function, 4 bytes for the id, and 12 bytes for the position vector. The overhead is almost zero, we basically just copy 20 bytes.

Replay forward can be very straightforward: we can simply read the stream, deserialize all the function calls and invoke the flat API again in the same order. This should give us the same end result as during the original play (although differences in FPS should be accounted for). Note that we only record/replay the visual world state, we don’t record game logic at all (it’s suspended during replay), this decoupling of game logic and render logic greatly simplifies things.

So what about time rewind?

It all kind of falls apart when you want to replay the stream backwards. Something which used to be this:

int id = StageModel("models/chair.obj", &err);
SetModelPosition(id, pos, &err);

now becomes this:

SetModelPosition(id, pos, &err);
int id = StageModel("models/chair.obj", &err);

As you can see, SetModelPosition tries to access an ID which was not yet created. The next possible solution during the brainstorming was to, obviously, replay whole frames backwards, not individual commands. But it’s the same problem: a model can be created in frame 0 and then reused in frame 1. We can’t reverse it because in frame 1 we’d try to access a model which does not exist yet.

So my bruteforce solution was to run a processing pass I called “simulation” which happens after we’re done recording and we want to replay it. Now, after issuing a frame to render, we also emit a new special opcode “FrameEnd”. This serves as a way to reconstruct where frames start/end when processing the stream. Then we “simulate” each frame (“SimFrame”) by running through the recorded stream. When StageModel is found, we create a “SimModel” object and store it in the frame’s internal table of objects. When SetModelPosition is found, we retrieve SimModel and assign its position value. By the time we reach the end of the SimFrame, we have a collection of SimModels (I simplify it because we also have SimSkeleton and SimSpline) with the latest relevant properties. So, to rerender the frame, we can simply clear the world, iterate over known SimModels, stage everything, set the required properties and we’re done.

However, that still does not solve the problem of accessing models that do not exist yet. To achieve that, when creating a new SimFrame, we call curFrame.InheritState(prevFrame) on it. It basically imports all the accumulated data from the previous, already simulated, frame. As for the first frame ever, it inherits the state from the global world state outside of the recording (because the first frame can reference models which were created outside of the recording). Now, each SimFrame is what I call “time-independent”, i.e. we don’t need to replay frames in the correct order because each frame has a full complete list of model data by the time that frame was issued. We can simply clear the world state, recreate models, render them, in any order, and all good.

However, what I found, was that this solution was quite bad performance-wise: each frame, we stage many models, render them, and then delete them. Remember that we did not reuse existing models because we want to be able to replay in any order (including backwards).

My solution was the following. Remember that we can (almost) guarantee that model ID’s are practically unique in a game session (it’s possible to overflow but the recording should be running for days non-stop). What it means is that, if frame 0 refers to a model ID “42”, and frame 1 refers to a model ID “42”, we can be 99.9% sure they refer to the same object (as going backward in time):

frame 1:
SetModelPosition(42, Vector(6, 6, 6))

frame 0:
int r = StageModel("model.obj") // returned "42"

After the simulation step, we have:

Frame 1:
SimModel #42 (position = Vector(6, 6, 6))

Frame 0:
SimModel #42 (pos = default)

When we replay Frame 1, we call the same API StageModel(..) to create a model to re-render, however, due to the uniqueness guarantee, the actual ID will be something new, say, 68. So, we manage an “ID rewrite map” where we remember: “when you see 42 in a SimFrame, it’s actually 68 in the actual world!”

In Frame 0, we can now consult the ID rewrite map to see if we already have a backing model for our virtual model #42 in the stream. And we do! It’s most likely the same object, so we reuse it. This stuff is especially important if we have dynamically generated models on the scene: we don’t have to regenerate them. Also material states don’t have to be recreated.

The last step is what I call garbage collection. When a real world model is reused between two replayed frames (in any order) to replay a SimModel, we mark such a real world model as “reused”. All models (skeletons, splines) which were not marked “reused” are unstaged (we don’t want to render them!) We also delete them from the ID rewrite map.

I go through all these hoops because I want recording during actual gameplay be zero overhead and also memory-friendly. It’s this post-processing step where we actually try to figure out what was going on.

Now that the time rewind gimmick is there, I’m now working on full script state serialization so that it was possible to restore a game state at a marked point in time with as few potential glitches as possible.

My time rewind (game replay backwards)