Log In  


Precision Drift in t() Over Long Uptimes

Bug Description

Over hours, floating-point division yields cumulative precision drift.

This results in noticeable glitching in all applications that derive any of their values from t() at any point during their _update() or _draw() functions.

Summary

  • Root cause: lua54_time() uses 60.0f (32-bit), resulting in precision drift
  • Fix: change 60.0f => 60.0
  • Alternatives: new stat(STAT_SEC_TICKS) or stat(STAT_TICKS_RUNNING, n)

If this is behaving as intended or you go with the "Fix", I implore you to consider implementing one of the stat() solutions!

Context

  • time() is the lua-side of the c function lua54_time()
  • lua54_time() returns cproc->ticks_running / 60.0 as a lua_Number
  • ticks_running is a once-per-tick integer counter
    • (tick = 1/60 seconds) => t() increments ~0.01667 seconds per tick
  • No reliable sub-second API for long-running apps
    • stat(STAT_EPOCH) => returns the current unix timestamp => whole seconds
    • date("!*t") => constructs a table of time data from the current unix timestamp => whole seconds

Reproduction

  1. Run the bunny screensaver (or any loop using t() for timing)
    • To avoid accidents with the screensaver ending, run the screensaver as a normal cartridge
  2. Let it run for several hours
  3. Observe growing visual artifacts as frame data gets calculated incorrectly

Actual Behavior

  • Frames step by ~0.01667 s, floating-point accumulation drifts over time
  • After extended uptime, precision loss becomes significant enough for artifacts to become pronounced

Expected Behavior

  • Consistent 1/60 s timing indefinitely, with no precision drift

Workarounds

Whole-Second Fallback

local START_TIME = stat(STAT_EPOCH)

function time()
    return stat(STAT_EPOCH) - START_TIME
end
t = time


This only works if the user doesn't care about partial seconds, which is a fairly niche use-case

Bit-Split Delay Hack

local TICKS_RUNNING = 0

function time()
    local high = TICKS_RUNNING >> 5 -- drop bottom 5 bits (effective div by 32)
    local low = TICKS_RUNNING & 0x1f -- store bottom 5 bits
    return (high / 1.875) + (low / 60) -- 1.875 == (60 / 32)
end
t = time

function _update()
    -- ...

    TICKS_RUNNING = TICKS_RUNNING + 1
end


This workaround gives us 32 times the amount of time before we experience precision errors, but that still means that for long-running applications, someone leaving picotron running (as users are want to do), this still means that a few days tops are all you get.

Identification

Having dug into the binary some, it appears that cproc's field for tracking the number of ticks the process has been alive is a long, and the assembly confirms that it's loaded as a double (through long-to-double conversion). However, the 60.0 is loaded as a 32-bit float and the division is performed as between 2 floats. This means that for the purposes of lua54_time(), cproc->ticks_alive is a 32-bit float.

; FLOAT_60_0 is the address of where `60.0` sits in loaded
; memory. The Floating-Point Unit (FPU) is the part of the
; CPU that does maths on floats and doubles, and has a number
; of registers that can be treated as a stack. Please note
; the "dword ptr" signifier. A word, on x86, is 2 bytes.
; a "dword" is a "double word", so 4 bytes. "dword ptr"
; describes the instruction as treating the value as a pointer
; to a 4-byte value. In this case, that would be a `float *`.

FLD           dword ptr [FLOAT_60_0]            => __fpu_push(*FLOAT_60_0)
                                                => __fpu_push((float)60.0)
                                                ;  push 32-bit float 60.0 into
                                                ;; the FPU's register stack

CVTSI2SD      XMM1,qword ptr [RAX + 0x270]      => reg XMM1 = *(reg RAX + 0x270)
                                                => reg XMM1 = cproc->ticks_running

MOVSD         qword ptr [RSP]=>local_18,XMM1    => *reg RSP = reg XMM1
                                                => local_18 =
                                                    (double)cproc->ticks_running

FDIVR         qword ptr [RSP]=>local_18         => __fpu_push(
                                                     *reg RSP / __fpu_pop()
                                                   )
                                                => *reg __fpu_stk_top =
                                                     cproc->ticks_running / 60.0f

FSTP          qword ptr [RSP + local_10]        => *(reg RSP + local_10) =
                                                     __fpu_pop()
                                                => local_10 = __fpu_pop()
                                                ;  local_10 is stack variable


I only included the most relevant disassembly. For the full disassembly of lua54_time(), please see the "Reference" heading.

Conclusion

Confirmed: 60.0 is loaded as a 32-bit value, so cproc->ticks_running gets cast to a single-precision floating-point value.

I suspect the cause of this is an f suffix on the 60.0 as such:

int lua54_time(lua_State *L) {
    lua_pushnumber(L, cproc->ticks_running / 60.0f);
    return 1;
}

Proposed Solutions

I propose one or more of the following:

  • Modifying lua54_time()'s behaviour to return the result of a double-precision division
  • introducing a new stat(STAT_SEC_TICKS) for determining how many ticks into a second the process is, using cproc->ticks_running % 60 or equiv
  • introducing a new stat(STAT_TICKS_RUNNING, n) for returning the integer result of cproc->ticks_running % n.

Reasoning / Justification

These are not expected, just suggestions for potential solutions. Of course, this all assumes this was unintentional. If this is behaving as intended, I implore you to still consider implementing one of the proposed stat() solutions.

  • Modifying lua54_time()'s behaviour to return the result of a double-precision division
    • it's the most simple and effective solution
    • potentially a few cycles slower, but practically the difference is none
    • even if it did use an additional 4 bytes in the executable, it would be practically no change. And it would not use an extra 4 bytes to do this because of data alignment anyhow. There are unused bytes between FLOAT_60_0 and the next value, enough that this change would not affect the program in any meaningful way.
  • Introducing a new stat(STAT_SEC_TICKS) for determining how many ticks into a second the process is.
    • this would function as a precision counter-part to t()
    • It lets users solve the problem for themselves if they need the higher precision.
    • This is most powerful in conjunction with stat(STAT_EPOCH), where stat(STAT_SEC_TICKS) would allow users to achieve additional precision in any timing data.
  • Introducing a new stat(STAT_TICKS_RUNNING, n) for returning the integer value of cproc->ticks_running % n
    • stat(STAT_TICKS_RUNNING, n?) where the return value is cproc->ticks_running % n and n defaults to 1
    • this would function as a precision alternative to t(), allowing users to do stat(STAT_TICKS_RUNNING, 60) to get the same value as stat(STAT_SEC_TICKS), or do stat(STAT_TICKS_RUNNING, 1) to just get the number of ticks the process has been running. This is more flexible solution.

Reference

Full lua54_time() disassembly:

**************************************************************
*                          FUNCTION                          *
**************************************************************
int __stdcall lua54_time(lua_State * L)

; function signature

int           EAX:4                             <RETURN>
lua_State *   RDI:8                             L
undefined8    Stack[-0x10]:8                    local_10
undefined8    Stack[-0x18]:8                    local_18

; procedure start

lua54_time:

SUB           RSP,0x18                          => reg RSP -= 0x18
                                                => reg RSP = &local_18

                                                ; RSP is stack register/pointer
                                                ; local_18 is stack variable

PXOR          XMM1,XMM1                         => reg XMM1 = XMM1 ^ XMM1
                                                => reg XMM1 = 0 ; fast impl

MOV           RAX,qword ptr [cproc]             => reg RAX = cproc

; FLOAT_60_0 is the address of where `60.0` sits in loaded
; memory. The Floating-Point Unit (FPU) is the part of the
; CPU that does maths on floats and doubles, and has a number
; of registers that can be treated as a stack. Please note
; the "dword ptr" signifier. A word, on x86, is 2 bytes.
; a "dword" is a "double word", so 4 bytes. "dword ptr"
; describes the instruction as treating the value as a pointer
; to a 4-byte value. In this case, that would be a `float *`.

FLD           dword ptr [FLOAT_60_0]            => __fpu_push(*FLOAT_60_0)
                                                => __fpu_push((float)60.0)
                                                ;  push 32-bit float 60.0 into
                                                ;; the FPU's register stack

CVTSI2SD      XMM1,qword ptr [RAX + 0x270]      => reg XMM1 = *(reg RAX + 0x270)
                                                => reg XMM1 = cproc->ticks_running

MOVSD         qword ptr [RSP]=>local_18,XMM1    => *reg RSP = reg XMM1
                                                => local_18 =
                                                    (double)cproc->ticks_running

FDIVR         qword ptr [RSP]=>local_18         => __fpu_push(
                                                     *reg RSP / __fpu_pop()
                                                   )
                                                => *reg __fpu_stk_top =
                                                     cproc->ticks_running / 60.0f

FSTP          qword ptr [RSP + local_10]        => *(reg RSP + local_10) =
                                                     __fpu_pop()
                                                => local_10 = __fpu_pop()
                                                ;  local_10 is stack variable

MOVSD         XMM0,qword ptr [RSP + local_10]   => reg XMM0 =
                                                     (reg RSP + local_10)
                                                => reg XMM0 = local_10

CALL          lua_pushnumber                    => lua_pushnumber(L, reg XMM0)
                                                => lua_pushnumber(L, local_10)

MOV           EAX,0x1                           => reg EAX = 1
                                                ;  return value

ADD           RSP,0x18                          => reg RSP += 0x18
                                                ;  restore the stack

RET                                             => return

; procedure end




[Please log in to post a comment]