Log In  

what is this?

The wiki is nice for checking exactly how time-expensive various operations are, but it's a bit out of date. Also, it'd be nice to just be able to directly test two implementations against each other, rather than adding up how long each individual operation takes.

The wiki also has a code listing for a cpu profiler, but it's a bit hard to find if you don't know it exists. Plus, it was fun for me to dive in and double-check the math myself.

My profiler is pretty similar to the one on the wiki, although IMO mine has a nicer/simpler interface. Additionally, I've commented exactly how the cycle calculation works, which might be useful for other people to see:

-- slightly simplified from the version in the cart
function profile_one(func)
  local n = 0x1000

  -- n must be larger than 256, or m will overflow
  assert(n>0x100)

  -- we want to type
  --   local m = 0x80_0000/n
  -- but 8𝘮𝘩z is too large a number to handle in pico-8,
  -- so we do (0x80_0000>>16)/(n>>16) instead
  -- (n is always an integer, so n>>16 won't lose any bits)
  local m = 0x80/(n>>16)

  local function cycles(t0,t1,t2) return (t0+t2-2*t1)*m/30 end
  -- given three timestamps (pre-calibration, middle, post-measurement),
  --   calculate how many more 𝘤𝘱𝘶 cycles func() took compared to nop()
  -- derivation:
  --   𝘵 := ((t2-t1)-(t1-t0))/n (frames)
  --     this is the extra time for each func call, compared to nop
  --     this is measured in #-of-frames (at 30fps) -- it will be a small fraction for most ops
  --   𝘧 := 1/30 (seconds/frame)
  --     this is just the framerate that the tests run at, not the framerate of your game
  --     can get this programmatically with stat(8) if you really wanted to
  --   𝘮 := 256*256*128 = 8𝘮𝘩z (cycles/second)
  --     (𝘱𝘪𝘤𝘰-8 runs at 8𝘮𝘩z; source: https://www.lexaloffle.com/bbs/?tid=37695)
  --   cycles := 𝘵 frames * 𝘧 seconds/frame * 𝘮 cycles/second
  -- optimization / working around pico-8's fixed point numbers:
  --   𝘵2 := 𝘵*n = (t2-t1)-(t1-t0)
  --   𝘮2 := 𝘮/n := m (e.g. when n is 0x1000, m is 0x800)
  --   cycles := 𝘵2*𝘮2*𝘧

  -- calibrate, then measure
  local nop=function() end -- this must be local, because func is local
  flip()
  local atot,asys=stat(1),stat(2)
  for i=1,n do nop() end
  local btot,bsys=stat(1),stat(2)
  for i=1,n do func() end
  local ctot,csys=stat(1),stat(2)

  -- report
  local lua=cycles(atot-asys,btot-bsys,ctot-csys)
  local sys=cycles(asys,bsys,csys)
  local tot=lua+sys
  return {
    lua=lua,
    sys=sys,
    tot=tot,
  }
end

how do I use it?

You can try it here online, but to really use it you'll want to download it yourself and edit the body of the analyze() function. There are instructions embedded in the cart with more details:

Cart #cyclecounter-2 | 2022-01-16 | Code ▽ | Embed ▽ | License: CC4-BY-NC-SA
10

misc results

poke4 v. memcopy

  profile("memcpy     ", function() memcpy(0,0x200,64)       end)
  profile("poke4/poke4", function() poke4(0,peek4(0x200,16)) end)

> memcpy : 7 +64 = 71 (lua+sys)
> poke4/poke4 : 7 +60 = 67 (lua+sys)

Copying 64 bytes of memory is very slightly faster if you use poke4 instead of memcpy -- interesting!
(iirc this is true for other data sizes... find out for yourself for sure by downloading and running the cart!)

edit: this has changed in 0.2.4b! the memcpy in this example now takes 7 +32 cycles

constant folding

I thought lua code was not optimized by the lua compiler/JIT at all, but it turns out there are some very specific optimizations it will do.

  profile("     +", function() return 2+2 end)
  profile("   +++", function() return 2+2+2+2+2+2+2+2 end)

These functions both take a single cycle! That long addition gets optimized by lua, apparently. @luchak found these explanations:

https://stackoverflow.com/questions/33991369/does-the-lua-compiler-optimize-local-vars/33995520
> Since Lua often compiles source code into byte code on the fly, it is designed to be a fast single-pass compiler. It does do some constant folding

A No Frills Introduction to Lua 5.1 VM Instructions (book)
> As of Lua 5.1, the parser and code generator can perform limited constant expression folding or evaluation. Constant folding only works for binary arithmetic operators and the unary minus operator (UNM, which will be covered next.) There is no equivalent optimization for relational, boolean or string operators.

constant folding...?

One further test case:

  profile("tail add x3", function() local a=2 return 2+2+2+2+2+2+2+a end)
  profile("head add x3", function() local a=2 return a+2+2+2+2+2+2+2 end)

> tail add x3 : 2 + 0 = 2 (lua+sys)
> head add x3 : 8 + 0 = 8 (lua+sys)

These cost different amounts! Constant-folding only seems to work at the start of expressions. (This is all highly impractical code anyway, but it's fun to dig in and figure out this sort of thing)

update the wiki?

I have not updated the CPU page on the wiki; it's a bit hard to pin down exactly which operations take cycles, and I would personally rather use a tool like this to compare two potential implementations.

But, just so you're aware, the wiki is definitely out of date; when I ran the wiki's cpu profiler on pico-8 0.2.4, it produced different results. (I put a summary of the raw differences here)

edit: thisismypassword updated the wiki -- thank you!

credits

Cart by pancelor.

Thanks to @samhocevar for the initial snippet that I used as a basis for this profiler!

Thanks to @freds72 and @luchak for discussing an earlier version of this with me!

changelog

v1.1

  • added: press X to copy to clipboard
  • added: can pass args; e.g. profile("lerp", lerp, {args={1,4,0.3}})

v1.0

  • intial release
P#104795 2022-01-11 03:31 ( Edited 2022-08-13 22:46)

2

the profiler is missing an input variable somehow - the current pattern forces declaration of a local (or global) to mimic real life usage

qol request: copy results to clipboard

P#105134 2022-01-15 12:57 ( Edited 2022-01-15 12:59)

good points -- added! passing input variables is slightly awkward, but it's at least possible now

P#105168 2022-01-16 06:35

I updated this into a useful snippet; in my own code, I always have this snippet sitting in a tab ready to be used whenever I want to measure something. it's much easier than needing to switch to another cart to test things

load #prof to get the snippet yourself, then use it like this:

  prof(
    function(a,b)
      local c=((a+1)*(b+1))-1
    end,
    function(a,b)
      local c=a*b+a+b
    end,
    {locals={3,5},name="mult"}
  )

see the snippet's code for more instructions + documentation

Cart #prof-0 | 2022-10-26 | Code ▽ | Embed ▽ | License: CC4-BY-NC-SA

P#119593 2022-10-26 12:17

[Please log in to post a comment]

Follow Lexaloffle:          
Generated 2022-11-30 16:53:38 | 0.026s | Q:23