Subscribe to this Thread
(Receive email notifications)
Pin To Profile
tl;dr
In the pico8 console, do load #prof, then edit the third tab with some code:
prof(function(x)
local _=sqrt(x) -- code to measure
end,function(x)
local _=x^0.5 -- some other code to measure
end,{ locals={9} }) -- "locals" (optional) are passed in as args
Run the cart: it will tell you exactly how many cycles it takes to run each code snippet.
what is this?
The wiki is helpful to look up CPU costs for various bits of code, but I often prefer to directly compare two larger snippets of code against each other. (plus, the wiki can get out of date sometimes)
For the curious, here's how I'm able to calculate exact cycle counts
(essentially, I run the code many times and compare it against running nothing many times, using stat(1) and stat(2) for timing)
-- slightly simplified from the version in the cart
function profile_one(func)
local n = 0x1000
-- we want to type
-- local m = 0x80_0000/n
-- but 8𝘮𝘩z is too large a number to handle in pico-8,
-- so we do (0x80_0000>>16)/(n>>16) instead
-- (n is always an integer, so n>>16 won't lose any bits)
local m = 0x80/(n>>16)
-- given three timestamps (pre-calibration, middle, post-measurement),
-- calculate how many more 𝘤𝘱𝘶 cycles func() took compared to noop()
-- derivation:
-- 𝘵 := ((t2-t1)-(t1-t0))/n (frames)
-- this is the extra time for each func call, compared to noop
-- this is measured in #-of-frames (at 30fps) -- it will be a small fraction for most ops
-- 𝘧 := 1/30 (seconds/frame)
-- this is just the framerate that the tests run at, not the framerate of your game
-- can get this programmatically with stat(8) if you really wanted to
-- 𝘮 := 256*256*128 = 8𝘮𝘩z (cycles/second)
-- (𝘱𝘪𝘤𝘰-8 runs at 8𝘮𝘩z; see https://www.lexaloffle.com/bbs/?tid=37695)
-- cycles := 𝘵 frames * 𝘧 seconds/frame * 𝘮 cycles/second
-- optimization / working around pico-8's fixed point numbers:
-- 𝘵2 := 𝘵*n = (t2-t1)-(t1-t0)
-- 𝘮2 := 𝘮/n := m (e.g. when n is 0x1000, m is 0x800)
-- cycles := 𝘵2*𝘮2*𝘧
local function cycles(t0,t1,t2) return ((t2-t1)-(t1-t0))*m/30 end
local noop=function() end -- this must be local, because func is local
flip()
local atot,asys=stat(1),stat(2)
for i=1,n do noop() end -- calibrate
local btot,bsys=stat(1),stat(2)
for i=1,n do func() end -- measure
local ctot,csys=stat(1),stat(2)
-- gather results
local tot=cycles(atot,btot,ctot)
local sys=cycles(asys,bsys,csys)
return {
lua=tot-sys,
sys=sys,
total=tot,
}
end
Note: This cartridge's settings do not allow embedded playback. A [Play at lexaloffle] link will be included instead.
This is neat but impractical; for everyday usage, you'll want to load #prof and edit the last tab.
The cart comes with detailed instructions, reproduced here for your convenience:
=================
★ usage guide ★
=================
웃: i have two code snippets;
which one is faster?
🐱: edit tab 2 with your
snippets, then run.
it will tell you precisely
how much cpu it takes to
run each snippet.
the results are also copied
to your clipboard.
(for ease of use, consider
integrating this cart into
your own cart during dev)
웃: what do the numbers mean?
🐱: the cpu cost is reported
as lua and system cycle
counts. look up stat(1)
and stat(2) for more info.
if you're not sure, just
look at the sum -- lower
is faster (better)
웃: why "{locals={3,5}}"
do in the example?
🐱: accessing local variables
is faster than global vars.
/!\ /!\ /!\ /!\
"local" values outside
the current scope are also
slower to access!
/!\ /!\ /!\ /!\
so if the scenario you're
trying to test involves
local variables, simulate
this by passing them in:
prof(function(a)
local _=sqrt(a)
end,{ locals={9} })
note: you can profile many
functions at once, or just
one. also, passing options
at the end isn't required:
prof(function()
memcpy(0,0x200,64)
end,function()
poke4(0,peek4(0x200,16))
end)
웃: can i do "prof(myfunc)"?
🐱: no, this will give wrong
results! always use inline
functions:
prof(function()
-- code for myfunc here
end)
as an example, "prof(sin)"
reports "-2" -- wrong! but
"prof(function()sin()end)"
correctly reports "4"
(see the notes at the start
of the next tab for a brief
technical explanation)
======================
★ alternate method ★
======================
this cart is based on code by
samhocevar:
https://www.lexaloffle.com/bbs/?pid=60198#p
if you do this method, be very
careful with local/global vars.
it's very easy to accidentally
measure the wrong thing.
here's an example of how to
measure cycles (ignoring this
cart and using the old method)
local a=11.2 -- locals
local n=1024
flip()
local tot1,sys1=stat(1),stat(2)
for i=1,n do end -- calibrate
local tot2,sys2=stat(1),stat(2)
for i=1,n do local _=sqrt(a) end -- measure
local tot3,sys3=stat(1),stat(2)
function cyc(t0,t1,t2) return ((t2-t1)-(t1-t0))*128/n*256/stat(8)*256 end
local lua = cyc(tot1-sys1,tot2-sys2,tot3-sys3)
local sys = cyc(sys1,sys2,sys3)
print(lua.."+"..sys.."="..(lua+sys).." (lua+sys)")
run this once, see the results,
then change the "measure" line
to some other code you want
to measure.
misc results
(these may be out of date now, but they were interesting)
Copying 64 bytes of memory is very slightly faster if you use poke4 instead of memcpy -- interesting!
(iirc this is true for other data sizes... find out for yourself for sure by downloading and running the cart!)
edit: this has changed in 0.2.4b! the memcpy in this example now takes 7 +32 cycles
constant folding
I thought lua code was not optimized by the lua compiler/JIT at all, but it turns out there are some very specific optimizations it will do.
A No Frills Introduction to Lua 5.1 VM Instructions (book)
> As of Lua 5.1, the parser and code generator can perform limited constant expression folding or evaluation. Constant folding only works for binary arithmetic operators and the unary minus operator (UNM, which will be covered next.) There is no equivalent optimization for relational, boolean or string operators.
constant folding...?
One further test case:
profile("tail add x3", function() local a=2 return 2+2+2+2+2+2+2+a end)
profile("head add x3", function() local a=2 return a+2+2+2+2+2+2+2 end)
These cost different amounts! Constant-folding only seems to work at the start of expressions. (This is all highly impractical code anyway, but it's fun to dig in and figure out this sort of thing)