Log In  

Here's a little profiler you can drop into an existing program - if you have 313 tokens to spare.
It instruments the global functions to count the number of calls and measure the CPU time spent inside them. Then it dumps it to a CSV file you can open in a spreadsheet viewer, and hopefully get some insight into where your program is spending the most time.

Instructions are in the comments at the top.

EDIT:

  • Count fn calls in a single 32bit integer (thanks @freds72).
  • Use "..." arguments syntax to support more than 9 arguments (thanks @Siapran)
  • Save uninstrumented fns to avoid calling the instrumented ones when dumping stats.
-->8
-- profiler

-- usage:
-- run your program
-- press esc to break
-- > _prof.begin()
-- > resume
-- run your program some more (to gather data)
-- press esc to break
-- > _prof.dump()
-- this will dump a profile.csv file to the cart's folder

-- notes:
-- profiling for more than 2-3 minutes could cause cpu measurements 
-- to overflow, and give incorrect results.
-- measures the amount of cpu spent in each function, where 1 is one whole frame
-- so divide by frame rate to get seconds (30, or 60 if using _update60).
-- profiling will slow down your program somewhat.

_prof={
    stats={},
    -- list of global functions to *not* instrument
    donttouch={flip=true,__flip=true,_startframe=true,_update_buttons=true,_get_menu_item_selected=true,_update_framerate=true,_map_display=true,_mark_cpu=true},
    -- original uninstrumented functions
    type=type,
    stat=stat,
    printh=printh,
    all=all,
    tostr=tostr,
    pairs=pairs
}

-- start profiling.
-- if called multiple times, will reset stats
_prof.begin=function()
    -- clear stats
    _prof.stats={}

    -- instrument functions
    if not _prof.instr then
        for n,v in _prof.pairs(_ENV) do
            if _prof.type(v)=="function" and not _prof.donttouch[n] then
                -- wrap function
                _ENV[n]=function(...)
                    -- run and get cpu timing
                    local cpuin=_prof.stat(1)
                    return (function(...)
                        -- update stats
                        local cpuout=_prof.stat(1)
                        local st=_prof.stats[n] or {ct=0,cpu=0}
                        st.ct+=0x0.0001
                        st.cpu+=cpuout-cpuin
                        _prof.stats[n]=st
                        return ...  
                    end)(v(...))
                end
            end
        end
        _prof.instr=true
    end
end

-- dump profile stats to file
-- order is field to order by (name,ct,cpu,avg)
-- defaults to "cpu"
-- dst is printh destination
-- defaults to "profile.csv"
_prof.dump=function(order,dst)
    order = order or "cpu"
    dst = dst or "profile.csv"
    local rpt={}
    for n,v in _prof.pairs(_prof.stats) do
        local cpu,ct,sh=v.cpu,v.ct,16

        -- average is cpu/(ct<<16), except that ct<<16 could
        -- overflow, in which case shift ct left as many bits 
        -- as possible and shift cpu right the remaining bits.
        while sh>0 and ct<0x4000 do       -- can't exceed 0x7fff as it would be treated as negative
            ct<<=1 sh-=1
        end
        cpu>>=sh

        -- create stat      
        local st={name=n,ct=v.ct,cpu=v.cpu,avg=cpu/ct}

        -- insert into rpt in order
        local i=#rpt+1
        while i>1 and (rpt[i-1][order]>st[order])==(order=="name") do
            rpt[i]=rpt[i-1]
            i-=1
        end
        rpt[i]=st
    end

    -- print rpt
    _prof.printh("name,ct,cpu,avg",dst,true)
    for st in _prof.all(rpt) do
        _prof.printh(st.name..",".._prof.tostr(st.ct,2)..","..st.cpu..","..st.avg,dst)
    end
end

Here's an example of my 2D transforms demo.
For example you can see it spends a large amount of time in the "map" function (which would normally be suspicious except that the "2D transforms" replaces the standard "map" function with a custom version).

P#142589 2024-03-08 11:03 ( Edited 2024-03-09 01:51)

1

suggest to use increment of 0x0.0001 to count function calls - would remove the need to handle overflows, use tostring(ct,2) to get full 32bits value

afk but would that work with less tokens:
— correction: doesn’t work 😰

_𝘦𝘯𝘷[n]=function(...)

  -- run and get cpu timing
  local cpuin=stat(1)
    return v(...),(function(cpuout)
      -- update stats
      if (not _prof.stats[n]) _prof.stats[n]={ct={0,0},cpu=0}
      local st=_prof.stats[n]
      st.ct+=0x0.0001
      st.cpu+=cpuout-cpuin
    end)(stat(1))
end
P#142591 2024-03-08 11:40 ( Edited 2024-03-08 13:22)
1

to make it work with multiple return values:

_𝘦𝘯𝘷[n]=function(...)

  -- run and get cpu timing
  local cpuin=stat(1)
  return (function(...)
      -- update stats
      local cpuout = stat(1)
      if (not _prof.stats[n]) _prof.stats[n]={ct={0,0},cpu=0}
      local st=_prof.stats[n]
      st.ct+=0x0.0001
      st.cpu+=cpuout-cpuin
      return ...
    end)(v(...))
end
P#142597 2024-03-08 13:37 ( Edited 2024-03-08 13:41)
1

@Mot, Awesome tool and perfect timing, since I'm currently working on performance for a 64x128 32 colors pico-8 draw library.
@Siapran, interesting, do you know a good resource for me to learn how to use triple dot ?
This could be very useful for Wpal() that is the 32 colors equivalent to pal(), but only supports single palette entry change at the moment.

P#142601 2024-03-08 14:18
1

@RealShadowCaster, ... represents a variable number of arguments (0 or more):

function foo(value, ...) -- must be the last position in the parameter list
  local a1, a2, a3 = ... -- in a function body, ... evaluates to 0 or more values
  local args = {...} -- you can package them into a table, like you can with any list of values
  print(a1)
  print(a2)
  print(a3)
  print(#args)
end

foo(1, 2, 3, 4, 5) -- prints 2 3 4 4
foo(1) -- prints nil nil nil 0
foo() -- prints nil nil nil 0
foo(1, 2, 3) -- prints 2 3 nil 2

... is very useful for generic programming like that profiler above, where we manipulate different functions.

P#142617 2024-03-08 17:10 ( Edited 2024-03-08 17:10)
2

@Siapran, @RealShadowCaster

It's worth noting that sometimes t=pack(...) is better than t={...}, as it allows you to detect the difference between trailing nil arguments vs. nothing being passed, since pack() adds a t.n member with the actual count of ..., whereas trailing nils are ignored when querying #t:

> t=pack(1,2,3,nil,nil)
> ?#t
3
> ?t.n
5

Similarly, using unpack(t) will produce a 5-tuple with two nils on the end, because it checks for the t.n value and pads the tuple with nils. This matters when you're storing deferred function calls, especially API function calls, since there are several that care about the difference between an explicit nil arg and one that isn't supplied.

This is sometimes useful, though not always, and pack() does cost an extra token and a little call overhead, so use it judiciously.

P#142625 2024-03-08 19:10 ( Edited 2024-03-08 19:12)

[Please log in to post a comment]