Log In  

0.2.4 has been released, and we now have an extra segment of memory to play with from 0x8000 to 0xffff. That's 32K, or 0x8000 bytes. A spritesheet takes up 0x2000 bytes... so we could stuff 4 extra spritesheets in there!

I've created a system where you can call my custom function cspr the same way you would normally call spr, and everything "just works". The difference is, cspr can handle up to 1024 sprites instead of the standard 256-sprite-limit of spr. (also, cspr is a bit slower (but not much!) than spr, because it has to manage a cache)

Here's a demo that uses 4 full spritesheets; search the code for "cspr" to see how easy it is to use, once you've set it up!

Cart #hefafanino-4 | 2021-12-23 | Code ▽ | Embed ▽ | License: CC4-BY-NC-SA
25

(the games are all by me: #linecook, #remains, #ocelotsafari, and #escalatorworld)

technical details

The core ideas are the blit and cspr functions:

  • blit: this is used to move sprites between the 4 upper spritesheets and the sprite cache (located at 0x0000, where the spritesheet normally is)
  • cspr: cspr_simple gets the core idea of the sprite caching across; the full cspr in the cart can deal with any arguments you would normally pass to spr (like the width/height/flip parameters)

These are the simple/readable versions; the versions in the cart have some optimizations such as replacing y0*64 with y0<<6 that may make it more difficult to follow:

-- copies a sprite from x0,y0
--   (from spritesheet based at
--   memory address base0)
--   into x1,y1 on the spritesheet
--   based at address base1
-- set base1 to 0x6000 to use the
--   screen as the destination
-- all coordinates are measured
--   in pixels
-- note! odd x-coordinates will
--   be rounded down
-- by pancelor
function blit(base1,x1,y1,base0,x0,y0, w,h)
 local a0=base0+y0*64+x0\2 --source
 local a1=base1+y1*64+x1\2 --destination
 local w2=w and w\2 or 4   --half-width
 for da=0,(h or 8)*64-1,64 do
  memcpy(a1+da,a0+da,w2)
 end
end

-- "cached sprite" by pancelor
--  uses a direct-mapped cache
--  up to 4 spritesheets are
--   stored in 0x8000+
--  they are numbered 0-1023
--  using any sprite s will write
--   it to spritesheet slot s%256
--   and then use it from there
_cspr_bank={} -- maps slots (0-255) to which bank the sprite comes from (0-3)
function cspr_simple(sbig,x,y)
 local bank,s=sbig>>8&3,sbig&0xff
 -- bank is 0-3, sbig is 0-255 (inclusive)
 if _cspr_bank[s]~=bank then
  -- cache miss!
  -- blit sprite s from its bank into the cache:
  local sx,sy=s%16*8,s\16*8
  blit(0,sx,sy,
       0x8000+bank*0x2000,sx,sy,
       8,8)
  _cspr_bank[s]=bank
 end
 spr(s,x,y)
end

drawbacks:

An earlier version of this demo used something called upspr instead of cspr and had many drawbacks (do load #hefafanino-1 in your local pico-8 to see that old version). I've updated the demo cart to a way better version that "just works", using cspr.

  1. sprites can only be blitted on even x-pixel values. e.g., upspr(290,11,11) will draw sprite 290 to 10,11 instead
  2. you can't store anything inside upper memory until runtime, so you'll probably want to use PX9 or something similar to store your extra spritesheets (as code strings? inside 0x0000-0x2000?) and then decompress them at startup
  3. flipping sprites takes extra work
  4. palette and transparency will not be respected
  5. sprite editing is more difficult (due to 2)
  6. the upper memory is being used by spritesheets, which may get in the way of other uses for the upper memory (such as larger maps). But you don't need to completely fill the upper memory with spritesheets; you could have, say, 512 sprites (2x normal) and still have 0x4000 bytes leftover for a 128x128 map. we've got more room for either, but we still have to make a tradeoff between the two!
  7. map() no longer works -- you'll need to write your own version of map() that calls spr() repeatedly. luckily this is about as fast as calling map() directly. note that mget() and mset() will need to be rewritten too, because they only handle 1-byte entries.

extension ideas:

  1. if you stored all of your sprites in upper memory and used the spritesheet at 0x0000 as a cache for the sprites you're actively using, you can fix drawbacks 1, 3, and 4 (above) "for free". See "Virtual Sprites" in @freds72's POOM devlog for an idea of how to do this
    • Done! I used a direct-mapping cache instead of an LRU cache, which might cause performance problems if you repeatedly draw sprite x and then sprite x+256*n (because those sprites both map to the same slot, x). For example, sprites 17, 256+17,512+17, and 768+17 all get stored in slot 17 of my direct-mapped sprite cache.
    • However, cspr/blit is quite fast, so you might not even see performance problems. (see my next post for performance details)

recommendations:

Someone on discord asked: "should people draw all sprites from bank 1, then swap for bank 2, etc?"

My recommendation: That would help, but I don't recommend it -- there are more effective ways to improve the performance, I think:

  • If you're okay with needing to manage which sprite banks are currently loaded, you could avoid all of the overhead of calling cspr() and just call spr() directly. This lets you use map(), too. You would of course need to manually memcpy the sprite banks into the 0x0000 region when appropriate.
  • If you want better performance but want a low-maintenance cache that's easy to use, you should probably write an LRU cache instead of my simple direct-mapped cache. (I may do this myself soon)

is cspr good enough to just use?

I think so! it's fast enough; it uses up to 10~15% of a frame (in my limited testing) and you get 4x the sprite space, without needing to think about the cache at all. An LRU cache might make this number way better, but I haven't tried that yet.

Keep in mind that map() does not work with cspr -- this may be a dealbreaker for some. (you'll need to roll your own implementation of map/mget/mset)

If you are willing to give up the "without needing to think about the cache at all" requirement, you should maybe manually move pages of sprites (128x32? 128x64? 128x128?) around instead -- it'd be mostly pretty simple, and very token-efficient. (thanks to merwok for the suggestion!)

P#103168 2021-12-20 12:38 ( Edited 2021-12-23 08:37)

1

Your drawing a sprite >=256 is exciting enough right there ! And yes in my current project I'm already taking advantage of that extended 32k for image swapping. :)

I was looking at POOM recently and saw it did the same prior to reading this.

Gold star for your code and explanation, @pancelor.

I may find some free time and write a faster sprite plotter.

P#103194 2021-12-20 18:05 ( Edited 2021-12-20 18:28)

a follow-up on performance:

analysis

I took my game linecook and added some code (tab 5) that makes the game act as if it's running with 4 full
spritesheets (essentially, I moved each of the 4 tabs to the 4 upper-memory spritebanks and then modified spr() to call a slightly modified version of cspr() that maps 0..63 to the first upper spritebank, 64..127 to the second, etc)

This causes a decent number of cache misses (the cart shows #misses and #total-spr-calls in the top left corner), which simulates what might happen in a game that uses all 4 spritebanks, without any special consideration about where the sprites are stored (to avoid cache misses).

Cart #majurataga-0 | 2021-12-23 | Code ▽ | Embed ▽ | License: CC4-BY-NC-SA

Based on this testing cart, I sped up cspr() by about 2x -- mainly, I added a fast path for the common case when you call spr(s,x,y) without all the extra parameters. (These speed-ups have been backported to the main demo cart in the original post)

My test steps:

  • I did not edit the game code or reorganize the sprites to avoid cache thrashing. This simulates what it would be like for someone to use cspr() without caring too much about the specific details of how it works.
  • I ran the cart, started a game, and waited for all the birds to arrive.
    Then, I reported the middle number of the ctrl-p performance overlay
  • I did the above on the unmodified cart, on the cart with a custom map() implementation, and on the cart with a custom map() implementation combined with a custom spr() implementation that uses cspr()

Measurements: (lower is better)

  • no changes: 0.46
  • custom map(): 0.49
    (the vanilla pico-8 map doesn't work with cspr, so we need to write our own implementation)
  • custom spr() + custom map(): 0.59

There are generally 320~330 calls to cspr() per frame, and 35~45 of those result in a cache miss

(Note that the code that tracks cache misses itself costs roughly 0.01 ~ 0.02 of the perf. This extra cost is not included in the above numbers.)

Interpretation

Using cspr without doing any optimization can cost up to 10~15% of your cpu time budget (at 60fps).

I believe most of this time is spent handling cache misses -- when the birds are still arriving, the cache statistics show ~20 misses out of ~290 cspr calls, and the perf monitor shows 0.51. When the birds arrive and the game really begins, the stats change to ~40 misses out of ~320 cspr calls, with 0.61 perf. Drawing 30 extra sprites per frame only costs about 0.01, so 0.09 ish is spent doing cspr things, including function overhead and cache handling.

Testing has been limited; I'd be interested to see what this looks like in real-world scenarios. In the demo cart from the main post, cspr only takes 0.03 frames (3% of your budget) more than spr, but I don't think this is the most representative use case.

P#103374 2021-12-23 02:13 ( Edited 2021-12-23 04:30)
1

Keeping track of the loaded sprite banks and calling regular spr seems good. Matches what old consoles did, or what some current engines do!

P#103389 2021-12-23 03:48 ( Edited 2022-03-04 11:28)

mhm, that does sound kinda fun! This system was designed with the goal of making things "just work" without needing any extra mental overhead, but there are certainly other ways to add more sprite space.

I bet managing sprite banks is very token-efficient too (which is probably important to you if your game is large enough to want all these extra sprites!)

P#103390 2021-12-23 04:14
2

On #escalatorworld: Though the controls can be a bit awkward at times, @pancelor has implemented an innovative scoring system - the likes of which I've never seen - and has definitely struck the sweet spot between fun and difficulty. This cart has captured my imagination (do I see a sequel in the future?) - a true hidden gem of the PICO-8.

P#113315 2022-06-18 17:33
8

I've experimented a bit. With knowledge of most of the PICO-8's secrets, I found a method that could possibly make cspr() cheaper:

Cart #four_sprite_banks_demo-0 | 2023-06-12 | Code ▽ | Embed ▽ | License: CC4-BY-NC-SA
8

What is this sorcery?

This cartridge does not touch upper RAM at all. This relies on a quirky property of PICO-8's remapping feature, plus a secret feature that unlocks a whopping 4 banks of workable screen memory, 8 KB each.

To enable the extra banks, set bit 0 of poke location 0x5F36 (e.g. poke(0x5f36,1)). Call _map_display(n) with n ranging from 0 to 3 to select the appropriate bank. You can also read back the current bank with stat(3).

The following chart shows how the 0x0000-0x1FFF and 0x6000-0x7FFF regions of memory are redirected as this cart runs:

0x0000    0x6000
spr       disp0     // boot up
spr       disp0     // poke(0x5f36,1) to enable multi-display
spr       disp0     // _map_display(0) -- write sprite bank 0
spr       disp1     // _map_display(1) -- write sprite bank 1
spr       disp2     // _map_display(2) -- write sprite bank 2
spr       disp3     // _map_display(3) -- write sprite bank 3
disp3     disp3     // poke(0x5f54,0x60) -- was 0x00
disp3     spr       // poke(0x5f55,0x00) -- was 0x60
disp0     spr       // _map_display(0) -- select sprite bank 0
disp1     spr       // _map_display(1) -- select sprite bank 1
disp2     spr       // _map_display(2) -- select sprite bank 2
disp3     spr       // _map_display(3) -- select sprite bank 3
disp0     spr       // _map_display(0) -- back to bank 0
...

As you can see, the bank switch can be used to select 4 sprite banks with ease. What was formally the 8 KB of sprite memory at 0x0000 is now being used as the screen memory.

The 4 sheets need to be filled at runtime, much like with the extended map you can enable from 0x5F56/57. However, I went the lazy route and just painted circles on them with varying size and colors. The PICO-8 cartridge is not big enough to store 4 entire sheets (32 KB) along with any other assets, so compression must be used (or in the worst possible case, multi-carting).

P#130832 2023-06-12 07:49 ( Edited 2024-02-28 21:55)

@StinkerB06 What's the bug when you pause? I used this in my last game so you've got me worried now...

https://www.lexaloffle.com/bbs/?tid=55129

I suppose it won't matter much since zep has just said he will be "fixing" it in a future version and I'll need to find another way...

Funnily enough my proof of concept filled the extra screens with circles too:

https://www.lexaloffle.com/bbs/?tid=49254

:]

P#142135 2024-02-28 16:32
1

@drakeblue 0.2.6 just came out "fixing" the exploit.
suggest to use the official screen remapping asap!

https://www.lexaloffle.com/bbs/?tid=140421

P#142150 2024-02-28 19:06

@freds72 the news (for me at least) that it’ll be fixed in future was from the 0.2.6 release post. Testing just now it still seems to work in 0.2.6, but I guess I will need to change over soon.

P#142156 2024-02-28 20:18

@drakeblue Sorry, there was no actual bug. It was just that the transparency effect on the pause menu is weird at times.

P#142158 2024-02-28 22:11

nice, I can't wait to have time to work on my unfinished pico projects!

P#142168 2024-02-28 23:35

[Please log in to post a comment]

Follow Lexaloffle:          
Generated 2024-03-29 05:18:05 | 0.050s | Q:40