Fast sort function (97 tokens) / faster sort function (111 tokens)

samhocevar • 2020-07-06*2020-07-06 16:04* •

Here is a simple implementation of merge sort in 97 tokens. It uses less PICO-8 CPU than anything else I’ve tried (with a caveat, see below).

A small benchmark comparing CPU usage, sorting an array of 200 random numbers:

this merge sort: 12% CPU
ce_heap_sort (casual effect’s heap sort): 12% CPU (but 191 tokens)
iqsort (dual-pivot quicksort by Felice): 74% CPU in degenerate cases (already sorted list), 10% CPU otherwise
zsort (quicksort by musurca): 125% CPU in degenerate cases, 18% CPU otherwise

Note that non-randomised quicksort suffers from very bad performance on quasi sorted lists, so the above benchmark does not represent what you would get in real life. Always test in your own application.

function sort(t, a, b)
    local a, b = a or 1, b or #t
    if (a >= b) return
    local m = (a + b) \ 2
    local j, k = a, m + 1
    sort(t, a, m)
    sort(t, k, b)
    local v = { unpack(t) }
    for i = a, b do
        if (k > b or j <= m and v[j] <= v[k]) t[i] = v[j] j += 1 else t[i] = v[k] k += 1
    end
end

In real life merge sort performs worse than e.g. quicksort because it suffers from poor locality of reference (cache misses), but in a high level language such as Lua this becomes meaningless.

To any computer scientist this specific implementation would be badly written and would appear to perform extremely poorly because {unpack(t)} effectively copies the whole array ~~n×log(n)~~ n times. However, in PICO-8 world this function is ridiculously fast because for the virtual CPU the operation is almost free. Use at your own risk because the actual CPU running PICO-8 will still perform the operations!

Another limitation: does not support arrays of more than 16384 elements because of the overflow in (a+b)\2. Just replace with (a/2+b/2)\1 if necessary.

Edit: here is a version that performs log(n) full array copies and is thus much lighter on the CPU. It is also about 10% faster. The drawback is that it’s now 111 tokens, but that’s still pretty good.

function sort(t)
  local function f(a, b)
    if a < b then
      local d = (b - a) \ 2 + 1
      f(a, a + d - 1)
      f(a + d, b)
      local v, j, k = { unpack(t, a, b) }, 1, d + 1
      local vj, vk = v[j], v[k]
      for i = a, b do
        if (not vk or j <= d and vj <= vk) t[i] = vj vj = v[j] j += 1 else t[i] = vk vk = v[k] k += 1
      end
    end
  end
  f(1, #t)
end

Felice • 2020-07-06*2020-07-06 17:24*

What is the actual cost on unpack()? I haven't timed it. If it's too cheap then we'll end up with "valid" carts running too slowly on lesser hardware again.

Intuitively, I would not expect "{ unpack(t) }" to be fast at all.

samhocevar • 2020-07-06*2020-07-06 22:17*

That’s a valid concern. I posted new code that does not perform those excessive numbers of array copies, at the cost of 20 tokens. I also added explanations about quicksort performance, because I focused on the worst case behaviour and from my rough benchmark the reader may get a bad impression of its performance in the average case.

zep • 2020-07-07*2020-07-07 08:00*

cpu police -- good and bad news.

This bad news is that unpack() still needs to be costed (my bad) -- it can't stay like that because the real cpu cost should never be a required consideration. For example, at 2 cycles per item, the second version would cost ~0.40 for the same test.

The good news is that with the same cpu cost, the second version would remain quite fast: around ~0.13 instead of 0.11. Does that sound reasonable?

It is something of an oddity that {unpack(t)} is becoming the standard fastest way to do a shallow copy of a table. Perhaps that needs to be reviewed also.

freds72 • 2020-07-07*2020-07-07 12:44*

2 cycles per item?
Would that kill unpack?
in it’s current form, it is really handy to replace boring:
local a,b,c,d=m[1],m[2],...
to:
local a,b,c,d=unpack(m)

Felice • 2020-07-07*2020-07-07 17:29*

@freds72

Isn't the cost of reading m[1],m[2],... directly already 2 cycles each?

Felice • 2020-07-08*2020-07-08 09:59*

@zep

By the way, in the absence of a built-in function to split or copy a table, {unpack(t)} should be the ideal way to do a shallow copy of a array, especially if you want to copy only a portion of it, e.g. {unpack(t,a,b)}.

You should assume people will frequently use it for that when you set costs.

Felice • 2020-07-08 2020-07-08 10:01

Side note: It's kinda funny that we're talking about balancing a game engine instead of a game. XD

rilden • 2020-07-11*2020-07-11 21:15*

@samhocevar did you try shellsort?
This is an implementation using 70 tokens:

gaps={701,301,132,57,23,10,4,1}
function shellsort(a)
	for gap in all(gaps) do
		for i=gap+1,#a do
			local x=a[i]
			local j=i-gap
			while j>=1 and a[j]>x do
				a[j+gap]=a[j]
				j-=gap
			end
			a[j+gap]=x
		end
	end
end

Sorting an array of 200 random numbers uses 5-9% cpu, 6% cpu on average.

Edit: I made a mistake in the middle loop, looks like it's using 9% cpu on average on an array of 200
I also tried sorting the array {1,2,3,4} with your second version of mergesort and it's giving back a result of {1,1,3,3}

CarlosEFML • 2024-10-24 2024-10-24 18:30

For me, what worked best was to mess with my original array to avoid the worst case:

for i=1,#plist,4 do
  local ni = flr(1+rnd()*#plist)
  plist[i],plist[ni]=plist[ni],plist[i]
end

and continue using qsort.

[Please log in to post a comment]

About | Contact | Updates | Terms of Use | Picotron

Follow Lexaloffle:

Generated 2025-05-08 16:44:51 | 0.008s | Q:26

User:
Password: