Just a few notes from playing around with userdata. These notes assume 8M VM cycles/sec and large enough arrays to avoid substantial overhead and fully realize economies of scale.
- Fast ops -
add/mul/copy/etc. - cost 1/16 cycle. - Slow ops -
div/convert/etc. - cost 1/4 cycle. matmulis charged as a fast op based on the size of the output matrix. I'm a little suspicious that the answer seems to be so simple, so I'm wondering if I missed something.copyandmemmap/memcpyare approximately the same speed for 64 bit datatypes. For smaller datatypes,memcpyis proportionally faster, though of course you then have to manage strides/spans yourself.memcpyshould also enablereinterpret_casttype shenanigans.- There is substantial overhead for small spans. If you use spans of length 1 you pay 1/4 cycle/span, same as a slow op. It looks like this may be a flat cost per span, but I'm not sure. Using the full/strided forms of the ops does not seem to have noticeable additional costs beyond the per-span cost.
- For full-screen rendering, you have about 1 cycle/pixel at 480x270x60Hz. This includes whatever scaling/conversion you need to do at the end of the process. So realistically, you'll get in the neighborhood of 10 additions/multiplications per pixel. Exact numbers depend on whether you need a divide at the end, and whether or not you can work in
u8. - userdata flat access w/ locals seems to cost 1 cycle/element including the assignment.
- userdata
getis 1/4 cycle / element at scale ... but each explicit assignment will cost you 1 cycle on top of this. - userdata
setis 1 cycle / element at scale.
There also seems to be some interesting behavior happening where multivals, even very large multivals, do not seem to noticeably increase CPU usage when passed to pack or set. While I'm enjoying taking advantage of this for f64 to u8 conversions at the same cost as convert, I'm worried this might not last.
[Please log in to post a comment]




