Here's a 7500 word dictionary cart for making word games and whatnot. It contains the most common 3-6 letter words according to wiktionary.com, including proper names. The loader is 264 tokens, and the data is 11317, stored over the full map (including shared gfx), plus the last 44 SFX. So there are just 128 sprites and 20 SFX spare. The 5 most common and 10 least common words on the list are:
THE AND THAT WAS HIS RADIUM BAWL LINTEL WAFER WELTER AUNTY OPTICS SIKH LIL BARB
The data is generated using a convoluted toolchain process, that I'll post later on if I find time to organize it into something useful.
The compression works by enumerating every possible word in order of word size first, and then alphabetical order. So, A is 0, B is 1, AA is 26, and so on. This means that to store the whole dictionary, only the distances between each word's index is needed. There are many clusters of close words (e.g the distance between CAN and CAP is only 2), so the distances are sorted into range categories depending on how many bits are needed to store each range. Repeated categories are common and so can be encoded with a single bit -- otherwise a 3-bit value is used to store the category for each distance. The encoding utility greedy-searches sets of 5 category bit-lengths and a roman cypher to try to optimize the encoded size, which saved around 1k compared with hand-optimizing the parameter set.
I analyzed the resulting dictionary to see if you could prune your 'potential' dictionary at all to reduce distances. I was looking for letters that didn't start a word, end a word, or follow a certain other letter, e.g. "qz". There's one invalid starting char, two invalid ending chars, and 249 invalid pairs, encoded into a string of 1+1+26=28 lists here:
Also note that you only need 5 bits per char, so those 279 chars plus terminator could probably fit in about 175 bytes.
I wrote some code to work out the reduction in potential dictionary size, based on those rules. I'm getting a figure of about 17.7% of the original size for a 1-to-6-letter-word dictionary and 12.8% for a 1-to-7-letter-word dictionary, assuming I'm calculating correctly.
Now, I'm not sure that pre-eliminating entries from the dictionary will get you longer/better runs of valid/invalid words, but it's possible. Distances should be shorter in general. The cost to include the rules is pretty low, so it wouldn't take much to make it worth it.
Mind you, with more room comes the urge to add more words. Obviously the number of rules will probably decrease as one adds words. Add "zoo" and the "doesn't start with 'z'" rule goes away. Add "pizzazz" and "raj" and we lose both of the "doesn't end with" rules too. Still, a lot should remain.
Anyway, I had food for thought and decided to split the meal with you. :)
PS: Here's the text from the image, if you want to play around with it quickly. It's just <invalid starts>,<invalid ends>,<doesn't follow 'a'>, ... ,<doesn't follow 'z'>.
z,jz,z,cfgkpqvxz,bdfgjpvwxz,bcfq xz,z,bcdhjkmnpqvwxz,cfjkqvxz,bcd fghjkpqvxz,hwyz,bcdfghjklmnpqrst vwxyz,bcdfgjmpqtvwxz,jqxz,cghjkq vwxz,qz,qz,bcdfgjmnqvxz,abcdefgh ijklmnopqrstvwxyz,jz,fgjvxz,bgjn qvxz,hjqwz,bcdfghjklmnpqrstvwxz, cgjqvwxz,bdgjkmnqrsvwz,jqvyz,abc defghijklmnopqrstuvwxyz
Log in to post a comment