20250808_150008

Photo by Scott Hamilton Personal Collection of Commodore Equipment.

I was reading a very interesting article about classic computing this week and learned that someone has managed to use a new technology to allow a classic computer to do something that seems impossible. A modern classic computing software developer, John Byrd, recently published a spell-checker software for the Commodore 64 computer from the early 1980s which has a major limitation when it comes to storing word lists. You see the Commodore 64 only has 64 kilobytes (KB) of memory and modern practical word lists for spell-checking contains 123,676 words and takes 1.19 megabytes (MB) to store. To put it into perspective, the disk on the C64 only held 170KB, which means in order to store the word list alone requires no less than seven disks. Not only will the list not fit in the memory of the computer, it won’t even fit on the storage media of the day. So how did Byrd do it?

Byrd used a fairly new technology which is one of the main drivers behind modern large language model artificial intelligence agents (LLMs). The technology he used is called a Bloom filter, which ironically is a classic technology itself, but didn’t come into mainstream use until recently because of the compute power required to use it reliably. The Bloom filter was designed by Burton Howard Bloom in 1970 as a method to test if an element was a member of a set. It has a trade-off of accuracy for immense memory efficiency; it allows for false positives, but zero false negatives, which basically means that it has the probability of saying an object is in the set when in reality it is not, but it will never tell you an object that is not in the list unless it is not in the list. As you can see from this use case it shrinks the amount of memory required to store the set by a factor of at least seven times in this particular case.

Let’s start with describing the Bloom filter and then go on to explain how it works for storing a spelling dictionary. The filter works by using hashing algorithms in conjunction with binary arrays to determine if an object is in the original list of objects. You start off by creating a binary array, which is a list of values that can only store a 1 or a zero in each block; this is also referred to as a bit and is the smallest possible memory element in any computer system. You can think of it as a bunch of tiny boxes that are either full or empty. We start off with all of the boxes being empty.

Now to add an item into the set, we take the value of the item, which can be anything, a word, a phrase, a number, the Bloom filter doesn’t care, it just adds whatever you tell it to add. It’s not even smart enough to know the difference between a word and number, but it can tell you later if you never added a particular item with 100% certainty, but can only tell you that you probably added the item because there is a risk of overlapping bit patterns. So you start with the item and you run it through an algorithm, called a hash algorithm, which is a mathematical formula that scrambles input data into a unique fixed-length string of characters referred to as a hash. This function is not reversible, you cannot figure out the original data from the generated hash, but same data will always give the same hash. This hash is then translated to its binary equivalent, which is a little complex to explain but in other words it becomes a unique string of ones and zeros. You then fill the buckets in the original array with the values from the hash; if the hash contains a one in a location you make that spot in the Bloom array a one.

How do you find out if an item is in the Bloom array? You take the item you are looking for and perform the same hash you would in order to insert it and check to see if every place it would have a one has a one in the Bloom array. Of course while its hash is unique, it is possible as enough values get entered that overlapping bits can result in the algorithm saying an item is in the list when it is not. But what can never happen, because an item inserted turns a bit in the Bloom array to a one, is that you report an item is in the list when it is not, because there is no way for a zero to exist in a location if the item was entered in the list.

I know that last part got a little confusing, but basically what this does is allows us to create a Bloom filter that fits in the 170KB of the C64 disk that has stored the hashes of all 123,676 words in the SCOWL 60 word list (used by all English spell-check programs) and has an amazing accuracy. It will never tell you that a correctly spelled word is incorrect, but it may rarely report that a misspelled word is correct. In fact with Byrd’s algorithm, one in 123 misspelled words might get through. This makes it possible to have modern level spell-check capabilities on a quite old computer system. The maximum dictionary size for spell check in the 1980s was a mere 25,000 words and that required swapping disks during the spell check to compare the entire dictionary, and the spell-check could take literally hours. Using this algorithm makes it work almost as quickly as a modern spell-check, but slightly less accurate. The bad news is that we would not have had enough compute power in the early 1980s to generate the Bloom filter without relying on much larger commercial computer systems to generate the filter. This is exactly what Byrd did; he uses a modern computer to generate the Bloom filter and then runs the filter on the C64.

I am impressed with the advancements in computers, but even more impressed by the fact that with modern techniques we are able to utilize older technology to do even more than they were designed to accomplish. Until next week stay safe and learn something new.

Scott Hamilton is an Expert in Emerging Technologies at ATOS and can be reached with questions and comments via email to shamilton@techshepherd.org or through his website at https://www.techshepherd.org.

Leave a Reply

Share via
Copy link
Powered by Social Snap