Last Friday when I was in the gym doing my cardio warm up on the cardio theater’s screens it was playing a phone-in game show. The quiz that the players were asked to solve was to find a word, the letters of which were presented jumbled up. The gym’s owner was there and we tried quite hard to find the answer but after a few minutes we gave up, recognizing that is indeed a very hard problem. We exchanged some comments about how in earth is it possible for people to pay all that money to phone in trying to find the right answer and then getting in a lottery to win a small prize. After that the owner was called to attend to something and I was left alone.
The first try
Having 10 more minutes on this very boring part of my workout I started to think. How could I solve this problem using a computer. There must be a database of all the Greek words that I can search on? My first though was to take all the permutations of the letters and search them against the word database but then I counted the number of letters which were 12. Thinking the number of possible permutations would mean factorial of 12 possibles to check. But wait… factorial can get huge fast so no, that would probably not work. (I run the numbers at home and actually 12! = 479,001,600 !).
Then I thought, since I know all the words, why not create a tree so I can know what words start with one letter, which words have a second letter and so on and finally the word at the leaf. This way I can traverse the tree with what letters I have at each level/ position in the word and finally reaching the leafs to find the word or words that correspond to the jumbled letters. But how much memory would that need? Would it be fast enough, or would it still take forever to search?… but then I finished my warm-up exercise and went on with the rest of my session.
The days went on, and whenever I had dead time I was thinking about all these questions, how it should work, would it perform, would it fit in memory etc? But then I reached a point where I was not able to do any more progress in my mind alone. The only way forward to find the answers would be to write code and test it! I had the gut feeling this would be a good idea solving the original problem and I wanted to know if my instinct was right. I also thought this would be good content for a post so I decided to write the code :).
Each node is about a character that words contain in that levels and what possible characters follow as references to other nodes each containing the possible letter. It also describes whether its character is the last in a word. The root node is special, having no specific letter. You might notice that I did not put the complete words as leafs of the tree. Instead leafs are nodes that have end true. For my case they were not required and I did not include them to conserve memory. I will elaborate more when I talk about the search algorithm.
Say I want to represent the words ΔΩΡΟ, ΓΑΤΑ, ΓΑΛΑ , ΓΚΟΛ and ΓΑΤΑΣ. In the root since my words start with either with Δ or Γ, I have two references to a node representing Δ the one and Γ the other. For ΔΩΡΟ we see that there are no other words sharing any of the next characters so we have one ref per node for each level until we reach the end with an empty leaf with END=true. For the other words we can see that they all share their first letter Γ. So they all are reachable by the Γ node in the second level of the tree. The words ΓΑΤΑ, ΓΑΛΑ and ΓΑΤΑΣ also share their second letter Α so they are in turn reachable from the A node at the third level of the tree. ΓΚΟΛ diverges from the second letter so we can also see it’s path diverge, being reachable by the Κ node at the second level.
To populate the tree the word database is iterated and for each word we iterate over the letters. Starting at the root node we follow the next reference node for each letter. If a reference node does not currently exist for a letter a new one is created. When there are no more letters in the word the last node that we encounter we also set the end=true flag denoting this whole path to represent a word.
Searching the tree for a permutated word starts with a list of all it’s characters. Then at each node level we match this list to characters that have references to lower nodes. For each of these characters we remove it from the list of remaining and move onto the next referenced node. We also store the matched characters in a list of found characters. When we reach where we have depleted our remaining characters we know we have found our original non permutated word. At this point the list of matched characters also represents our original dictionary word. This is why I finally did not add a leaf node with the complete word since the list of visited nodes also gives us the valid word we are searching for.
I have implemented and played with a Java version of the tree and algorithm described here. You can find the code in this GitHub repository https://github.com/gaganis/TreeStringIndex. I have used the the latest aspell database for Greek words from here elspell.math.upatras.gr. This database contains 588,557 greek words. In this they are contained with their usual lowercase form where they also contain punctuation marks. So when populating the tree I also pass them through class WordPreProcessor which converts punctuated letters to non punctuated form and then everything to uppercase.
I have created a main class that populates a tree with this database. Then goes through this list sorted descending from the longest to the shortest word, shuffles each word and then does a search.
I used a 2008 Core2 Duo @ 2.00GHz laptop with 3GB of memory to run this and I got the following results. The tree population takes 8.5-10 seconds and then each search takes less than 500ms except for the first and longest word – ΣΤΡΟΓΓΥΛΟΚΟΥΛΟΥΡΙΑΖΟΝΤΟΥΣΑΝ – which takes about 1.5-2 seconds to be found but I suspect this is also due to the cache warming-up.
Related data structures
I have done some research to find what kind of tree this is and what algorithms it is related but I wasn’t able to find much. I have found this wikipedia article https://en.wikipedia.org/wiki/Generalized_suffix_tree that describes a quite similar tree in the form of the example provided but I suspect the intent and precise structure are different.
I would be really grateful if anyone commented/contacted me with more info about what kind of tree this is and how it is related to other string trees and algorithms.
The idea seems to actually work for solving the original problem. Even with populating the tree and then running a search, the time it takes is about 10seconds which is acceptable to me. Since now it is not really portable, the next step is to make this into a Web Service so I can easily access it on my phone. So next time this quiz pops up, I will be able to quickly answer it! Some would argue though, this constitutes cheating since I would be using an algorithm and computer to get the answer instead of only ones mind.
Thanks for reading!
One thought on “A tree data structure for String indexing solving a phone-in TV quiz.”
Interesting George! Keep on working out!