Thursday, March 12, 2009

Simple Anagrammer

While watching my family feverishly play word jumble one day, I decided to write a program to list anagrams. They're still addicted to that game, but I've long since finished up my program. Although it wasn't much effort, I thought I'd go over what I came up with. I wrote code to find single word anagrams, sub anagrams, and multi word anagrams. Since those may be my own nomenclature, I'll explain more clearly. What I'm calling a Single Word Anagram is a single word made up from all the letters of one or more words, such as "cat" to "act" and "resume" to "sure me." What I'm calling Sub Anagrams are possible shorter words made up from one or more words, such as "cat" to "catharsis" and "user" to "sure me." What I'm calling Multi Word anagrams are two or more words made up from all the letters of , such as "sure me" to "resume." Now let's get into my implementation. You should know that my implementation does assume you're more concerned about processing time than memory space.

Also, I originally wrote the code in java, but that language is not very suited for code snippets... it's so, so verbose. So all my code examples are going to be in ruby, with an emphasis of being as concise as possible. I left out all the optimizations I did for my java program, I'm only intending to give a general example of how the code works. So the code runs very, very slowly for long words. But, you'll get the idea of what's going on.

Word List

Before I get into my algorithms, I should go over obtaining a good word list, which is probably the most important step. A word list is exactly what it sounds like and it's what you'll use to determine if a word is real or not. There are lots of word lists out there and you should choose one depending on how thorough you want to be and what language you want to use. There's one in /usr/dict/words if you're using unix, but it isn't very good. I like the very flexible set I found at word list called scowl. It gives you a lot of different word lists you can pick and choose from. The word lists are found in "[scowl home]\final"; they're arranged by a description followed by their part number. A part number of 10 includes a small set of very common words, a part number of 95 contains a very large set of uncommon words. No two sets for a description contain the same word. For my word list, I used english-words.10, english-words.20, english-words.30, and english-words.40. You can trim down the word lists however you want.

By the way, I also found a word list at wordnet that you can also use as a dictionary for looking up definitions. It's free, which is nice.

Single Word Anagrams

After you have your word list, you need a good algorithm. Let's say you want to find all the anagrams of a word. The most straightforward solution is to find all the permutations of the letters in that word and then figure out which ones are real by comparing to your word list. Of course, this is horribly inefficient. I mean, for every n length word you're doing n! comparisons. It's doable on today's machines, just very, very slow.

A better solution is to create a special dictionary (/hash table) for yourself. The key will be the letters of a word arranged alphabetically, the value will be a list of words with that key. So, for example, the key "abt" will contain the word list "bat, tab". You'll have to iterate over your whole word list to create the dictionary, which will take a few seconds, but it is a one time thing. And once it's created, listing the anagram is as easy as arranging the letters of your target word alphabetically and looking it up in the dictionary you created. Make sure everything is lower case for your comparisons.

Here's my ruby code:
AnagrammerBySort.rb - Part 1
view plaincopy to clipboardprint?
  1. def words text  
  2.   text.downcase.scan(/['\w]+/)  
  3. end  
  4.   
  5. def getWordKey word  
  6.   return word.split("").sort!.to_s.strip  
  7. end  
  8.   
  9. def train features  
  10.   model = Hash.new{|hash,key| hash[key] = Array.new}  
  11.   features.each {|f| model[getWordKey(f)].push(f)}  
  12.   return model  
  13. end  
  14.   
  15. NWORDS = train(words(File.new("words.txt").read))  
  16.   
  17. def singleWordAnagram word  
  18.   return NWORDS[getWordKey(word)]  
  19. end  
Sub Anagrams

To find sub anagrams, you first need to be able to determine if a word is a subset of another word. Remember that duplicates count for anagrams, in that "aabc" is not a subset of "abcd." There are many, many ways of checking subsets. In my ruby code example, I just went for the most concise way. It actually performs really poorly.

Once you have a way to determine subsets, all you need to do is iterate over all the keys in your anagram dictionary to determine if they are subsets of your target word. You could also create a new dictionary that maps all words in your word list to their sub anagrams, which would make many future sub anagram lookups instant (it won't be instant for words you haven't seen before.)

Here's my ruby code:
AnagrammerBySort.rb - Part 2
view plaincopy to clipboardprint?
  1. # Whether word1 a subset of word2, duplicates are not ignored.  
  2. def isSubset( word1, word2 )  
  3.   word1.split("").uniq.each { |f|   
  4.     return false if(word1.count(f) > word2.count(f))   
  5.   }  
  6.   return true  
  7. end  
  8.   
  9. def subAnagrams word  
  10.   solution = Array.new  
  11.   NWORDS.keys.each {|f| solution.push(NWORDS[f]) if(isSubset(f,word)) }  
  12.   return solution.flatten!  
  13. end  
(Edit: I found this page which explains a much faster (although much more memory intensive) method for finding Sub Anagrams that uses a tree algorithm *somewhat* similar to a tree. If you want to get deeper into anagrams, you should definitely take a look at it as well as its follow up. The comments are really useful to read too.)

Multi Word Anagrams

Multi Word Anagrams are easy after knowing how to make single word anagrams and sub anagrams. This is how I did it:

1) Make a candidate list of all possible sub anagrams of your target word.
2) Iterate the candidate list. In each iteration, remove the first word on the candidate list and compare it to all other words on the candidate list.
    A) Add a combination of the words (Word A Word B) to the candidate list if they are a subset of the target word.
    B) Record all words that have the same key as the target word key.
3) All the words you've recorded is your solution.

This has to be done carefully or you'll quickly run out of memory. You should add logic to make sure a word is added to the list only if you want to evaluate it later.

Here's my ruby code:
AnagrammerBySort.rb - Part 3
view plaincopy to clipboardprint?
  1. def multiWordAnagrams word  
  2.   solution = Array.new  
  3.   candidates = subAnagrams(word)  
  4.   while candidates.size > 0      
  5.     f = candidates.delete(candidates[0])  
  6.     candidates.each {|g|      
  7.       if( getWordKey(word) == getWordKey(f+" "+g) )   
  8.         solution.push(f+" "+g)  
  9.       elsif( isSubset( getWordKey(f+" "+g), getWordKey(word) ) )  
  10.         candidates.push(f+" "+g)  
  11.       end  
  12.     }  
  13.   end  
  14.   return solution  
  15. end  
  16.   
  17. # Testing #  
  18. require 'pp';  
  19. print "Enter word: "; word = gets.strip  
  20. print "Single Anagrams: "; pp singleWordAnagram(word)  
  21. print "Sub Anagrams: "; pp subAnagrams(word)  
  22. print "Multi Word Anagrams: "; pp multiWordAnagrams(word)  
Alternative Dictionary

I think alphabetical keys work very well for full word anagrams, but not as well for sub anagrams. The solution I preferred involved using a different key for the anagram dictionary. First assign each letter of the alphabet a prime number greater than 1. 'a' can be 2, 'b' can be 3, and so on. All characters you don't care about (such as apostraphes, periods, and spaces) should be assigned 1. You can find a list of prime numbers here. Use the smallest numbers you can, otherwise you'll have bigger problems later. The key for the new dictionary will be the product of the letters in the word. Like before, the value will be a list of words sharing the key. Since you're multiplying prime numbers, you'll know that if Word A / Word B is 0, then the words are anagrams of each other. If Word A % Word B is 0, then the Word A is a sub anagram of Word B. To determine sub anagrams, traverse all the keys in your dictionary and find the anagram lists that divide into the target word key evenly. Finding Multi Word Anagrams will also be basically the same as before, but switching out the way you determined words are anagrams or sub anagrams.

One thing to watch out for with the prime number method is the size the of the key. In java, a 32 Bit Int won't even hold the word resume. I used a long, but even that has issues if I'm trying to find the anagrams for a sentence, much less a paragraph. You can use code to deal with really large numbers or you can just just report an error if the word can't be represented. How you deal with this problem is up to you.

I admit, the prime number algorithm doesn't actually go noticably faster, although there should be significantly less comparisons being made. Furthermore, it does have a limit in the words based on the maximum size of the key. I still like this algorithm better. Anyways, here's the code using the prime number methodology:
  1. #!/usr/bin/ruby  
  2.   
  3. LETTERS = { 'a'=>2,'b'=>3,'c'=>5,'d'=>7,'e'=>11,'f'=>13,'g'=>17,'h'=>19,  
  4. 'i'=>23,'j'=>29,'k'=>31,'l'=>37,'m'=>41,'n'=>43,'o'=>47,'p'=>53,  
  5. 'q'=>59,'r'=>61,'s'=>67,'t'=>71,'u'=>73,'v'=>79,'w'=>83,'x'=>89,  
  6. 'y'=>97,'z'=>101 }  
  7. LETTERS.default=1  
  8.   
  9. # Warning: Requiring 2 Character Minimum, includes apostrophes  
  10. def words text  
  11.   text.downcase.scan(/['\w]{1}['\w]+/)  
  12. end  
  13.   
  14. # Warning: Not checking for value overflows  
  15. def getWordKey word  
  16.   product = 1  
  17.   n = word.length  
  18.   (0..n-1).collect {|c| product *= LETTERS[word[c,1]] }  
  19.   return product  
  20. end  
  21.   
  22. def train features  
  23.   model = Hash.new{|hash,key| hash[key] = Array.new}  
  24.   features.each {|f| model[getWordKey(f)].push(f)}  
  25.   return model  
  26. end  
  27.   
  28. NWORDS = train(words(File.new("words.txt").read))  
  29.   
  30. def singleWordAnagram word  
  31.   return NWORDS[getWordKey(word)]  
  32. end  
  33.   
  34. def subAnagrams word  
  35.   solution = Array.new  
  36.   NWORDS.keys.each {|f| solution.push(NWORDS[f]) if(getWordKey(word)%f==0) }  
  37.   return solution.flatten!  
  38. end  
  39.   
  40. def multiWordAnagrams word  
  41.   solution = Array.new  
  42.   candidates = subAnagrams(word)  
  43.   while candidates.size > 0   
  44.     f = candidates.delete(candidates[0])  
  45.     candidates.each { |g|   
  46.       if ( getWordKey(word) == getWordKey(f+" "+g) )   
  47.         solution.push(f+" "+g)  
  48.       elsif( getWordKey(word) % getWordKey(f+" "+g) == 0 )  
  49.         candidates.push(f+" "+g)  
  50.       end  
  51.     }  
  52.   end  
  53.   return solution  
  54. end  
  55.   
  56. # Testing #  
  57. require 'pp';  
  58. print "Enter word: "; word = gets.strip  
  59. print "Single Anagrams: "; pp singleWordAnagram(word)  
  60. print "Sub Anagrams: "; pp subAnagrams(word)  
  61. print "Multi Word Anagrams: "; pp multiWordAnagrams(word)  
Those are the basic solutions to all the problems I played with. The above should be good inspiration to make your own anagrammer. There's still lots of fine tuning and optimizations you can add that I didn't get into. For example, you may want to ignore words with apostrophes or words less than two characters. You could work in a multi-threaded fashion as well. There are also lots of word list tweaks you can do. There's also a lot of features you can add; fun behavioral tweaks. For example, you might want to look for sentences from your multi word anagrams by throwing them through a grammar checker. There are lots of really fun anagrams out there and I'm sure you could write a program to find many more.

No comments: