While watching my family feverishly play word jumble one day, I decided to write a program to list anagrams. They're still addicted to that game, but I've long since finished up my program. Although it wasn't much effort, I thought I'd go over what I came up with. I wrote code to find single word anagrams, sub anagrams, and multi word anagrams. Since those may be my own nomenclature, I'll explain more clearly. What I'm calling a Single Word Anagram is a single word made up from all the letters of one or more words, such as "cat" to "act" and "resume" to "sure me." What I'm calling Sub Anagrams are possible shorter words made up from one or more words, such as "cat" to "catharsis" and "user" to "sure me." What I'm calling Multi Word anagrams are two or more words made up from all the letters of , such as "sure me" to "resume." Now let's get into my implementation. You should know that my implementation does assume you're more concerned about processing time than memory space.
Also, I originally wrote the code in java, but that language is not very suited for code snippets... it's so, so verbose. So all my code examples are going to be in ruby, with an emphasis of being as concise as possible. I left out all the optimizations I did for my java program, I'm only intending to give a general example of how the code works. So the code runs very, very slowly for long words. But, you'll get the idea of what's going on.
Word ListBefore I get into my algorithms, I should go over obtaining a good word list, which is probably the most important step. A word list is exactly what it sounds like and it's what you'll use to determine if a word is real or not. There are lots of word lists out there and you should choose one depending on how thorough you want to be and what language you want to use. There's one in /usr/dict/words if you're using unix, but
it isn't very good. I like the
very flexible set I found at word list called scowl. It gives you a lot of different word lists you can pick and choose from. The word lists are found in "[scowl home]\final"; they're arranged by a description followed by their part number. A part number of 10 includes a small set of very common words, a part number of 95 contains a very large set of uncommon words. No two sets for a description contain the same word. For my word list, I used english-words.10, english-words.20, english-words.30, and english-words.40. You can trim down the word lists however you want.
By the way, I also found
a word list at wordnet that you can also use as a dictionary for looking up definitions. It's free, which is nice.
Single Word AnagramsAfter you have your word list, you need a good algorithm. Let's say you want to find all the anagrams of a word. The most straightforward solution is to find all the permutations of the letters in that word and then figure out which ones are real by comparing to your word list. Of course, this is horribly inefficient. I mean, for every n length word you're doing n! comparisons. It's doable on today's machines, just very, very slow.
A better solution is to create a special dictionary (/hash table) for yourself. The key will be the letters of a word arranged alphabetically, the value will be a list of words with that key. So, for example, the key "abt" will contain the word list "bat, tab". You'll have to iterate over your whole word list to create the dictionary, which will take a few seconds, but it is a one time thing. And once it's created, listing the anagram is as easy as arranging the letters of your target word alphabetically and looking it up in the dictionary you created. Make sure everything is lower case for your comparisons.
Here's my ruby code:
- def words text
- text.downcase.scan(/['\w]+/)
- end
-
- def getWordKey word
- return word.split("").sort!.to_s.strip
- end
-
- def train features
- model = Hash.new{|hash,key| hash[key] = Array.new}
- features.each {|f| model[getWordKey(f)].push(f)}
- return model
- end
-
- NWORDS = train(words(File.new("words.txt").read))
-
- def singleWordAnagram word
- return NWORDS[getWordKey(word)]
- end
Sub AnagramsTo find sub anagrams, you first need to be able to determine if a word is a subset of another word. Remember that duplicates count for anagrams, in that "aabc" is not a subset of "abcd." There are many, many ways of checking subsets. In my ruby code example, I just went for the most concise way. It actually performs really poorly.
Once you have a way to determine subsets, all you need to do is iterate over all the keys in your anagram dictionary to determine if they are subsets of your target word. You could also create a new dictionary that maps all words in your word list to their sub anagrams, which would make many future sub anagram lookups instant (it won't be instant for words you haven't seen before.)
Here's my ruby code:
-
- def isSubset( word1, word2 )
- word1.split("").uniq.each { |f|
- return false if(word1.count(f) > word2.count(f))
- }
- return true
- end
-
- def subAnagrams word
- solution = Array.new
- NWORDS.keys.each {|f| solution.push(NWORDS[f]) if(isSubset(f,word)) }
- return solution.flatten!
- end
(Edit: I found this page which explains a much faster (although much more memory intensive) method for finding Sub Anagrams that uses a tree algorithm *somewhat* similar to a tree. If you want to get deeper into anagrams, you should definitely
take a look at it as well as
its follow up. The comments are really useful to read too.)
Multi Word AnagramsMulti Word Anagrams are easy after knowing how to make single word anagrams and sub anagrams. This is how I did it:
1) Make a candidate list of all possible sub anagrams of your target word.
2) Iterate the candidate list. In each iteration, remove the first word on the candidate list and compare it to all other words on the candidate list.
A) Add a combination of the words (Word A Word B) to the candidate list if they are a subset of the target word.
B) Record all words that have the same key as the target word key.
3) All the words you've recorded is your solution.
This has to be done carefully or you'll quickly run out of memory. You should add logic to make sure a word is added to the list only if you want to evaluate it later.
Here's my ruby code:
- def multiWordAnagrams word
- solution = Array.new
- candidates = subAnagrams(word)
- while candidates.size > 0
- f = candidates.delete(candidates[0])
- candidates.each {|g|
- if( getWordKey(word) == getWordKey(f+" "+g) )
- solution.push(f+" "+g)
- elsif( isSubset( getWordKey(f+" "+g), getWordKey(word) ) )
- candidates.push(f+" "+g)
- end
- }
- end
- return solution
- end
-
-
- require 'pp';
- print "Enter word: "; word = gets.strip
- print "Single Anagrams: "; pp singleWordAnagram(word)
- print "Sub Anagrams: "; pp subAnagrams(word)
- print "Multi Word Anagrams: "; pp multiWordAnagrams(word)
Alternative DictionaryI think alphabetical keys work very well for full word anagrams, but not as well for sub anagrams. The solution I preferred involved using a different key for the anagram dictionary. First assign each letter of the alphabet a prime number greater than 1. 'a' can be 2, 'b' can be 3, and so on. All characters you don't care about (such as apostraphes, periods, and spaces) should be assigned 1. You can find a list of prime numbers
here. Use the smallest numbers you can, otherwise you'll have bigger problems later. The key for the new dictionary will be the product of the letters in the word. Like before, the value will be a list of words sharing the key. Since you're multiplying prime numbers, you'll know that if Word A / Word B is 0, then the words are anagrams of each other. If Word A % Word B is 0, then the Word A is a sub anagram of Word B. To determine sub anagrams, traverse all the keys in your dictionary and find the anagram lists that divide into the target word key evenly. Finding Multi Word Anagrams will also be basically the same as before, but switching out the way you determined words are anagrams or sub anagrams.
One thing to watch out for with the prime number method is the size the of the key. In java, a 32 Bit Int won't even hold the word resume. I used a long, but even that has issues if I'm trying to find the anagrams for a sentence, much less a paragraph. You can use code to deal with really large numbers or you can just just report an error if the word can't be represented. How you deal with this problem is up to you.
I admit, the prime number algorithm doesn't actually go noticably faster, although there should be significantly less comparisons being made. Furthermore, it does have a limit in the words based on the maximum size of the key. I still like this algorithm better. Anyways, here's the code using the prime number methodology:
-
-
- LETTERS = { 'a'=>2,'b'=>3,'c'=>5,'d'=>7,'e'=>11,'f'=>13,'g'=>17,'h'=>19,
- 'i'=>23,'j'=>29,'k'=>31,'l'=>37,'m'=>41,'n'=>43,'o'=>47,'p'=>53,
- 'q'=>59,'r'=>61,'s'=>67,'t'=>71,'u'=>73,'v'=>79,'w'=>83,'x'=>89,
- 'y'=>97,'z'=>101 }
- LETTERS.default=1
-
-
- def words text
- text.downcase.scan(/['\w]{1}['\w]+/)
- end
-
-
- def getWordKey word
- product = 1
- n = word.length
- (0..n-1).collect {|c| product *= LETTERS[word[c,1]] }
- return product
- end
-
- def train features
- model = Hash.new{|hash,key| hash[key] = Array.new}
- features.each {|f| model[getWordKey(f)].push(f)}
- return model
- end
-
- NWORDS = train(words(File.new("words.txt").read))
-
- def singleWordAnagram word
- return NWORDS[getWordKey(word)]
- end
-
- def subAnagrams word
- solution = Array.new
- NWORDS.keys.each {|f| solution.push(NWORDS[f]) if(getWordKey(word)%f==0) }
- return solution.flatten!
- end
-
- def multiWordAnagrams word
- solution = Array.new
- candidates = subAnagrams(word)
- while candidates.size > 0
- f = candidates.delete(candidates[0])
- candidates.each { |g|
- if ( getWordKey(word) == getWordKey(f+" "+g) )
- solution.push(f+" "+g)
- elsif( getWordKey(word) % getWordKey(f+" "+g) == 0 )
- candidates.push(f+" "+g)
- end
- }
- end
- return solution
- end
-
-
- require 'pp';
- print "Enter word: "; word = gets.strip
- print "Single Anagrams: "; pp singleWordAnagram(word)
- print "Sub Anagrams: "; pp subAnagrams(word)
- print "Multi Word Anagrams: "; pp multiWordAnagrams(word)
Those are the basic solutions to all the problems I played with. The above should be good inspiration to make your own anagrammer. There's still lots of fine tuning and optimizations you can add that I didn't get into. For example, you may want to ignore words with apostrophes or words less than two characters. You could work in a multi-threaded fashion as well. There are also lots of word list tweaks you can do. There's also a lot of features you can add; fun behavioral tweaks. For example, you might want to look for sentences from your multi word anagrams by throwing them through a grammar checker. There are lots of really fun anagrams out there and I'm sure you could write a program to find many more.