Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.

Due to spam on this forum, all posts now need moderator approval.

Entire forum ➜ MUSHclient ➜ Suggestions ➜ Spell checker improvements?

Spell checker improvements?

It is now over 60 days since the last post. This thread is closed. Refresh page

Pages: 1 2 3

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Thu 05 Oct 2006 02:14 AM (UTC)

Message

There have been a few requests in the past for improvements to the spell checker. Currently MUSHclient uses a proprietary system purchased from Wintertree Software Inc.

I am contemplating moving to a custom system, which would potentially have a few advantages:

You could supply your own dictionaries (eg. in other languages, like French, Greek, German, etc.).
The behaviour could be customised more (eg. ignore the first word, like @aset or something like that).
Maybe auto-correct things (like 'teh' to 'the').
Maybe correct as-you-type.

It seems to me that spell checking isn't really rocket science, it shouldn't be too hard to do something that is at least as good as we currently have.

I guess the features which are probably needed to make this work are:

Possibly ignore lines which match an alias.
Break up entered text into "words". A word would probably be a sequence of letters and possibly numbers, probably including single quotes (eg. Nick's).
Discard certain words if required (eg. starting with caps, all caps, words with numbers in them, words with mixed case, words of all numbers).
Look up remaining words in a dictionary, possibly allowing for a "user dictionary" which can be added to.
Also check for doubled words (eg. "bird in the the hand").
If the word isn't found, possibly auto-correct if in a list (eg. 'teh' to 'the').
Otherwise display a dialog box with the "wrong" word displayed, and a list of suggested alternatives.
Allow manual correction, selecting from an alternative, or adding the word to a user dictionary.

I wonder what the correct treatment of case errors is? For instance, you enter "boston" but "Boston" is in the dictionary ...

Accept as entered
Auto-correct the case
Suggest the corrected case
Just list all alternative-sounding words as usual

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #1 on Thu 05 Oct 2006 03:46 AM (UTC)

Message

From another thread, Ksilyan wrote:

Quote:

(That being said, I would be happy to provide Nick with a C++ class I wrote that fully implements a character-based lexical trie; the work would then be to add all words displayed into that trie and then query it for words. It already takes care of things like asking for a list of words that start with a certain prefix, or even a list of words that match a (simple) regular expression.)

Would that be suitable for a spell checker? Would it take less storage than a simple table? I currently tested loading in thousands of words into a Lua table, it seems fast enough, but takes about 9 Mb of memory.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by David Haley USA (3,881 posts) Bio

Date Reply #2 on Thu 05 Oct 2006 07:04 AM (UTC)

Message

Yes, lexical tries are very, very good for anything, really, that needs to store a list of words.

My class also happens to have a built-in suggestion request method, which you might find helpful. :-)

Here are some memory statistics, as seen by 'top':


13723 david     18   0 26992  20m  896 S    0  1.0   0:01.07 trie
13748 david     17   0  8776 2612  896 S    0  0.1   0:00.10 trie

Recall that memory usage is in the 5th column (virtual memory i.e. total) and 6th column (resident memory).

In the big case (26mb) I loaded 127,142 words. In the small case (8k), I loaded the first 10,000 words of the big file -- all the words up to and including 'bewail'.

One reason we're seeing such a big memory savings is that tries are very good at compressing string with common prefixes. It would be a better test of memory usage to load 10,000 random words and see what the memory usage is. Nonetheless, I'm not sure how many thousand words you added in your case -- let's assume it was 10,000. That took 9 mb in a straight Lua table. Mine takes just over twice as much memory for 12 times as many words; one would assume that the Lua table size would increase more or less linearly as you add more and more words.

I will put the code up on my website in a moment.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #3 on Thu 05 Oct 2006 07:46 AM (UTC)

Message

My only problem is that I need to keep suggested spellings. That is, for a given word I need to know "sound alike" words. Merely knowing "present or not" is not sufficient. Will the trie handle that? To put it another way, say someone enters "bote", we might suggest "boat", "bat", "bot", "boot" and so on.

I have an algorithm for the sound-alike stuff, and what I am currently doing is converting that (to, say, "BT") and then looking up that by key in the Lua table. Under BT in the table will be a sequential list of all the words that happen to sound like that. Now, if "boat" is in the list, then that counts as correct spelling. This also gives me a chance to match on, optionally, same or different case. If it isn't in the list, I immediately have all the sound-alikes.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by David Haley USA (3,881 posts) Bio

Date Reply #4 on Thu 05 Oct 2006 07:56 AM (UTC)

Message

Yes, it can do that. It uses edit distance as a measurement of proximity, which is not necessarily the best.

Here is what it gives for an edit distance of 1:


Suggested corrections for 'bote', e.d. = 1:
  bate bite bode boite bole bone bore bot bota botel both bots bott byte cote dote mote note rote tote vote

The word 'boat' doesn't appear in that, which is unfortunate but that's what edit distance does. If we increase the max edit distance to 2, we get all kinds of words, many of which we don't want.

A smarter algorithm would assign a max 'cost', and then use some heuristic to determine which changes cost how much. For instance changing 'o' to 'i' might be cheap due to keyboard proximity. This would be a fair bit more complicated, though.

I have finished packaging the class and am releasing it now. I'll put a post in Prog:General with the link. :)

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #5 on Thu 05 Oct 2006 08:46 AM (UTC)

Amended on Thu 05 Oct 2006 08:47 AM (UTC) by Nick Gammon

Message

I really want sound-alikes.

Using the rules I currently have, looking up "knife" (which converts to NF) gives this:



1="knave"

2="knave's"

3="knaves"

4="knife"

5="knife's"

6="knifes"

7="knives"

8="naive"

9="naives"

10="nave"

11="nave's"

12="naves"

13="navies"

14="navy"

15="navy's"

16="nephew"

17="nephew's"

18="nephews"

19="nova"

20="nova's"

21="novae"

22="novas"

Some don't even start with "k" (but sound like "knife").

Maybe a trie per sound-alike sound.

The number of sound-alikes depends on the number of characters you store, for 4 characters I get 10181 unique sounds, but for 6 characters I get 23204 sounds.

This is from a dictionary of 76822 words. The whole thing loads from a straight text file, including computing the sound-alikes (itself implemented in C) in about a second.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #6 on Thu 05 Oct 2006 09:31 PM (UTC)

Message

Hmm. I wonder if this sound-alike notion could be adapted to the trie's notion of 'word distance'. Or, as you say, store a trie per sound-alike. Or, another idea, perhaps more efficient given the large number of sounds you have, would be to have a trie of sounds, and instead of a simple boolean indicating presence, you would use a pointer to another trie of words that sound alike. (And to really save memory you would want those tries to share string memory. My shared string library comes to mind...)

How do you map a given word to its "sound representation"? Do you compute that on the fly?

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #7 on Thu 05 Oct 2006 10:57 PM (UTC)

Message

This is the documentation I have for the algorithm, and yes it is done on the fly. ...


/***************************************************************

 Metaphone Algorithm

   Created by Lawrence Philips (location unknown). Metaphone presented
   in article in "Computer Language" December 1990 issue.

   Converted from Pick BASIC, as demonstrated in article, to C by
   Michael J. Kuhn (Baltimore, Maryland)

   My original intention was to replace SOUNDEX with METAPHONE in
   order to get lists of similar sounding names that were more precise.
   SOUNDEX maps "William" and "Williams" to the same values. METAPHONE
   as it turns out DOES THE SAME.  There are going to be problems
   that you need to resolve with your own set of data.

   Basically, for my problem with S's I think that if

      IF metaphone[strlen(metaphone)] == "S"
                                  AND strlen(metaphone) >= 4  THEN

           metaphone[strlen(metaphone)] = ""

   You can add you own rules as required.

   Also, Lawrence Philips suggests that for practical reasons only the
   first 4 characters of the metaphone be used. This happens to be the
   number of characters that Soundex produces. This is indeed practical
   if you already have reserved exactly 4 characters in your database.

   In addition an analysis of your data may show that names are split
   into undesirable "metaphone groups" as the number of metaphone characters
   increases.

             *********** BEGIN METAPHONE RULES ***********

 Lawrence Philips' RULES follow:

 The 16 consonant sounds:
                                             |--- ZERO represents "th"
                                             |
      B  X  S  K  J  T  F  H  L  M  N  P  R  0  W  Y

 Exceptions:

   Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-"  ----> drop first letter
                      "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright"

   Beginning of word: "x"                                ----> change to "s"
                                      as in "Deng Xiaopeng"

   Beginning of word: "wh-"                              ----> change to "w"
                                      as in "Whalen"

 Transformations:

   B ----> B      unless at the end of word after "m", as in "dumb", "McComb"

   C ----> X      (sh) if "-cia-" or "-ch-"
           S      if "-ci-", "-ce-", or "-cy-"
                  SILENT if "-sci-", "-sce-", or "-scy-"
           K      otherwise, including in "-sch-"

   D ----> J      if in "-dge-", "-dgy-", or "-dgi-"
           T      otherwise

   F ----> F

   G ---->        SILENT if in "-gh-" and not at end or before a vowel
                            in "-gn" or "-gned"
                            in "-dge-" etc., as in above rule
           J      if before "i", or "e", or "y" if not double "gg"
           K      otherwise

   H ---->        SILENT if after vowel and no vowel follows
                         or after "-ch-", "-sh-", "-ph-", "-th-", "-gh-"
           H      otherwise

   J ----> J

   K ---->        SILENT if after "c"
           K      otherwise

   L ----> L

   M ----> M

   N ----> N

   P ----> F      if before "h"
           P      otherwise

   Q ----> K

   R ----> R

   S ----> X      (sh) if before "h" or in "-sio-" or "-sia-"
           S      otherwise

   T ----> X      (sh) if "-tia-" or "-tio-"
           0      (th) if before "h"
                  silent if in "-tch-"
           T      otherwise

   V ----> F

   W ---->        SILENT if not followed by a vowel
           W      if followed by a vowel

   X ----> KS

   Y ---->        SILENT if not followed by a vowel
           Y      if followed by a vowel

   Z ----> S

 **************************************************************/

There is C code for that, I can post that if you want.

Now I immediately see a problem with this - the algorithm is obviously oriented around the English language. If it is used in a spell checker, then languages like French and German (and the rest) will have different rules for what is silent in what situation, and what consonants sound like each other.

I thought maybe a state machine? You are an expert on them are you not Ksilyan? Surely what is described above can be reduced to a fairly simple state machine? Then you could have one state table per language.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #8 on Fri 06 Oct 2006 12:22 AM (UTC)

Message

Yes, you are right about the problem with other languages. I'm somewhat curious how spell checkers handle this problem in general. Perhaps the dictionary files encode their language's notion of word similarity?

Quote:
You are an expert on them are you not Ksilyan? Surely what is described above can be reduced to a fairly simple state machine? Then you could have one state table per language.

Well I'm not sure if I'm an "expert". :-) But yes it should be relatively simple to convert that to a state machine, as long as the rules have no ambiguity. (I'm assuming they don't.) This could probably be done using the lexing tool 'flex'. This sounds like a fun, relatively short project. I'll see if I have time to work on this this weekend.

One problem I see is that some languages, including English for that matter, have different rules for the same arrangements of letters. 'Knuth' is actually a good example. This documentation claims it's pronounced "Nooth" when in fact it's pronounced "Ka-Nooth". This is a common Stanford way to tell if somebody else is in fact familiar with Don Knuth personally. :-) (For reference's sake, my claim comes straight from Knuth himself: http://www-cs-faculty.stanford.edu/~knuth/faq.html)
French is even worse as the rules are not quite regular, even for relatively common words.

I wonder how word processors like Word, Open Office, or Thunderbird's built-in checker handle these things.

But, in any case, yes, if a language could have its sound-alike rules written out, then you could have one state machine per language, which would let you quickly convert a word to its phoneme. Then you could imagine a "reverse-checker" that would iterate over the trie, seeing which words could possibly correspond to that suggestion. Hopefully the rules given in the algorithm go both ways unambiguously. If that is the case then it would be fairly easy to walk the trie recursively and efficiently determine which branches could correspond to the given word.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #9 on Fri 06 Oct 2006 01:38 AM (UTC)

Message

Or maybe it can be done with a series of regexps. My testing indicates it probably can, however it may be slow. Also, regexps, continually applied to the whole string, probably have a different sense than looking forwards by letter.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #10 on Fri 06 Oct 2006 02:13 AM (UTC)

Message

A bit more research reveals this article:

http://www.ddj.com/dept/cpp/184401251?pgno=2

The author of the original metaphone algorithm made another one which is supposed to handle other languages better.

Testing that on "knife" now gives me:


1="knave"
2="knife"
3="naive"
4="nave"
5="navy"
6="nova"
7="novae"

And, looking up Ksilyan gives me:


1="casualness"
2="casualness's"
3="gasoline"
4="gasoline's"
5="gazillion"
6="gazillions"
7="gosling"
8="gosling's"
9="guzzling"

I am down to 9,226 sound-alikes now (from 10,181 before).

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #11 on Fri 06 Oct 2006 09:50 PM (UTC)

Message

Anyone have any thoughts on this problem with spell checking?

The current spell-checker has an option to omit words starting with uppercase, so that this sentence would not raise an error:



We must find Eowiabwyn and slay her!

A game is likely to have fantasy names that are not in dictionaries, and thus omitting the check for such a word makes sense.

But what about this:



Runn away!

I have misspelt "run" but at the start of a sentence it has an uppercase first letter. Thus the spell checker would let it through.

Now you can make the spell checker detect the start of a sentence and check the word anyway, but then you might get this:



Eowiabwyn is standing here.

Now we have a proper name, that is at the start of a sentence. Does the initial uppercase indicate sentence start, or proper name?

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #12 on Fri 06 Oct 2006 10:56 PM (UTC)

Message

That is a very interesting question, honestly I'm not so sure what to do about it.

What's nice about correct-as-you-type, or, at least, highlight-as-you-type, is that you know what words the checker thinks are bad but you don't have to deal with popup windows or anything of the sort. That way, I can type whatever I want, and if something isn't in the dictionary but I know it's correct, I can either add it (if I'm going to use it a lot) or just ignore it.

The other metaphone algorithm you gave looks like it gives better results, by the way. I'm happier with its list of suggestions for 'knife'.

I've been thinking about how to best use the algorithm in combination with tries to make a very efficient spell-checker, but I'm not sure yet. I suspect that the metaphone algorithm can be combined into the recursive walking of the trie. If I had the time, I'd have wanted to do more proper research into how the open source spell-checkers solve this problem...

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Posted by

Linda Sweden (164 posts) Bio

Date

Reply #13 on Fri 06 Oct 2006 11:04 PM (UTC)

Message

I think that most people find actual 'correct as you type' annoying, save for a very limited list of things that always are errors. Like 'teh' for the', for example. I believe Word, for example, allows you to set it so it only autocorrects very common mistakes.

On the other hand, live _marking_ of potentially misspelled words (I think ICQ, for example, uses a red dashed underline for such words) is excellent, since it gets your attention.

Even better is if you then can right-click on the marked word to get a list of suggestions.

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #14 on Fri 06 Oct 2006 11:24 PM (UTC)

Message

Quote:

That is a very interesting question ...

My experience in MUDs is you generally type everything in lower case, so maybe the problem goes away a bit. However the proper names are still an issue, maybe you quickly add them to the dictionary, or ignore-list.

Quote:

On the other hand, live _marking_ of potentially misspelled words (I think ICQ, for example, uses a red dashed underline for such words) is excellent

Well, live marking involves a spell check for every character as you type, which makes the burden of making it fast enough greater. A single check for a paragraph might accept a 0.5 second delay, but not for every character.

As for the marking, I'm not sure the edit box used for MUSHclient commands accepts stuff like red underlines, that potentially is another problem.

However I must admit that I was finding it hard to think of a nice way of spell checking without annoying dialog boxes getting in your way.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

151,842 views.

This is page 1, subject is 3 pages long: 1 2 3

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.