Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.

Due to spam on this forum, all posts now need moderator approval.

Entire forum ➜ MUSHclient ➜ International ➜ TinyMUX 2.7 with UTF-8 and MUSHClient

TinyMUX 2.7 with UTF-8 and MUSHClient

It is now over 60 days since the last post. This thread is closed. Refresh page

Pages: 1 2 3

Posted by

Sparks (7 posts) Bio

Date

Reply #15 on Sat 10 Mar 2007 09:23 PM (UTC)

Message

I would think that if the user unchecks UTF-8, the client should send a Telnet CHARSET request, get the list of available encodings from the server, and reply with either ISO-8859-1 or US-ASCII as appropriate. Just as if the UTF-8 box were checked during the connection being active, it should request and reply with UTF-8.

The charset/encoding is not negotiated only once at connection time; it can be renegotiated at any time during the connection.

If the user changes fonts during connection, unless you have a way to know it's a Unicode font (though most more recent TrueType fonts are), I think they're SOL. The UTF-8 box state can be conveyed entirely within the bounds of RFC 2066, though. :)

Either way, there /is/ a workaround in the meantime; a MUSHclient user can @set themselves UNICODE manually (though it then applies to all connectsion), and we can just note in the helpfiles that MUSHclient doesn't support RFC 2066 for negotiation.

Rachel 'Sparks' Blackman

Posted by

Brazil USA (10 posts) Bio

Date

Reply #16 on Sun 11 Mar 2007 03:15 PM (UTC)

Message

Here's a scenario to consider. This includes most of the elements we've been talking about.

Let's say someone @name's themselves with Chinese characters -- it's OK because everyone else connecting there is fluent in Mandarin. :)

They use MUSHClient to connect, and they have checked the UTF-8 box. Also, they are using a font that includes all the right the characters.

We can even assume the welcome screen contains nothing but ASCII, but now, it's time to login. The server does not have access to the player object because the player has not logged in, yet. Likeise, the player cannot login because doing so requires using UTF-8 characters.

The workaround is that anyone with a UTF-8 player name will need to use their dbref, but that makes me sad.

This is going to be a problem for clients that 'auto' logon because everything happens so fast, even if charset negotiation occurs, it won't be completed by the time the UTF-8 data needs to be sent.

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #17 on Sun 11 Mar 2007 08:53 PM (UTC)

Amended on Sun 11 Mar 2007 08:58 PM (UTC) by Nick Gammon

Message

Quote:

... the player cannot login because doing so requires using UTF-8 characters.

I don't perceive this as a problem. Remember, the UTF-8 problem is different in the two directions - inwards (what the player types), and outwards (what the server sends).

(I am using Unicode in this post, bear with me if your web browser does not render it properly).

Bearing in mind that the standard 7-bit ASCII characters set is identical to the Unicode code set of 0x0000 to 0x007F, you don't need to know if I am using Unicode if I send the name "Nick" or not. Either way it will be the same. However if I give my name as 蓬蓬 then you can safely assume I am interested in Unicode output. You can tell it is Unicode text from the fact that the 8-bit will be set in each character.

This leads us to the initial login screen. A simple solution would simply be to ask for "Name" / "Password" and train non-English speakers to recognise those words. Another approach would be to do what some sites (eg. Wikipedia) do, and offer different languages on the home page. For example:



Name (Nom  / &#22995;&#21517; / Nombre):

Password (Mot de passe / &#23494;&#30908; / Contraseña ):

I think most people, on seeing this (and even if the Unicode letters were rendered as some sort of rubbish) would understand what was wanted.

Another approach again (although I think that my previous one was OK), would be to simply query for Unicode support, eg.:



Unicode? Y/N (&#21517;&#20026;Unicode ):

Then, follow up with the other questions.

Either way, the auto-logon will still work - you don't need the (complicated and possibly unsupported) character set negotiation, the player simply sends down, through auto-logon, their name and password, whether or not it has UTF-8 characters in it, and the server figures out the rest.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #18 on Sun 11 Mar 2007 09:00 PM (UTC)

Message

I'm not sure how to get the Unicode into the post, for me at least it is rendering as stuff like 蓬 - but anyway you see the idea. Imagine Chinese characters replacing text like that sequence.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

Brazil USA (10 posts) Bio

Date

Reply #19 on Sun 11 Mar 2007 09:08 PM (UTC)

Message

While 7-bit ASCII is encoded the same as Latin1 and UTF-8, there is a conflict between Latin1 and UTF-8 for the upper 128 characters. While it is possible after looking at a sufficiently long string to resolve between latin1 and UTF-8, the server does not carry that much state before it needs to respond.

And, the prompt solution is the same as complexity as the charset option, and involves the same defering, but instead of an IAC packet, it's a Y/N.

In any case, all the issues seem to be laid out, so over the next year or so, how it actually impacts players will become known in the form of questions and support.

Posted by

Sparks (7 posts) Bio

Date

Reply #20 on Sun 11 Mar 2007 09:13 PM (UTC)

Amended on Sun 11 Mar 2007 09:19 PM (UTC) by Sparks

Message

True, but RFC 2066 is an existing standard, and not difficult to implement. It's possible to do all kinds of guesswork to try and detect the system the player wants, but as Brazil points out that's not a guarantee (if they send a user-name with an accented Latin1 character, for instance, or if they send a name that's pure 7-bit ASCII when they still want Unicode support)... and arguably that's a broken workaround method, since it requires detection at the login screen and then ALSO potentially the player to set a flag for override (in case the login screen detection isn't appropriate), etc. Asking at the login screen also breaks existing autologin systems.

What's obviously called for is a simple standard to negotiate charsets with less guesswork involved; such a standard already exists, in the form of RFC 2066. And RFC2066 really /is/ simple as long as you don't support [TTABLE], and frankly, I've never seen anything that does.

Just reply with confirmation you support it when the server offers it (DO/WILL exchange), and upon receiving the list of IANA charset names in order of server preference (for instance, IAC SB CHARSET REQUEST 0 "UTF-8" 0 "ISO-8859-1" 0 "US-ASCII" 0 IAC SE), you reply with one of them (IAC SB CHARSET ACCEPT "UTF-8" IAC SE) or a rejection message (IAC SB CHARSET REJECT IAC SE) if you support none.

The logic would simply be 'if the UTF-8 box is checked and the server offers it, reply with that. Otherwise, pick ISO-8859-1 or US-ASCII, whichever is available. If neither is available, send a rejection message.' Since RFC 2066 charsets must be named with their IANA charset name, using the preferred mime-type name, it's easy to detect.

And really, MUSHclient only needs to know 'UTF-8' and 'ISO-8859-1' (Latin1) and 'US-ASCII' out of all the possible IANA stuff, unless you add support for the various non-Unicode legacy CJK encodings. So that makes it simple.

And it seems to me that if a player has gone and manually checked the UTF-8 checkbox in Mushclient, they're potentially interested in seeing Unicode; if they don't have it checked, they probably aren't (or at least, couldn't see it anyway). :)

Rachel 'Sparks' Blackman

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #21 on Sun 11 Mar 2007 09:24 PM (UTC)

Amended on Sun 11 Mar 2007 09:25 PM (UTC) by Nick Gammon

Message

I'll take a look at RFC 2066 for you, but you still have the problem of people using older versions (almost everyone), and other clients.

I wouldn't dismiss my other idea out of hand. Look at the bit patterns for well-formed Unicode:


bytes | bits | representation
     1 |    7 | 0vvvvvvv
     2 |   11 | 110vvvvv 10vvvvvv
     3 |   16 | 1110vvvv 10vvvvvv 10vvvvvv
     4 |   21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv

Are you really going to support Latin1 with the high-order bit set, as well as Unicode? Seems to me this is extra effort, when Unicode embraces those characters, plus more.

Even if you are, and assuming players have (say) 3+ character names, it would be an unusual Latin1 name that consisted of entirely characters in the range 0x80 to 0xFF. Even given that, if they are using Unicode, then a 3-character name will be at least 6 UTF-8 characters, in this sort of pattern:



110vvvvv 10vvvvvv 110vvvvv 10vvvvvv 110vvvvv 10vvvvvv

A quick test would indicate it seems to be Unicode, and of course you can see if that name is on the database.

The worst case scenario, and I doubt this would happen, would be for 2 players to happen to have the same names *and* the same passwords, one player using Latin1 characters, with all the high order bits set, and which has the bit patterns in the first 2/3 bits to make it look like Unicode, and the other one to have the identical byte pattern, but in UTF-8.

I really, really doubt this would occur, and if it does, you simply disallow such names.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

Sparks (7 posts) Bio

Date

Reply #22 on Sun 11 Mar 2007 09:33 PM (UTC)

Message

MUX 2.7 must continue to support Latin1 particularly /for/ legacy clients.

Internally, it's UTF-8 now, and if you have a UTF-8 client it sends things as UTF-8. If you have a Latin1 client, it sends things as Latin1 (converting all accented characters from UTF-8 accordingly, and turning anything with no Latin1 equivalent into a '?'). If you have a client that doesn't support anything, it sends ASCII (downgrading everything to 7-bit, including turning all accented characters into plain unaccented).

Your login screen idea is a decent one, but assumes a somewhat simplified MU*ing world. Here's another example; say someone logs into a UTF-8 enabled game with a client set to speak S-JIS. They type something, and the client sees multi-byte patterns so assumes it's UTF-8... now neither side can talk to the other. RFC2066 prevents that by explicitly naming the charset, rather than doing guesswork. The guesswork still has a place especially during the transitory phase, which is why we have the override flags and why it'll even try to read the LC_CTYPE locale if the player has that exporting over the telnet NEW-ENVIRON option.

I don't dispute that this is a bit of a weird situation in terms of legacy stuff, but it's a necessary step in moving forward. Particularly with some of the newer servers considering it (I know of at least two MUD servers that are looking at doing their charset stuff using RFC2066 instead of just assuming the player's encoding), it's been a chicken-and-egg problem without client support.

The more clients that can at least use RFC2066 to say 'yeah, I'm using that' or 'er, no, I'm using a different multibyte encoding,' the less confusion there'll be. Right now, there are MU*s out there which just blindly assume S-JIS encoding, or one of the EUC encodings for East Asian languages, or assume Cyrillic, or whatever. RFC2066 addresses that inasmuch as even if you connect with a UTF-8 client to a game that only speaks S-JIS, both client and server can at least come to the agreement and realization (before displaying random garbage to the user) that they aren't going to be able to exchange text, since neither understands the encoding the other wants.

Saying you'll look into adding RFC2066 does a great deal to help move that forward, so, thank you. :)

Rachel 'Sparks' Blackman

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #23 on Sun 11 Mar 2007 10:45 PM (UTC)

Amended on Sun 11 Mar 2007 11:59 PM (UTC) by Nick Gammon

Message

I'm muddling along trying to get the negotiation to work, let me confirm if this is what you expect:


Server sends:  IAC DO CHARSET
Client sends:  IAC WILL CHARSET
Server sends:  IAC SB CHARSET REQUEST DELIM NAME IAC SE
Client sends:  IAC SB CHARSET ACCEPTED NAME IAC SE
or
Client sends:  IAC SB CHARSET REJECTED IAC SE

Where the following definitions apply:


IAC:     0xFF
DO:      0xFD
CHARSET: 0x2A
WILL:    0xFB
SB:      0xFA
REQUEST: 0x01
ACCEPTED:0x02
REJECTED:0x03
SE:      0xF0
DELIM:   some character that does not appear in the charset name, other than IAC, eg. comma, space
NAME:    the character string "UTF-8" (or some other name like "S-JIS")

The server can request multiple character sets like this:


Server sends:  IAC SB CHARSET REQUEST DELIM NAME1 DELIM NAME2 IAC SE

The "rejected" message means none of the character sets can be used. In practice, MUSHclient would only accept the character set requested if it was the current one in use in the output window, and in the case of UTF-8, if the UTF-8 checkbox was enabled.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date Reply #24 on Sun 11 Mar 2007 11:28 PM (UTC)

Message

I have hit a stumbling block now. Perhaps someone can help.

The problem is to identify, given a font (like Lucida Sans Unicode), what the *name* of the character set is. I gather from the preceding discussion we want a character string like "UTF-8".

As far as I can see, the Windows function (I am using GetLogFont, but am open to suggestions), returns a number, defined along these lines:


#define ANSI_CHARSET            0
#define DEFAULT_CHARSET         1
#define SYMBOL_CHARSET          2
#define SHIFTJIS_CHARSET        128
#define HANGEUL_CHARSET         129
#define HANGUL_CHARSET          129
#define GB2312_CHARSET          134
#define CHINESEBIG5_CHARSET     136
#define OEM_CHARSET             255
#if(WINVER >= 0x0400)
#define JOHAB_CHARSET           130
#define HEBREW_CHARSET          177
#define ARABIC_CHARSET          178
#define GREEK_CHARSET           161
#define TURKISH_CHARSET         162
#define VIETNAMESE_CHARSET      163
#define THAI_CHARSET            222
#define EASTEUROPE_CHARSET      238
#define RUSSIAN_CHARSET         204

#define MAC_CHARSET             77
#define BALTIC_CHARSET          186

For a start, UTF-8 isn't in the list, and in any case, I'm not sure if you are going to request (say) "SHIFTJIS_CHARSET". Your example was "UTF-8", "ISO-8859-1" , "US-ASCII" - how does the above list get translated to those?

Quote:

Since RFC 2066 charsets must be named with their IANA charset name, using the preferred mime-type name, it's easy to detect.

Er, yes. But programmatically, I'm not sure, given a selected font, how to make that translation.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

Sparks (7 posts) Bio

Date

Reply #25 on Sun 11 Mar 2007 11:54 PM (UTC)

Message

ANSI_CHARSET is the WCHAR goo that Windows uses internally. DEFAULT_CHARSET is confusingly named, but I believe it will more or less map to 'Microsoft's slightly mutant Unicode variant' in this case, I'm given to understand.

I'll add a caveat here that in Mac OS X (where I write a MU* client), working with Unicode is vastly different. And that in Windows (where I write code for my day job on an instant messaging package called Trillian), Trillian's natively UTF-8 internally and we cheat and just convert to/from WCHAR whenever we need to display stuff or take from an input box, using MultiByteToWideChar and WideCharToMultiByte with CP_ACP and CP_UTF8.

So over in my day job, we don't detect font support; we just do UTF8 internally for everything, turn everything into Windows' native WCHAR string format for display (or convert from WCHAR when getting input) and let Windows sort out what to use. Which it does admirably, I'll grant; if the user-selected font doesn't support the extended characters, it finds a fallback for us which does, just for those specific characters.

This means we don't need to worry about font support, just about converting the strings; as such, I'm not entirely certain about all of the font property stuff since I try to avoid it in my Windows life. But a quick perusal of the Scary Windows Unicode Reference seems to imply DEFAULT_CHARSET's what you want here. ;)

Your RFC2066 understanding is accurate; you're also welcome to try connecting to 'mux.riverdark.net 2860' which is a testbed MUX2.7 server with the RFC2066 extension enabled. Upon connect, you should get the various negotiations right then and there.

Rachel 'Sparks' Blackman

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #26 on Sun 11 Mar 2007 11:59 PM (UTC)

Message

My testing seems to reveal that it returns ANSI_CHARSET (0) for both Lucida Sans Unicode, which supports Unicode, and FixedSys, which doesn't.

So, that isn't particularly helpful.

I don't know what to do next, I could always "support" UTF-8 if the UTF-8 box is checked, regardless of font chosen. I am open to better suggestions.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

Sparks (7 posts) Bio

Date

Reply #27 on Mon 12 Mar 2007 12:01 AM (UTC)

Message

I think honestly supporting it if the checkbox is checked is sufficient information; you're returning that the client can effectively display it, which is true. You can trust that if a user has checked the UTF-8 box, they've done so deliberately and are going to set an appropriate font.

I mean, a user could set the font to Wingdings, and even in ASCII or Latin1, what they get from the game won't make sense. So you can't account for every font choice and encoding interaction, I'd think. :)

Rachel 'Sparks' Blackman

Posted by

Nick Gammon Australia (23,173 posts) Bio Forum Administrator

Date

Reply #28 on Mon 12 Mar 2007 12:13 AM (UTC)

Message

I would like to be able to go slightly further, if possible, having gone to all the trouble of doing charset negotiation.

It seems to only partly solve your problem, if a player checks the UTF-8 box, but doesn't change the font, which is exactly what started this thread in the first place.

For the time being, however, are you suggesting you would want:

'UTF-8' - supported if the UTF-8 box is checked; and
'ISO-8859-1' - if not?
'US-ASCII' - or this instead?

These options do not cover the use of the S-JIS code set, which you mentioned earlier.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Posted by

Sparks (7 posts) Bio

Date

Reply #29 on Mon 12 Mar 2007 01:07 AM (UTC)

Amended on Mon 12 Mar 2007 01:11 AM (UTC) by Sparks

Message

True, but if you want to support S-JIS et al (which would be useful for purposes of MUDs, even though MUX2.7 is just going the UTF-8 route to take the easier all-inclusive route), you'll need more than just the font.

For instance, MS Mincho can be used to render Japanese text regardless of what encoding it was encoded in. Just passing raw S-JIS data to it will generally not avail you much. If you choose to go down that path, you'll need to play with recording the encoding and using the MultiByteToWideChar and WideCharToMultiByte routines to convert between various encodings. Then you can ensure that whatever you get from the game, it's in a known encoding (namely, CP_ACP, which would be ideal for your circumstances if you want to support everything) and can be rendered the same way.

I agree that's the ideal path -- over on Mac OS X, in Atlantis, I have an encoding value for the remote site. If the user picks one, I use that for both incoming/outgoing. If the server uses RFC2066 to negotiate a charset I can speak, I switch to that for both outgoing/incoming. (Since the server knows better than I do what it wants to talk, after all!) If the server doesn't use RFC2066 and the user doesn't pick a specific encoding, I default to assuming that incoming is Latin1 and outgoing is ASCII (since that's how some games work).

If you want to go that route, I certainly won't discourage you! I'm all for other clients more fully supporting RFC2066! However, that's a much larger amount of work, to support many different encodings.

In the meantime, just conveying whether or not you're handling that 8th bit or not is at least 'sufficient' for advanced users; you already have an 'enable/disable' UTF-8 option, and you're just conveying whether or not that's enabled, in this case. (How you pick between Latin1 and ASCII is up to you, really. I'm just giving them as an example, since those are the three options MUX2.7 offers.)

This is the logic most MU* clients that support RFC2066 seem to use presently; they have a user-defined encoding, and if the server offers encodings they only return an accept if one of those encodings matches their own. Otherwise, it's a reject. I've discovered that Atlantis is a rarity inasmuch as it actually will change active encoding for the world based on what the server asks for. (I honestly didn't realize how everyone else had done it when I wrote that, or I might've saved myself a great deal of work, but...)

Rachel 'Sparks' Blackman

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

139,863 views.

This is page 2, subject is 3 pages long: 1 2 3

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.