Gammon Forum : MUSHclient : Python : Pickling

Entire forum ➜ MUSHclient ➜ Python ➜ Pickling

Pickling

It is now over 60 days since the last post. This thread is closed. Refresh page

Pages: 1 2 3 4

Posted by

Ked Russia (524 posts) Bio

Date

Reply #15 on Sun 17 Apr 2005 06:16 AM (UTC)

Message

Was away from my computer that has both Mushclient and Python installed for a couple of days, but I'll look at what's causing this now.

Top

Posted by

Ked Russia (524 posts) Bio

Date

Reply #16 on Sun 17 Apr 2005 09:59 AM (UTC)

Message

Can you post the exact code that's demonstrating that behaviour? Because I've tried dumping/loading some stuff and it works without a problem, so it's probably something very specific.

Top

Posted by Poromenos Greece (1,037 posts) Bio

Date Reply #17 on Sun 17 Apr 2005 11:50 AM (UTC)

Message

Yes, it works fine if you don't save the world, but if you save and reload from the xml file, it messes up. Here it is:


import pickle, string, textwrap
from codecs import getencoder

dicAreas = {}

dicAreas[u"Test"] = [u"Test", u"Test", u"Test", u"", [u"Shyte", u"Small"], u"http://www.poromenos.org"]
world.SetVariable("varAreas", pickle.dumps(dicAreas))

- SAVE AND CLOSE WORLD -

//import pickle, string, textwrap
from codecs import getencoder

encUnicode = getencoder('utf8')
dicAreas = pickle.loads(encUnicode(world.GetVariable("varAreas"))[0])
world.note(`dicAreas`)

Vidi, Vici, Veni.
http://porocrom.poromenos.org/ Read it!

Top

Posted by Ked Russia (524 posts) Bio

Date Reply #18 on Sun 17 Apr 2005 12:14 PM (UTC)

Message

Ok, I see it. This is likely Mushclient fixing up the newlines so that they fit the Windows way (CRLF). Here's an exerpt from the help page on Base64Decode:

Quote:
Note that due to the way strings are represented internally, it is not possible for the decoded string to contain the NULL character (hex 0x00) and be returned correctly.

The 0x00 is exactly what \r stands for in Python. So the solution looks to be stripping the NULL characters from the string stored in a Mushclient var before loads'ing it. This is fairly simple to do:


dicAreas = pickle.loads(encUnicode(world.GetVariable("varAreas").replace("\r",""))[0])

Or you can move the replace() method to the result of decoding, rather than the initial Unicode string returned by world.GetVariable, though it shouldn't make much difference. And once again, you are better off with using proxy functions unless you want to type in all that crap by hand every time.

Top

Posted by

Poromenos Greece (1,037 posts) Bio

Date

Reply #19 on Sun 17 Apr 2005 12:19 PM (UTC)

Message

Ah, thanks. I don't understand though, why would the variable contain \0s? I thought dumps produced a string, isn't that supposed to not have \0? I'll go with your solution, but what if I want to store newlines? Won't they break when I loads?
And I just use it once in OnPluginInstall, so a proxy function is not really needed.

Vidi, Vici, Veni.
http://porocrom.poromenos.org/ Read it!

Top

Posted by

Ked Russia (524 posts) Bio

Date

Reply #20 on Sun 17 Apr 2005 01:44 PM (UTC)

Amended on Sun 17 Apr 2005 01:50 PM (UTC) by Ked

Message

Dumps doesn't put them in there, Mushclient does. And what it does is actually 0x0d not 0x00, as I originally thought. When Python dumps() the dictionary, it delimits lines with 0x0a (\n). However, either when saving the variable, or loading it from XML, Mushclient seems to convert the line endings to 0x0d0x0a. I think there was some function somewhere in Python's standard lib that converted between the '\r\n' newlines and the '\n' ones, but I can't recall where I saw it.

And if you want to store newlines then you can always just store '\n' and rely on Mushclient to convert them for you. Using the '\n' by itself always worked for me in Mushclient, for example:

world.Tell("I want it to be a\n world.Note")

or:

world.Send("I am too lazy to do\nmany\nworld.Send's\nso\nI\nuse\nalot\n.")

Besides, pickling always stores newlines inside pickled strings by escaping them: '\n' -> '\\n' and then unescapes them when loading. So you can store whatever you want and it should come out as expected, this issue only has to do with the formatting of the pickle string, not what is actually pickled inside it.

Top

Posted by

Poromenos Greece (1,037 posts) Bio

Date

Reply #21 on Sun 17 Apr 2005 01:54 PM (UTC)

Message

Ah, I see. Well, it's good to know that everything works, but I must have misunderstood something, because my question remains: Shouldn't dumps not include \0s in the strings (for MC to misconvert)? If I understood you correctly, either dumps includes 0x00 in the string, which MC saves as 0x0d, or MC converts \n to 0x0d0x0a (the windows default) and then does not remove the 0x0d. Is that it?

Vidi, Vici, Veni.
http://porocrom.poromenos.org/ Read it!

Top

Posted by

Ked Russia (524 posts) Bio

Date

Reply #22 on Sun 17 Apr 2005 02:06 PM (UTC)

Message

The answer is: MC converts 0x0a to 0x0d0x0a. Sorry for the confusion. I am not sure what happens when you try to store 0x00, but it looks like... nothing happens - you get the 0x00 back when you loads, even after closing and reopenning the world. Same should happen with \r and \n in a pickled string, they are just escaped by the pickler in its picklerish way, and then unescaped when loading to produce the initial string.

Top

Posted by

Poromenos Greece (1,037 posts) Bio

Date

Reply #23 on Sun 17 Apr 2005 02:09 PM (UTC)

Message

Right, right... Well, that's understandable, since it runs on windows... But its inability to remove the \r is a MC bug, isn't it?

Vidi, Vici, Veni.
http://porocrom.poromenos.org/ Read it!

Top

Posted by

Ked Russia (524 posts) Bio

Date

Reply #24 on Sun 17 Apr 2005 04:18 PM (UTC)

Message

Well, I guess you could say that Mushclient doesn't preserve strings the way they were saved with it. But why would it want to battle against the OS its running on? So you could equally say that certain parts of Python's standard library (namely the pickler) don't obey the rules of the OS where they are ran. Then again - if pickler bends the Windows way when running on Windows, then so much for portability. So since pickler insists on being independent and doing things its own way (essentially the Unix way), it's up to its user to watch out for rough edges like newlines of different formats. Thus, I don't see anyone's fault here.

Top

Posted by

Poromenos Greece (1,037 posts) Bio

Date

Reply #25 on Sun 17 Apr 2005 04:20 PM (UTC)

Message

Ah, but the pickler saves extensions as \r\n in windows and \n in linux, doesn't it? Anyway, I'm not saying MC should save them as \n, but it should give me back exactly what I saved... If it adds a \r, it should remove it when it loads.

Vidi, Vici, Veni.
http://porocrom.poromenos.org/ Read it!

Top

Posted by

Worstje Netherlands (899 posts) Bio

Date

Reply #26 on Mon 30 Apr 2007 01:31 PM (UTC)

Message

I ran into this problem with MUSH not giving back precisely what it stores too. Nick, is there a chance you could have a look into this and see if it can be fixed up? The use of replace() like Ked suggested seems more of a workaround for a bug in MUSH than a work around for a bug in Python, and I can imagine MUSHclient undoing what it did being quite a bit faster than a Python replace function could for really huge pickles.

Top

Posted by

Shadowfyr USA (1,791 posts) Bio

Date

Reply #27 on Mon 30 Apr 2007 07:19 PM (UTC)

Message

Umm. Saving it exactly as it was given would be a lot easier than "undoing what it did". Mind you, even the former would require changing the code to **not** use what ever OS dependent function is mangling things, but how would MUSH know that the original format "was" different than what it saved?

Frankly, I have always found the whole line ending stuff BS anyway. I mean, **technically** a CR is, on the old teletypes, supposed to, "return the print position back to the start of the line ***without*** moving to a new line." New line is supposed to, "move the print position to a new line, ***without*** changing the current column position." Ironically, this means that Windows is the one ding it right, not Linux. But, because Linux does it wrong, and that is the widest used standard, it also does something even stupider and treats CR and LF as the "same" thing when found, thus double spacing everything in some applications that are not smart enough to strip out the redundant information. Its all around a complete fracking mess. And the Linux/Unix convention was probably a result of some BS like, "Saving one byte of space per line, so documents don't take up as much when stored.", or something similarly absurd today. That's only a guess though.

Top

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #28 on Mon 30 Apr 2007 08:02 PM (UTC)

Message

Quote:
And the Linux/Unix convention was probably a result of some BS like, "Saving one byte of space per line, so documents don't take up as much when stored."

I would say it's because the only reason you would go to a new line without going to the beginning of the line is when you had physical limitations of the machine you were on. I see basically no reason to keep the notion of "new line, same column". I actually think that what Linux does makes lots of sense. I find it somewhat amusing that you say Windows acting like an old-fashioned type-writer is the "right" thing. :-) I also note that you are self-contradictory, in that you think saving one byte per line is absurd today, yet you think emulating ancient typewriters is the way to go.

Quote:
it also does something even stupider and treats CR and LF as the "same" thing when found, thus double spacing everything in some applications that are not smart enough to strip out the redundant information.

This doesn't make any sense. Why would a Windows application choose to add lines for \r characters? It can just always add lines for the \n and be done with it. I'm not sure what redundant information you're referring to.

You run into problems with Mac, which for some reason decided to go with yet another line convention and uses \r for everything.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Top

Posted by Nick Gammon Australia (23,165 posts) Bio Forum Administrator

Date

Reply #29 on Mon 30 Apr 2007 09:32 PM (UTC)

Amended on Mon 30 Apr 2007 09:35 PM (UTC) by Nick Gammon

Message

Quote:

I ran into this problem with MUSH not giving back precisely what it stores too. Nick, is there a chance you could have a look into this and see if it can be fixed up?

I am a bit lost about what "pickling" is exactly, I thought you did that to vegetables, like onions.

Anyway, I am assuming from the general gist here that you are using some function to dump something into binary format, which is saved in a variable, which is in a plugins save file or a world file, and then that variable either won't load in correctly next time, or has some characters omitted or replaced. Is that it?

Maybe it would help to construct a very small test case so we can agree on the exact problem. For example:


 SetVariable ("test", "test\n")

A quick test seems to show that any combination of \n, \r, and \r\n save correctly - that is, you can examine the saved variable and it is saved as written. However, attempting to save \0 does not work because MUSHclient uses 0x00 as a string terminator in a number of places, hence the warning under functions like the Base 64 decode.

It has been a standard practice when using C libraries for a long time to use 0x00 as a string terminator (rightly or wrongly), and when I started writing MUSHclient I used that convention. Various functions (like strlen) which are used extensively, detect string lengths by scanning a string for that terminator.

Lua uses a different method, which is to store the length seperately, which is why the Lua versions of those functions can handle the 0x00 byte, however only for storage in Lua variables. Once they are placed into MUSHclient variables the same problem is likely to rear its head.

However back to the pickled onions problem, things change when *loading* the saved variables. You can look at the source for the XML parser in MUSHclient (xmlparse.cpp), to see a couple of things it does to the string data whilst loading it:


  // convert tabs to spaces, we don't want tabs in our data
  m_strxmlBuffer.Replace ('\t', ' ');  // line 216

And a bit further on (line 675 onwards):


   // copy if not nested, and not inside an element definition
   //  -- omit carriage returns
   if (iDepth == 0 && 
      !bInside && 
      *pi != '\r')
     {
     // make linefeeds into carriage return/linefeed
     if (*pi == '\n')
       *po++ = '\r';
     *po++ = *pi;    // copy if not inside an element
     }

What I read from this is that you will find:

Tabs (0x09) will be converted to spaces
Carriage returns (0x0D) will be omitted
Line feeds (0x0A) will be loaded as 0x0D 0x0A

I don't remember the reason to convert tabs to spaces, there must have been one. Perhaps it was because people were using external editors to create plugins, and put tabs inside the XML, which were then not recognised later on in the XML parser. As for the carriage-returns/linefeeds it is basically doing what it does to incoming MUD data - try to normalise the various line endings into standard Windows CR/LF form. For example, if you edited a multi-line trigger on a Unix platform, and only had linefeeds between lines, it would not display correctly on Windows.

What I recommend you do is convert "binary" strings yourself before saving them (eg. using Base64Encode) and then converting them back on loading (eg. using Base64Decode). However you will still have problems with a string with 0x00 in it. In that case I assume if Python is indeed creating strings with 0x00 in them, it will have tools to convert them into printable form - like its own version of
Base64Encode / Base64Decode.

To be honest, when I wrote the variables stuff (and the XML loader) I expected people to want to store things like mob names, not pure binary data with imbedded nulls, carriage-returns and linefeeds. If I had expected that, then MUSHclient would have been written from the start with more robust string handling - that is, the ability to store strings with 0x00 in them, which basically means all the standard C string libraries can't be used.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

186,918 views.

This is page 2, subject is 4 pages long: 1 2 3 4

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

top