[Home] [Downloads] [Search] [Help/forum]


Register forum user name Search FAQ

Gammon Forum

[Folder]  Entire forum
-> [Folder]  MUSHclient
. -> [Folder]  Lua
. . -> [Subject]  LPeg pattern parsing

LPeg pattern parsing

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page


Pages: 1 2  

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Mon 02 Jun 2008 06:01 AM (UTC)

Amended on Mon 02 Jun 2008 09:34 PM (UTC) by Nick Gammon

Message
I have been experimenting a bit with LPeg - a pattern matching library for Lua, based on Parsing Expression Grammars. This is written by Roberto Ierusalimschy, the developer of Lua.

You can read about the Lua implementation at:

http://www.inf.puc-rio.br/~roberto/lpeg/

and also:

http://www.inf.puc-rio.br/~roberto/lpeg/re.html

Plus more information in a PDF file at:

http://www.inf.puc-rio.br/~roberto/docs/peg.pdf


Effectively what lpeg does (as far as I can tell) is give you the tools to make something similar to a regular expression parser, but with the regular expression pattern (and its components) being "first class values". That is, they can be operated on in pieces.

I am pretty confident that David Haley could explain these grammars a lot better than I can, but I will make a stab at it.


If you want to try these examples at home, I have taken the Lua library distributed by Roberto, compiled it under Cygwin to get a Windows DLL, and released that here (84 Kb):

http://www.gammon.com.au/files/mushclient/lpeg-0-1.8.1_b.zip

Take the file lpeg.dll from the .zip file and place it in the same directory as MUSHclient.exe, and you are ready to roll.

To get started, let's look at looking for matching on something like "You gain 300 hp.".


require "lpeg"

digits = lpeg.R ("09")

hp = lpeg.P ("You gain ") * digits^1 * " hp."

print (lpeg.match (hp, "You gain 300 hp."))  --> 17


What we have done here is make a variable (actually a Lua userdatum) called "digits" which is the pattern for the range 0-9.

Then the main pattern consists of:


  • lpeg.P ("You gain ") --> match on this literal string
  • digits^1 --> match on the digits pattern, one or more times
  • "hp." --> match on "hp." literally


The "*" symbol is the concatenation operator for lpeg. That is, this followed by that.

We now pass that to lpeg.match, along with our target string, and get a non-nil result (actually the column number just after the match ends), which confirms we matched.

We can do something similar to match on mana:


mana = lpeg.P ("You gain ") * digits^1 * " mana."

print (lpeg.match (mana, "You gain 500 mana."))  --> 19


Now we probably want the "capture" results back (that, is what *was* our HP value?). So we can put lpeg.C (capture) around the part in question:


hp = lpeg.P ("You gain ") * lpeg.C (digits^1) * " hp."
print (lpeg.match (hp, "You gain 300 hp."))  --> 300

mana = lpeg.P ("You gain ") * lpeg.C (digits^1) * " mana."
print (lpeg.match (mana, "You gain 500 mana."))  --> 500


Things start to get interesting now. We have two patterns, one matches on hp, and the other on mana. We can concatenate the patterns together to get a larger pattern:


print (lpeg.match (hp * mana, "You gain 300 hp.You gain 500 mana."))  --> 300 500


Both results are returned, and we are matching on the concatenated string.

We can swap around the order:


print (lpeg.match (mana * hp, "You gain 500 mana.You gain 300 hp."))  --> 500 300


Or, we can do one *or* the other:


print (lpeg.match (hp + mana, "You gain 123 hp."))  --> 123
print (lpeg.match (hp + mana, "You gain 456 mana."))  --> 456


The "+" operator is the choice operator. We will match on the hp string, or the mana string.

Next, let's parse a more complex string. Something like "You see exits leading north, up, down, west and south".

This can be broken into components. The first component is the possible exit directions:


directions = lpeg.C (lpeg.P "north" + "south" + "east" + "west" + "up" + "down")


This shows that the possible directions are (in each case) a choice of north, south etc., which we want captured.

The next part of our exits string is the part underlined:


You see exits leading north, up, down, west and south


We can see that what we have here is one initial direction (north in this case), followed by zero or more extra directions (up, down, west). These extra directions are preceded by a comma. Finally we have the word "and". This can all be expressed like this:


comma_directions = directions * (", " * directions)^0 * " and "


Finally the whole string starts with "You see exits leading" and finishes off with the final direction.


patt = lpeg.P ("You see exits leading ") * (comma_directions^-1) * directions


We want all the directions nicely placed in a table, so we can use lpeg.Ct to gather all the submatches together:


result = lpeg.match (lpeg.Ct (patt), "You see exits leading north, up, down, west and south")

require "tprint"

if result then
  tprint (result)
else
  print "no match"
end -- if

Output

1="north"
2="up"
3="down"
4="west"
5="south"


Another way of expressing the same idea is with an "exit line grammar". This might look like this:


directions       <- "north" | "south" | "east" | "west" | "up" | "down"

comma_directions <- directions (", " directions)*  " and "

exitline         <- "You see exits leading "  comma_directions? directions


In the notation above "|" means "or", "*" means "zero or more" and "?" means zero or one.

We can express that grammer in lpeg like this:


exitgrammar = lpeg.P {
           "exitline",
           directions = lpeg.C (lpeg.P "north" + "south" + "east" + "west" + "up" + "down"),
           comma_directions = lpeg.V "directions" * (", " * lpeg.V "directions")^0 * " and ",
           exitline = "You see exits leading " * lpeg.V "comma_directions"^-1 * lpeg.V "directions",
           }
           
result = lpeg.match (lpeg.Ct (exitgrammar),  "You see exits leading north, up, down, west and south")


This gives the same results as before.


There is considerably more power in lpeg than I have described here (or, indeed, been able to understand). The links given above give more details. There are examples there of patterns that parse comma-separated strings, and even evaluate whole arithmetic expressions, returning the result.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #1 on Mon 02 Jun 2008 06:12 AM (UTC)

Amended on Mon 16 Jun 2008 03:11 AM (UTC) by Nick Gammon

Message
Another pattern I was wresting with, as it wasn't particularly obvious, was how to match something like:


<someone> says, <something>


For example:


Nick Gammon says, But does it get goat's blood out?


My initial attempt was along these lines:


lpeg.P (1)^1 * " says " * lpeg.P (1)^1


That is, one or more of any character, followed by " says " followed by one or more of any character. However that failed because the initial sequence 'lpeg.P (1)^1' was "greedy" and gobbled up the entire line.

So it needs to be changed to <a single character, provided we are not at the start of " says, "> and match that multiple times. That effectively consumes all the characters up to, but not including " says, ".

The finished result looks like this:


require "lpeg"

line = lpeg.P (" says, ")
patt = lpeg.C ((1 - line)^1) * line * lpeg.C (lpeg.P(1)^1)

who, what = lpeg.match (patt, 
                   "Nick Gammon says, But does it get goat's blood out?")

print (who)  --> Nick Gammon
print (what) --> But does it get goat's blood out?


And see below for how to do this example using the "re" module.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #2 on Mon 02 Jun 2008 09:42 PM (UTC)
Message
The file to download has changed. I found yesterday that the original DLL crashed MUSHclient when running the "test.lua" file supplied as part of the lpeg download. I have recompiled it a different way, which incidentally made the DLL somewhat smaller. The new file (lpeg-0-1.8.1_b.zip) is available from the link given in the earlier post.

With this DLL you can run the lpeg "torture test" by copying test.lua into the MUSHclient install directory (along with lpeg.dll as mentioned above), and then in the Immediate window typing:


dofile "test.lua"


You should see in the output window:


General tests for LPeg library
version 0.8
+
+
+
+
+
OK


This 960 line file tests various aspects of lpeg (and also provides some examples if you want to read it).


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #3 on Mon 02 Jun 2008 09:47 PM (UTC)

Amended on Tue 03 Jun 2008 06:50 AM (UTC) by Nick Gammon

Message
The lpeg download also includes the file "re.lua" which implements a form of regular expressions, using lpeg, written by Roberto Ierusalimschy.

As an example, you can write in a more "natural" way, a grammar which can be processed by lpeg. An example given in the enclosed file re.html, shows how you could write a grammar that parses comma-delimited fields.

As mentioned in the documentation:


A field is either a quoted field (which may contain any character except an individual quote, which may be written as two quotes that are replaced by one) or an unquoted field (which cannot contain commas, newlines, or quotes). A record is a list of fields separated by commas, ending with a newline or the string end (-1).


The example below shows this in action.


require "re"

record = re.compile[[
  record <- ( <field> (',' <field>)* ) -> {} ('%nl' / !.)
  field <- <escaped> / <nonescaped>
  nonescaped <- { [^,"%nl]* }
  escaped <- '"' {~ ([^"] / '""' -> '"')* ~} '"'
]]


result = lpeg.match (record, [[all,cows,"every good boy ""deserves"" fruit",eat,grass]])

require "tprint"

if result then
  tprint (result)
else
  print "no match"
end -- if

Output
1="all"
2="cows"
3="every good boy "deserves" fruit"
4="eat"
5="grass"



- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #4 on Tue 03 Jun 2008 03:19 AM (UTC)

Amended on Tue 03 Jun 2008 03:24 AM (UTC) by Nick Gammon

Message
The grammar for my earlier example of matching on a room exits list can also be expressed a bit more neatly by using the "re" syntax:


require "re"

exits = re.compile[[
  ExitLine         <- "You see exits leading "  <CommaDirections>? <Directions> 
  CommaDirections  <- <Directions> (", " <Directions>)*  " and "
  Directions       <- { "north" / "south" / "east" / "west" / "up" / "down" }
]]


print ( exits:match ("You see exits leading north, up, down, west and south") )

Output
north up down west south


A small "gotcha" here is that the PEG syntax defines identifiers as [A-Za-z]+ so I had to omit the underscore I had used previously, in order to avoid puzzling error messages.

You can see that the re syntax above describes the room exit list in a fairly natural way.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #5 on Sun 08 Jun 2008 10:18 PM (UTC)

Amended on Sat 14 Jun 2008 11:56 PM (UTC) by Nick Gammon

Message
Another nice use for LPeg is to handle strings with delimiters. In particular, you sometimes have the case of reading in a file which might be in Unix, Windows, or Mac format, and you need to handle the end-of-line situation in a graceful way. In the case of Unix the line ending will be \n, in the case of Windows it will \r\n and in the case of Mac (at least the older ones) it would be \r.

As suggested on another Lua forum, the way to handle this is to look for any ending on its own, or followed by a different one. In other words, any of these could be considered a line end:


  • \r
  • \n
  • \r\n
  • \n\r


However multiple instances of the same ending (eg. \n\n) would be two line breaks.

Using LPeg we can express the four line endings like this:


require "lpeg"

delim = lpeg.P (lpeg.P("\n") * "\r") + 
        lpeg.P (lpeg.P("\r") * "\n") + 
        lpeg.P ("\n") + 
        lpeg.P ("\r")


Bear in mind that "*" means concatenation, and "+" means "or".

Now from the LPeg documentation page, this function will split a string at a separator, returning a table:


function split (s, sep)
  sep = lpeg.P(sep)
  local elem = lpeg.C((1 - sep)^0)
  local p = lpeg.Ct(elem * (sep * elem)^0)   -- make a table capture
  return lpeg.match(p, s)
end


Now to test it:


test = "every\n" .. "good\r" .. "boy\r\n" .. "deserves\n\r" .. "fruit"

t = split (test, delim)  -- table of results

table.foreachi (t, print)  -- print table

Output

1	every
2	good
3	boy
4	deserves
5	fruit


This "convert to a table" approach would be handy for reading something like a configuration file, and then processing it line-by-line, where you weren't certain of the line endings of the input file.

Another approach is to simply replace the line ending with our desired one.


function gsub (s, patt, repl)
  patt = lpeg.P(patt)
  patt = lpeg.Cs((patt / repl + 1)^0)
  return lpeg.match(patt, s)
end

s = gsub (test, delim, ",")  -- string result

print (s)  -- print string

Output

every,good,boy,deserves,fruit


Of course, instead of a comma in the gsub we could use \n, \r\n or \r which would convert from any input style to a desired output style.


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #6 on Mon 16 Jun 2008 03:17 AM (UTC)

Amended on Sun 07 Nov 2010 07:31 PM (UTC) by Nick Gammon

Message
Another way of doing the: "[who] says, [what]" example above is to use the "re" module, like this:


require "re"

patt = re.compile [[
          result <- { <who> } <says> { .+ }
          who <- (!<says> .)+
          says <- " says, "
          ]]
          
who, what = lpeg.match (patt, 
             "Nick Gammon says, But does it get goat's blood out?")
             
print (who)  --> Nick Gammon
print (what) --> But does it get goat's blood out?


Starting at the bottom and working up, we need the "delimiter" which divides the line into two parts - in this case the string " says, ". So we assign that to the grammar name "says".

Next, the "who" part is one or more characters, providing we are not up to " says, " yet.
We can express 'not up to " says, " and take a single character' like this: (!<says> .)
We need one or more of those, hence the "+" sign after it.

Finally the finished result is { <who> } - the "who" part, returned as a capture, followed by <says> (the word " says, ") followed by { .+ } which is one or more characters of any sort.




An interesting challenge is to take the above and make the "who" part greedier. If we add another "says," to target string we get this:


who, what = lpeg.match (patt, 
             "Nick Gammon says, John says, But does it get goat's blood out?")
             
print (who)  --> Nick Gammon
print (what) --> John says, But does it get goat's blood out?


This may well be what we want in this case, but what if we wanted everything after the last "says," to be in the "what" field, not everything after the first "says,"?

A small variation to the pattern achieves this:


require "re"

patt = re.compile [[
          result <- { (<who> <says>)+ }  { .+ }
          who <- (!<says> .)+
          says <- " says, "
          ]]
  
          
who, what = lpeg.match (patt, 
             "Nick Gammon says, John says, But does it get goat's blood out?")
             
print (who)  --> Nick Gammon says, John says, 
print (what) --> But does it get goat's blood out?


By putting <who> <says> into a group, and asking for one or more of them, that part of the pattern will consume characters until there are no more "says," in the string.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Worstje   Netherlands  (899 posts)  [Biography] bio
Date Reply #7 on Mon 16 Jun 2008 11:59 AM (UTC)
Message
That's pretty heavy stuff, Nick. I've read through most of it, but I'm kind of confused what its advantages are over regular expressions. Can you give some examples of stuff it can do that regular expressions can not?
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #8 on Mon 16 Jun 2008 10:06 PM (UTC)
Message
Well the classic example would be anything with nested definitions, which PCRE does not handle very well. For example, C-style nested comments.

If I want to match on /* .... */ using PCRE regexps I basically have to choose a greedy or non-greedy match. However neither really handles it, see below:


             /*  aaa    /*  nested   */  bbb    */ (not comment)     /*  cccc    */
Greedy:      <-------------------------------------------------------------------->
Not greedy:  <------------------------>


Neither case is matching the first comment (with the nested comment). Similar things would apply to, say, HTML where you might have nested unordered lists, eg.


<ul>    <ul>    </ul>   </ul>  <ul>  </ul>


However as Roberto Ierusalimschy points out, this is the sort of thing that LPeg is good at. Or indeed any complex sort of pattern (eg. comma-separated values, where commas inside quotes are not considered separators).

I must admit that it is a bit mind-boggling at first, which is why I tried some fairly simple examples. I should point out that for simple cases, use the simple tools.


  • Lua's regexp (string.match etc.) will handle many simple cases, but fails on complex "or" situations. For example "You (say|yell) something". For that you need PCRE.

  • PCRE (which is used in MUSHclient's triggers and aliases) will handle most situations, excepting the (probably rare) situations of nested definitions.

  • LPeg would be suited to more complex matches, like nested things, or indeed parsing stuff like an XML document.



- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Erendir   Germany  (47 posts)  [Biography] bio
Date Reply #9 on Tue 17 Jun 2008 12:41 AM (UTC)
Message
@Worstje
in a more scientific way:
regular expressions work with regular Languages (Type-0 in Chomsky-hierarchy), and LPeg, 'cause of possibility to describe grammars, match at least Context-sensitive languages (Type-1 in Chomsky), and this is much more.
More about the hierarchy: http://en.wikipedia.org/wiki/Chomsky_hierarchy
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #10 on Sat 21 Feb 2009 02:40 AM (UTC)
Message
lpeg 0.9 is now built into MUSHclient 4.40.

Also re.lua is supplied in the lua directory.

The above examples tested OK under lpeg 0.9

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by David Haley   USA  (3,881 posts)  [Biography] bio
Date Reply #11 on Sat 21 Feb 2009 03:09 AM (UTC)
Message
This is fascinating stuff. I somehow missed this post the first time around. I knew that lpeg did some pretty fancy stuff with parsing and grammars, but I hadn't realized that it implements what seem to be context-free grammars. Very nifty indeed! (Or maybe it just implements something that looks like CFGs, but doesn't allow recursion, and so isn't truly more expressive than regular languages.) I'll have to look into this further... very nifty indeed.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #12 on Sat 21 Feb 2009 03:47 AM (UTC)
Message
See Roberto's talk: http://vimeo.com/1485123

I think I did a timing test a while back - lpeg came out pretty well compared to doing the same thing with PCRE or Lua's regexp. Not brilliantly faster, but not slower either, from memory.

The thing is, that of course you can express a heap more than you can in simple Lua regexs, and even with PCRE's parsing, which I think tends to fail if you try to do something complex like detect nested commas, or nested HTML elements, that sort of thing.

I know this might seem obscure in a MUD client, but after all, a lot of time we spend time trying to parse complex triggers or room descriptions, and being able to express them as a grammar just might make them easier to decode.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #13 on Sat 21 Feb 2009 03:49 AM (UTC)
Message
And David, you missed my opening comment:

Quote:

I am pretty confident that David Haley could explain these grammars a lot better than I can, but I will make a stab at it.



I await your analysis. ;)


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by WillFa   USA  (525 posts)  [Biography] bio
Date Reply #14 on Sun 22 Feb 2009 10:09 AM (UTC)

Amended on Sun 22 Feb 2009 10:19 AM (UTC) by WillFa

Message
When you say it's built in, is it just compiled/linked into the exe, or are there any features that utilize it?

It'd be rather neat (though a lot of work, I'm suspecting) if the trigger engine had the option of matching PCRE or LPeg triggers. Although it may just be as simple as a new check box/tristate check box/drop down (Parsing: MC|PCRE|LPEG|LPEG-RE)

I've been playing with LPEG 0.8 all day, and there are some things that I'd really like to try out with 0.9 (anonymous captures, named captures... I'd like to see if (although I don't think this is possible if you could make a backreference a name (how simple would that be? " {{:key: %a+:} ': ' {:=key: %d+ :} ->{} " "HP: 450" would return a table {HP = '450'} (in 0.8 I've gotten a {patt} => func working to parse those key/value pairs. it'd be nice if there was a more elegant way to do it.)


Thanks Nick, as always.
[Go to top] top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


76,707 views.

This is page 1, subject is 2 pages long: 1 2  [Next page]

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page

Go to topic:           Search the forum


[Go to top] top

Quick links: MUSHclient. MUSHclient help. Forum shortcuts. Posting templates. Lua modules. Lua documentation.

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.

[Home]


Written by Nick Gammon - 5K   profile for Nick Gammon on Stack Exchange, a network of free, community-driven Q&A sites   Marriage equality

Comments to: Gammon Software support
[RH click to get RSS URL] Forum RSS feed ( https://gammon.com.au/rss/forum.xml )

[Best viewed with any browser - 2K]    [Hosted at HostDash]