[Home] [Downloads] [Search] [Help/forum]


Register forum user name Search FAQ

Gammon Forum

[Folder]  Entire forum
-> [Folder]  Programming
. -> [Folder]  General
. . -> [Subject]  Lua regular expression library that does not need Lua

Lua regular expression library that does not need Lua

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page


Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Sat 30 Apr 2011 11:26 PM (UTC)

Amended on Sat 23 Aug 2014 04:01 AM (UTC) by Nick Gammon

Message
I am pleased to announce the release of a Regular Expression library for the Arduino and other applications where a compact library is required.

The examples given are for the Arduino IDE, but the library itself does not use any specific Arduino function calls. It could be used in PIC processors, or anywhere (like a C++ program) where you simply want a compact, fast, powerful regular expression matcher.

What are regular expressions?


If you haven't used them before, they are incredibly powerful ways of parsing text in strings. For example, you can look for a word starting with an upper-case letter, a hex string, a number with an optional leading minus sign, and so on.

They provide a somewhat easier and more powerful way of breaking up strings than using library functions like strtok, strcmp, strstr etc.


Download


The library can be downloaded from:

http://gammon.com.au/Arduino/Regexp.zip (9Kb).

[EDIT] Also at: https://github.com/nickgammon/Regexp

Add the file Regexp.cpp to your project. The "include" file Regexp.h has the relevant declarations in it. If you are using this on the Arduino the file keywords.txt has the library keywords in it.

To install on the Arduino, simply unzip the folder into your "libraries" folder.

[EDIT]

Uploaded version 1.2 on 19th May 2011. This now has extra functions for doing replacements, and global (multiple) finds, and counting how many matching strings there are.

Design


My design criteria for this library were:


  • Be powerful enough to be useful (ie. not a "toy")
  • Be fast enough that it could be used to do things like process input from GPS devices, RFID readers, web pages etc.
  • Be compact enough that it doesn't use most of the available memory on a microprocessor


I believe these have been met, as follows:

Power


The library processes regular expressions which are identical in syntax to those used by the Lua string.match, string.find and similar functions. This is because the code is adapted from the Lua code written by Roberto Ierusalimschy. It has been adapated enough to make it work outside the Lua structure basically.

Lua regular expressions are well-known and well-documented. Their power is such that (for example) very extensive add-ons for the World of Warcraft game are written in Lua, and use the regular expression matching to break up incoming data from the server.

My own documentation for Lua regular expressions is here:

http://www.gammon.com.au/scripts/doc.php?lua=string.find

You can not only match strings like "%d+" (for one or more digits) but you can specify "captures" which means each captured substring has its position returned, so you can easily extract it out from the original string.

There are some examples here too:

http://gammon.com.au/forum/?id=6034

That page mentions the string.find and string.match in Lua, but the part about the patterns themselves would apply.


Speed


The Lua regular expression matcher has been well-regarded for its speed, and this library performs well too. For example:

String to parse: "Testing: answer=42"
Regular expression: "(%a+)=(%d+)"

Time taken to match: around 2 milliseconds on an Arduino microprocessor running at 16 MHz.

This test returned the matching text ("answer=42"), its length, plus the two captures ("answer" and "42").


match start was 9
match length was 9
Match text: 'answer=42'
Captures: 2
Capture number: 1
Text: 'answer'
Capture number: 2
Text: '42'


This shows how easily you can use regular expressions to parse incoming text (eg. GPS data in the form keyword=value).


Size


The library takes about 2392 bytes on the Arduino. For example a minimal test would be:


#include <Regexp.h>
void setup ()
{
  MatchState ms;
  ms.Target ("cat");  // what to search
  char result = ms.Match ("a");  // look for "a"
} // end of setup

void loop () {}


This compiles to be 2842 bytes. However an "empty" sketch is 450 bytes, so the regular expression library has added 2392 bytes (2.33 Kb).

I believe this is an acceptable length. This is around 7% of the memory on a 32 Kb device. You can reduce the memory slightly by reducing the number of captures it supports (currently 32). Alternatively, if you need to do dozens of captures you can do that by changing one define, at the cost of 4 bytes per capture.


Error handling


The library "throws" exceptions by doing a non-local goto (longjmp), in exactly the same way Lua does. This keeps the code compact and simple. If there is a parsing problem then the library returns a negative number as the result of the regexp call. You can interpret those to tidy up your regular expressions to make them work properly.

It does not use C++ exceptions because they are not supported by the Arduino IDE.

Usage


You need to include the library:


#include <Regexp.h>


Since, unlike Lua, functions cannot return multiple results (eg. all the captures) the MatchState structure is used to communicate with the library. You start by setting up the string to be searched, and its length:


MatchState ms;
ms.Target ("Testing: answer=42");


You can supply either a zero-terminated string (like the above) or a char buffer and a length.

Then you call the Match method of the MatchState structure, supplying the regular expression string itself, and an zero-relative offset to commence searching from. The function returns 1 on a match, 0 on no match, and a negative number for a parsing error.


  char result = ms.Match ("(%a+)=(%d+)");
  if (result == REGEXP_MATCHED)
  {
    // matching offsets in ms.capture
  }
  else if (result == REGEXP_NOMATCH)
  {
    // no match
  }
  else
  {
    // some sort of error
  }


The meanings of the various error codes are defined in Regexp.h.

If and only if you get a REGEXP_MATCHED result (that is, 1) then the captures array in the MatchState structure is set up to indicate what the address and length of each capture substring is. You can use that information to index into your supplied string and extract out the substrings, if required.

For example:


char buf [100];  // large enough to hold expected string

Serial.print ("Captures: ");
Serial.println (ms.level);

for (int j = 0; j < ms.level; j++)
  {
  Serial.print ("Capture number: ");
  Serial.println (j, DEC);
  Serial.print ("Text: '");
  Serial.print (ms.GetCapture (buf, j));
  Serial.println ("'");
  
  } // end of for each capture


Also the matching text itself (the whole match) is stored as an offset and length. You could index into your original string to extract out the matching text, if required. It may not be required. You may only need to know if a match occurred, or not.


char buf [100];  // large enough to hold expected string
Serial.print ("Matched on: ");
Serial.println (ms.GetMatch (buf));
        


The library does not attempt to "pre-extract" those strings for you on the grounds that to do so would take extra time and memory, which you, the user of the library, may not care to expend.

MatchState methods



void MatchState::Target (char * s);
void MatchState::Target (char * s, const unsigned int len);


These let you supply the string to be searched (the target string). It can be null-terminated, in which case the library finds the end by doing a strlen on it, or you supply the length. If you have built up a buffer from incoming text you may prefer to just supply the length. You can also provide the target string on the MatchState constructor.


char MatchState::Match (const char * pattern, unsigned int index = 0);


This performs the match based on the supplied null-terminated pattern, and starting at the supplied index into the target string. By modifying the index parameter you can re-match further and further through the same target string, perhaps to keep finding the same sort of string (eg. a word).

The result of the match will be > 0 if successful, 0 if no match, and < 0 if an error occurred (invalid regular expression).


char * MatchState::GetMatch (char * s) const;


After a successful match, this copies the matching string from the target buffer to another memory location, with a null-terminator. Thus you must allocate enough memory to hold the matching string, plus one for the 0x00 byte at the end. You could either statically allocate a buffer (as in the examples above) or do a malloc based on MatchLength which is calculated during the match. If no successful match was previously done, then an empty string is copied.

The supplied buffer is also returned from the function so you can directly use it in a Serial.println function or similar.


char * MatchState::GetCapture (char * s, const int n) const;


After a successful match, this copies the specified capture string from the target buffer to another memory location, with a null-terminator. Thus you must allocate enough memory to hold the matching string, plus one for the 0x00 byte at the end. You could either statically allocate a buffer (as in the examples above) or do a malloc based on capture [n].len which is calculated during the match. If no successful match was previously done, or this capture does not exist, then an empty string is copied.

The supplied buffer is also returned from the function so you can directly use it in a Serial.println function or similar.

The variable "levels" in the MatchState object tells you how many captures were found. This could be zero if you did not use round brackets in the regular expression.

The match number is zero-relative, so the first match is match number 0.


unsigned int MatchCount (const char * pattern);


This applies the pattern to the target buffer repeatedly, and returns the count of matches (which could be zero of course). This could be used, for example, to count how many spaces are in a string.


unsigned int GlobalMatch (const char * pattern, GlobalMatchCallback f);


This applies the pattern to the target buffer repeatedly, and for each match calls the function f which must be declared like this:


void fname (const char * match,          // matching string (not null-terminated)
            const unsigned int length,   // length of matching string
            const MatchState & ms);      // MatchState in use (to get captures)


This function gets a pointer to the matching string in the actual target buffer, thus it will not be null-terminated. You need to use the length as well to work out how long it is. You could, for example, memcpy or memcmp the string. If you want to get the captures for each match then use the MatchState variable in the way you would for ordinary matches.

This function could be used to break a string up into words (for example) and for each word do something with it.


unsigned int GlobalReplace (const char * pattern, GlobalReplaceCallback f, const unsigned int max_count = 0);


This applies the pattern to the target buffer repeatedly, and for each match calls the function f which must be declared like this:


void fname (const char * match,          // matching string (not null-terminated)
            const unsigned int length,   // length of matching string
            char * & replacement,        // what to replace with
            unsigned int & replacement_length,  // how long the replacement is
            const MatchState & ms);      // MatchState in use (to get captures)


This function gets a pointer to the matching string in the actual target buffer, thus it will not be null-terminated. You need to use the length as well to work out how long it is. If the function does nothing, then the original string will be retained (thus nothing will change).

However the function can set the replacement variable to a new string, and the replacement_length variable to a new length. The new string must still be in scope when the function returns. Thus, you could use a static variable, or a literal string.

If max_count is zero then as many replacements as possible are done, otherwise it stops after doing max_count replacements.


unsigned int GlobalReplace (const char * pattern, char * replacement, const unsigned int max_count = 0);


This applies the pattern to the target buffer repeatedly, and for each match replaces the matched string with the replacement string. This can be used for simple replacements, like replacing multiple spaces with a single space.

If max_count is zero then as many replacements as possible are done, otherwise it stops after doing max_count replacements.

For both replacement functions the replacements are done "in situ". Thus the buffer holding the original string must be large enough for the replacement, especially if you replace with something longer than the original.

Also a null-terminator is appended to the string for convenience, even if there wasn't one there to begin with. The buffer would need room for that.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #1 on Sat 30 Apr 2011 11:40 PM (UTC)

Amended on Thu 19 May 2011 07:32 AM (UTC) by Nick Gammon

Message
Iterating over a string


This small example demonstrates how you can repeatedly apply a regular expression to a string (like string.gmatch in Lua) to do things like extract out words from some text. This is done by calling the GlobalMatch function with a "callback" function. The callback function is called once for each match, so you can do something with the matching text. The example below displays the matching text, and also the captures (sub-patterns):


#include <Regexp.h>

// called for each match
void match_callback  (const char * match,          // matching string (not null-terminated)
                      const unsigned int length,   // length of matching string
                      const MatchState & ms)      // MatchState in use (to get captures)
{
char cap [10];   // must be large enough to hold captures
  
  Serial.print ("Matched: ");
  Serial.write ((byte *) match, length);
  Serial.println ();
  
  for (byte i = 0; i < ms.level; i++)
    {
    Serial.print ("Capture "); 
    Serial.print (i, DEC);
    Serial.print (" = ");
    ms.GetCapture (cap, i);
    Serial.println (cap); 
    }  // end of for each capture

}  // end of match_callback 


void setup ()
{
  Serial.begin (115200);
  Serial.println ();
  unsigned long count;

  // what we are searching (the target)
  char buf [100] = "The quick brown fox jumps over the lazy wolf";

  // match state object
  MatchState ms (buf);

  // original buffer
  Serial.println (buf);

  // search for three letters followed by a space (two captures)
  count = ms.GlobalMatch ("(%a+)( )", match_callback);

  // show results
  Serial.print ("Found ");
  Serial.print (count);            // 8 in this case
  Serial.println (" matches.");
 

}  // end of setup  

void loop () {}


Output:


The quick brown fox jumps over the lazy wolf
Matched: The 
Capture 0 = The
Capture 1 =  
Matched: quick 
Capture 0 = quick
Capture 1 =  
Matched: brown 
Capture 0 = brown
Capture 1 =  
Matched: fox 
Capture 0 = fox
Capture 1 =  
Matched: jumps 
Capture 0 = jumps
Capture 1 =  
Matched: over 
Capture 0 = over
Capture 1 =  
Matched: the 
Capture 0 = the
Capture 1 =  
Matched: lazy 
Capture 0 = lazy
Capture 1 =  
Found 8 matches.


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #2 on Fri 13 May 2011 08:50 AM (UTC)

Amended on Thu 19 May 2011 08:13 AM (UTC) by Nick Gammon

Message
Finding and replacing


The example below demonstrates how you can use a regular expression to search a string, repeatedly, for something (eg. &lt; or &gt;) and then do a replacement. Then keep searching. This is done by calling GlobalReplace, passing down a "replace callback" function. This callback function is called once for each match. It gets a chance to see what the matching string is, and if necessary, replace it with something else.


#include <Regexp.h>

// called for every match
void replace_callback (const char * match,         // what we found
                       const unsigned int length,  // how long it was
                       char * & replacement,       // put replacement here
                       unsigned int & replacement_length,  // put replacement length here
                       const MatchState & ms)      // for looking up captures
{
   
   if (memcmp (match, "&lt;", length) == 0)
     {
     replacement = "<";
     replacement_length = 1;     
     }
   else if (memcmp (match, "&gt;", length) == 0)
     {
     replacement = ">";
     replacement_length = 1;     
     }
  
}  // end of replace_callback 


void setup ()
{
  Serial.begin (115200);

  // what we are searching
  char buf [100] = "I do like to be &lt;&lt; beside &gt;&gt; the seaside";
  MatchState ms (buf);

  ms.GlobalReplace ("&%a+;", replace_callback);

  Serial.println (buf);
  
}  // end of setup  

void loop () {}


If a match is found we compare the matching data with a string we are interested in. If a match is found, we set up a replacement string, otherwise we just keep going.


Example output from the above:


I do like to be << beside >> the seaside






Here is another example of replacing. In this case the callback function gets a hex number (eg. %22) and converts that into the equivalent character.


#include <Regexp.h>

// called for every match
void replace_callback (const char * match,         // what we found
                       const unsigned int length,  // how long it was
                       char * & replacement,       // put replacement here
                       unsigned int & replacement_length,  // put replacement length here
                       const MatchState & ms)      // for looking up captures
{
static byte c;  // for holding replacement byte, must be static

   char hexdigits [3];  // to hold hex string
   
    // get first capture
    ms.GetCapture (hexdigits, 0);  
    // convert from hex to printable
    c = strtol (hexdigits, NULL, 16);
    
    // set as replacement
    replacement = (char *) &c;
    replacement_length = 1;
}  // end of replace_callback 


void setup ()
{
  Serial.begin (115200);

  // what we are searching
  char buf [100] = "%7B%22John+Doe%22%7D";

  // for matching regular expressions  
  MatchState ms (buf);

  // easy part, replace + by space
  ms.GlobalReplace ("%+", " ");
  
  // replace %xx (eg. %22) by what the hex code represents
  ms.GlobalReplace ("%%(%x%x)", replace_callback);

  Serial.println (buf);

}  // end of setup  

void loop () {}


Example output:


{"John Doe"}





For simple replacements (eg. replacing multiple spaces by one space) you can just provide a replacement string:


#include <Regexp.h>

void setup ()
{
  Serial.begin (115200);

  // what we are searching
  char buf [100] = "The quick brown fox jumps over the lazy pig";

  // for matching regular expressions  
  MatchState ms (buf);

  // replace vowels
  ms.GlobalReplace ("[aeiou]", "#");

  Serial.println (buf);

}  // end of setup  

void loop () {}


Example output:


Th# q##ck br#wn f#x j#mps #v#r th# l#zy p#g





Note that all replacing is done "inline" - in the original string. Thus, if you are replacing with something longer the buffer must be long enough to hold the replacement string.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Delsauce   USA  (1 post)  [Biography] bio
Date Reply #3 on Thu 05 Jan 2012 05:08 PM (UTC)
Message
Nice work on the library -- very useful. I created a command parser that uses this for a simple debug interface via a serial port (any interface that uses Print::write, actually).

https://github.com/delsauce/YASP
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #4 on Fri 06 Jan 2012 05:36 AM (UTC)
Message
Interesting, thank you.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Paul Webster   New Zealand  (1 post)  [Biography] bio
Date Reply #5 on Wed 17 Apr 2013 07:30 PM (UTC)
Message
Hi Nick,

just stumbled across this library, is exactly what I'm after, good work!

I found you on Guthub, but no sign of the project there - wondered if you'd consider pushing the source up as a project on Github?

Cheers,

Paul.

http://www.progression.co.nz
[Go to top] top

Posted by Nick Gammon   Australia  (22,975 posts)  [Biography] bio   Forum Administrator
Date Reply #6 on Thu 18 Apr 2013 12:06 PM (UTC)
Message
https://github.com/nickgammon/Regexp

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


56,140 views.

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page

Go to topic:           Search the forum


[Go to top] top

Quick links: MUSHclient. MUSHclient help. Forum shortcuts. Posting templates. Lua modules. Lua documentation.

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.

[Home]


Written by Nick Gammon - 5K   profile for Nick Gammon on Stack Exchange, a network of free, community-driven Q&A sites   Marriage equality

Comments to: Gammon Software support
[RH click to get RSS URL] Forum RSS feed ( https://gammon.com.au/rss/forum.xml )

[Best viewed with any browser - 2K]    [Hosted at HostDash]