switchScanner 0.6 |
Overview |
switchScanner is a simple Python script that generates
simple lexical scanners in C using switch
statements.
The resulting scanners are generally portable (Win32, x86 Linux, OS X)
and are incredibly fast.
The switchScanner home page is here:
http://www.midwinter.com/~larry/programming/switchScanner/And you can download a fresh copy of the source code here:
http://www.midwinter.com/~larry/programming/switchScanner/switchScanner.0.6.tar.gz
http://www.midwinter.com/~larry/programming/switchScanner/switchScanner.0.6.zip
The Gory Details |
You use it from Python like so:
import switchScanner s = switchScanner("myScannerName") s.addKeyword("keywordGoesHere") s.addKeyword("secondKeyword") ... switchScanner.write()This will create scanner.c and scanner.h in the current directory. There will be one entry-point in scanner.h:
extern token myScannerName(char **s);The "token" class is an enum; its values are auto-generated from the keywords. To use the scanner,
#include "scanner.h"
then call:
token t; while ((t = myScannerName(&s)) > TOKEN_NOERROR) { // recognize T as necessary }The scanner will return
TOKEN_EOF
if it reaches the end of
the input string without incident, and TOKEN_ERROR
if it
encounters an unknown token.
Okay, so what's the point of all this?
Simple: the scanners generated by this script are lightning-fast. They
make exactly one pass through each character in the scanned string.
And they are explicitly not data-driven; they use switch
statements,
recognizing each letter in sequence in the keywords. For instance, a
scanner that looked for the strings cat
, car
, cur
, and bat
might
look like this:
switch (*s++) { case 'b': if (!strcmp(s, "ar") return TOKEN_BAR; break; case 'c': switch (*s++) { case 'a': switch (*s++) { case 'r': return TOKEN_CAR; case 't': return TOKEN_CAT; } break; case 'u': if (*s == 'r') return TOKEN_CUR; break; } break; }This is greatly simplified over the actual code, which handles case sensitivity and ensures that the keywords terminate. (The above example, for instance, would return
TOKEN_CUR
for
the word curtsey
.) The real thing also has some
additional silly little optimizations.
Trying out switchScanner |
I've included a simple (hacked-up!) sample program for switchScanner. Under UNIX, simply run "make"; Win32 developers should run "nmake /f win32.mak". This will produce sstest (on Windows, sstest.exe), with the scanner in scanner.c.
According to the (super-simple!) benchmark in sstest, my scanner can recognize over 15 million symbols per second on my 933MHz Pentium 3 Linux server.
Notes And Warnings |
switchScanner.tokenValue
to that number before adding your
first token. The minimum number is 3.
switchScanner.tokenValue
to 3, then add a token
to the end of your token list called FINAL_TOKEN
(or something similar).
Then create an array to represent your hash values, with TOKEN_FINAL_TOKEN
entries. To hash a string, call the scanner twice; the first time you should get
a valid token, the second time you should get an EOF. Use the valid token as an
index into the array, and you're done!
switchScanner.basename
to another string
(the default is "scanner"
). Similarly, you can change
the enum's name by changing switchScanner.enumName
,
and you can change the prefix automatically added to
keywords by changing switchScanner.tokenPrefix
.
cPrinter
,
which makes printing C programs convenient by mantaining
the indention level for you.
Licensing |
Here's the license:
[BEGIN NOTICE] Copyright 2005-2006 Larry Hastings This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. The switchScanner homepage is here: http://www.midwinter.com/~larry/programming/switchScanner/ [END NOTICE]In non-legalese, my goal was to allow you to do anything you like with the software, except claim that you wrote the original. If my license prevents you from doing something you'd like to do, contact me (my email address is in the source) and we can discuss it.
Furthermore, I'd like to point out that my license makes no claim on the output of switchScanner. Scanners you create with switchScanner are entirely your property.
Version History |
char **
passed in didn't get advanced to the end.
#include <string.h>
to the output code.
extern "C"
to make
it link correctly into C++ projects.
switchScanner.py
to use
spaces instead of tabs. (I keep fighting with
my new editor...)
tokenValue
logic; it's now
legal to set it to less than 100.
basenameEnumNameLookup()
;
EnumName here being enumName
only capitalized.
strnicmp
is deprecated.
Once again, the VC8 chuckleheads are getting up
my nose.
larry