I am creating my own language.
The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.
Right now, I'm focusing on the aspect of interpreting it in PHP and run it.
At the moment, I'm using regex to split the string and extract the multiple tokens.
This is the regex I have:
/\:((?:cons@(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:
||
))*")))|(?:[a-z]+(?:@[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g
This is quite hard to read and maintain, even though it works.
Is there a better way of doing this?
Here is an example of the code for my language:
:define:&0:factorial
:param:~0:static
:case
:lower@equal:cons@1
:case:end
:scope
:return:cons@1
:scope:end
:scope
:define:~0:static
:define:~1:static
:require:static
:call:static@sub:^~0:~1 :store:~0
:call:&-1:~0 :store:~1
:call:static@sum:^~0:~1 :store:~0
:return:~0
:scope:end
:define:end
This defines a recursive function to calculate the factorial (not so well written, that isn't important).
The goal is to get what is after the :
, including the @
. :static@sub
is a whole token, saving it without the :
.
Everything is the same, except for the token :cons
, which can take a value after. The value is a numerical value (integer
or float
, called static
or dynamic
in the language, respectively) or a string, which must start and end with "
, supporting escaping like \"
. Multi-line strings aren't supported.
Variables are the ones with ~0
, using ^
before will get the value to the above :scope
.
Functions are similar, being used &0
instead and &-1
points to the current function (no need for ^&-1
here).
Said this, Is there a better way to get the tokens?
Here you can see it in action: http://regex101.com/r/nF7oF9/2
[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:
preg_match('/
# read constant (?)
\:((?:cons@(?:\d+(?:\.\d+)?|
# read a string (?)
(?:"(?:(?:\\\\)+"|[^"]|(?:
||
))*")))|
# read an identifier (?)
(?:[a-z]+(?:@[a-z]+)?|
# read whatever
\^?[\~\&](?:[a-z]+|\d+|\-1)))
/gx
', $input)
Beware that all space are ignored, except under certain conditions ( is normally "safe").
Now, if you want to pimp you lexer and parser, then read that:
What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.
As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:
$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
$input = ...;
$tokens = [];
while ($input) {
$best = null;
$k = null;
for ($regexp as $re => $kind) {
if (preg_match($re, $input, $match)) {
$best = $match[0];
$k = $kind;
break;
}
}
if (null === $best) {
throw new Exception("could not analyze input, invalid token");
}
$tokens[] = ['kind' => $kind, 'value' => $best];
$input = substr($input, strlen($best)); // move.
}
Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).
The $regexp
array can be anything, I expected it to be a "regexp" => "kind"
key/value, but you can also an array like that:
$regexp = [['reg' => '...', 'kind' => STRING], ...]
You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:
class Foobar {
const FOOBAR = "arg";
function x() {...}
}
There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class
identifier when you are actually in a class
.
FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/
").
Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).
This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.
As I said, it was something I made in the past - something like 6/7 years ago.
preg_match
with an offset, to avoid doing the substr($input, ...)
at the end.You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)