PHP正则表达式解析 - 用我自己的语言拆分令牌。有没有更好的办法？

I am creating my own language.

The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.

Right now, I'm focusing on the aspect of interpreting it in PHP and run it.

At the moment, I'm using regex to split the string and extract the multiple tokens.

This is the regex I have:

/\:((?:cons@(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:
||
))*")))|(?:[a-z]+(?:@[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g

This is quite hard to read and maintain, even though it works.

Is there a better way of doing this?

Here is an example of the code for my language:

:define:&0:factorial
    :param:~0:static
    :case
        :lower@equal:cons@1
    :case:end
    :scope
        :return:cons@1
    :scope:end
    :scope
        :define:~0:static
        :define:~1:static
        :require:static
        :call:static@sub:^~0:~1 :store:~0
        :call:&-1:~0 :store:~1
        :call:static@sum:^~0:~1 :store:~0
        :return:~0
    :scope:end
:define:end

This defines a recursive function to calculate the factorial (not so well written, that isn't important).

The goal is to get what is after the :, including the @. :static@sub is a whole token, saving it without the :.

Everything is the same, except for the token :cons, which can take a value after. The value is a numerical value (integer or float, called static or dynamic in the language, respectively) or a string, which must start and end with ", supporting escaping like \". Multi-line strings aren't supported.

Variables are the ones with ~0, using ^ before will get the value to the above :scope.

Functions are similar, being used &0 instead and &-1 points to the current function (no need for ^&-1 here).

Said this, Is there a better way to get the tokens?

Here you can see it in action: http://regex101.com/r/nF7oF9/2

[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:

preg_match('/
  # read constant (?)
  \:((?:cons@(?:\d+(?:\.\d+)?|
  # read a string (?)
  (?:"(?:(?:\\\\)+"|[^"]|(?:
||
))*")))|
  # read an identifier (?)
  (?:[a-z]+(?:@[a-z]+)?|
  # read whatever 
  \^?[\~\&](?:[a-z]+|\d+|\-1)))
  /gx
', $input)

Beware that all space are ignored, except under certain conditions ( is normally "safe").

Now, if you want to pimp you lexer and parser, then read that:

What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.

As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:

$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
$input = ...;
$tokens = [];
while ($input) {
  $best = null;
  $k = null;
  for ($regexp as $re => $kind) {
    if (preg_match($re, $input, $match)) {
      $best = $match[0];
      $k = $kind;
      break;
    }
  }

  if (null === $best) {
    throw new Exception("could not analyze input, invalid token");
  }

  $tokens[] = ['kind' => $kind, 'value' => $best];

  $input = substr($input, strlen($best)); // move.
}

Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).

The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:

$regexp = [['reg' => '...', 'kind' => STRING], ...]

You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:

class Foobar {
  const FOOBAR = "arg";
  function x() {...}  
}

There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.

FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").

Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).

This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.

As I said, it was something I made in the past - something like 6/7 years ago.

It was on Windows.
It was not particularly quick (well it is O(N²) because of the two loops).
I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.
I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.

You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)

PHP正则表达式解析 - 用我自己的语言拆分令牌。 有没有更好的办法？

PHP正则表达式解析 - 用我自己的语言拆分令牌。有没有更好的办法？