使用PLY为PHP创建lexing标记

I am building a PHP lexer using PLY so that I can understand the concepts behind lexing/parsing. I decided to start with a very simple PHP code block:

<?php if (isset($_REQUEST['name'])){
    $name     = $_REQUEST['name'];
    $msg      = "Hello, " . $name . "!";
    $encoded  = htmlspecialchars($msg);
}
?>

Eventually my goal will be to parse this so that I can "trace" the user-input and verify that it has indeed landed in the htmlspecialchars function (or whatever function I choose). But before I get to parsing I need to generate the tokens correctly.

I am specifically stuck on tokenizing the following line:

$msg     = "Hello, " . $name . "!";

What is the best way to go about converting everything into tokens? My current code recognizes the $msg as a variable, the = as the correct token but the rest of the line "Hello, " . $name . "!" gets treated as one token which is incorrect.

I am very new to this painful world of lexing/parsing so any help would be appreciated. Here is my current code and results:

import ply.lex as lex

states = (
    )


delimeters = ('LPAREN', 'RPAREN', 'LBRACKET', 'RBRACKET')

tokens = delimeters + (
    "CHAR",
    "NUM",
    "OPEN_TAG",
    "CLOSE_TAG",
    "php",
    "VARIABLE",
    "CONSTANT_ENCAPSED_STRING",
    "ENCAPSED_AND_WHITESPACE",
    "QUOTED_ENCAPSED_STRING",
    "LCURLYBRACKET",
    "RCURLYBRACKET",
    "EQUALS",
    "SEMICOLON",
    "QUOTE",
    "DOT",
    "IF"
)



t_ignore         = " \t"
t_CHAR           = r"[a-z]"
t_LPAREN         = r'\('
t_RPAREN         = r'\)'
t_RBRACKET       = r'\]'
t_LBRACKET       = r'\['
t_RCURLYBRACKET  = r'\}'
t_LCURLYBRACKET  = r'\{'
t_EQUALS         = r'='
t_SEMICOLON      = r';'
t_DOT            = r'\.'


def t_newline(t):
    r'
+'
    t.lexer.lineno += t.value.count("
")


def t_CONSTANT_ENCAPSED_STRING(t):
    r"'([^\\']|\\(.|
))*'"
    t.lexer.lineno += t.value.count("
")
    return t

def t_QUOTED_ENCAPSED_STRING(t):
    r"\"([^\\']|\\(.|
))*\""
    t.lexer.lineno += t.value.count("
")
    return t


def t_OPEN_TAG(t):
    r'<[?%]((php[ \t
]?)|=)?'
    if '=' in t.value: t.type = 'OPEN_TAG_WITH_ECHO'
    t.lexer.lineno += t.value.count("
")
    return t

def t_CLOSE_TAG(t):
    r'[?%]>?
?'
    t.lexer.lineno += t.value.count("
")
    #t.lexer.begin('INITIAL')
    return t

def t_VARIABLE(t):
    r'\$[A-Za-z_][\w_]*'
    return t

def t_QUOTE(t):
    r'"'


def t_NUM(t):
    r"\d+"
    t.value = int(t.value)
    return t

def t_error(t):
    print t.lexer.current_state
    print dir(t.lexer)
    raise TypeError("unknown char '%s'"%(t.value))


string = """<?php if (isset($_REQUEST['name'])){
               $name = $_REQUEST['name'];
               $msg = "Hello, " . $name . "!";
               $encoded = htmlspecialchars($msg);
}
?>"""

lex.lex()

lex.input(string)
for tok in iter(lex.token, None):
    print repr(tok.type), repr(tok.value)

RESULTS:

'OPEN_TAG' '<?php '
'CHAR' 'i'
'CHAR' 'f'
'LPAREN' '('
'CHAR' 'i'
'CHAR' 's'
'CHAR' 's'
'CHAR' 'e'
'CHAR' 't'
'LPAREN' '('
'VARIABLE' '$_REQUEST'
'LBRACKET' '['
'CONSTANT_ENCAPSED_STRING' "'name'"
'RBRACKET' ']'
'RPAREN' ')'
'RPAREN' ')'
'LCURLYBRACKET' '{'
'VARIABLE' '$name'
'EQUALS' '='
'VARIABLE' '$_REQUEST'
'LBRACKET' '['
'CONSTANT_ENCAPSED_STRING' "'name'"
'RBRACKET' ']'
'SEMICOLON' ';'
'VARIABLE' '$msg'
'EQUALS' '='
'QUOTED_ENCAPSED_STRING' '"Hello, " . $name . "!"'
'SEMICOLON' ';'
'VARIABLE' '$encoded'
'EQUALS' '='
'CHAR' 'h'
'CHAR' 't'
'CHAR' 'm'
'CHAR' 'l'
'CHAR' 's'
'CHAR' 'p'
'CHAR' 'e'
'CHAR' 'c'
'CHAR' 'i'
'CHAR' 'a'
'CHAR' 'l'
'CHAR' 'c'
'CHAR' 'h'
'CHAR' 'a'
'CHAR' 'r'
'CHAR' 's'
'LPAREN' '('
'VARIABLE' '$msg'
'RPAREN' ')'
'SEMICOLON' ';'
'RCURLYBRACKET' '}'
'CLOSE_TAG' '?>'

So all of the tokens seem correct except for the QUOTED_ENCAPSED_STRING token but I don't know how to fix it. Is this where "states" come in handy?

Specific questions:

  1. How do I fix this so that the tokens are correctly assigned?

  2. What is the correct way of assigning a token to a function/method name? For example in the above output you see that the htmlspecialchars and isset functions are just treated as bunch of individual characters but eventually when I get around to parsing I will want my tokens to be such that recognizing functions names will be "easy".

It turns out that I had a typo in my regular expression that was causing the issue.

Updating t_QUOUTED_ENCAPSED_STRING to the following allowed the application to break up each element into their own token:

def t_QUOTED_ENCAPSED_STRING(t):
    r"\"([^\\"]|\\(.|
))*\""
    t.lexer.lineno += t.value.count("
")
    return t

Now I run the app I get the following output:

'OPEN_TAG' '<?php '
'CHAR' 'i'
'CHAR' 'f'
'LPAREN' '('
'CHAR' 'i'
'CHAR' 's'
'CHAR' 's'
'CHAR' 'e'
'CHAR' 't'
'LPAREN' '('
'VARIABLE' '$_REQUEST'
'LBRACKET' '['
'CONSTANT_ENCAPSED_STRING' "'name'"
'RBRACKET' ']'
'RPAREN' ')'
'RPAREN' ')'
'LCURLYBRACKET' '{'
'VARIABLE' '$name'
'EQUALS' '='
'VARIABLE' '$_REQUEST'
'LBRACKET' '['
'CONSTANT_ENCAPSED_STRING' "'name'"
'RBRACKET' ']'
'SEMICOLON' ';'
'VARIABLE' '$msg'
'EQUALS' '='
'QUOTED_ENCAPSED_STRING' '"Hello, "'
'DOT' '.'
'VARIABLE' '$name'
'DOT' '.'
'QUOTED_ENCAPSED_STRING' '"!"'
'SEMICOLON' ';'
'VARIABLE' '$encoded'
'EQUALS' '='
'CHAR' 'h'
'CHAR' 't'
'CHAR' 'm'
'CHAR' 'l'
'CHAR' 's'
'CHAR' 'p'
'CHAR' 'e'
'CHAR' 'c'
'CHAR' 'i'
'CHAR' 'a'
'CHAR' 'l'
'CHAR' 'c'
'CHAR' 'h'
'CHAR' 'a'
'CHAR' 'r'
'CHAR' 's'
'LPAREN' '('
'VARIABLE' '$msg'
'RPAREN' ')'
'SEMICOLON' ';'
'RCURLYBRACKET' '}'
'CLOSE_TAG' '?>'