I am building a PHP lexer using PLY so that I can understand the concepts behind lexing/parsing. I decided to start with a very simple PHP code block:
<?php if (isset($_REQUEST['name'])){
$name = $_REQUEST['name'];
$msg = "Hello, " . $name . "!";
$encoded = htmlspecialchars($msg);
}
?>
Eventually my goal will be to parse this so that I can "trace" the user-input and verify that it has indeed landed in the htmlspecialchars
function (or whatever function I choose). But before I get to parsing I need to generate the tokens correctly.
I am specifically stuck on tokenizing the following line:
$msg = "Hello, " . $name . "!";
What is the best way to go about converting everything into tokens? My current code recognizes the $msg
as a variable, the =
as the correct token but the rest of the line "Hello, " . $name . "!"
gets treated as one token which is incorrect.
I am very new to this painful world of lexing/parsing so any help would be appreciated. Here is my current code and results:
import ply.lex as lex
states = (
)
delimeters = ('LPAREN', 'RPAREN', 'LBRACKET', 'RBRACKET')
tokens = delimeters + (
"CHAR",
"NUM",
"OPEN_TAG",
"CLOSE_TAG",
"php",
"VARIABLE",
"CONSTANT_ENCAPSED_STRING",
"ENCAPSED_AND_WHITESPACE",
"QUOTED_ENCAPSED_STRING",
"LCURLYBRACKET",
"RCURLYBRACKET",
"EQUALS",
"SEMICOLON",
"QUOTE",
"DOT",
"IF"
)
t_ignore = " \t"
t_CHAR = r"[a-z]"
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_RBRACKET = r'\]'
t_LBRACKET = r'\['
t_RCURLYBRACKET = r'\}'
t_LCURLYBRACKET = r'\{'
t_EQUALS = r'='
t_SEMICOLON = r';'
t_DOT = r'\.'
def t_newline(t):
r'
+'
t.lexer.lineno += t.value.count("
")
def t_CONSTANT_ENCAPSED_STRING(t):
r"'([^\\']|\\(.|
))*'"
t.lexer.lineno += t.value.count("
")
return t
def t_QUOTED_ENCAPSED_STRING(t):
r"\"([^\\']|\\(.|
))*\""
t.lexer.lineno += t.value.count("
")
return t
def t_OPEN_TAG(t):
r'<[?%]((php[ \t
]?)|=)?'
if '=' in t.value: t.type = 'OPEN_TAG_WITH_ECHO'
t.lexer.lineno += t.value.count("
")
return t
def t_CLOSE_TAG(t):
r'[?%]>?
?'
t.lexer.lineno += t.value.count("
")
#t.lexer.begin('INITIAL')
return t
def t_VARIABLE(t):
r'\$[A-Za-z_][\w_]*'
return t
def t_QUOTE(t):
r'"'
def t_NUM(t):
r"\d+"
t.value = int(t.value)
return t
def t_error(t):
print t.lexer.current_state
print dir(t.lexer)
raise TypeError("unknown char '%s'"%(t.value))
string = """<?php if (isset($_REQUEST['name'])){
$name = $_REQUEST['name'];
$msg = "Hello, " . $name . "!";
$encoded = htmlspecialchars($msg);
}
?>"""
lex.lex()
lex.input(string)
for tok in iter(lex.token, None):
print repr(tok.type), repr(tok.value)
RESULTS:
'OPEN_TAG' '<?php '
'CHAR' 'i'
'CHAR' 'f'
'LPAREN' '('
'CHAR' 'i'
'CHAR' 's'
'CHAR' 's'
'CHAR' 'e'
'CHAR' 't'
'LPAREN' '('
'VARIABLE' '$_REQUEST'
'LBRACKET' '['
'CONSTANT_ENCAPSED_STRING' "'name'"
'RBRACKET' ']'
'RPAREN' ')'
'RPAREN' ')'
'LCURLYBRACKET' '{'
'VARIABLE' '$name'
'EQUALS' '='
'VARIABLE' '$_REQUEST'
'LBRACKET' '['
'CONSTANT_ENCAPSED_STRING' "'name'"
'RBRACKET' ']'
'SEMICOLON' ';'
'VARIABLE' '$msg'
'EQUALS' '='
'QUOTED_ENCAPSED_STRING' '"Hello, " . $name . "!"'
'SEMICOLON' ';'
'VARIABLE' '$encoded'
'EQUALS' '='
'CHAR' 'h'
'CHAR' 't'
'CHAR' 'm'
'CHAR' 'l'
'CHAR' 's'
'CHAR' 'p'
'CHAR' 'e'
'CHAR' 'c'
'CHAR' 'i'
'CHAR' 'a'
'CHAR' 'l'
'CHAR' 'c'
'CHAR' 'h'
'CHAR' 'a'
'CHAR' 'r'
'CHAR' 's'
'LPAREN' '('
'VARIABLE' '$msg'
'RPAREN' ')'
'SEMICOLON' ';'
'RCURLYBRACKET' '}'
'CLOSE_TAG' '?>'
So all of the tokens seem correct except for the QUOTED_ENCAPSED_STRING
token but I don't know how to fix it. Is this where "states" come in handy?
Specific questions:
How do I fix this so that the tokens are correctly assigned?
What is the correct way of assigning a token to a function/method name? For example in the above output you see that the htmlspecialchars
and isset
functions are just treated as bunch of individual characters but eventually when I get around to parsing I will want my tokens to be such that recognizing functions names will be "easy".
It turns out that I had a typo in my regular expression that was causing the issue.
Updating t_QUOUTED_ENCAPSED_STRING
to the following allowed the application to break up each element into their own token:
def t_QUOTED_ENCAPSED_STRING(t):
r"\"([^\\"]|\\(.|
))*\""
t.lexer.lineno += t.value.count("
")
return t
Now I run the app I get the following output:
'OPEN_TAG' '<?php '
'CHAR' 'i'
'CHAR' 'f'
'LPAREN' '('
'CHAR' 'i'
'CHAR' 's'
'CHAR' 's'
'CHAR' 'e'
'CHAR' 't'
'LPAREN' '('
'VARIABLE' '$_REQUEST'
'LBRACKET' '['
'CONSTANT_ENCAPSED_STRING' "'name'"
'RBRACKET' ']'
'RPAREN' ')'
'RPAREN' ')'
'LCURLYBRACKET' '{'
'VARIABLE' '$name'
'EQUALS' '='
'VARIABLE' '$_REQUEST'
'LBRACKET' '['
'CONSTANT_ENCAPSED_STRING' "'name'"
'RBRACKET' ']'
'SEMICOLON' ';'
'VARIABLE' '$msg'
'EQUALS' '='
'QUOTED_ENCAPSED_STRING' '"Hello, "'
'DOT' '.'
'VARIABLE' '$name'
'DOT' '.'
'QUOTED_ENCAPSED_STRING' '"!"'
'SEMICOLON' ';'
'VARIABLE' '$encoded'
'EQUALS' '='
'CHAR' 'h'
'CHAR' 't'
'CHAR' 'm'
'CHAR' 'l'
'CHAR' 's'
'CHAR' 'p'
'CHAR' 'e'
'CHAR' 'c'
'CHAR' 'i'
'CHAR' 'a'
'CHAR' 'l'
'CHAR' 'c'
'CHAR' 'h'
'CHAR' 'a'
'CHAR' 'r'
'CHAR' 's'
'LPAREN' '('
'VARIABLE' '$msg'
'RPAREN' ')'
'SEMICOLON' ';'
'RCURLYBRACKET' '}'
'CLOSE_TAG' '?>'