I need to parse and return the tagname and the attributes in our PHP code files:
<ct:tagname attr="attr1" attr="attr2">
For this purpose the following regular expression has been constructed:
(\<ct:([^\s\>]*)([^\>]*)\>)
This expression works as expected but it breaks when the following code is parsed
<ct:form/input type="attr1" value="$item->field">
The original regular expression breaks because of the > character in the $item->field. I would need to construct a regular expression that ignores the -> or => but not the single >.
I am open to any suggestions... Thanks for your help in advance.
You could try using negative lookbehind like that:
(\<ct:([^\s\>]*)(.*?)(?<!-|=)\>)
Matches :
<ct:tagname attr="attr1" attr="attr2">
<ct:form/input type="attr1" value="$item->field">
Not sure that it the best suited solution for your case, but that respects the constraints.
I think what you want to do is not recognize the ->
and =>
, but ignore everything between pairs of quotes.
I think it can be done by inserting ((
("[^"]*")*
)) at the opportune place.
In general, any parsing problem rapidly runs into language constructs that are context-free but not regular. It may be a better[1] solution to write a context-free parser, ignoring everything except the elements you're interested in.
[1] "better" as seen from a viewpoint of Being The Right Thing, not necessarily a return on investment one.
Try this:
<ct:([^\s\>]*)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*')\s*)*)>
But if that’s XML, use should better use a XML parser.
My suggestion is to match to the attributes in the same expression.
\<ct:([^\s\>]*)((([a-x0-9]+)=\"([^\"]*)\")*)\>
edit: removed part about > not being valid xml in attribute values.