使用正则表达式解析HTML时转义 - >和=>

I need to parse and return the tagname and the attributes in our PHP code files:

<ct:tagname attr="attr1" attr="attr2">

For this purpose the following regular expression has been constructed:

(\<ct:([^\s\>]*)([^\>]*)\>)

This expression works as expected but it breaks when the following code is parsed

<ct:form/input type="attr1" value="$item->field">

The original regular expression breaks because of the > character in the $item->field. I would need to construct a regular expression that ignores the -> or => but not the single >.

I am open to any suggestions... Thanks for your help in advance.

You could try using negative lookbehind like that:

(\<ct:([^\s\>]*)(.*?)(?<!-|=)\>)

Matches :

<ct:tagname attr="attr1" attr="attr2">
<ct:form/input type="attr1" value="$item->field">

Not sure that it the best suited solution for your case, but that respects the constraints.

I think what you want to do is not recognize the -> and =>, but ignore everything between pairs of quotes.

I think it can be done by inserting ((

("[^"]*")*

)) at the opportune place.

In general, any parsing problem rapidly runs into language constructs that are context-free but not regular. It may be a better[1] solution to write a context-free parser, ignoring everything except the elements you're interested in.

[1] "better" as seen from a viewpoint of Being The Right Thing, not necessarily a return on investment one.

Try this:

<ct:([^\s\>]*)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*')\s*)*)>

But if that’s XML, use should better use a XML parser.

My suggestion is to match to the attributes in the same expression.

\<ct:([^\s\>]*)((([a-x0-9]+)=\"([^\"]*)\")*)\>

edit: removed part about > not being valid xml in attribute values.