寻找强大的HTML DOM方法来正确提取包含单个撇号的属性的文本值

As part of a migration task of data, I am extracting some data from some html, the values in alt and title attributes of the img html element using PHP.

An example of the source html is:

<img src='myimage.jpg' alt='Andy's garden vegetables' title='Andy's garden vegetables'/>

As you can see, in the source html, the values of the alt and title attributes have their start and finish (container characters) denoted by a single apostrophe ' But within the text itself, the single apostrophe is used in possessive ownership sense to say vegetables belonging to Andy.

So for a simple parser, this is going to be problematic as it would incorrectly regard the apostrophe within the text as the end of the value, as in 'Andy' rather than 'Andy's garden vegetables'.

The solution I can think of to incorporate further surrounding text into a regex to clarify the start and finish of the attribute value, as in the alt=' and the ' at the end. Though this would not work if there are spaces between the = or if double quotes were used. I think that the ' single apostrophes may not be legal html but that is the data I have to work with.

Is there a more robust solution than regex, perhaps html dom based that can handle ' single apostrophes within the text and distinguish them from being used as delimiters?

I think this is what you're asking for?:

(?<=alt='|title=').+(?='\s)

I just used positive lookahead/lookbehind to identify the tags and the closing single apostrophe.

This matches your sample data's alt and title fields by using look arounds with alternate content and a reluctant quantifier (.+?) to ensure the match doesn't skip past quotes to end on the last quote in the input:

(?<=alt='|title=').+?(?='(\s|/))

See a live demo of this regex working with your sample and some edge cases.