I am trying to index some content from a series of .html's that share the same format.
So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...
And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with ">
and end with either <a
or </p
. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>
, <u>
, etc.).
So then I have something like this:
$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);
Let's look at $regex
for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">
. Then I do the numbers inside the []
thing. Then I single out the </a>
. So far so good.
At the end, I do a (.)*?(<)
. This is the turning point. By leaving the last bit, (<)
like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p)
the resulting array ends up empty. I've tried changing that to only (<a)
, but it seems that 2 characters mess up the whole ting.
What can I do? I've been struggling with this all day.
As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.
I suggest using an XML parser such as PHP's DomDocument.
Create an object, then use the loadHTMLFile method to open the file. Extract your a
tags with getElementsByTagName, and then extract the content as the NodeValue property.
It might look like
// Create a DomDocument object
$html = new DOMDocument();
// Load the url's contents into the DOM
$html->loadHTMLFile("http://whatever.com/some.htm");
// make an array to hold the text
$anchors = array();
//Loop through the a tags and store them in an array
foreach($html->getElementsByTagName('a') as $link) {
$anchors[] = $link->nodeValue;
}
One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.
PHP Tidy is your friend. Don't use regexes.
Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/
seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.
Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.