I have a large document - I need to parse it and spit out only this part: schule.php?schulnr=80287&lschb=
how do I parse the stuff!?
<td>
<A HREF="schule.php?schulnr=80287&lschb=" target="_blank">
<center><img border=0 height=16 width=15 src="sh_info.gif"></center>
</A>
</td>
Love to hear from you
You could also do it this way (it's not perl but more "visual"):
Copy + Paste this XPath expression into the text field labeled "XPpath:"
//a[contains(@href, "schule")]/@href
Click "Eval" button.
There are also tools to do this on the command line, e.g. "xmllint" (for unix)
xmllint --html --xpath '//a[contains(@href, "schule")]/@href' myfile.php.or.html
You could do further processing from thereon.
You ought to use a DOM parser like PHP Simple HTML DOM Parser
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
In Perl, the quickest and best way, I know to scan HTML is HTML::PullParser
. This is based on a robust HTML parser, not simple FSA like Perl regex (without recursion).
This is more like a SAX filter, than a DOM.
use 5.010;
use constant NOT_FOUND => -1;
use strict;
use warnings;
use English qw<$OS_ERROR>;
use HTML::PullParser ();
my $pp
= HTML::PullParser->new(
# your file or even a handle
file => 'my.html'
# specifies that you want a tuple of tagname, attribute hash
, start => 'tag, attr'
# you only want to look at tags with tagname = 'a'
, report_tags => [ 'a' ],
)
or die "$OS_ERROR"
;
my $anchor_url;
while ( defined( my $t = $pp->get_token )) {
next unless ref $t or $t->[0] ne 'a'; # this shouldn't happen, really
my $href = $t->[1]->{href};
if ( index( $href, 'schule.php?' ) > NOT_FOUND ) {
$anchor_url = $href;
last;
}
}
What Rfvgyhn said, but in Perl flavor since that was one of the tags: use HTML::TreeBuilder
Plus, for reasons as to why RegEx is almost never a good idea to parse XML/HTML (sometimes it's Good Enough With Major Caveats), read the obligatory and infamous StackOverflow post:
RegEx match open tags except XHTML self-contained tags
Mind you, if the full extent of your task is literally "parse out HREF links", AND you don't have "<link>" tags AND the links (e.g. HREF="something"
substrings) are guaranteed not to be used in any other context (e.g. in comments, or as text, or have "HREF=" be part of the link itself), it just might fall into the "Good Enough" category above for regex usage:
my @lines = <>; # Replace with proper method of reading in your file
my @hrefs = map { $_ =~ /href="([^"]+)"/gi; } @lines;