I have a file that looks something like this:
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
I need to extract anything within the quotes that follow "name=", i.e., content_analyzer , content_analyzer2 and content_analyzer_items.
I am doing this on a Linux box, so a solution using sed, perl, grep or bash is fine.
转载于:https://stackoverflow.com/questions/5080988/how-to-extract-string-following-a-pattern-with-grep-regex-or-perl
Since you need to match content without including it in the result (must match name="
but it's not part of the desired result) some form of zero-width matching or group capturing is required. This can be done easily with the following tools:
With Perl you could use the n
option to loop line by line and print the content of a capturing group if it matches:
perl -ne 'print "$1\n" if /name="(.*?)"/' filename
If you have an improved version of grep, such as GNU grep, you may have the -P
option available. This option will enable Perl-like regex, allowing you to use \K
which is a shorthand lookbehind. It will reset the match position, so anything before it is zero-width.
grep -Po 'name="\K.*?(?=")' filename
The o
option makes grep print only the matched text, instead of the whole line.
Another way is to use a text editor directly. With Vim, one of the various ways of accomplishing this would be to delete lines without name=
and then extract the content from the resulting lines:
:v/name=/d
:%s/\v.*name\="([^"]+)".*/\1
If you don't have access to these tools, for some reason, something similar could be achieved with standard grep. However, without the look around it will require some cleanup later:
grep -o 'name="[^"]*"' filename
In all of the commands above the results will be sent to stdout
. It's important to remember that you can always save them by piping it to a file by appending:
> result
to the end of the command.
this could do it:
perl -ne 'if(m/name="(.*?)"/){ print $1 . "\n"; }'
The regular expression would be:
.+name="([^"]+)"
Then the grouping would be in the \1
If you're using Perl, download a module to parse the XML: XML::Simple, XML::Twig, or XML::LibXML. Don't re-invent the wheel.
An HTML parser should be used for this purpose rather than regular expressions. A Perl program that makes use of HTML::TreeBuilder
:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file( \*DATA );
my @elements = $tree->look_down(
sub { defined $_[0]->attr('name') }
);
for (@elements) {
print $_->attr('name'), "\n";
}
__DATA__
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
content_analyzer
content_analyzer2
content_analyzer_items
Here's a solution using HTML tidy & xmlstarlet:
htmlstr='
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
'
echo "$htmlstr" | tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
sed '/type="global"/d' |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n
Oops, the sed command has to precede the tidy command of course:
echo "$htmlstr" |
sed '/type="global"/d' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n
If the structure of your xml (or text in general) is fixed, the easiest way is using cut
. For your specific case:
echo '<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>' | grep name= | cut -f2 -d '"'