How do I get to pick up the contents within an alt tag with regex
Given this text:
<a href="gallery.com/gallery-name"; target="_blank"> <img class="aligncenter" src="myblog.com/wp-content/image.jpg " alt=" I want to get this text " width=" 400 " height="300" /></a>
How do I match I want to get this text
I've tried this alt=".*"
but this yields alt=" I want to get this text " width=" 400 " height="300"
which is undesirable.
You should really use an html parser for this, but you seem to have creative control over the source string, and if it's really this simple then the edge cases should be reduced.
<img\s(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=['"]([^"]*)['"]?) (?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>
This regular expression will do the following:
alt
attributealt
attribute value and put into the capture group 1Live Demo
https://regex101.com/r/cN0lD4/2
Sample text
Note the difficult edge case in the second img
tag.
<a href="gallery.com/gallery-name"; target="_blank"> <img class="aligncenter" src="myblog.com/wp-content/image.jpg" alt=" I want to get this text" width =" 400 " height="300" /></a>
<img onmouseover=' alt="This is not the droid you are looking for" ;' class="aligncenter" src="myblog.com/wp-content/image.jpg" alt="This is the droid I'm looking for." width =" 400 " height="300" />
Sample Matches
img
tagalt
attribute, not including any surrounding quotes[0][0] = <img class="aligncenter" src="myblog.com/wp-content/image.jpg" alt=" I want to get this text" width =" 400 " height="300" />
[0][1] = I want to get this text
[1][0] = <img onmouseover=' alt="This is not the droid you are looking for" ;' class="aligncenter" src="myblog.com/wp-content/image.jpg" alt="This is the droid I'm looking for." width =" 400 " height="300" />
[1][1] = This is the droid I'm looking for.
NODE EXPLANATION
----------------------------------------------------------------------
<img '<img'
----------------------------------------------------------------------
\s whitespace (
, , \t, \f, and " ")
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (
,
, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (
, , \t, \f, and " ")
----------------------------------------------------------------------
alt= 'alt='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (
, , \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (
, , \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
Thanks to those who helped solved this:
'/<img.*?alt="(.*?)".*>/'