I want to match the address of a property on realty server. Lets say the div containing address is named <div class="title">
and the address is located in the last <h2>
section like this:
<body>
<div class="price">
<h2>
h2
</h2>
</div>
<div class="title">
<abcd>
abcd
</abcd>
<efg>
efg
</efg>
<h2>
adress
</h2>
</div>
</body>
Is there a possible way to capture an address by only one regex, even if it will be in some captured group?
My not working solution is:
regex="/<div class="title">everything_except_<h2>*([^<]*)/";
Try this regex:
<div class="title">(?:.(?!<\/div>))*<h2>([^<]*)
The main point here is to make .*
after <div class="title">
greedy but match only until </div>
is found. So the regex limits the .
with only those occurrences that are not followed by </div>
(which gives us (?:.(?!<\/div>))*
as a result).
Demo: https://regex101.com/r/2EGXue/1
Update:
If nested div
s may occur but only one level of nesting is possible and the required <h2>...</h2>
is not within any of those div
s (as it happens in the provided data sample), the greedily matching pattern (.(?!<\/div>)
) should be extended to match either "not <div ...>...</div>
" (which is <div.*?<\/div>
) or just "not </div>
" (.(?!<\/div>)
):
<div class="title">(?:<div.*?<\/div>|.(?!<\/div>))*<h2>([^<]*)