如何在文本中找到任何单词但区分标签?

I want to find any word with a min length (eq 4) in a Text which also could between tags like <strong> or <h1> etc. After that i want to make a kind of weighting of these word. Normal words just become a lower score than words between a <strong>. But the words shouldn't be alone in a more scoring tag (like strong).

Example content

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor
invidunt ut labore et dolore <strong>magna aliquyam erat</strong>, sed diam voluptua. 
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor 
invidunt ut labore et dolore <strong>magna</strong> aliquyam erat, sed diam voluptua

Can i do this with regexp like finding any word and check in the preg callback whether they are inside a tag or how is that possible?

thanks a lot!

(?<=\/|<)(\w{4,})(?=>)|\b(\w{4,})

You can try this.Part 1 of the match will always be from tags.Part 2 of the match will be other normal words.

See demo.

http://regex101.com/r/hQ1rP0/74

<\w*>([a-zA-Z0-9 ]{4,})</\w*>

You use this to wind text between tags then you count the number of spaces in that text to know how many words it has and give it your according weighting, you control the min lenght with {4,} in this case its 4 or more

for normal words you just use

\w{4,}

Is that all?

Oh you probably wanted something like this right?

<\w*>(?<case1>[a-zA-Z0-9 ]{4,})</\w*>|(?<case2>\w{4,})

In case1 group there are words that are between tags and in case2 are words that are not between tags. Btw i dont know exactly how capture groups in PHP are done so the regex might look a bit difrent and "/" might be a escape char in PHP also so you need to use \ before it if it is

http://regex101.com/r/iR5lW1/1