How I can delete if word in the text combined in incorrect form. For example I have this text:
HelloEveryOne, СаломБаХама, Ҳама дарПеши ҷаҳонЯк мебошадАммо. HELLOeveryOneHelloFORyouYOU HELLO everyOneHello FORyouYOU canBEcorrectedThisSTRINGinCorrectlyFORm canBEcorrected ThisSTRINGin CorrectlyFORm
Hello Every One, Салом Ба Хама, Ҳама дар Пеши ҷаҳон Як мебошад Аммо. HELLO every One Hello FOR you YOU HELLO every One Hello FOR you YOU can BE corrected This STRING in Correctly FOR m can BE corrected This STRING in Correctly FOR m
Thanks advance!
I don't recognize this locale so I wasn't able to test these strange chars, but the first string can be solved with this:
<?php
$str = 'HelloEveryOne';
$newStr = '';
for ($i = 0; $i < strlen($str); $i++ ) {
$newStr .= ctype_upper($str[$i]) ? ' ' : '';
$newStr .= $str[$i];
}
echo $newStr;
The ctype_upper
function returns if a string has all chars in uppercase. I'm passing a single char at a time to it, so if it is in uppercase, the program adds a space before the char.
You can use the unicode metacharacters to look for uppercase and lowercase letters. Something like:
\B(\p{Lu}[\p{Ll}.,!]+)
and replace with
\1
Regex demo: https://regex101.com/r/QskwDd/2/
in PHP it can be used as:
$string = 'HelloEveryOne, СаломБаХама, Ҳама дарПеши ҷаҳонЯк мебошадАммо.';
echo preg_replace('/\B(\p{Lu}[\p{Ll}.,!]+)/u', ' \1', $string);
Demo: https://3v4l.org/ZjHh4
A simpler approach could be just looking for capital letters and adding a space.
\B\p{Lu}
replace with:
\0
Regex Demo: https://regex101.com/r/QskwDd/1/
This was a bit of tricky challenge to crack! ...but I got it. Using negative lookarounds proved unfruitful for negating unwanted substrings. The (*SKIP)(*FAIL)
technique did the job.
The logic behind it all is to target the three types of words regardless of spacing. They are:
See the inline comments in the php code block for layman's pattern explanation.
Pattern: Demo
/(?:\p{Ll}+|\p{Lu}\p{Ll}+|\p{Lu}{2,}+)[,.!?]?(?:\s|$)(*SKIP)(*FAIL)|(?:\p{Ll}+|\p{Lu}{2,}+|\p{Lu}\p{Ll}+)[,.!?]?/u
Code: (Demo)
$input='HelloEveryOne, СаломБаХама, Ҳама дарПеши ҷаҳонЯк мебошадАммо.
HELLOeveryOneHelloFORyouYOU HELLO everyOneHello FORyouYOU
can,BEcorrectedThisSTRINGinCorrectlyFORm
canBEcorrected ThisSTRINGin CorrectlyFORm.';
// optional trailing punctuation-vvvv vvvv- white space or end of input (that we don't want to replace)
var_export(preg_replace('/(?:\p{Ll}+|\p{Lu}\p{Ll}+|\p{Lu}{2,}+)[,.!?]?(?:\s|$)(*SKIP)(*FAIL)|(?:\p{Ll}+|\p{Lu}{2,}+|\p{Lu}\p{Ll}+)[,.!?]?/u','$0 ',$input));
// all lower-^^^^^^^ ^^^^^^^^^^^-all upper ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-repeat first alternative without trailing white space or end of input
// one upper then all lower-^^^^^^^^^^^^^ ^^^^^^^^^^^^^^-discard these matches
Output:
'Hello Every One, Салом Ба Хама, Ҳама дар Пеши ҷаҳон Як мебошад Аммо.
HELLO every One Hello FOR you YOU HELLO every One Hello FOR you YOU
can, BE corrected This STRING in Correctly FOR m
can BE corrected This STRING in Correctly FOR m.'