文本预处理的性能改进

Is there a way to improve this code and maintain functionality? Some of that is result of checking that code and outputs on Windows and Linux, so to be "multi-OS" is necessary.

// Remove tags
$input = strip_tags($input);
// Converts accented to non-accented
$input =  iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
// String to lower
$input = strtolower($input);
// Remove all non-word and non-space chars
$input = preg_replace('/[^\sa-z]/', '', $input);
// Replace enters
$input = preg_replace('/[
]/', ' ', $input);
// Remove stopwords
$input = preg_replace('/\b(' . implode('|', $stopwords) . ')\b/', '', $input);
// Remove individual chars
$input = preg_replace('/\b([a-z])\b/', '', $input);
// Trim it
$input = trim($input);
// Remove multiple spaces
$input = preg_replace("/[[:blank:]]+/", " ", $input);

INPUT:

<doc id="603" url="http://it.wikipedia.org/wiki/Esperanto">
Esperanto.
Esperanto (pôvodne Lingvo Internacia – „medzinárodný jazyk“) je najrozšírenejší <a     href="Medzin%C3%A1rodn%C3%BD_pomocn%C3%BD_jazyk">medzinárodný</a> <a     href="Umel%C3%BD_jazyk">plánový jazyk</a>. Názov je odvodený od <a     href="Pseudonym">pseudonym</a>u, pod ktorým v roku <a href="1887">1887</a> zverejnil lekár     <a href="Ludwik_Lejzer_Zamenhof">L. L. Zamenhof</a> základy tohto jazyka. Zámerom tvorcu     bolo vytvoriť ľahko naučiteľný a použiteľný neutrálny jazyk, vhodný na použitie v     medzinárodnej komunikácii. Cieľom nebolo nahradiť <a href="N%C3%A1rodn%C3%BD_jazyk">národné     jazyky</a>, čo bolo neskôr aj deklarované v <a     href="Boulonsk%C3%A1_deklar%C3%A1cia">Boulonskej deklarácii</a>.
Hoci žiaden <a href="%C5%A1t%C3%A1t">štát</a> neprijal esperanto ako <a href="%C3%BAradn%C3%BD_jazyk">úradný jazyk</a>, používa ho komunita s odhadovaným počtom hovoriacich 100 000 až 2 000 000, z čoho približne 1 000 tvoria rodení hovoriaci. Získalo aj isté medzinárodné uznania, napríklad dve rezolúcie <a href="UNESCO">UNESCO</a> či podporu známych osobností verejného života. V súčasnosti sa esperanto využíva pri <a href="Cestovanie">cestovaní</a>, dopisovaní, medzinárodných stretnutiach a kultúrnych výmenách, <a href="Kongres">kongres</a>och, <a href="Veda">vedeckých</a> diskusiách, v pôvodnej aj prekladovej
</doc>

OUTPUT:

esperanto esperanto povodne lingvo internacia medzinarodny jazyk najrozsirenejsi medzinarodny planovy jazyk nazov odvodeny pseudonymu ktorym roku zverejnil lekar zamenhof zaklady tohto jazyka zamerom tvorcu vytvorit lahko naucitelny pouzitelny neutralny jazyk vhodny pouzitie medzinarodnej komunikacii cielom nebolo nahradit narodne jazyky neskor deklarovane boulonskej deklaracii hoci ziaden stat neprijal esperanto uradny jazyk pouziva komunita odhadovanym poctom hovoriacich coho priblizne tvoria rodeni hovoriaci ziskalo iste medzinarodne uznania napriklad dve rezolucie unesco podporu znamych osobnosti verejneho zivota sucasnosti esperanto vyuziva cestovani dopisovani medzinarodnych stretnutiach kulturnych vymenach kongresoch vedeckych diskusiach povodnej prekladovej

Are you sure this is a bottleneck in your application? You should definitely profile it, before making performance optimizations that are none.

I am not sure whether this increases the performance significantly, but at least it simplifies the code a bit. You can get rid of the last two statements, by collapsing them into the Replace enters call:

// Replace all (consecutive) whitespace characters with a single space:
$input = preg_replace('/\s+/', ' ', $input);

And you can combine the two replacements after that:

// Remove all stopwords and single letters:
$input = preg_replace('/\b(' . implode('|', $stopwords) . '|[a-z])\b/', '', $input);

So you end up with this:

// Remove tags
$input = strip_tags($input);
// Converts accented to non-accented
$input =  iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
// String to lower
$input = strtolower($input);
// Remove all non-word and non-space chars
$input = preg_replace('/[^\sa-z]/', '', $input);
// Replace all (consecutive) whitespace characters with a single space:
$input = preg_replace('/\s+/', ' ', $input);
// Remove all stopwords and single letters:
$input = preg_replace('/\b(' . implode('|', $stopwords) . '|[a-z])\b/', '', $input);
// Trim it
$input = trim($input);

In fact you could do the trim with two more alternatives inside your last preg_replace, but I find this rather obscuring, and again, I don't know whether that is even good for your performance.

m.buettner's answer is a good start. With that solution my benchmarks measure better than 25% speed improvement.

PCRE 'S' "Study" Modifier

For certain regexes, the PCRE 'S' Study modifier can speed up matching quite a bit. Here is an enhanced version of m.buettner's code:

// Remove tags
$input = strip_tags($input);
// Converts accented to non-accented
$input =  iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $input);
// String to lower
$input = strtolower($input);
// Remove all non-word and non-space chars
$input = preg_replace('/[^\sa-z]+/S', '', $input);
// Replace all (consecutive) whitespace characters with a single space:
$input = preg_replace('/\s+/S', ' ', $input);
// Remove all stopwords and single letters:
$input = preg_replace('/\b('.implode('|', $stopwords).'|[a-z])\b/', '', $input);
// Trim it
$input = trim($input);

This improves it further for about a 45% speed improvement over the original. Note that the S Study modifier does not help with regexes that begin with literal text or anchors. It may be that the bottleneck is in the $stopwords statement depending on how many you've got in there. (I used a simple array with four elements for my benchmarking: ['one','two','three','four']). A much larger $stopwords array will prolly show less improvement.

There are many more useful efficiency tidbits like this one in the classic: Mastering Regular Expressions (3rd Edition) - a MUST READ for everyone who uses regexes on a regular basis.

8^)

Take a look at the following:

http://stuffivelearned.org/doku.php?id=programming:general:phpvspythonvsperl

It compares the speed of regexes for PHP, Perl and Python. Note especially the tremendous speed of Perl, which takes only around 20% of the time required by PHP.