I have 75,000+ texts stored in my database, each with 10,000+ characters.
Approximately every two minutes, a new text gets inserted into the database.
At the moment, my task is to find duplicates. However, I can't compare the texts using the comparison operator ==
as in many cases there have been made small changes to the text.
My idea so far is to compare the incoming text with all the other texts using the PHP function similar_text
and add a relationship to texts that are almost 100% alike.
The problem: similar_text
uses a very expensive algorithm. Therefore, one comparison between two texts with approx. 10,000+ characters takes over 0.1 second. That means that comparing one text with all the other texts takes 75,000 * 0.1 = 7500 seconds = 125 minutes. That is way too long because after two minutes I already receive the next text that I have to compare as well.
How can I speed this process up? Is there a quicker way to calculate the similarity of two texts? Or do you have other ideas on how I can find similar texts?