防止搜索滥用

I am unable to google something useful on this subject, so I'd appreciate either links to articles that deal in this subject, or direct answers here, either is fine.

I am implementing a search system in PHP/MySQL on a site that has quite a lot of visitors, so I am going to implement some restrictions to the length of the characters a visitor is allowed to enter in the search field and the minimum time required between two searches. Since I'm kind of new to these problems and I don't really know the "real reasons" why this is usually done, it's only my assumptions that the character minimum length is implemented to minimize the number of results the database will return, and the time between searches is implemented to prevent robots from spamming the search system and slowing down the site. Is that right?

And finally, the question of how to implement the minimum time between two searches. The solution i came up with, in pseudo-code, is this

  1. Set a test cookie at the URL where the search form is submitted to
  2. Redirect user to the URL where the search results should be output
  3. Check if the test cookie exists
    • If not, output a warning that he isn't allowed to use the search system (is probably a robot)
  4. Check if a cookie exists that tells the time of the last search
    • If this was less that 5 seconds ago, output a warning that he should wait before searching again
  5. Search
  6. Set a cookie with the time of last search to current time
  7. Output search results

Is this the best way to do it?

I understand this means visitors that have cookies disabled will not be able to use the search system, but is that really a problem these days? I couldn't find the statistics for 2012, but I managed to find data saying 3.7% of people had disabled cookies in 2009. That doesn't seem like a lot and I suppose should probably be even less these days.

"only my assumptions that the character minimum length is implemented to minimize the number of results the database will return". Your assumption is absolutely correct. It reduces the number of potential results, by forcing the user to think about, what it is they wish to search.

As far as bots spamming your search, you could implement a captcha, the most frequently used is recaptcha. If you don't want to show a captcha right away, you can track (via session) the number of times the user submitted search, and if X amount of searches occur within a certain time frame, then render the captcha.

I've seen sites like SO and thechive.com implement this type of strategy, where captcha isn't rendered right away, but will be rendered if a threshold is encountered.

This way you're preventing Search Engine from indexing your search results. A cleaner way of doing this would be:

  1. Get IP where search originated
  2. Store that IP in a cache system such as memcached and the time that query was made
  3. If another query is sent from same IP and less then x second passed simply reject it or make the user wait

Another thing you can do to increase performance is to take a look at analytics and see which queries are made most often and cache those so when a request comes in you serve the cached version and not make a full db query, parsing, etc...

Another naive option would be to have a script run 1-2 times a day running all common queries and create static HTML files that users hit when making particular search queries instead of hitting the db.