I was wondering how I could quickly search a data string of up to 1 billion bytes of data. The data is all numeric. Currently, we have the data split into 250k files and the searches using strpos (fastest built-in function) on each file until it finds something. Is there a way I can index to make it go faster? Any suggestions?
Eventually I would like to find multiple occurrences, which, as of now, would be done with the offset parameter on strpos.
Any help would surely lead to recognition where needed.
Thanks! - James Hartig
Well, your tags indicate what you should do (the tag I am referring to is "indexing").
Basically, you should have separate files which would have the indexes for the data. It would have the data strings that you are looking for, as well as the file and byte positions that it is in.
You would then access the index, look up your value and then find the location(s) in the original file(s) for the data string, and process from there.
A good answer may require that you get a little more specific.
How long is the search query? 1 digit? 10 digits? Arbitrary length?
How "fast" is fast enough? 1 second? 10 seconds? 1 minute?
How many total queries per second / minute / hour do you expect?
How frequently does the data change? Every day? Hour? Continuously?
When you say "multiple occurrences" it sounds like you mean overlapping matches.
What is the "value" of the answer and to how many people?
A billion ain't what it used to be so you could just index the crap out of the whole thing and have an index that is 10 or even 100 times the original data. But if the data is changing by the minute, that would mean your were burning more cycles to create the index than to search it.
The amount of time and money you put into a solution is a function of the value of that solution.
You should definitely get a girlfriend. Besides helping you spend your time better it can grow fat without bursting. Oh, and the same goes for databases.
All of Peter Rowell's questions pertain. If you absolutely must have an out-of-the box answer then try grep. You can even exec it from PHP if you like. It is orders of magnitude faster than strpos. We've actually used it quite well as a solution for something that couldn't deal with indexing.
But again, Peter's questions still all apply. I'd answer them before diving into a solution.
Would a hash function/table work? Or a Suffix Array/Tree?