I need to design a program that will loop through 100 million keywords, call a web service (http://example.com/service.aspx?keyword=xxxx) and keep the result which is a JSON output into redis.
What I am thinking of is to start with mass insert all keywords (which is an input to the web service) into redis.
After that, I will write a PHP script using rolling-curl that will read from Redis and update it back with two things: flag and result. flag key will be used to track the updated records, and the latter will be used to store the JSON string as follows:
{
"keyword": "keyword_1",
"flag": 1,
"output": "result": [
{
"ID": "21",
"field1": "some text",
"field2": "another text"
},
{
"ID": "150",
"field1": "some text",
"field2": "another text"
},
{
"ID": "255",
"field1": "some text",
"field2": "another text"
}
]
}
Questions:
1) Is this is the best and most efficient way to do it in terms of expected time to complete this task?
2) using this data structure, can I search my redis and find the keyword using field1 or field2? if not, how can this be implemented using redis?
Thank you
+1 to Dagon - Redis is an “In Memory”-store … so if your your data set was supposed to be representative for one row, than you’ll need over 35GB of memory.
Next thing is the time it will take to send out 100 million requests, and collect the results. This will still take forever even you aren’t waiting for the server’s response and can ask the next one.
And at last - the query by field isn’t going to work out in redis with 100 million entries. You might instead consider MongoDB - if you want rich and on-the-fly queries - or Couchbase, or CouchDB - if views (rather limited and predefined queries) are enough.
So in the end if comes down to the timeframe you are aiming for. If you want the program to run once a week, it should be good enough from performance. If you can provide the memory for redis it will work, however the queries on fields will be very slow since you frankly shouldn’t use them. When you have big datasets you should only query fields which are indexed. Because redis doesn’t have support for secondary indices, you can’t do it there.
Agreed with @zahorak & Dagon, 1 million is too much of a data to be kept in memory and queried. Additionally for querying nested data structures, MongoDB is one of the databases which can be highly preferred as it provides n level nesting (http://docs.mongodb.org/manual/tutorial/model-tree-structures-with-nested-sets/)