MongoDB：复合索引决策

I recently had to optimize certain sets of queries on our MongoDB, and run into this particular problem:

say I have a query that match on A and B, then do a range select on C, and output by sorting on D, so in shell they look like:

db.collection.find({ A: 'something', B: 'something-else', C: { $gt: 100 } })
             .sort({ D: -1 }).limit(10)

I read a post last year that talked about creating index for such scenario, their basic rules:

Exact value match field go first
Sorting field comes second
Range search ($in, $gt etc.) field comes last

Their tree explanation looks reasonable so I went ahead and created an index as such:

db.collection.ensureIndex({ A:1, B:1, D:-1, C:-1 })

Now the problem comes: mongodb decides BasicCursor is better than this index. If I hint the full index it works (and much faster), but doing that would require quite a few changes on our codebase, so we are trying to avoid that if at all possible.

My questions are:

Why does mongodb query optimizer decides { A:1, E:-1 }, { D:-1 } or even BasicCursor are better than { A:1, B:1, D:-1, C:-1 }, when my query includes all 4 fields.
Is { A:1, D:-1 } redundant, mongo docs does say using partial index is less efficient?

Furthermore, we also have queries like following:

db.collection.find({ A: { $in : ['str1','str2'] }, B: 'something', C: { $gt: 100 } })
             .sort({ D: -1 }).limit(10)

To efficiently query it, do we need an extra index like following? Frankly I am not sure how will MongoDB query optimizer treat them.

db.collection.ensureIndex({ B:1, D:-1, C:-1, A:1 })

These are the explain for my query with and without hint.

with hint (full index): http://pastebin.com/xtpJ3dsf
with hint (A,D index): http://pastebin.com/v66QmtsP
without hint: http://pastebin.com/QAtM0WN0
without hint (dropped other index): http://pastebin.com/6ZDweiNX

Turns out it was defaulting to { A:1, E:-1 } not { A:1, D:-1 }, which seem even stranger as we did't query on field E.

I dropped the index on { A:1, E:-1 }, now explain tells me it defaults to { D:-1 }, so I dropped it as well, now MongoDB begin using BasicCursor... It doesn't seem to like neither my full index nor the A:1, D:-1 index (despite hint result in much better performance).

This feels weird.

The only reason something "unusual" like this would happen is if your data distribution happens to be such that BasicCursor actually completes the query (i.e. finds all the matching documents) faster than an indexed query. Same thing for a "partial" index.

A specific case where this would happen, using your data structure as an example is if a has relatively few distinct values at the beginning of a collection, and b has extremely low cardinality (i.e. very few distinct values, like one or a handful) then scanning the collection in order or using a "less efficient" index will show equal or better performance than using theoretically "ideal" index.

Here's an example where the first 1000 documents have a=1 and b=2 - later documents are very differently distributed.

> db.compound4.find({a:1, b:2, d:{$lt:100}}).sort({c:-1}).limit(10).explain(true)
{
    "cursor" : "BtreeCursor a_1",
    "isMultiKey" : false,
    "n" : 10,
    "nscannedObjects" : 18,
    "nscanned" : 18,
    "nscannedObjectsAllPlans" : 46,
    "nscannedAllPlans" : 56,
    "scanAndOrder" : true,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "a" : [
            [
                1,
                1
            ]
        ]
    },
    "allPlans" : [
        {
            "cursor" : "BtreeCursor a_1",
            "n" : 18,
            "nscannedObjects" : 18,
            "nscanned" : 18,
            "indexBounds" : {
                "a" : [
                    [
                        1,
                        1
                    ]
                ]
            }
        },
        {
            "cursor" : "BtreeCursor a_1_b_1_c_1_d_1 reverse",
            "n" : 10,
            "nscannedObjects" : 10,
            "nscanned" : 20,
            "indexBounds" : {
                "a" : [
                    [
                        1,
                        1
                    ]
                ],
                "b" : [
                    [
                        2,
                        2
                    ]
                ],
                "c" : [
                    [
                        {
                            "$maxElement" : 1
                        },
                        {
                            "$minElement" : 1
                        }
                    ]
                ],
                "d" : [
                    [
                        100,
                        -1.7976931348623157e+308
                    ]
                ]
            }
        },
        {
            "cursor" : "BasicCursor",
            "n" : 18,
            "nscannedObjects" : 18,
            "nscanned" : 18,
            "indexBounds" : {

            }
        }
    ]
}

Since the compound index is large it takes longer to traverse than the smaller partial index and because of selectivity of "b" is not very good (i.e. very bad) it makes that query plan fall behind.

After running my original query with explain(true) the reasoning behind mongodb query optimizer is much clearer.

It's because field C, which was a range search on an int field and was expected to have high selectivity does not have many matching results at the start of search (especially when results are ordered by D and descending, which was a date field.)

The base logic behind mongodb query optimizer appear to be: given the same number of n result, use the plan (index) with smallest nscanned (best if nscanned = n). In case of a small n, eg. limit(10), optimizer may not use the most efficient index we planned for it.

In my case, it turns out the both index { D:-1 } and BasicCursor beats the full index on such test. So the mystery is solved.

PS: for curiosity, these are the explain(true) output for my testing dataset:

no C field matching our range search: http://pastebin.com/5ibFDs44
every C field matching our range search: http://pastebin.com/4Y35dMXM

Hope this helps someone :)