In a blog post titled New Benchmarks Show Postgres Dominating MongoDB in Varied Workloads, EDB claims that Postgres outperforms MongoDB in document-based data tests. While I generally find such marketing benchmarks useless, they highlight mistakes made by vendors comparing a database where they are experts with one they don’t know and are unwilling to learn about. This provides an opportunity for educating on efficient index and query design.
There are four queries in this benchmark, with data loaded from github archive from 2015. This data is in JSON, which makes it suitable to test queries on documents, and queries are OLAP style, using aggregation pipelines in MongoDB.
Load documents to a collection
I used the same method to load data in a small lab to reproduce the queries:
for file in http://data.gharchive.org/2015-{01..12}-{01..31}-{0..23}.json.gz
do
wget -q -o /dev/null -O - $file |
gunzip -c |
mongoimport --collection="github2015"
done
Here is an example of one document:
db.github2015.find().limit(1);
[
{
_id: ObjectId('684ee281d8d90a383a078112'),
id: '2489368070',
type: 'PushEvent',
actor: {
id: 9152315,
login: 'davidjhulse',
gravatar_id: '',
url: 'https://api.github.com/users/davidjhulse',
avatar_url: 'https://avatars.githubusercontent.com/u/9152315?'
},
repo: {
id: 28635890,
name: 'davidjhulse/davesbingrewardsbot',
url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'
},
payload: {
push_id: 536740396,
size: 1,
distinct_size: 1,
ref: 'refs/heads/master',
head: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
before: '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
commits: [
{
sha: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
author: {
email: 'da8d7d1118ca5befd4d0d3e4f449c76ba6f1ee7e@live.com',
name: 'davidjhulse'
},
message: 'Altered BingBot.jarnnFixed issue with multiple account support',
distinct: true,
url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'
}
]
},
public: true,
created_at: '2015-01-01T00:00:00Z'
}
]
This dataset is ideal for testing a document database because:
- documents have an average size of three kilobytes, with some up to two megabytes.
- it features a structured format with sub-documents like „actor“ and „payload.“
- it contains arrays, such as „payload.commit,“ necessitating multi-key or inverted indexes.
I will test the four queries used by the benchmark and include an additional one that queries the array „payload.commit,“ which the benchmark overlooked. MongoDB’s multi-key indexes significantly outperform all SQL databases in this regard (see the Multi-key Indexes series). While many vendor benchmarks limit their test coverage to create a favorable impression, my objective is to transparently showcase indexing best practices.
Single-field indexes
Here are the indexes that were created for the vendor benchmark:
db.github2015.createIndex( {type:1} )
db.github2015.createIndex( {"repo.name":1} )
db.github2015.createIndex( {"payload.action":1} )
db.github2015.createIndex( {"actor.login":1} )
db.github2015.createIndex( {"payload.issue.comments":1} )
Seeing the index definitions, I already know why they got better results on PostgreSQL. Complex queries rarely filter or sort on a single field, and a well-designed database should have compound indexes. Without the right compound indexes, PostgreSQL can combine multiple indexes with bitmap scans, which is not ideal, as it cannot cover range filters or sort orders. However, it can be used to reduce the number of indexes created, as they negatively impact the vacuum process.
The document model in MongoDB offers the advantage of having all important fields consolidated within a single document. This allows for the use of a compound index that can effectively handle equality, range, and sort order. These indexes can be applied to both scalar values and arrays.
Benchmark queries
I ran benchmark queries in my lab to educate on indexing practices. I created the appropriate indexes, and the best starting point for documentation is The ESR (Equality, Sort, Range) Guideline.
Query a) Repositories order by most open issues quantity
The query is:
db.github2015.aggregate([
{ $match: { $and: [ { type: "IssuesEvent"} , { "payload.action" : "opened" } ] }},
{ $group: { _id: "$repo.name", total: { $sum: 1 } } },
{ $sort: { total: -1 } }
])
Index according to the ESR guideline:
- Equality: „type“ and „payload.action“, with the less selective first (better compression)
- Sort: „repo.name“, as grouping is faster on sorted keys
db.github2015.createIndex(
{ "payload.action": 1, "type": 1, "repo.name": 1 }
)
In my test, the query scanned nearly three hundred thousands index keys and returned in 700 milliseconds:
executionStats: {
executionSuccess: true,
nReturned: 72361,
executionTimeMillis: 777,
totalKeysExamined: 278489,
totalDocsExamined: 0,
Query b) Return git event type order by quantity from major to minor
The query is:
db.github2015.aggregate([
{ $group: { _id: "$type", total: { $sum: 1 } } },
{ $sort: { total: -1 } }
])
Index according to the ESR guideline:
- Sort: „repo.name“, as grouping is faster on sorted keys
db.github2015.createIndex({ "type": 1 })
This index was created by the benchmark, however it is important to know that the query planner will not choose the index without an equality or range predicate, even if it could help with sorting or grouping.
The query should simply add an unbounded range on the sort key:
db.github2015.aggregate([
{ $match: { "type": { $gt: MinKey} }},
{ $group: { _id: "$type", total: { $sum: 1 } } },
{ $sort: { total: -1 } }
])
In my test, the query aggregated ten million keys in 4.5 seconds:
executionStats: {
executionSuccess: true,
nReturned: 14,
executionTimeMillis: 4585,
totalKeysExamined: 9480600,
totalDocsExamined: 0,
Query c) Return the top 10 most active actors
The query is:
db.github2015.aggregate([
{ $group: { _id: "$actor.login", events: { $sum: 1 } } },
{ $sort: { events: -1 } },
{ $limit: 10 }
])
Index according to the ESR guideline:
- Sort: „repo.name“, as grouping is faster on sorted keys
db.github2015.createIndex({ "actor.login": 1 })
As seen above, we need to add an unbounded range:
db.github2015.aggregate([
{ $match: { "actor.login": { $gt: MinKey} }},
{ $group: { _id: "$actor.login", events: { $sum: 1 } } },
{ $sort: { events: -1 } },
{ $limit: 10 }
])
In my test, the query aggregated ten million keys in 13 seconds:
executionStats: {
executionSuccess: true,
nReturned: 727312,
executionTimeMillis: 13498,
totalKeysExamined: 9480600,
totalDocsExamined: 0,
If you need to optimize this access pattern further, use the computed design pattern, and increment a login counter in an actor’s collection, leveraging the atomic $inc operator.
Query d) Return repositories that have more than two comments and a specific event type, order by average comments from major to minor
The query is:
db.github2015.aggregate([
{ $match: { "type": "PushEvent", "payload.issue.comments": { $gt : 2 } } },
{ $group: { _id: "$repo.name", avg: { $avg: "$payload.issue.comments" } } },
{ $sort: { avg: -1 } }
])
Index according to the ESR guideline:
- Equality: „type“
- Sort: „repo.name“, as grouping is faster on sorted keys
- Range: „payload.issue.comments“
db.github2015.createIndex(
{ "type": 1, "repo.name": 1, "payload.issue.comments": 1 }
)
I don’t know if it was done on purpose for the benchmark, but the dataset has no „PushEvent“ with a „payload.issue“ so the query returns an empty result.
For this particular case, as there’s no keys to sort, it is better to place the range key before the sort key:
db.github2015.createIndex(
{ "type": 1, "payload.issue.comments": 1, "repo.name": 1 }
)
The index immediately finds that there are no keys for those bounds:
executionStats: {
executionSuccess: true,
nReturned: 0,
executionTimeMillis: 1,
totalKeysExamined: 0,
totalDocsExamined: 0,
Ad-Hoc queries for OLAP
While acceptable performance can be achieved with the right index, it may not be the optimal solution for OLAP use cases. Instead, consider creating a single Atlas Search index to handle all related queries. For further guidance, refer to my previous post: Search Index for Reporting.
Atlas Search Indexes are maintained asynchronously on a dedicated node, ensuring no impact on the operational database. They provide a near-real-time state without the complexity of streaming changes to another database.
Another Query: Recent push events by user
The recent push by a user is a relevant use-case for this dataset. However, the benchmark did not run any queries on „commits“ since it is an array. While PostgreSQL supports JSON, it cannot be directly compared to a document database that handles non-scalar fields more effectively. If you run the benchmark on PostgreSQL, try this query:
SELECT
data->'repo'->>'name' as repo_name,
data->'payload'->'commits' as commits_info
FROM "github2015"
WHERE
data->>'type' = 'PushEvent'
AND data->'payload'->'commits' @> '[{"author": {"name": "ggolden@umich.edu"}}]'
ORDER BY
data->>'created_at' DESC
LIMIT 5;
You can explore various indexes, such as a GIN index, but you will never find one that directly retrieves the five documents needed for the result. If you do, please correct me in a comment.
On MongoDB, the same index guideline applies:
- Equality: „type“ (low selectivity) and „payload.commits.author.name“ (will be multi-key)
- Sort: „created_at“ (a must for pagination query)
db.github2015.createIndex({
"type": 1,
"payload.commits.author.name": 1,
"created_at": -1
})
The query is simple and doesn’t even need an aggregation pipeline:
db.github2015.find({
"type": "PushEvent",
"payload.commits.author.name": "ggolden@umich.edu"
}, {
"repo.name": 1,
"payload.commits.author.name": 1,
"payload.commits.message": 1,
_id: 0
}).sort({
"created_at": -1
}).limit(5)
The execution statistics indicate that only 5 documents have been read, which is the minimum required for the results:
executionStats: {
executionSuccess: true,
nReturned: 5,
executionTimeMillis: 0,
totalKeysExamined: 5,
totalDocsExamined: 5,
executionStages: {
The execution plan illustrates how a single seek has read five index entries by utilizing a multi-key index. It applies the index bounds for both the equality filters and the sort order, ensuring an efficient retrieval process:
stage: 'IXSCAN',
nReturned: 5,
executionTimeMillisEstimate: 0,
...
isMultiKey: true,
multiKeyPaths: {
type: [],
'payload.commits.author.name': [ 'payload.commits' ],
created_at: []
},
...
direction: 'forward',
indexBounds: {
type: [ '["PushEvent", "PushEvent"]' ],
'payload.commits.author.name': [ '["ggolden@umich.edu", "ggolden@umich.edu"]' ],
created_at: [ '[MaxKey, MinKey]' ]
},
keysExamined: 5,
seeks: 1,
dupsTested: 5,
Conclusion
Vendor benchmarks can be misleading with their time comparisons and performance claims. However, when analyzed critically, they serve an educational role by highlighting common design mistakes in the database with which they compare to.
In this post, I applied the [The ESR (Equality, Sort, Range) Guideline], looked at the execution plan, and tested queries that are relevant to a document model, to show the power of MongoDB to query JSON data.