Introduction
OpenAI is one of the companies at the forefront of the AI movement. ChatGPT is one of the most popular LLM applications, and grows everyday. OpenAI has to handle millions of queries per second, and uses Postgres for critical workloads. Their use of Postgres isn’t a surprise, Postgres is one of the most popular relational databases. What is surprising, however, is how they optimize Postgres to support their needs. Most of their optimizations are based on battle tested best practices.
In a recent talk titled “Scaling Postgres for OpenAI”, OpenAI Technical Staff shared how they scaled Postgres for their use-cases. My takeaway from the talk is that they aren’t using exotic optimizations to get the best performance. They’re monitoring their database, finding bottlenecks, and using best practices to fix issues.
Context
Here’s the context. OpenAI has a single primary database and several read replicas. At the start of their performance journey, the primary served reads and writes, while the read replicas served a subset of read queries. Over time, they had several incidents related to handling high traffic, and spent time hardening the database.
The talk has more in-depth analysis, but I want to highlight how they approach performance. Every optimization they made is in service of reducing load on their primary database as much as possible.
Techniques
1. Review Your Application for Unneeded Queries
The team profiled their application and found code paths with read queries, that didn’t need to be made. Removing queries from these code paths reduced the number of queries that hit the primary database.
2. Audit ORM Queries and use raw SQL where needed
Object Relation Mapping (ORM) libraries let us interact with the database without writing a single line of SQL. We get to focus on the relationships between our data rather than how to write the correct joins. The tradeoff is that the ORM may generate complex SQL, when simpler queries suffice. The OpenAI team reviewed the generated SQL from their ORM, and replaced over-complicated queries with raw SQL where it made sense.
3. Move Read Queries to the Read Replicas
To reduce load on their primary database, the OpenAI team moved as many read queries to read replicas as possible. There is a consistency tradeoff when moving read queries to read replicas. That is, replicas may take some time to be updated with the latest changes from the primary, especially if replica lag is high. However, reducing load on the primary by an order of magnitude was an acceptable tradeoff for potential impacts to consistency.
4. Use Timeouts For Queries
Long running queries may hold onto connections and prevent other transactions from being processed. Configuring statement and transaction timeouts ensures the database won’t fail to process other queries due to a few long-running transactions. These can be applied at the application level, or in Postgres itself.
5. Use Connection Pools to optimize connection usage
The OpenAI team introduced PgBouncer, a connection pooler for Postgres, to their system to optimize connection pool usage. Connection Pooling is a common technique to efficiently manage the number of connections an application opens to a database. Most database libraries provide connection pooling at the application level. Your database library will manage the connections and re-use them as efficiently as possible. PgBouncer takes it one level lower to the database server. App Servers connect to the PgBouncer proxy just like the would a database, and PgBouncer creates or re-uses connections as efficiently as possible. Whether you choose to go with PgBouncer, or use built-in functionality from your DB library, connection pooling is always a good idea.
Conclusion
Database performance doesn’t have to be complicated. Even hyper-scale AI companies lean on best practices to get the performance they’re looking for.
To learn more about how OpenAI scaled their postgres instances give the talks a view here:
- Full Talk (50 minutes): https://www.youtube.com/watch?v=Ni1SGhNu-Q4
- Truncated Talk (25 minutes): https://www.youtube.com/watch?v=NvY2kvi1Fa0
Leave a comment about how you’ve improved database performance in your systems!
And here are some related resources about PgBouncer and Postgres Config:
- PgBouncer: https://www.pgbouncer.org/usage.html
- Postgres Client Configuration: https://www.postgresql.org/docs/current/runtime-config-client.html