Compressing for Performance over Cost in Opensearch

The Opensearch folks who extended Elasticsearch are not stupid – that’s why determining how one stores data in an OS Cluster is an index-level configurable setting, not one flat setting to apply throughout.

Which puts the onus on us – to determine the most efficient, cheapest and fastest way to store data – for every different index.

If you have enough unused-compute in your OS Nodes, below are perhaps the best compression algorithms to use for your usecases when you know the read and write patterns to your index.

Let’s look at the why:

Frequent Writes, Frequent Reads: zstd_no_dict
Reasoning: This codec offers the best write performance and very good read performance. Since both reads and writes are frequent, prioritizing speed is crucial. While compression is slightly less than zstd, the improved performance outweighs this in high-traffic scenarios. LZ4 would be another option trading compression for speed.
Few Writes, Frequent Reads: zstd
Reasoning: Since writes are infrequent, the higher CPU cost for compression is less of a concern. zstd provides a better compression ratio than zstd_no_dict, reducing storage space. Read performance is still very good. If read performance is critical you can again use zstd_no_dict.
Few Writes, Few Reads: best_compression
Reasoning: For archival or infrequently accessed data, maximizing compression is most important. best_compression (zlib) offers the highest compression ratio, minimizing storage space, even though it’s the slowest.
Frequent Writes, Few Reads: zstd
Reasoning: We would want faster writes, which is what zstd provides while still maintaining high compression.

‼️ However, do not use any of the above, if:

you have small nodes in the cluster, with CPU averaging above 75%
or, if you have a usecase that requires near-realtime ingestion & searches

If speed or cpu/mem is a priority than storage cost, and near-realtime ingestion and search is crucial, then stick to the default codec that relies on LZ4 algorithm for compressing your index data.

You can stop reading here, and trust my word. :)
But if you are still curious behind the rationale 
(plus some bonus hacks), then keep scrolling.

About the Codecs

The above recommendations are a culmination of my experience running large 3-digit Terabyte OS clusters in Production, and with years of tinkering with compression codecs.

Let’s explore the ‚why‘ behind the ‚why‘, and how the different codecs bring in different strengths and tradeoffs.

Codec	`default`	`best_compression`	`zstd`	`zstd_no_dict`
Algo	LZ4	zlib	Zstandard	Zstandard without dictionary
Compression Ratio	Low	High ✅	High ✅	Moderate
Speed	Very High ✅	Low	High	Very High ✅
CPU Usage	Low ✅	High	Moderate	Moderate
Memory Usage	Low ✅	Low ✅	Moderate	Moderate
Default Compression Level	NA	NA	3	3

Talking about how the 4 above codecs were designed:

LZ4: Designed for raw-speed. Low CPU and memory footprint. Ideal where fast compression/decompression is paramount.
zlib: Good balance of speed and compression, but slower than LZ4. Higher CPU usage than LZ4, especially during compression. Low memory usage.
Zstd : Offers a tunable balance. Can achieve compression ratios comparable to zlib with significantly better speed. Moderate CPU and memory usage, but tunable by adjusting compression level.
zstd_no_dict: Faster than zstd by excluding dictionary compression, trading compression ratio for speed.

Backing up with Data

The truth about zstd

The original zstd launch post from Facebook Engineering [1] has the caveat in plain-sight – zstd achieves great compression ratios at the cost of lower speeds.

The data reveals that while Zstd gains a 49% higher compression ratio than LZ4, LZ4 is still 226.5% faster than zstd in compression, and 303.8% faster than zstd in decompression.

From AWS Benchmarks (that ignore talking about CPU & Memory overhead), we see that zstd brings only a slight increase (1%) in median and P90 Read-latencies. For Writes, median latency is same between ztsd and lz4 , and P90 latency increases by 2% in zstd.
Of course, it cannot be denied that zstd results in 35% savings in disk-footprint due to higher compression, and 7% improvement in throughput compared to default LZ4.

Zstd_no_dict instead provides a nicer balance – 30% reduced disk footprint, 2% better P90 read-latency and 5% better P90 write-latency than LZ4, and a greater 14% improvement in throughput.

A 5% larger index size in zstd_no_dict (vs zstd) still results in a demonstrably faster write operation than all other codecs, with significant improvements in both latency and throughput as well.

Looking at public benchmarks done by others, one finds that zstd-no-dict provides higher compression & decompression speeds as well.

Where LZ4 shines

LZ4 has time & again been found to be performant – utilizing the lowest cpu & memory, while providing the maximum compression and decompression speeds – while trading off the compression ratio.

People have thoroughly tested LZ4 in Research Papers, Stack Overflow, and Personal Blogs.

Reading from the data in the graph from a 2020 paper [2] (above), one finds that the lowest compression of LZ4 is still 50% greater than the lowest compression of Zstd.

Looking at the benchmarks run by Gregory Szorc [3] (above), one finds that if speed is of utmost importance, then LZ4 tops all charts. If a somewhat slower but amazing compression-ratio is needed, then Zstd is the way to go.

People on Stackoverflow too have found [4] (above) that LZ4 is blazing-fast when it comes to compression speeds.

LZ4 uses lower CPU and Memory than Zstd – and a paper from 2023 [5] also covers this.

Concluding

Be smart. Know what you are trading off. If you have unused CPU & Memory, but limited storage – use zstd_no_dict. If you have a need for speed or dont have too many nodes to optimize storage for – then stick to default.

Bonus Hacks

Being the smarter platform engineer

Here are some time-tested tips for you to balance R/W performance and Storage cost when running large Opensearch clusters.

Shard responsibly. Sure, you want to save up on storage cost by using zstd. Well, then spread out the compression load of the segments by sharding your index – so that load on CPU & RAM is spread across multiple nodes during your search and writes.
Dont shoot yourself in the foot by selecting zstd for a 1.5 TB index, and then sharding it by just 10. Go 20 or 30.
Compress more. Hmm, why stick to the default compression level (3) if you really want to save costs? Use the index.codec.compression_level index-setting and make it the maximum (6) for zstd. Go all-in. What even is a CPU?
Hardware Acceleration for Intel Clusters: If you are not using Ultrawarm and Graviton, and your cluster is fully intel, you can use the qat_lz4 and the qat_deflate index properties in OS v2.15+ to utilize hardware accelerated compression – to juice out performance. But only for LZ4. 🙂
Use Reindex to mutate. Did you know you could keep your more recent indices at LZ4 for performance, and older indices at zstd for optimized storage? With reindexing, you could also consolidate a 10-shard index into a 5-shard index. Be creative!
Talk to your developers. Understand their usage, and see if their access patterns are read-heavy or sparse. Dont put zstd in a CPU-constrained cluster, and hope that you wont get paged to add additional nodes coz writes & searches suddenly slowed down. Dont trade short-term storage costs with long-term cluster-management costs.
Search more, fetch less. Use ztsd if you know that all fields in your index are not returned at once. Because even AWS says that fetching all stored fields may result in increased latency.
Stop complicating life. Understand your customers & constraints before wanting to replicate someone else’s success.
Get a life. Who even reads until the end these days. Scroll back to the few/frequent read/write matrix.

Name	Typ	Größe	Geändert am	Zugriff
📁 .. (Zurück)
🗜️ dxvk-2.7.tar.gz	GZ	9.8 MB	07.07.2025 15:36	-rw-r--r--
📄 vkd3d-proton-2.14.1.tar.zst	ZST	2.77 MB	07.07.2025 15:37	-rw-r--r--