Distributed databases promise linear scaling in both data size and throughput (at least in theory). In practice, harnessing the full power of a distributed system, especially one based on PostgreSQL like Citus, demands a holistic approach. While advanced partitioning schemes, elaborate caching layers, and carefully designed schemas are useful, the biggest gains often come from understanding hardware constraints, transaction patterns, and the internals of how PostgreSQL implements critical components like checkpointing and WAL (Write-Ahead Logging).
This article describes how we tuned a CitusDB (distributed PostgreSQL) cluster to achieve roughly 10x throughput by focusing primarily on tuning the database for its hardware environment and workload characteristics, rather than implementing fancy, complex sharding strategies.
We run a Citus cluster comprising:
Although each server is quite powerful, we share them with other teams and thus cannot assume exclusive access to all resources. Still, each node’s substantial RAM, high-speed storage, and modern CPU cores offer tremendous potential throughput if configured properly.
In our workload:
INSERT
, UPDATE
, and SELECT
statements).Citus offers powerful sharding strategies from hash distribution on specific columns (e.g., warehouse_id
) to replicating read-mostly tables. In many typical scenarios (like “warehouse-centric” transaction processing), these designs absolutely matter. However, in our tests, we discovered that beyond a basic partitioning scheme, the real bottlenecks lay elsewhere. Adopting a fancy distribution strategy wouldn’t help if I/O contention, memory pressure, or checkpoints kept slowing everything down.
In other words, for our specific environment, focusing on hardware-aware tuning and understanding PostgreSQL’s internals yielded the largest performance jump. Your mileage may vary, but we believe the lesson is universal: always measure to see where the bottleneck actually is before investing effort in more complex solutions.
When we first deployed our database and ran the workload:
Unsurprisingly, the system was under heavy transactional load, with many frequent updates. This environment can quickly saturate I/O when the database tries to write dirty buffers during a checkpoint.
A cardinal rule of tuning is: find the real bottleneck first. A seemingly slow query might be the symptom, not the cause. In our case, we suspected that massive updates, particularly on frequently accessed tables, were generating large WAL volumes, which in turn triggered frequent, extended checkpoints.
In PostgreSQL (and by extension, in Citus), WAL records every data change. Eventually, the database checkpoints, ensuring all dirty pages are flushed to disk so that recovery does not require replaying an overly large set of WAL files.
By default, PostgreSQL schedules these checkpoints every 5 minutes. However, in our high-throughput, I/O-intensive environment, each checkpoint was taking an unexpectedly long time to complete—sometimes consuming 2 to 3 minutes of processing time.
Using extensions like pg_stat_statements, we identified that many of the slowest queries were simple accesses to our most frequently updated table. While the queries themselves were efficient, the heavy I/O load imposed by the checkpoint operations was the real culprit behind our performance degradation.
Why can checkpoints stall performance so drastically?
checkpoint_timeout
is too short, checkpoints happen too frequently, causing repetitive heavy writes.checkpoint_completion_target
is small, PostgreSQL attempts to complete each checkpoint as quickly as possible, hammering I/O in a short burst.max_wal_size
is too small, PostgreSQL is forced to checkpoint more often to avoid filling up WAL.All these contribute to “checkpoint spikes,” during which throughput can plummet.
After confirming that checkpointing was indeed the major bottleneck, we adjusted PostgreSQL parameters to minimize checkpoint-induced stalls:
checkpoint_timeout
checkpoint_completion_target
max_wal_size
wal_buffers
I/O Concurrency Settings
maintenance_io_concurrency
and effective_io_concurrency
: Tells PostgreSQL how many parallel I/O operations it can perform, factoring in the system’s underlying storage.Through a mix of these changes, we drastically reduced how often large checkpoint write storms occurred, thereby smoothing out performance. The net effect was that transactions spent more time actually processing requests and less time waiting on I/O.
Large memory (1 TB RAM per node) can be a double-edged sword:
Key parameters we adjusted:
shared_buffers
bgwriter_lru_maxpages
and bgwriter_lru_multiplier
Under high concurrency, distributed Postgres can encounter performance-sapping phenomena:
For common OLTP-like workloads:
warehouse
or user
table) can drastically boost throughput.In our scenario, the Warehouse table was updated the most in payment-related transactions. We found the best fix (beyond applying indexes) was to ensure that the writes to warehouse
were less frequently stalled by checkpoint or background processes. That was a huge part of letting concurrency scale more smoothly.
Though we did not rely heavily on complex sharding, we still used a modest distribution strategy:
warehouse_id
).item
) to each node.This approach already localizes most queries to a single node and avoids cross-node chatter for typical transactions. The complexity of implementing a second distribution dimension (like item_id
) typically wasn’t worth it once we realized the real gating factor was I/O from checkpoint storms.
By focusing primarily on checkpoint and memory tuning, we achieved:
While it might be tempting to apply advanced sharding or partitioning strategies to achieve data locality, our experiments showed that such approaches offered diminishing returns when the system was primarily constrained by I/O. By focusing our efforts on hardware-aware tuning to get the checkpoint bottleneck under control, we achieved a dramatic throughput improvement without the complexity of overhauling our data distribution design. The underlying hardware (fast NVMe and huge RAM) combined with a well-tuned checkpoint policy turned out to be the primary driver of performance gains.
Measure First
Use tools like pg_stat_statements
and system resource monitoring to identify the real culprits. It’s pointless to implement fancy distribution logic if checkpoint thrashing or I/O saturation is the real bottleneck.
Longer Checkpoint Intervals on Large-Memory Systems
Short checkpoint intervals on servers with massive RAM is a recipe for excessive, constant flushes. Extend checkpoint_timeout
, enlarge max_wal_size
, and ensure that you can manage the disk space in WAL. Note that if you have no problem with throughput, you don't really need to change anything as we generally want periodic checkpointing for faster recovery. In my case the hardware were extremely reliable and we specifically need to handle bursts of transactions.
Gradual Checkpoints
Tuning checkpoint_completion_target
to near 1.0 ensures smoother, less spiky I/O usage.
Partitioning Matters (But Not Always)
If your workload is already well-localized by default (say, hashing by a key used in frequent joins) and a big chunk of your overhead is I/O or memory-related, fancy multi-column partitioning or highly sophisticated data distribution might not yield major incremental benefits.
Keep It Simple
The simplest solution that addresses the largest bottleneck is often best in complex systems. More advanced strategies are helpful only if your main problem is actually data distribution or network chatter, not I/O storms.
Achieving high throughput on a CitusDB cluster involves more than just slapping on an intricate sharding strategy. While distributing data effectively is essential to scale reads and writes, the real magic often lies in understanding how PostgreSQL handles write-ahead logging, checkpoints, and I/O concurrency, and tuning it accordingly.
In our case, adjusting checkpoint intervals, WAL sizes, memory parameters, and concurrency settings provided the biggest performance leap. By allowing the system to run longer between checkpoints and smoothing out the flushing process, we minimized the downtime and overhead associated with massive writes. This alone enabled our cluster to handle roughly 10x more transactions per second than our initial baseline; an impressive reminder that in performance-critical tuning, hardware configuration and PostgreSQL internals often matter more than advanced distribution tactics.
If you are operating a distributed PostgreSQL environment (with or without Citus), take a step back before implementing complicated partitioning. Identify your real bottlenecks (often it’s disk I/O or checkpoint stalls) and tune those first. You might just find that big gains come with a surprisingly simple solution.
Happy tuning! Once you conquer checkpoint overhead and tune your hardware parameters, the path to horizontally scaling your transactions becomes much smoother.