Tuning Distributed PostgreSQL


Distributed databases promise linear scaling in both data size and throughput (at least in theory). In practice, harnessing the full power of a distributed system, especially one based on PostgreSQL like Citus, demands a holistic approach. While advanced partitioning schemes, elaborate caching layers, and carefully designed schemas are useful, the biggest gains often come from understanding hardware constraints, transaction patterns, and the internals of how PostgreSQL implements critical components like checkpointing and WAL (Write-Ahead Logging).

This article describes how we tuned a CitusDB (distributed PostgreSQL) cluster to achieve roughly 10x throughput by focusing primarily on tuning the database for its hardware environment and workload characteristics, rather than implementing fancy, complex sharding strategies.

Setup

We run a Citus cluster comprising:

  • 5 nodes (1 coordinator, 4 workers)
  • 10 Gbps Ethernet interconnect
  • NVMe drives for main storage
  • 1 TB RAM per node

Although each server is quite powerful, we share them with other teams and thus cannot assume exclusive access to all resources. Still, each node’s substantial RAM, high-speed storage, and modern CPU cores offer tremendous potential throughput if configured properly.

In our workload:

  • 20 different clients generate ~430k total transactions (a mix of INSERT, UPDATE, and SELECT statements).
  • The main objective is to complete these transactions as fast as possible: maximizing throughput.
  • Latency is also relevant, but we treat it mostly as a natural extension of throughput improvement (i.e., as throughput rises and fewer queries contend for the same resources, latency also tends to improve).

Why Not a Complex Sharding Strategy?

Citus offers powerful sharding strategies from hash distribution on specific columns (e.g., warehouse_id) to replicating read-mostly tables. In many typical scenarios (like “warehouse-centric” transaction processing), these designs absolutely matter. However, in our tests, we discovered that beyond a basic partitioning scheme, the real bottlenecks lay elsewhere. Adopting a fancy distribution strategy wouldn’t help if I/O contention, memory pressure, or checkpoints kept slowing everything down.

In other words, for our specific environment, focusing on hardware-aware tuning and understanding PostgreSQL’s internals yielded the largest performance jump. Your mileage may vary, but we believe the lesson is universal: always measure to see where the bottleneck actually is before investing effort in more complex solutions.

Initial Observation

When we first deployed our database and ran the workload:

  1. Throughput started high (tens of thousands of transactions per minute).
  2. After a short while (~10 minutes), performance sharply dropped.
  3. Resource usage (CPU, I/O, memory) exhibited erratic patterns, especially around the times that PostgreSQL started performing checkpoints.

Unsurprisingly, the system was under heavy transactional load, with many frequent updates. This environment can quickly saturate I/O when the database tries to write dirty buffers during a checkpoint.

Finding the Bottleneck

A cardinal rule of tuning is: find the real bottleneck first. A seemingly slow query might be the symptom, not the cause. In our case, we suspected that massive updates, particularly on frequently accessed tables, were generating large WAL volumes, which in turn triggered frequent, extended checkpoints.

PostgreSQL Checkpoint

In PostgreSQL (and by extension, in Citus), WAL records every data change. Eventually, the database checkpoints, ensuring all dirty pages are flushed to disk so that recovery does not require replaying an overly large set of WAL files.

By default, PostgreSQL schedules these checkpoints every 5 minutes. However, in our high-throughput, I/O-intensive environment, each checkpoint was taking an unexpectedly long time to complete—sometimes consuming 2 to 3 minutes of processing time.

Using extensions like pg_stat_statements, we identified that many of the slowest queries were simple accesses to our most frequently updated table. While the queries themselves were efficient, the heavy I/O load imposed by the checkpoint operations was the real culprit behind our performance degradation.

Why can checkpoints stall performance so drastically?

  • A checkpoint flushes potentially gigabytes of data from in-memory buffers to disk.
  • If checkpoint_timeout is too short, checkpoints happen too frequently, causing repetitive heavy writes.
  • If checkpoint_completion_target is small, PostgreSQL attempts to complete each checkpoint as quickly as possible, hammering I/O in a short burst.
  • If max_wal_size is too small, PostgreSQL is forced to checkpoint more often to avoid filling up WAL.

All these contribute to “checkpoint spikes,” during which throughput can plummet.

Taming the Checkpoint

After confirming that checkpointing was indeed the major bottleneck, we adjusted PostgreSQL parameters to minimize checkpoint-induced stalls:

  1. checkpoint_timeout

    • What it does: Specifies how frequently PostgreSQL forces a checkpoint.
    • Default: 5 minutes.
    • Our Adjustment: We increased it substantially (e.g., to 15 or even 60 minutes). With large RAM and fast NVMe, we could safely keep more dirty pages in memory without jeopardizing recovery times.
  2. checkpoint_completion_target

    • What it does: Determines how gradually PostgreSQL spreads out checkpoint writes. Closer to 1.0 tries to smooth out I/O by writing buffers slowly over the entire checkpoint interval.
    • Default: 0.5.
    • Our Adjustment: Raised it (e.g., 0.9+). This avoids an I/O storm at the start of each checkpoint and helps even out performance.
  3. max_wal_size

    • What it does: Determines how large WAL can grow before a checkpoint is forced to prevent running out of disk space.
    • Our Adjustment: Significantly increased it (e.g., from default 1GB or 2GB to 8GB or more). This let the database accumulate more WAL before forcing a checkpoint.
  4. wal_buffers

    • What it does: Buffer size for WAL data before it’s written out.
    • Our Adjustment: Increased from default (often 16MB) to something larger (e.g., 128MB) to better handle high concurrency.
  5. I/O Concurrency Settings

    • maintenance_io_concurrency and effective_io_concurrency: Tells PostgreSQL how many parallel I/O operations it can perform, factoring in the system’s underlying storage.
    • Our Adjustment: Cranked them up (e.g., 100), since we had fast NVMe drives capable of many parallel operations.

Through a mix of these changes, we drastically reduced how often large checkpoint write storms occurred, thereby smoothing out performance. The net effect was that transactions spent more time actually processing requests and less time waiting on I/O.

Memory and Cache Tuning

Large memory (1 TB RAM per node) can be a double-edged sword:

  • Pro: More data can be cached, and more dirty buffers can be held before writes.
  • Con: Large writes can create bigger checkpoint storms, and if you saturate your memory bandwidth or I/O, the entire system can lock up for extended periods.

Key parameters we adjusted:

  • shared_buffers
    • Commonly recommended to be 25–40% of total system memory in dedicated Postgres servers. With 1 TB of RAM (and other tenants on the same hardware), we used something more modest (e.g., 16–64 GB) to avoid overcommitting memory just for shared buffers.
  • bgwriter_lru_maxpages and bgwriter_lru_multiplier
    • The background writer tries to keep the most frequently accessed pages clean by writing them out gradually. Tweaking these helped us reduce the amount of dirty data that the checkpoint process later had to flush in a big batch.

Concurrency and Lock Tuning

Under high concurrency, distributed Postgres can encounter performance-sapping phenomena:

  • Row or table locking if many sessions touch the same data.
  • Distributed deadlocks, which we addressed by ensuring Citus had a consistent transaction ordering or by using retry logic on deadlock detection.

For common OLTP-like workloads:

  • Index usage is critical for quick lookups.
  • Minimizing lock contention on frequently updated rows or “hot spots” (e.g., a central warehouse or user table) can drastically boost throughput.

In our scenario, the Warehouse table was updated the most in payment-related transactions. We found the best fix (beyond applying indexes) was to ensure that the writes to warehouse were less frequently stalled by checkpoint or background processes. That was a huge part of letting concurrency scale more smoothly.

Basic Sharding Strategy (But Not Overly Complex)

Though we did not rely heavily on complex sharding, we still used a modest distribution strategy:

  • Hash partitioning on a key that appears in the majority of joins/queries (e.g., warehouse_id).
  • Replicate small, read-mostly dimension tables (e.g., item) to each node.

This approach already localizes most queries to a single node and avoids cross-node chatter for typical transactions. The complexity of implementing a second distribution dimension (like item_id) typically wasn’t worth it once we realized the real gating factor was I/O from checkpoint storms.

Results and Observations

By focusing primarily on checkpoint and memory tuning, we achieved:

  • ~10x improvement in average throughput compared to the initial baseline with default settings.
  • Reduced overall job execution time from multiple hours to under an hour (or even 40 minutes in some runs).

While it might be tempting to apply advanced sharding or partitioning strategies to achieve data locality, our experiments showed that such approaches offered diminishing returns when the system was primarily constrained by I/O. By focusing our efforts on hardware-aware tuning to get the checkpoint bottleneck under control, we achieved a dramatic throughput improvement without the complexity of overhauling our data distribution design. The underlying hardware (fast NVMe and huge RAM) combined with a well-tuned checkpoint policy turned out to be the primary driver of performance gains.

Lessons Learned

  1. Measure First
    Use tools like pg_stat_statements and system resource monitoring to identify the real culprits. It’s pointless to implement fancy distribution logic if checkpoint thrashing or I/O saturation is the real bottleneck.

  2. Longer Checkpoint Intervals on Large-Memory Systems
    Short checkpoint intervals on servers with massive RAM is a recipe for excessive, constant flushes. Extend checkpoint_timeout, enlarge max_wal_size, and ensure that you can manage the disk space in WAL. Note that if you have no problem with throughput, you don't really need to change anything as we generally want periodic checkpointing for faster recovery. In my case the hardware were extremely reliable and we specifically need to handle bursts of transactions.

  3. Gradual Checkpoints
    Tuning checkpoint_completion_target to near 1.0 ensures smoother, less spiky I/O usage.

  4. Partitioning Matters (But Not Always)
    If your workload is already well-localized by default (say, hashing by a key used in frequent joins) and a big chunk of your overhead is I/O or memory-related, fancy multi-column partitioning or highly sophisticated data distribution might not yield major incremental benefits.

  5. Keep It Simple
    The simplest solution that addresses the largest bottleneck is often best in complex systems. More advanced strategies are helpful only if your main problem is actually data distribution or network chatter, not I/O storms.

Conclusion

Achieving high throughput on a CitusDB cluster involves more than just slapping on an intricate sharding strategy. While distributing data effectively is essential to scale reads and writes, the real magic often lies in understanding how PostgreSQL handles write-ahead logging, checkpoints, and I/O concurrency, and tuning it accordingly.

In our case, adjusting checkpoint intervals, WAL sizes, memory parameters, and concurrency settings provided the biggest performance leap. By allowing the system to run longer between checkpoints and smoothing out the flushing process, we minimized the downtime and overhead associated with massive writes. This alone enabled our cluster to handle roughly 10x more transactions per second than our initial baseline; an impressive reminder that in performance-critical tuning, hardware configuration and PostgreSQL internals often matter more than advanced distribution tactics.

If you are operating a distributed PostgreSQL environment (with or without Citus), take a step back before implementing complicated partitioning. Identify your real bottlenecks (often it’s disk I/O or checkpoint stalls) and tune those first. You might just find that big gains come with a surprisingly simple solution.


Additional References & Reading


Happy tuning! Once you conquer checkpoint overhead and tune your hardware parameters, the path to horizontally scaling your transactions becomes much smoother.

Check out my latest posts!


# First Pass Systematic testing of concurrent programs has been researched for quite some time and it is known to have the problem of *state-space explosion* whereby testing all possible interleaving of concurrent programs is exponential in the execution length. Deterministic Partial Order Reduction...
Partial Order Aware Concurrency Sampling (POS) - Yuan, et al.

Partial Order Aware Concurrency Sampling (POS) - Yuan, et al.

PaperBlog
# Problem Even when requirements are clear, verifying software correctness and ensuring our code works in all scenarios without affecting existing logic can be challenging. This is more apparent on ‘hot’ codes: Excerpts of code that got updated often due to its status as a new base logic or due to ...
Improving Software Testability with Math

Improving Software Testability with Math

Blog
# Introduction Database normalization is a fundamental process in database design, ensuring data integrity and minimizing redundancy. Despite its theoretical foundations, practical implications and occurrences of different normal forms in real-world databases are underexplored. Our recent work addre...
Testing DB Normalization Theory vs Practice - FDSampleRush

Testing DB Normalization Theory vs Practice - FDSampleRush

ProjectResearchPaper
# Overview From full stack development, data management, automation, random hops on other teams’ meetings, to realizing that people having double eyelid (or lack thereof) is a thing, I have expanded my knowledge and honed my SWE skills by a lot during my 6-month internship at ShopBack. This article ...
My ShopBack Adventure

My ShopBack Adventure

Blog