Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Asus ExpertCenter PN54 reviewed

    Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

    P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Blue-collar jobs are gaining popularity as AI threatens office work

      August 17, 2025

      Man who asked ChatGPT about cutting out salt from his diet was hospitalized with hallucinations

      August 15, 2025

      What happens when chatbots shape your reality? Concerns are growing online

      August 14, 2025

      Scientists want to prevent AI from going rogue by teaching it to be bad first

      August 8, 2025

      AI models may be accidentally (and secretly) learning each other’s bad behaviors

      July 30, 2025
    • Business

      Why Certified VMware Pros Are Driving the Future of IT

      August 24, 2025

      Murky Panda hackers exploit cloud trust to hack downstream customers

      August 23, 2025

      The rise of sovereign clouds: no data portability, no party

      August 20, 2025

      Israel is reportedly storing millions of Palestinian phone calls on Microsoft servers

      August 6, 2025

      AI site Perplexity uses “stealth tactics” to flout no-crawl edicts, Cloudflare says

      August 5, 2025
    • Crypto

      Chainlink (LINK) Price Uptrend Likely To Reverse as Charts Hint at Exhaustion

      August 31, 2025

      What to Expect From Solana in September

      August 31, 2025

      Bitcoin Risks Deeper Drop Toward $100,000 Amid Whale Rotation Into Ethereum

      August 31, 2025

      3 Altcoins Smart Money Are Buying During Market Pullback

      August 31, 2025

      Solana ETFs Move Closer to Approval as SEC Reviews Amended Filings

      August 31, 2025
    • Technology

      Asus ExpertCenter PN54 reviewed

      August 31, 2025

      Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

      August 31, 2025

      P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

      August 31, 2025

      Best AI Workstation Processors 2025: Why AMD Ryzen Beats Intel for Local AI Computing for now!

      August 31, 2025

      How to turn a USB flash drive into a portable games console

      August 31, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»When Sigterm Does Nothing: A Postgres Mystery
    Technology

    When Sigterm Does Nothing: A Postgres Mystery

    TechAiVerseBy TechAiVerseJuly 15, 2025No Comments13 Mins Read2 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    When Sigterm Does Nothing: A Postgres Mystery
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    BMI Calculator – Check your Body Mass Index for free!

    When Sigterm Does Nothing: A Postgres Mystery

    In my opinion, the worst bugs are the ones you decide not to look further into.

    Maybe the problem only manifests 2% of the time, or when you run the code on a particular brand of toaster. Maybe the customer doesn’t supply enough information to begin narrowing down the cause. Maybe you just don’t have enough time to track the issue down fully.

    And everyone eventually moves on. But when you see the same bug strike again months later, you quietly mutter to yourself and wish you’d been persistent the first time. This is one of those stories.

    The ClickPipes team had encountered a bug with logical replication slot creation on Postgres read replicas—specifically, an issue where a query that was already taking hours rather than the few seconds it usually took couldn’t be terminated by any of the usual methods in Postgres, causing customer frustration and risking the stability of production databases. In this blog post, I’ll walk through how I investigated the problem and ultimately discovered it was due to a Postgres bug. We’ll also share how we fixed it and our experience working with the Postgres community.

    At ClickPipes, our charter is simple. Allow customers to easily move lots of data from to ClickHouse Cloud, where X is an ever-growing set of object stores, queues, and databases. My team and I look after the database pipes. But before this, we were a startup named PeerDB, which made a product that moved lots of data from Postgres to instead. By being Postgres-focused, the team had significant experience with Postgres CDC at scale.

    For the uninitiated, change data capture (CDC) is the process of continuously tracking database changes, allowing ClickPipes to replicate them to ClickHouse in near real-time. Postgres performs change data capture (CDC) primarily through logical replication slots. Logical replication slots decode changes from the write-ahead log (WAL) and stream them to consumers for replay. They act as an intermediary layer, transforming raw WAL data into consumable change events. The Postgres ClickPipe is built around logical replication, creating slots and using them to read from hundreds of customer databases.

    It all began when one of our biggest PeerDB customers pinged us, letting us know that they had set up a pipe to replicate data from their new Postgres read replica, but it appeared to have become “stuck.” Keeping a few possibilities in mind, I pulled up their instance and saw nothing wrong—just a single active connection that was doing… something.

    SELECT pid,backend_start,xact_start,wait_event_type,wait_event,state,query 
    FROM pg_stat_activity 
    WHERE query LIKE '%pg_create_logical_replication_slot%' 
    AND pid!=pg_backend_pid();
    -[ RECORD 1 ]---+--------------------------------------------------------------
    pid             | 5074
    backend_start   | 2025-06-27 14:57:01.458979+05:30
    xact_start      | 2025-06-27 14:57:47.819348+05:30
    wait_event_type |
    wait_event      |
    state           | active
    query           | SELECT * FROM pg_create_logical_replication_slot('demo_slot', 'pgoutput');
    

    When creating a new pipe, one of the first steps is to create a logical replication slot. This typically takes a few seconds, but it was taking far longer here. We don’t see any wait_event associated, which means Postgres thinks this command is not stuck waiting on anything. Additionally, we noticed the slot we’re trying to create has shown up in the pg_replication_slots system table with an active PID.

    After determining that the issue seemed to be on the Postgres side, one of our initial troubleshooting steps was to “turn it off and on again” and reset the pipe so it started afresh. Doing this led to an even more troubling discovery: even though the ClickPipe had disconnected, this rogue command wouldn’t go away. We even used SQL commands to send SIGINT and SIGTERM signals to the backend process assigned to this query, but neither was taking effect. SIGTERM not taking effect is bad because that’s supposed to be the “nuclear” option here.

    SELECT pg_cancel_backend(5074), pg_terminate_backend(5074);
     pg_cancel_backend | pg_terminate_backend
    -------------------+----------------------
     t                 | t
    (1 row)
    
    SELECT pid,wait_event_type,wait_event,state,query 
    FROM pg_stat_activity
    WHERE pid=5074;
     pid  | wait_event_type | wait_event | state  |                             query
    ------+-----------------+------------+--------+----------------------------------------------------------------
     5074 |                 |            | active | SELECT pg_create_logical_replication_slot('demo_slot', 'pgoutput');
    

    We now had a replication slot marked as “active”, and since the process connected to the slot refuses to quit, we cannot drop it. The slot was also now beginning to retain WAL for the decoding process, which is dangerous, as without a way to read from or drop the slot, it can continue retaining WAL indefinitely, risking storage exhaustion. As the customer was running a managed Postgres service, the only option they had to get out of this situation was to fully restart the Postgres instance to remove this process, which is not a very appealing proposition as it’d lead to downtime for things dependent on it.

    As alluded to earlier, this is not the first time we have seen this exact problem. A few months earlier, another customer using a different managed Postgres service saw the same symptoms while creating a few pipes on a read replica. While we were not able to deterministically solve it, tuning some settings (like turning on hot_standby_feedback) seemed to make it go away, so we attributed it to some idiosyncrasy in their managed service (we now know that this was not the case). But it was worrying to see it in a completely different managed service on the latest Postgres version with all recommended settings. This was when I set aside time to investigate this further.

    I got my first clue from support for the customer’s Postgres provider, which was a strace output on the backend process running the broken query:

    strace: Process 117778 attached
    restart_syscall(<... resuming interrupted nanosleep ...>) = 0
    nanosleep({tv_sec=0, tv_nsec=1000000}, NULL) = 0
    nanosleep({tv_sec=0, tv_nsec=1000000}, NULL) = 0
    nanosleep({tv_sec=0, tv_nsec=1000000}, NULL) = 0
    nanosleep({tv_sec=0, tv_nsec=1000000}, NULL) = 0
    nanosleep({tv_sec=0, tv_nsec=1000000}, NULL) = 0
    nanosleep({tv_sec=0, tv_nsec=1000000}, NULL) = 0
    ...
    

    Seeing only nanosleep here and no IO or other syscalls would suggest that the backend is in a loop of sleep() calls with a constant duration of 1ms each. This was already a good clue since there are only a few places where such a loop exists in Postgres code. After finding a function in the control flow of creating replication slots that could sleep in this manner under certain conditions, I worked backwards to figure out what this function does and what can get it into this state. And while this bug is not that common in production databases, the “reproduction” of this issue is relatively trivial:

    1. Either set up or nicely ask a managed service to give you a Postgres cluster (v16+ since that’s when standbys started supporting creation of logical replication slots) with a primary instance and a read replica.
    2. On the primary, start a transaction and perform any DML operation. As an optimization, Postgres only assigns transactions a transactionid/XID once they have written data, so this is to make the transaction “real”. Don’t COMMIT or ROLLBACK just yet; just leave the session open.
    3. On the read replica, create a logical replication slot (the most straightforward method would be to use the SQL function pg_create_logical_replication_slot). Once you issue the command, Ctrl-C becomes useless. Trying to cancel or terminate the backend from another session doesn’t work either. If you read syscalls from the backend process via strace or similar, you’d see a lot of nanosleep syscalls like the one above.
    4. COMMIT/ROLLBACK the transaction from step 2. Doing this will eventually cause the slot creation to return successfully (if you tried to cancel it earlier, it will notice the attempt now and cancel itself).

    Internally, creating a replication slot requires waiting for transactions before the slot creation query is issued to COMMIT/ROLLBACK, to reach a “consistent point” where the slot can decode all future transactions. While it is not ideal to have long-running transactions on Postgres for various reasons, the reality is that certain transactions (think reporting queries or large data backfills) can take a while. If we create a slot under these conditions, it will have to wait for transactions to complete and therefore take longer than usual.

    2025-05-22 10:44:53.868 UTC [1195] LOG:  00000: logical decoding found initial starting point at 0/356B148
    2025-05-22 10:44:53.868 UTC [1195] DETAIL:  Waiting for transactions (approximately 1) older than 6068 to end.
    

    Since waiting on these older transactions can take an arbitrary amount of time, it is crucial to implement this wait as efficiently as possible. Postgres already handles various locks for different operations, and we can reuse this infrastructure to take a lock on older transactions. Since transactions only release their transactionid lock after completing, acquiring this lock confirms that the transaction is done. This approach provides visibility—system tables display active waits, allowing administrators to monitor progress.

    SELECT pid,wait_event_type,wait_event,state,query 
    FROM pg_stat_activity
    WHERE pid=804;
     pid | wait_event_type |  wait_event   | state  |                             query
    -----+-----------------+---------------+--------+----------------------------------------------------------------
     804 | Lock            | transactionid | active | SELECT pg_create_logical_replication_slot('demo_slot', 'pgoutput');
    (1 row)
    
    SELECT locktype,transactionid,pid,mode 
    FROM pg_locks
    WHERE pid=804 AND granted='f';
       locktype    | transactionid | pid |   mode
    ---------------+---------------+-----+-----------
     transactionid |          6077 | 804 | ShareLock
    

    If needed, one can cancel the problematic transactions or the slot creation process that’s still waiting. However, this holds only for a “typical” Postgres instance. As it turns out, being a Postgres read replica breaks some invariants, which leads to different behaviour in the same case.

    While “read replica” is the colloquial term for a Postgres instance serving read-only traffic (a term that makes it seem more straightforward than it actually is), Postgres refers to these as “hot standbys”. A hot standby is a Postgres instance in “recovery mode” that constantly receives WAL records from a primary Postgres instance and uses these to maintain an exact copy of all data on the primary. The relevant detail here is that the standby can only indirectly observe what transactions running on the primary are doing via the write-ahead log. Specifically, the standby continuously maintains a list of transactions running on the primary known as KnownAssignedXids based on information from incoming WAL records.

    The function that implements the “wait for a transaction to complete” operation described earlier is named XactLockTableWait, and the crucial part for us is this loop (code simplified for brevity):

    void
    XactLockTableWait(TransactionId xid, Relation rel, ItemPointer ctid,
    				  XLTW_Oper oper)
        <...>
    	for (;;)
    	{
    		Assert(TransactionIdIsValid(xid));
    		SET_LOCKTAG_TRANSACTION(tag, xid);
    
    		(void) LockAcquire(&tag, ShareLock, false, false);
    		LockRelease(&tag, ShareLock, false);
    
    		if (!TransactionIdIsInProgress(xid))
    			break;
    
            	pg_usleep(1000L);
    	}
        <...>
    

    Postgres acquires a lock on the transaction ID provided as input via LockAcquire, which hangs until the transaction completes and releases its lock. After obtaining the lock, we release it immediately and check if the transaction is still in progress via TransactionIdIsInProgress. If it isn’t, we exit. However, there is a 1ms sleep before the loop repeats, which we hit if we find the transaction is still running. If we only reach the check after we acquire a lock, how can this happen? As mentioned in a comment within this function, there is a window where the transaction has registered itself as running but hasn’t yet acquired its lock on the transactionid. While the code handles this case regardless, the sleep is good for avoiding repeated locking until the transaction reaches a consistent state. Under regular operation, this case should rarely be hit and not for very long.

    Let’s now consider the case of a hot standby. A standby still needs to find a consistent point while creating a logical replication slot, and older transactions need to end for that. It isn’t the one running the transaction; therefore, the LockAcquire will always return immediately on standby. But TransactionIdIsInProgress does take KnownAssignedXids into account, so we know that the transaction is still running. Thus, we hit the 1ms sleep and then another loop iteration. But unlike earlier, this is not a transient situation; we can be stuck here for hours.

    1. The optimized LockAcquire wait on the primary is replaced by a fixed 1ms polling loop that wastes cycles if we need to wait a long time.
    2. If LockAcquire needs to wait, it also handles interrupts. Since the function was designed around waiting in LockAcquire and the sleep case is for transient cases, there is no code for handling interrupts within the loop. This is what made the backend process “unkillable”.
    3. We don’t report that we’re waiting on an external operation, as LockAcquire handled this. The only way a regular user can tell that their slot creation is stuck on this is to look at logs, which still mention the need to wait for older transactions to complete. On a primary, this same wait would be visible in system tables.

    This issue is a good example of well-written code causing problems because it was unwittingly roped into handling something it was not designed for. This tends to happen in any codebase of sizable complexity.

    Our immediate concern was that it was currently impossible to stop a slot creation waiting for a consistent point on a standby. To this effect, I submitted a patch to the Postgres mailing list, which just added an interrupt check before each sleep call:

    +			CHECK_FOR_INTERRUPTS();
     			pg_usleep(1000L);
    

    Postgres maintainers quickly accepted this patch (thanks, Fujii Masao, for reviewing it!) and backported it to all supported major versions of Postgres. After a minor version upgrade, this problem should be mitigated for all users running hot standbys.

    Another community member graciously addressed the issues around the lack of a wait_event here and the inefficient loop. A patch is out to add a new wait event that highlights this case, and there is also an ongoing discussion about improving the way standbys wait for transactions to complete on the primary. Both these changes are still in flight and should hopefully make it to Postgres 19 next year.

    Once the ClickPipes team understood that this issue was not a one-off case, it took me a day or two to determine the RCA and develop the patch that addresses it. Being familiar with the Postgres codebase, its well-structured nature allowed me to hone in on XactLockTableWait() and work backwards. While the Postgres community has an older and sometimes daunting contribution process of submitting patches to a mailing list, they were also instrumental in reviewing the patch soon and getting it backported, continuing to ensure a high-quality user experience.

    This investigation into a seemingly simple bug reveals the incredible depth and complexity that lie beneath the surface of modern database systems. Postgres, with decades of development and its battle-tested architecture, still harbors edge cases where different subsystems can interact with each other in unexpected ways. Similarly, ClickHouse, with its columnar storage engine and rich feature set, also presents fascinating challenges in pushing the boundaries of analytical performance. As both databases are open source, it’s easier to identify issues like this, fix them, and submit your patches upstream, repaying your debt to a system that has served you well.

    Replicating data from PostgreSQL to ClickHouse unlocks columnar storage performance, parallel query execution, and advanced time-series analytics, enabling workloads that would be impractical or impossible on row-based transactional systems. We recommend ClickPipes for Postgres for ClickHouse Cloud users, offering managed real-time replication with zero infrastructure overhead. Self-hosted ClickHouse users should consider PeerDB, which provides high-performance self-hosted CDC capabilities. Both solutions work seamlessly with Postgres read replicas for CDC operations, allowing you to offload replication traffic from your primary database.

    BMI Calculator – Check your Body Mass Index for free!

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleMartin (YC S23) Is Hiring Founding Engineers to Build a Better Siri
    Next Article Meta says it’s cracking down on Facebook creators who steal content
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Asus ExpertCenter PN54 reviewed

    August 31, 2025

    Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

    August 31, 2025

    P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

    August 31, 2025
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025168 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 202548 Views

    New Akira ransomware decryptor cracks encryptions keys using GPUs

    March 16, 202530 Views

    Is Libby Compatible With Kobo E-Readers?

    March 31, 202528 Views
    Don't Miss
    Technology August 31, 2025

    Asus ExpertCenter PN54 reviewed

    Asus ExpertCenter PN54 reviewed – what the mini PC with AMD Ryzen AI 7 350…

    Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

    P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

    Best AI Workstation Processors 2025: Why AMD Ryzen Beats Intel for Local AI Computing for now!

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Asus ExpertCenter PN54 reviewed

    August 31, 20252 Views

    Huawei MatePad Mini: Launch date confirmed for compact flagship tablet with OLED screen

    August 31, 20252 Views

    P40WD-40: New Lenovo ThinkVision monitor leaks with Thunderbolt 4 and 120 Hz refresh rate for professionals

    August 31, 20252 Views
    Most Popular

    Xiaomi 15 Ultra Officially Launched in China, Malaysia launch to follow after global event

    March 12, 20250 Views

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views

    French Apex Legends voice cast refuses contracts over “unacceptable” AI clause

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.