• Arun Nukula

vRA 8.x PostgreSQL pods showing down ?


I was doing analysis on a problem where PostgreSQL pods were showing down. vRA 8.x deployments were stuck and not working as expected



In an ideal scenario, there should always be one master and two standby nodes in vRA 8.x


Ideal Status


Node "postgres-0.postgres.prelude.svc.cluster.local":
        PostgreSQL version: 10.10
        Total data size: 26 GB
        Conninfo: host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
        Role: standby
        WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
        Archive command: /bin/true
        WALs pending archiving: 0 pending files
        Replication connections: 0 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101)
        Replication lag: 0 seconds
        Last received LSN: 307/595EBA48
        Last replayed LSN: 307/595EBA48
Node "postgres-1.postgres.prelude.svc.cluster.local":
        PostgreSQL version: 10.10
        Total data size: 26 GB
        Conninfo: host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
        Role: primary
        WAL archiving: enabled
        Archive command: /bin/true
        WALs pending archiving: 0 pending files
        Replication connections: 1 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Replication lag: n/a
Node "postgres-2.postgres.prelude.svc.cluster.local":
        PostgreSQL version: 10.10
        Total data size: 26 GB
        Conninfo: host=postgres-2.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10
        Role: standby
        WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
        Archive command: /bin/true
        WALs pending archiving: 0 pending files
        Replication connections: 0 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Upstream node: postgres-1.postgres.prelude.svc.cluster.local (ID: 101)
        Replication lag: 0 seconds
        Last received LSN: 304/E1A1B458
        Last replayed LSN: 304/E1A1B458

When issue was seen one of the postgres pods was reporting down


postgres-0 1/1 Running 1 3d11h 
postgres-1 1/1 Running 0 3d11h 
postgres-2 0/1 Running 1 3d11h

By the time we realized, it fixed itself




Was curious to know what happened and how did it fix itself. So let's start our deepdive

Started inspecting postgres-1 which is the current master node


Till 2021-01-29 06:10:19 +0000 UTC , it was detecting that postgres-0 was it's master and all of a sudden at 2021-01-29 06:10:35 +0000 it was unable to ping postgres-0


2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts
2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited

. Below snippets show that postgres-1 has been promoted as MASTER, where previous MASTER was postgres-0.

postgres-1 has been promoted as MASTER on 2021-01-29 06:11:06.17188+00




2021-01-29 06:09:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:19 +0000 UTC [repmgrd] node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) monitoring upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to ping "host=postgres-0.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10"
2021-01-29 06:10:35 +0000 UTC [repmgrd] unable to connect to upstream node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100)
2021-01-29 06:10:35 +0000 UTC [repmgrd] checking state of node 100, 1 of 1 attempts
2021-01-29 06:10:37 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited
2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to ping "user=repmgr-db passfile=/scratch/repmgr-db.cred connect_timeout=10 dbname=repmgr-db host=postgres-0.postgres.prelude.svc.cluster.local fallback_application_name=repmgr"
2021-01-29 06:10:45 +0000 UTC [repmgrd] unable to reconnect to node 100 after 1 attempts
2021-01-29 06:10:45 +0000 UTC [repmgrd] 1 active sibling nodes registered
2021-01-29 06:10:45 +0000 UTC [repmgrd] primary node  "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) and this node have the same location ("default")
2021-01-29 06:10:45 +0000 UTC [repmgrd] local node's last receive lsn: 418/C2E8E700
2021-01-29 06:10:45 +0000 UTC [repmgrd] checking state of sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102)
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) reports its upstream is node 100, last seen 20 second(s) ago
2021-01-29 06:10:45 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) last saw primary node 20 second(s) ago
2021-01-29 06:10:45 +0000 UTC [repmgrd] last receive LSN for sibling node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) is: 418/C2E8E700
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has same LSN as current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:10:45 +0000 UTC [repmgrd] node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has lower priority (98) than current candidate "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) (99)
2021-01-29 06:10:45 +0000 UTC [repmgrd] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 2 seconds
2021-01-29 06:10:45 +0000 UTC [repmgrd] promotion candidate is "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:10:45 +0000 UTC [repmgrd] executing "failover_validation_command"
2021-01-29 06:10:56 +0000 UTC [repmgrd] no output returned from command
2021-01-29 06:10:56 +0000 UTC [repmgrd] failover validation command returned zero
2021-01-29 06:10:56 +0000 UTC [repmgrd] this node is the winner, will now promote itself and inform other nodes
2021-01-29 06:10:56 +0000 UTC [repmgrd] promote_command is:
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:56 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:10:58 +0000 UTC [socat] W waitpid(-1, {}, WNOHANG): no child has exited
*
*
*
2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:11:04 +0000 UTC [socat] E read(5, 0x55da35c642c0, 8192): Connection timed out
2021-01-29 06:11:06 +0000 UTC [repmgr] 1 sibling nodes found, but option "--siblings-follow" not specified
2021-01-29 06:11:06 +0000 UTC [repmgr] promoting standby to primary
2021-01-29 06:11:06 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:06.065 UTC [171] LOG:  received promote request
2021-01-29 06:11:06 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:06.065 UTC [175] FATAL:  terminating walreceiver process due to administrator command
2021-01-29 06:11:06 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:06.066 UTC [171] LOG:  invalid resource manager ID 47 at 418/C2E8E700
2021-01-29 06:11:06 +0000 UTC [postgres] [11-1] 2021-01-29 06:11:06.066 UTC [171] LOG:  redo done at 418/C2E8E6D8
2021-01-29 06:11:06 +0000 UTC [postgres] [12-1] 2021-01-29 06:11:06.066 UTC [171] LOG:  last completed transaction was at log time 2021-01-29 06:10:25.633443+00
2021-01-29 06:11:06 +0000 UTC [postgres] [13-1] 2021-01-29 06:11:06.076 UTC [171] LOG:  selected new timeline ID: 6
2021-01-29 06:11:06 +0000 UTC [postgres] [14-1] 2021-01-29 06:11:06.148 UTC [171] LOG:  archive recovery complete
2021-01-29 06:11:06 +0000 UTC [repmgr] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
2021-01-29 06:11:06 +0000 UTC [repmgr] STANDBY PROMOTE successful
2021-01-29 06:11:06 +0000 UTC [repmgr-event] 101 standby_promote 1 2021-01-29 06:11:06.17188+00 server "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) was successfully promoted to primary
2021-01-29 06:11:06 +0000 UTC [repmgr-event] Reconfiguring the db proxy with primary: postgres-1.postgres.prelude.svc.cluster.local
2021-01-29 06:11:06 +0000 UTC [postgres] [6-1] 2021-01-29 06:11:06.203 UTC [170] LOG:  database system is ready to accept connections
2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv stop db-proxy
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)

Post this on postgres-1 we see following snippets. At 2021-01-29 06:11:07 +0000 postgres-0 has been marked as a failure. Notifies postgres-2 as a new standby node



2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
*
*
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/tee /tmp/primary
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:06 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv start db-proxy
2021-01-29 06:11:06 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
socat: no process found
2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:07 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/bin/touch /tmp/initialized
2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
2021-01-29 06:11:07 +0000 UTC [sudo] pam_unix(sudo:session): session closed for user root
2021-01-29 06:11:07 +0000 UTC [repmgrd] executing notification command for event "repmgrd_failover_promote"
2021-01-29 06:11:07 +0000 UTC [repmgr-event] 101 repmgrd_failover_promote 1 2021-01-29 06:11:07.074526+00 node 101 promoted to primary; old primary 100 marked as failed
2021-01-29 06:11:07 +0000 UTC [repmgrd] 1 followers to notify
2021-01-29 06:11:07 +0000 UTC [repmgrd] notifying node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) to follow node 101
2021-01-29 06:11:07 +0000 UTC [repmgrd] switching to primary monitoring mode
2021-01-29 06:11:07 +0000 UTC [repmgrd] executing notification command for event "repmgrd_reload"
2021-01-29 06:11:07 +0000 UTC [repmgr-event] 101 repmgrd_reload 1 2021-01-29 06:11:07.086852+00 monitoring cluster primary "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:11:07 +0000 UTC [repmgrd] monitoring cluster primary "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101)
2021-01-29 06:11:12 +0000 UTC [repmgrd] new standby "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has connected
2021-01-29 06:11:12 +0000 UTC [repmgrd] executing notification command for event "child_node_new_connect"
2021-01-29 06:11:12 +0000 UTC [repmgr-event] 101 child_node_new_connect 1 2021-01-29 06:11:12.114782+00 new standby "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has connected
2021-01-29 06:12:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:13:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:14:07 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state

Around 2021-01-29 06:16:48 +0000 postgres-1 reports that postgres-2 has disconnected as a standby





postgres-1


2021-01-29 06:16:08 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:16:48 +0000 UTC [repmgrd] standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has disconnected
2021-01-29 06:16:48 +0000 UTC [repmgrd] executing notification command for event "child_node_disconnect"
2021-01-29 06:16:48 +0000 UTC [repmgr-event] 101 child_node_disconnect 1 2021-01-29 06:16:48.7483+00 standby node "postgres-2.postgres.prelude.svc.cluster.local" (ID: 102) has disconnected
2021-01-29 06:17:08 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.899 UTC [2820] LOG:  unexpected EOF on client connection with an open transaction
2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.914 UTC [2982] LOG:  could not receive data from client: Connection reset by peer
2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.914 UTC [2973] LOG:  could not send data to client: Broken pipe
2021-01-29 06:17:16 +0000 UTC [postgres] [8-1] 2021-01-29 06:17:16.918 UTC [2973] LOG:  could not receive data from client: Connection reset by peer
2021-01-29 06:17:16 +0000 UTC [postgres] [9-1] 2021-01-29 06:17:16.918 UTC [2973] FATAL:  connection to client lost
2021-01-29 06:17:16 +0000 UTC [postgres] [7-1] 2021-01-29 06:17:16.920 UTC [2811] LOG:  unexpected EOF on client connection with an open transaction
2021-01-29 06:18:09 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state
2021-01-29 06:19:09 +0000 UTC [repmgrd] monitoring primary node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) in normal state


postgres-2


2021-01-29 06:11:08 +0000 UTC [postgres] [6-1] 2021-01-29 06:11:08.631 UTC [7056] LOG:  fetching timeline history file for timeline 6 from primary server
2021-01-29 06:11:08 +0000 UTC [repmgr] STANDBY FOLLOW successful
2021-01-29 06:11:08 +0000 UTC [repmgr] executing notification command for event "standby_follow"
2021-01-29 06:11:08 +0000 UTC [repmgr-event] 102 standby_follow 1 2021-01-29 06:11:08.63358+00 standby attached to upstream node "postgres-1.postgres.prelude.svc.cluster.local" (ID: 101) 101 host=postgres-1.postgres.prelude.svc.cluster.local dbname=repmgr-db user=repmgr-db passfile=/scratch/rep
mgr-db.cred connect_timeout=10 postgres-1.postgres.prelude.svc.cluster.local
2021-01-29 06:11:08 +0000 UTC [repmgr-event] Reconfiguring the db proxy with primary: postgres-1.postgres.prelude.svc.cluster.local
2021-01-29 06:11:08 +0000 UTC [postgres] [7-1] 2021-01-29 06:11:08.662 UTC [7056] LOG:  started streaming WAL from primary at 418/C2000000 on timeline 5
2021-01-29 06:11:08 +0000 UTC [sudo] postgres : TTY=unknown ; PWD=/etc/service/repmgrd ; USER=root ; COMMAND=/usr/sbin/sv stop db-proxy
2021-01-29 06:11:08 +0000 UTC [sudo] pam_unix(sudo:session): session opened for user root by (uid=0)
2021-01-29 06:11:08 +0000 UTC [socat] W exiting on signal 15
2021-01-29 06:11:08 +0000 UTC [socat] W exiting on signal 15
*
*
2021-01-29 06:11:08 +0000 UTC [postgres] [8-1] 2021-01-29 06:11:08.821 UTC [7056] LOG:  replication terminated by primary server
2021-01-29 06:11:08 +0000 UTC [postgres] [8-2] 2021-01-29 06:11:08.821 UTC [7056] DETAIL:  End of WAL reached on timeline 5 at 418/C2E8E700.
2021-01-29 06:11:08 +0000 UTC [postgres] [10-1] 2021-01-29 06:11:08.822 UTC [7052] LOG:  new target timeline is 6
2021-01-29 06:11:08 +0000 UTC [postgres] [9-1] 2021-01-29 06:11:08.828 UTC [7056]