• Arun Nukula

Timed out while waiting for Event Broker response

Came across a problem recently where provisioning and reconfiguration of virtual machines through vRealize Automation were failing


Exceptions seen in catalina.out

2019-06-02 09:53:25,643 vcac: [component="cafe:event-broker" priority="INFO" thread="event-broker-service-taskExecutor8" tenant="" context="" parent="" token=""] com.vmware.vcac.core.event.broker.integration.PublishReplyEventServiceActivator.onApplicationEvent:96 - Message Broker unavailable[internal stop]

2019-04-02 09:53:25,711 vcac: [component="cafe:console-proxy" priority="WARN" thread="Grizzly(1)" tenant="" context="" parent="" token=""] com.vmware.vcac.platform.event.broker.client.stomp.StompEventSubscribeHandler.handleException:493- Error during message processing: session:[f4329ebb-4e2e-7690-8b6a-3f420c8bd226], command[null], headers[{message=[Connection to broker closed.], content-length=[0]}], payload [{}]. Reason : [Connection to broker closed.] 2019-04-02 09:53:25,712 vcac: [component="cafe:console-proxy" priority="ERROR" thread="Grizzly(1)" tenant="" context="" parent="" token=""] com.vmware.vcac.core.service.event.ServerEventBrokerServiceFacade.handleError:337 - Error for command 'null', headers: '{message=[Connection to broker closed.], content-length=[0]}'java.lang.Exception: Connection to broker closed.


Above exceptions clearly state the problem is with messaging broker which is rabbitmq


Performing rabbitmq reset's and then adding second or third node ( if available ) to the master would eventually resolve the problem.


After fair bit of research under rabbitmq logs we see


on node psvra01.nukescloud.com:


=INFO REPORT==== 11-Jun-2019::16:31:41 ===

rabbit on node 'rabbit@psvra03.nukescloud.com' down


=INFO REPORT==== 11-Jun-2019::16:31:41 ===

Keep rabbit@psvra03.nukescloud.com listeners: the node is already back



on node psvra03.nukescloud.com:


=INFO REPORT==== 11-Jun-2019::16:54:12 ===

rabbit on node 'rabbit@psvra01.nukescloud.com' down


=INFO REPORT==== 11-Jun-2019::16:54:12 ===

Keep rabbit@rabbit@psvra01.nukescloud.com listeners: the node is already back


...


=INFO REPORT==== 11-Jun-2019::18:55:09 ===

rabbit on node 'rabbit@rabbit@psvra01.nukescloud.com' down


=INFO REPORT==== 11-Jun-2019::18:55:09 ===

Keep rabbit@rabbit@psvra01.nukescloud.com listeners: the node is already back


Above snippets clearly show network partitions / issues happening.


Rabbitmq does not tolerate network partitioning events and does not recover from them properly.


Executing below command would help in somewhat resilient in case of these network partitions


rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic","ha-promote-on-failure":"always","ha-promote-on-shutdown":"always"}'

New versions of vRealize Automation have mechanisms in place to detect this sort of issues and attempt an automated recovery


The command mentioned above will help to certain extent but there has to be 100% available and redundant network available between vRealize Automation nodes.


Subscribe Now

  • Twitter
  • Facebook Social Icon

Copyright © 2019 nukescloud