NSX data collection unavailable
In preparation for using NSX network, security, and load balancing capabilities in vRealize Automation , at first we have to create an NSX endpoint
I was asked to look into a problem where even after creating an endpoint successfully along with association mapped , selecting data collection under Compute Resource does not show Network and Security Inventory
Looking at the logs after NSX endpoint was created we do see there is a data collection workitem created , that's VCNSInventory
Reference : ManagerService / All.log
[UTC:2019-09-03 10:29:48 Local:2019-09-03 15:59:48] [Debug]: [sub-thread-Id="45" context="" token=""] DC: Created data collection item, WorkflowInstanceId 183022, Task VCNSInventory, EntityID 8ed67519-99fb-4afa-811f-227e753a24eb, StatusID = 457b3af7-b739-45b2-ab9f-0cdd79596af0
Taking one of the instance 183022 into consideration and inspecting worker logs
Worker initialises instance
2019-09-03T10:29:49.962Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4268"] [sub-thread-Id="27" context="" token=""] Worker Controller: initializing instance 183022 - vSphereVCNSInventory of the workflow execution unit
2019-09-03T10:29:52.009Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4268"] [sub-thread-Id="27" context="" token=""] WorkflowExecutionUnit: initialize started: 183022
2019-09-03T10:30:14.401Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4864"] [sub-thread-Id="28" context="" token=""] Workflow ID: 183022 Activity <Mark Data Collection Complete>: State: Closed
2019-09-03T10:30:14.417Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Debug" thread="4864"] [sub-thread-Id="28" context="" token=""] Workflow Complete: 183022 - Successful
2019-09-03T10:30:14.417Z DC-DEM02 vcac: [component="iaas:DynamicOps.DEM.exe" priority="Trace" thread="4864"] [sub-thread-Id="28" context="" token=""] Worker Controller: WriteCompletedWorkflow
As shown above it did go through data collection and marked as successful but it was never showing up in UI.
At this time when we performed a Test Connection for an endpoint and click on OK, though test connection was successful , it was unable to save this endpoint.
That's when I got an idea that there must be something wrong with endpoints table
Assumption was changed to confirmation after reviewing API data captured from HAR file
Now that we know that there is definitely something wrong with the endpoints
Using query select * from ManagementEndpoints found that there were stale entries for all vSphere endpoints
Ideally there should be only one entry per endpoint ( vSphere ) inside this table. But here we have 2 per vSphere endpoint.
How do we now identify which is the correct one and what ManagementEndpointId to be deleted
For this you have to grep vSphereAgent.log ( Proxy Agent logs ) and search for managementEndpointId. This managementEndpointId what you find in the log is the correct one and this entry must remain under ManagementEndpointID of dbo.ManagementEndpoints table
Example 2019-09-09T03:54:23.466Z DC-AGENT01 vcac: [component="iaas:VRMAgent.exe" priority="Debug" thread="900"] [sub-thread-Id="6" context="" token=""] Ping Sent Successfully : [<?xml version="1.0" encoding="utf-16"?><pingReport agentName="vCenter" agentVersion="220.127.116.11" agentLocation="PRDVC" WorkitemsProcessed="9254"><Endpoint externalReferenceId="cbeebd33-245a-4b18-a8a8-d337e8c46627" productName="VMware vCenter Server" version="6.5.0" licenseName="VMware vCenter Server 6 Standard" /><ManagementEndpoint Name="vCenter" /><Nodes><Node name="SINGAPORE" type="Cluster" identity="prodvc/IDBI DC/host/SINGAPORE" datacenterExternalReferenceId="datacenter-21" externalReferenceId="domain-c26" isCluster="True" managementEndpointId="e5b052e1-0792-465a-a2a8-6b8b031f48ac" /><Node name="DC_PRODUCTION_RHEL_CLUSTER" type="Cluster" identity="prodvc/SGP/host/SINGAPORE" datacenterExternalReferenceId="datacenter-21" externalReferenceId="domain-c1310" isCluster="True" managementEndpointId="e5b052e1-0792-465a-a2a8-6b8b031f48ac" /></Nodes><AgentTypes><AgentType name="Hypervisor" /><AgentType name="vSphereHypervisor" /></AgentTypes></pingReport>]
Now that we know which ones are correct by cross checking vSphereAgent.log and then ManagementEndpoints table , we had to remove stale entries from this table
Took a backup of SQL IaaS database along with snapshots and then executed delete statements on the one's we thought are the stale entries
delete from dbo.ManagementEndpoints where ManagementEndpointID = 'E15DFAAE-229E-4874-AACB-793BDB6076F4';
delete from dbo.ManagementEndpoints where ManagementEndpointID = '03CACB31-23DD-444C-A493-8DDC8BC4E4CF';
But this did not solve our problem. Removing stale entries and then saving endpoints threw a different exception this time
So when you create an endpoint in vRA , it not only creates an entry in IaaS but it also creates an entry inside vRA's postgres database
We explored table called epconf_endpoint , this table has all entries of endpoints created through vRA UI and the id from Postgres database must match ManagementEndpointId of SQL database ( IaaS )
Remember these were the id's we deleted from SQL, the reason for "Endpoint with id [xxxxxx] is not found in iaas " is this discrepancy between IaaS and Postgres
Now updating id's taken for appropriate endpoints and updating here in Postgres would resolve this data mismatch. But there is a catch here.
As you can see above there is already a NSX endpoint created. Which we all know it is , as that's what we are troubleshooting to make it work.
Along with NSX endpoint , there is an association created, this association information is stored under epconf_association table
This association table contains
id of the association
from_endpoint_id : This is your NSXEndpointId from IaaS database and Id from epconf_endpoint of your postgres database
to_endpoint_id : This is your mapping you create to one of the vSphere endpoints.
Note : NSX endpoint information is stored inside table ,[DynamicOps.VCNSModel].[VCNSEndpoints] of IaaS database
This is where we found an answer to our problem
The to_endpoint_id inside epconf_association was pointing to a wrong id
Both the id's under epconf_endpoint has to modified to the one's present under IaaS
As a first step , we deleted NSX endpoint from vRA UI , this removed entry from epconf_association , so there is no need to update this table anymore
After removal of NSX endpoint from UI , we then moved onto epconf_endpoint to update id's with correct one's taken from IaaS database
Updating vCenter endpoint
update epconf_endpoint set id = 'e5b052e1-0792-465a-a2a8-6b8b031f48ac' where name = 'vCenter'
Updating vCenter01 endpoint
update epconf_endpoint set id = '5646fa1e-6a2b-4d08-9381-219fe6d92a5e' where name = 'vCenter01'
After we corrected id's inside epconf_endpoint ( Postgres ) to match with ManagementEndpointID ( IaaS Database ) , we were successfully able to save endpoints
Post this , creation of NSX endpoint and mapping it with a correct vSphere endpoint did result in a successful NSX data collection.
!! Hope this helps !!