vIDM Inventory Sync was stuck and no progress made even though vIDM was healthy , Why ?



 

Last night we had ADFS configured in my lab on vIDM ( globalenvironment ) and it was working as expected with vRealize Automation


Over a period of time , we did notice that globalenvironment on vRSLCM was not reporting health status


This was weird as vIDM was healthy and authentication was going through as expected.



When i browsed to Identity and Tenant Management pane , there was a message stating


VMware Identity Manager is not available at the moment. There are some requests which are in progress


There was something wrong with the new ADFS configuration as the permissions seemed to me messed up as well





vIDM Inventory Sync never went past the initial stage






*** Request Creation *** 


2022-06-15 01:43:32.459 INFO  [http-nio-8080-exec-7] c.v.v.l.l.u.RequestSubmissionUtil -  -- ++++++++++++++++++ Creating request to Request_Service :::>>> {
  "vmid" : "7d637ba2-5294-4e92-836a-2610a509fa03",
  "transactionId" : null,
  "tenant" : "default",
  "requestName" : "productinventorysync",
  "requestReason" : "VIDM in Environment globalenvironment - Product Inventory Sync",
  "requestType" : "PRODUCT_INVENTORY_SYNC",
  "requestSource" : "globalenvironment",
  "requestSourceType" : "user",
  "inputMap" : {
    "environmentId" : "globalenvironment",
    "productId" : "vidm",
    "tenantId" : ""
  },
  "outputMap" : { },
  "state" : "CREATED",
  "executionId" : null,
  "executionPath" : null,
  "executionStatus" : null,
  "errorCause" : null,
  "resultSet" : null,
  "isCancelEnabled" : null,
  "lastUpdatedOn" : 1655257412458,
  "createdBy" : null
}



*** Request Response *** 


2022-06-15 01:43:32.466 INFO  [http-nio-8080-exec-7] c.v.v.l.l.u.RequestSubmissionUtil -  -- Generic Request Response : {
  "requestId" : "7d637ba2-5294-4e92-836a-2610a509fa03"
}
2022-06-15 01:43:34.001 INFO  [scheduling-1] c.v.v.l.r.c.RequestProcessor -  -- Number of request to be processed : 1
2022-06-15 01:43:34.022 INFO  [scheduling-1] c.v.v.l.r.c.p.ProductInventorySyncPlanner -  -- Creating spec for inventory sync for product : vidm in environment : globalenvironment
2022-06-15 01:43:34.025 INFO  [scheduling-1] c.v.v.l.r.u.InfrastructurePropertiesHelper -  -- VCF properties: {
  "vcfEnabled" : false,
  "sddcManagerDetails" : [ ]
}
2022-06-15 01:43:34.027 INFO  [scheduling-1] c.v.v.l.r.c.p.ProductInventorySyncPlanner -  -- Found product with id vidm
2022-06-15 01:43:34.038 INFO  [scheduling-1] c.v.v.l.r.c.p.CreateEnvironmentPlanner -  -- Not a clustered vIDM, fetching the hostname from primary node.



*** Suit Request is generated and then set to IN_PROGRESS *** 

2022-06-15 01:43:34.785 INFO  [scheduling-1] c.v.v.l.r.c.RequestProcessor -  -- Processing request with ID : 7d637ba2-5294-4e92-836a-2610a509fa03 with request type PRODUCT_INVENTORY_SYNC with request state INPROGRESS.





Browsing to "Authentication Provider " under settings would return an exception stating


Failed to fetch Complete Authentication Provider details



That networksettings API returns 400 , response was "No Settings "




 

The problem described here could be due to engine is not picking up the request to due to already processing requests (overloaded) or genuine unprocessed but stuck requests.


So logged into database and executed a query which would return the number of requests which are stuck in "IN_PROGRESS" state



Login into vRSLCM database


/opt/vmware/vpostgres/11/bin/psql -U postgres -d vrlcm


Check IN_PROGRESS from vm_engine_event table



select currentState,status from vm_engine_event where status='IN_PROGRESS';


We do see 50 of them.


If this number is greater than or equal to 50 ( > vRSLCM 8.4 ) , then we need to follow below steps to clear this data and bring the system to functional state




 

Remediation



Before even proceeding further , please take a snapshot of vRSLCM Appliance. This is mandatory



Here's the plan which would be implemented in order to fix this issue




*** select count(*) queries. These queries will help you in identifying the number of records per table in a defined state  ***

select count(*) from vm_rs_request where requestname= 'lcmgenricsetting';

select count(*) from vm_engine_execution_request where enginestatus= 'INITIATED';

select count(*) from vm_engine_statemachine_instance where status= 'CREATED';

select count(*) from vm_engine_event where status= 'IN_PROGRESS';



*** execute these delete queries. This would clear up all stuck queries  ***

delete from vm_rs_request where requestname= 'lcmgenricsetting';

delete from vm_engine_execution_request where enginestatus= 'INITIATED';

delete from vm_engine_statemachine_instance where status= 'CREATED';

delete from vm_engine_event where status= 'IN_PROGRESS';


*** Perform a full VACUUM of the postgres database *** 

VACUUM FULL verbose analyze vm_rs_request;



*** exit database ***
\q



*** Restart vRSLCM Service ***

systemctl restart vrlcm-server 



Let's implement this plan at hand in my environment and see if it helps




   
*** select count(*) queries. These queries will help you in identifying the number of records per table in a defined state  ***  

 vrlcm=# select count(*) from vm_engine_event where status='IN_PROGRESS'; 
 count 
------- 
    50



                             ^ 
vrlcm=# select count(*) from vm_engine_statemachine_instance  where status= 'CREATED'; 
 count 
------- 
  1580
  
  
  
vrlcm=# select count(*)  from vm_engine_execution_request where enginestatus = 'INITIATED'; 
 count 
------- 
  1616
  
  

vrlcm=# select count(*) from vm_rs_request where requestname = 'lcmgenericsetting'; 
 count 
------- 
     0

     
*** execute these delete queries. This would clear up all stuck queries  ***  


vrlcm=# delete from vm_engine_event where status='IN_PROGRESS'; 
DELETE 50




vrlcm=# delete from vm_engine_statemachine_instance  where status= 'CREATED'; 
DELETE 1580




vrlcm=# delete from vm_engine_execution_request where enginestatus = 'INITIATED'; 
DELETE 1616


*** Note , Since select count(*) from vm_rs_request where requestname = 'lcmgenericsetting'; returned 0 records , i am not executing delete statement  ****


*** Perform a full VACUUM of the postgres database *** 



vrlcm=# VACUUM FULL verbose analyze vm_rs_request; 
INFO:  vacuuming "public.vm_rs_request" 
INFO:  "vm_rs_request": found 48 removable, 1565 nonremovable row versions in 379 pages 
DETAIL:  0 dead row versions cannot be removed yet. 
CPU: user: 0.06 s, system: 0.02 s, elapsed: 0.46 s. 
INFO:  analyzing "public.vm_rs_request" 
INFO:  "vm_rs_request": scanned 339 of 339 pages, containing 1565 live rows and 0 dead rows; 1565 rows in sample, 1565 estimated total rows 
VACUUM


*** exit database ***
\q

*** Restart vRSLCM Service ***

systemctl restart vrlcm-server




After this plan implementation , it would take 2 minutes for the server to startup and come back to it's functional state again



Then , if we go ahead and check the Authentication setting information




globalenvironment or the product inventory sync completed too





To conclude , there was something which went wrong in our last nights tests in lab which triggered these queued requests in vRSLCM.


Happy to get the environment back in functional state. Ensure a snapshot is taken before making any changes. Can't stress how important is this if things does not work out.




 

11 views0 comments

Recent Posts

See All

vRealize Automation 8.8.1 was released last evening and here's my experience in implementing in my lab. I've attached Upgrade runbook vRA 8.8.1 Deep-Dive.pdf document which contains all of the steps

vRealize Automation 8.8.1 capabilities are focusing on the areas of multi-cloud support with ability to Enable/Disable Log Analytics for Azure VMs Manage resource RBAC permission with quick create VM