Workflow exceptions post onboarding a virtual machine


Last week was involved on a use-case where our team was helping one of our customer to onboard virtual machines to vRealize Automation 8.x


Once virtual machine is onboarded , there is an extensibility configured where a vRO workflow is triggered which get's all of the machines properties from vRealize Automation 7.x and then updates the machine's records in version 8.x


But for some reason we were hitting an exception stating



2022-06-02T08:16:07.454Z ERROR vco [host='vco-app-69b4969975-hvgbp' thread='WorkflowExecutorPool-Thread-65' user='vro-gateway-U9ITj3QActbfOqea' org='-' trace='-'] {|__SYSTEM|vro-gateway-U9ITj3QActbfOqea:Migration Set VM Properties :339411e8-ae83-4318-8e0f-924d3456a591:token=92f0de3f-07e7-475b-affd-3a2439d4f1d1} SCRIPTING_LOG - Workflow execution stack: 
*** 
item: 'Migration Set VM Properties /item4', state: 'failed', business state: '', exception: 'Error: Unable to set properties for machine 2ef5658e-3a51-381d-9cfb-9d1ff3531095. Error: 404 Service not found: http://10.244.2.28:8282/resources/compute/<<resourceid>> (Dynamic Script Module name : updateMachineProperties#26)' 
workflow: 'Migration Set VM Properties ' (339****91) 

So going by the exception it clearly states that it's unable to set the properties of the machine as it get's 404 returned by the service ( provisioning-service ). Which means that there is some discrepancy in the machine's data / records



 

Investigation



  • Based on the timestamp mentioned above , we would start look at the provisioning service to start with. Based on the trace id , we just see below 5 statements


2022-06-02T08:16:07.412Z WARN provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='reactor-http-epoll-9' user='configadmin' org='<<replacedorgid>>' trace='9272***2986' parent='' span='c4e755***43e2'] c.v.i.api.common.ControllerRoutingUtil.handleOperationInternal:464 - [Username=configadmin][OrgID=<<replacedorgid>>] The request is missing mandatory query parameter apiVersion.

2022-06-02T08:16:07.414Z WARN provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='xn-index-queries-19' user='configadmin' org='<<replacedorgid>>' trace='927****2986' parent='' span='c4e75****3e2'] c.v.xenon.common.ServiceErrorResponse.create:83 - message: Service not found: http://10.244.2.28:8282/resources/compute/<<resourceid>>, statusCode: 404, serverErrorId: ff2f34d3-bdab-4b1c-b4f5-378f3b46849d

2022-06-02T08:16:07.414Z WARN provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='xn-index-queries-19' user='configadmin' org='<<replacedorgid>>' trace='92724***72986' parent='' span='c4e75***43e2'] c.v.a.s.w.p.x.c.StatefulServiceController.lambda$loadService$20:704 - Failed to start service /resources/compute/<<resourceid>> with 404 status code.

2022-06-02T08:16:07.414Z WARN provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='xn-index-queries-19' user='configadmin' org='<<replacedorgid>>' trace='927****72986' parent='' span='c4e75***3e2'] c.v.xenon.common.ServiceErrorResponse.create:83 - message: Service not found: http://10.244.2.28:8282/resources/compute/<<resourceid>>, statusCode: 404, serverErrorId: 1a207e42-2daa-48dd-bc4b-6c5f7599da47

2022-06-02T08:16:07.416Z INFO provisioning [host='provisioning-service-app-5d7b9467b7-lz5j9' thread='reactor-http-epoll-9' user='configadmin' org='<<replacedorgid>>' trace='92724***2986' parent='' span='c4e755***43e2'] reactor.netty.http.server.AccessLog.info:270 - 10.244.2.11 - - [02/Jun/2022:08:16:07 +0000] "PATCH /iaas/api/machines/<<resourceid>> HTTP/1.1" 404 269 8282 31 ms

  • Replaced orgID and resourceid for security reasons

  • As we see , we get a 404 exception which means resource isn't found after a PATCH call is made

PATCH /iaas/api/machines/<<resourceid>> HTTP/1.1" 404              

  • Now that we know that there is an issue somewhere with the data of the resources. Let's login into database and see if we can find a clue

  • Note: The search_columns is a function. This can be created by executing below code in postgres

CREATE OR REPLACE FUNCTION search_columns(
   needle text,
   haystack_tables name[] default '{}',
   haystack_schema name[] default '{public}'
)
RETURNS table(schemaname text, tablename text, columnname text, rowctid text)
AS $$
begin
 FOR schemaname,tablename,columnname IN
     SELECT c.table_schema,c.table_name,c.column_name
     FROM information_schema.columns c
     JOIN information_schema.tables t ON
       (t.table_name=c.table_name AND t.table_schema=c.table_schema)
     WHERE (c.table_name=ANY(haystack_tables) OR haystack_tables='{}')
       AND c.table_schema=ANY(haystack_schema)
       AND t.table_type='BASE TABLE'
 LOOP
   EXECUTE format('SELECT ctid FROM %I.%I WHERE cast(%I as text)=%L',
      schemaname,
      tablename,
      columnname,
      needle
   ) INTO rowctid;
   IF rowctid is not null THEN
     RETURN NEXT;
   END IF;
END LOOP;
END;
$$ language plpgsql;

  • After function is created, then we will use this to identify what all tables is the resource id present based on the machine name

select * from search_columns("nameoftheresourceonboarded");

  • We will be connecting to catalog database and execute the queries. There result returned would be something like below



  • Taking this result into consideration , let's search both the tables . Starting with the dep_request_event , we get 4 records. 2 ONBOARDED records and 2 ACTION records.


select * from dep_request_event where resource_name = 'nameoftheresourceonboaded';


  • Two onboarded records indicate that the machine was onboarded twice . See the difference in the timestamp

  • The next query would be to search dep_resource table


select id,name,deployment_id,summary_properties,origin,provider_resource_name,accounts from dep_resource where name = 'nameoftheresourceonboaded';


  • So we have two records. One for DISCOVERED and the next one for ONBOARDED

  • Keep the id's from this table handy. For security reasons i've wiped it out but the id's here should be mapped to document_self_link under compute_state table of provisioning database

  • Let's now connect to provisioning database and identify what tables is the resource name part of

select * from search_columns('nameoftheresourceonboaded');
  • Result would be as shown below


  • Keep this in mind when we review the records in the table compute_description


select document_self_link,document_update_action,TO_TIMESTAMP(document_update_time_micros/1000000),id,TO_TIMESTAMP(document_creation_time_micros/1000000) from compute_description where name = 'nameoftheresourceonboarded';

  • There is a record for DELETE and one more for patch

  • Checking other table called compute_state


select document_self_link,document_update_action,TO_TIMESTAMP(document_update_time_micros/1000000),id,TO_TIMESTAMP(document_creation_time_micros/1000000) from compute_state where name = 'nameoftheresourceonboarded';


  • There is just one record returned. What's important to understand here is that the id from dep_resource should be matching with the document_self_link of the onboarded resource

  • But what we observed here is that it's being matched with the discovered document_self_link. See the snippet below


2022-06-02T08:15:37.690Z INFO tango-blueprint [host='tango-blueprint-service-app-5b46464947-xtkw8' thread='http-nio-8080-exec-11' user='relocation-8d1Zeo8aFltm4u8j(configadmin)' org='replacedorgid' project='2f***82' deployment='a68***270' trace='ad9***78744'] com.vmware.tango.blueprint.service.BlueprintDeploymentService - Sending read tile operation for resource 2ef5***095 with resource link /resources/compute/284ac****aaca9

  • Ideally it should be as below


 Sending read tile operation for resource 2ef5***095 with resource link /resources/compute/2ef5***095

  • So we now know that there is no record in the database for

 /resources/compute/2ef5***095
  • This is the reason for the 404 exception seen in the snippet



 

Remediation


When onboarding a machine a duplicate entry for the machine is created in the resource center, with one entry in the 'discovered' state and one in the 'onboarded' state. This is a regression caused by a fix for an onboarding failure when machines have legacy IDs in the provisioning service inventory


Applying KB: 88162 should fix this issue seen. Once done , we should be able to onboard the machines as expected and no duplicate records would be identified.


 

Note: Ensure a snapshot is taken from vRSLCM before implementing the patch mentioned in the KB Article


After the fix is implemented , offboard the machines onboarded previously ,for us the count was very low.



 

This issue has been documented in the release notes too



 


9 views0 comments

Recent Posts

See All