vRA fails to deploy from vRSLCM if Second or Teritiary DNS servers are unable to resolve hostnames
I have been attempting to install vRA 8.x for quite a number of times but I've never been successful due to a simple problem. Let me explain what was that.
Every time I used to install it used to fail at this point where it was installing client-secrets
Release "client-secrets" does not exist. Installing it now. Error: Job failed: BackoffLimitExceeded helm failed to upgrade 'client-secrets' in namespace 'prelude'
Note: Above snippet has been taken from deploy.log
When we check csp-fixture-job-XXXX.log under /services-logs/csp-clients-fixture we see that the curl timed out
Logging in % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: premvra.prem.com
But before we started the install we did cross-check that nslookup to my DNS was working absolutely fine, so why this problem?
premvra which is our vRA node
premidm which is our vIDM node
premlcm which is out vRLCM node
When you trigger the easy installer it would ask you for Netowrk Information as you can see in the below screenshot
The first DNS server in my case is my Windows Active Directory which has forward and reverse lookup zones configured and contains all the DNS records for premlcm, premidm and premvra as well as the rest of the VMware environment.
The second DNS server 10.yy.yy.yy is our router which also functions as a DNS server for all other systems outside my lab environment. This router will not be able to resolve anything within the dns zone hosted in the MS DNS Server, but is reachable for all systems.
When vRA installation is in progress during this stage when client-secrets are being installed there are certain POST calls made for few registrations in the background
Form my research looks like we perform a ROUND-ROBIN load balancing mechanism when multiple DNS servers are configured. In my case , servers ( premlcm , premvra, and premidm ) will only be resolved through my primary DNS.
If in case the POST calls go through the secondary DNS for the name resolution it would fail
and throw below exception
2020-04-28 10:03:41.430+0000 ERROR 43 --- [or-http-epoll-1] c.v.i.common.util.HealthUtilComponent : premidm.prem.com: Name or service not known java.net.UnknownHostException: premidm.prem.com: Name or service not known at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[na:1.8.0_241] Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: Error has been observed at the following site(s): |_ checkpoint ⇢ Request to POST https://premidm.prem.com/SAAS/API/1.0/oauth2/token?grant_type=client_credentials [DefaultWebClient] Stack trace: at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[na:1.8.0_241] at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) ~[na:1.8.0_241] at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) ~[na:1.8.0_241] at java.net.InetAddress.getAllByName0(InetAddress.java:1277) ~[na:1.8.0_241]
After scrapping out this existing deployment, I went ahead and started the installation with only 1 DNS which was able to resolve all the nodes and has entries, and finally, the installation was successful.
This scenario might occur in LAB where not all DNS servers are configured for name resolutions or even in production environments where DNS replications have few issues
After numerous attempts, it was so heartening to see this screen where it says "INSTALLED"