NSX-T 3.1 – Backup & Restore_Production DR Experience – Part3

In this last post of NSX-T backup & restore, we will perform the DR test and post tasks after the restore is complete. Just in case if you have missed, here are links to previous posts.

NSX-T 3.1 – Backup & Restore_Production DR Experience – Part1
NSX-T 3.1 – Backup & Restore_Production DR Experience – Part2

We move to next step by deploying an additional NSX-T manager.

Deploy the new nsx-t manger on the DR site. Make sure to deploy same node for which last successfully backup was taken on the primary site. In my case, last successful backup was for ‘nsx78’ node. I have reserved the new ip address (172.16.31.178) for this node. Remember, we do not need to create DNS record for new node at this point. I am deploying the new nsx-t manager at Site-B (Singapore). Review the deployment parameters here…

Next, Its time to shutdown the primary site nsx-t managers. Ideally, in DR scenario, NSX-T managers on the primary site will be already down. And that’s when we start performing the restore process.

Next, We need to change the DNS records for primary site NSX-T managers to new IP’s from the secondary site. In production env, you might have different subnet on the DR site. I have the same subnet with different IP’s.

Login to the DC and update the DNS records.

Existing IP’s on DNS server.

Update to new IP’s.

All 4 DNS records have been updated.

Next, we need to configure newly deployed NSX-T manger for FQDN using the API call.

https://172.16.31.178/api/v1/configs/management

Let’s start the restore process now.

Configure the SFTP server on newly deployed NSX-T manger and check for recent backup that has been discovered by the appliance.

Note that we are getting EULA prompt. We get this prompt only on the newly deployed NSX-T manager.

Navigate to Backup config and enter sftp server details.

Note that the newly deployed appliance now discovers all successful backups.

Next,

Upon selecting recent backup, you should see ‘Restore’ option highlighted. If you do not see ‘Restore’ option, you will have to re-check all steps provided in this article.

Quick Tip here, if you do not see restore option, scroll down and select the backup where it lists the Appliance FQDN as your newly deployed appliance FQDN. 😊 We found this behavior while troubleshooting at customer location. And it does make sense too. That is why I mentioned earlier in this blog that, deploy the new appliance for which you see last successful backup for.

Note that we do not see any Tier-0 or segments in the newly deployed appliance.

Let’s begin the restore process,

Select the recent backup and restore…
You should see a prompt…

Restoring NSX Managers

If you are running NSX on Virtual Distributed Switch (VDS) and are attempting to perform a Restore operation to a backup file that contains N-VDS switches, then the restore operation might fail. Please

Please read the following steps carefully before you proceed:

  • Step 1: Power Off.
  • Power off all NSX Manager appliances that may be running from the earlier deployment.
  • Step 2: Restore NSX Manager appliances.
  • Restore NSX Manager appliances from the backup.
  • Note: During the restore process, your deployment will be in read-only mode and the UI will be inaccessible. Automatic backups will be disabled.
  • Step 3: Go to the Backup & Restore page.
  • After the restore process ends, log in to your NSX Manager appliance and visit the Backup & Restore page to continue.

NOTE: If the restore fails, you must install a new NSX Manager appliance and try the restore process again.

Review and click Continue.
Restore process begins…

The restore process should start and you will lose the connectivity to new nsx-t manger after 10-15 mins.
Since the restore has started, you will have to re-login to new nsx-t manager with your old password from your primary site.

It would take significant amount of time to come back online. Once up, login to NSX-T manager and navigate the backup again. You should see prompt.

“It appears that you have a NSX Management Cluster configuration in your NSX backup. There were 3 NSX Manager VMs in a NSX Management Cluster at the time of a backup. Currently, there is only 1 NSX Manager VM(s) as reference below. Please navigate to the NSX Manager deployment wizard (System > Overview) and add 2 more NSX Manager VMs to form a NSX Management Cluster of 3 NSX Manager VMs.”

Since we had 3 NSX-T managers on the primary site, backup is prompting us to install remaining 2 nsx-t managers before the restore can be proceed.

Note: After you finish deploying the NSX Management Cluster, you will need to return to this screen and click RESUME to finish the NSX restore workflow.

Before you install an additional nsx-t mangers, make sure to set the VIP of the cluster to new ip address…

Change the VIP from 172.16.31.77 to 172.16.31.177…

It will take around 10 mins to bring back services and to be able to access new VIP.

Deploy the 2nd nsx-t manager once the VIP is accessible. And then followed by 3rd NSX-T manager. Please note that the Cluster Status shows as ‘Degraded’ in the entire process.

Login to the VIP once it is up and running.

Next, we deploy an additional NSX-T managers.

On the next page, Compute Managers were not listing. Cancelled the wizard and checked compute managers. It shows as ‘Down’

Click on ‘Down’ to check the error msg.

“Plugin for compute manager is not running.”

And ‘Resolve’ option was grayed out. Looks like something wrong with the vCenter Server Extensions. Anyways, since it is a lab env, I would not worry much to deploy an additional NSX-T manager. However, we did not receive this error while doing the production DR at customer site.

Let’s go back to backup and resume the operation.

The restore process moved to 62%…

Next, It prompts to check the CM/VC connectivity. (Compute Managers / vCenter). We move on by clicking Resume here.

Then the backup process stops again and prompt to check all listed fabric node connectivity. We can ignore this, since it says that, ‘These nodes might eventually discover the NSX Manager’

Restore process continues to 96%

And finally restore was successful. 😊

Logout and re-login to NSX-T VIP. You should see a msg.

However, it does stop here. We need to make sure that all nodes are connected to new NSX-T manager. You can also deploy an additional NSX-T mangers if needed at this stage, however I skipped it due to compute capacity in the lab.

Couple of tasks after the Restore process…
Navigating to host transport node, it shows couple of error and the same time it gives an option to ‘Resolve’ it.

One of the compute managers shows ‘Down’. Let’s try Resolve option.

It’s UP now.

Dubai-vCenter transport nodes are UP.

Singapore-vCenter transport nodes are UP.

In an actual DR scenario, primary site (Dubai) will show as down since the site itself went down and that’s when we started this restore process.

Edge transport node shows ‘Failed’
Host configuration: Failed to send the HostConfig message. 

Surprisingly, putting the edge node in ‘NSX Maintenance Mode’ and Existing from it resolved the error.

We had to reboot all EDGE VM’s at customer location to resolve this issue.

Let’s check the BGP routes on the TOR.

All looks good here.

The DATA plane did not get affected in this entire activity at all. All workload VM’s had connectivity to the TOR.

Note: vMotion of any VM as well as Snapshot revert or any network related changes to VM will end up losing connectivity to the VM until NSX-T mangers are up & running.

Next, We had already verified routing. For testing purpose, we moved couple of test VM’s from primary site to dr site and it was all good. All newly connected VM’s were able to reach TOR. To move the workload from SiteA to SiteB, customer can opt for SRM (Site Recovery Manager) or any third party VMware compatible product.

We have successfully restored NSX-T manager on the DR site. The entire activity took 2.5 hours at the customer site. You will definitely face multiple issues while performing the DR and it is difficult to get this to success level at first run. At the same, you really don’t want to get used to this DR process. 😀

Thank you for reading the post. I hope this post has added some value to perform successful DR. Good Luck. Leave your comments if you face any issues and I should be able to give you some inputs.

Are you looking out for a lab to practice VMware products…? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in box below to receive notification on my new blogs.

NSX-T 3.1 – Backup & Restore_Production DR Experience – Part2

In previous post, We validated the base NSX-T env which is setup to perform Disaster Recovery of NSX-T. Here is the link to the previous post.

NSX-T 3.1 – Backup & Restore_Production DR Experience – Part1

There are couple of validation steps from DR point of view. Let’s start with the DR Validation Phase. I have bullet point in excel sheet. Will discuss them one by one with screen captures.

We need to make sure that the publish fqdn is set to ‘true’ in all nsx-t managers using the API call.

Here is the Get API call followed by PUT for all 3 NSX-T managers…

https://172.16.31.78/api/v1/configs/management
https://172.16.31.79/api/v1/configs/management
https://172.16.31.85/api/v1/configs/management

Before we go there let’s have a look at the backup details.

Note that the Appliance FQDN or IP lists an IP address of the appliance.

Let’s go back to the API call and run it. I am using postman utility for API calls.

GET https://172.16.31.78/api/v1/configs/management

Paste the above command, change the Authorization type to ‘Basic Auth’, enter the credentials and send. You might get SSL error if it is not disabled already.

Click on Settings and disable SSL verification here…

and send the call again.

Note that the ‘publish_fqdns’ value is false. We need change this to ‘true’

Change GET to PUT, copy 4 lines, Click on ‘Body’ change the radio button to ‘raw’ and select JSON from the drop down box at the end.

Paste copied 4 lines and change the value to ‘true’ and send it again.

Make sure to see the status as 200 OK.

Use the GET command again to verify…

Make sure that you see this value to ‘true’ on all 3 nsx-t managers.

Let’s go back to NSX-T and run the backup again. Note that the ‘Appliance FQDN’ now lists the FQDN instead of an IP address.

All good till here. Move to next step in the excel sheet.

Next step is to verify that all transport nodes in the env using FQDN instead of an IP address.

Run ‘get controller’ command on each edge node. Note that the “Controller FQDN” lists the actual FQDN instead of an IP.

Next,

All good here. Move to next action item…

Change the TTL value from 1 hr to 5 mins of all NSX-T managers in DNS records properties.

We have completed all validation steps for DR test. We now move to actual DR test in the next part of this blog series. Thank You.

Are you looking out for a lab to practice VMware products…? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in box below to receive notification on my new blogs.

NSX-T 3.1 – Backup & Restore_Production DR Experience – Part1

Hello Techies, This post will focus on NSX-T Disaster Recovery of the production env that I recently did for one of the customer. Post talks about my own experience and the procedure may differ as per your NSX-T design.

Here is the official VMware documentation which was referred while doing the activity.

https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/administration/GUID-A0B3667C-FB7D-413F-816D-019BFAD81AC5.html

Additionally, following document is MUST to go through before you plan your DR.

https://communities.vmware.com/t5/VMware-NSX-Documents/NSX-T-Multisite/ta-p/2771370

To put the screenshots in this post, I have recreated the env in my lab. All captures in this post are from the lab that I created for testing purpose.

To set the right expectations, This DR was performed to backup and restore the Management Plane of NSX-T and not the Data Plane. Let me explain the existing env to understand the reason for doing Management Plane recovery only.

  • NSX-T Multisite Env
  • Both sites are active and configured with respective BGP routing to local Top of the Rack (TOR) switches.
  • Primary Site hosts the NSX-T Manager cluster
  • Backup of the NSX-T manager configured on SFTP server which sits at DR site.
  • Both sites have a vCenter, Edge VM’s and ESXi nodes.
  • Inter-Site link has jumbo frames enabled.
  • Both Sites hosts active workload. Also, Load Balancer, VPN as well as micro-segmentation is in place.
  • 3rd Party solution is already configured to Migrate / Restart the VM’s on the DR site in case of disaster.

Since both sites are independent and have sperate EDGE VM’s and routing in place, only Management Plane needs to be restored. The 3rd party backup solution will restore the VM’s on the DR site in case of disaster.

Important Note: Data Plane (i.e. host transport nodes, edge transport nodes…) does not get affected even if you loose the NSX-T manager cluster for any reason. Routing and Connectivity to all workload VM’s works perfectly fine. In short, During the loss of Management Plane, Data Plane is still running as far you do not add any new workload. Also, keep in mind that the vMotion of any VM will end up in loosing the connectivity of that VM if it’s connected to NSX-T Overlay Network. So, it would be a good idea to disable DRS until you bring back the NSX-T manager cluster on the DR site.

The other disadvantage is you cannot make any configuration changes in NSX-T since the UI itself is not available.

Here are some additional bullet points…

  • You must restore to new appliances running the same version of NSX-T Data Center as the appliances that were backed up.
  • If you are using an NSX Manager or Global Manager IP address to restore, you must use the same IP address as in the backup.
  • If you are using an NSX Manager or Global Manager FQDN to restore, you must use the same FQDN as in the backup. Note that only lowercase FQDN is supported for backup and restore.

In most of the cases, FQDN is configured in the env which involves additional steps while restoring the backup. We will discuss more about it in detail. Let’s focus on configuring the backup.

Check my following post for configuring the backup for NSX-T env.

NSX-T Backup Configuration on VMware Photos OS

To begin this post, let’s have a look at the existing env architecture…

List of servers in the env with IP’s.

Here is the screen capture from the env…

Site A vCenter – Dubai

Site B vCenter – Singapore

As I said earlier, we are going to perform Management Plane recovery and not Data Plane, hence I did not configure edge, tier-0 etc on the Site-B. However, customer env had another edge cluster for site B and so the Tier-0. (as shown in the above diagram)

Stable NSX-T manager cluster, VIP assigned to 172.16.31.78

Dubai vCenter host transport nodes

Singapore vCenter host transport nodes

Just a single Edge Transport node deployed at primary site.

BGP Neighbors Configuration…

Note the source addresses. We should see them on TOR as neighbors.

Let’s have a look at the TOR…

Established 172.27.11.2 & 172.27.12.2 neighbors.

BGP routes on the TOR.

Let’s create a new segment and to see if the new routes appears on the TOR.

We should see 10.2.98.X BGP route on the TOR.

Perfect. We have everything in place to perform the DR test and check the connectivity once we bring the NSX-T manager cluster UP in the DR site.

That’s it for this post. We will discuss further process in the next part of this blog series.

Are you looking out for a lab to practice VMware products…? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in box below to receive notification on my new blogs.

NSX-T Backup Configuration on VMware Photon OS

I want to keep this one short, since it is part of parent topic here…

“NSX-T 3.1 – Backup & Restore_Production DR Experience – Part1”

For NSX-T 3.1, following are supported operating systems as per the VMware documentation, however it also says that other software versions might work. SFTP is the only supported protocol for now.

I remember having discussion with someone using VMware Photon OS for NSX-T backup. It is based on Linux OS and lightweight too. Does not consume many resources. Available for download at following location…

https://github.com/vmware/photon/wiki/Downloading-Photon-OS

Get the Minimal ISO…

Installation is straight forward. Just mount an ISO on a VM and follow the instructions to install it. Then we just run couple of commands to setup the VMware Photon OS.

Here is the screen capture of the commands that has been run to setup the sftp server.

Add the sftp user…
root@VirtualRove [ ~ ]# useradd siteA

Create backup directory…
root@VirtualRove [ ~ ]# mkdir /home/nsxbkp-siteA/

Add a group…
root@VirtualRove [ ~ ]# groupadd bkpadmin

Add a user in the group…
root@VirtualRove [ ~ ]# groupmems -g bkpadmin -a siteA

Set the password for user
root@VirtualRove [ ~ ]# passwd siteA

New password:
Retype new password:

passwd: password updated successfully

The chown command changes user ownership of a file, directory, or link in Linux
chown  USER:[GROUP NAME] [Directory Path]
root@VirtualRove [ ~ ]# chown siteB:bkpadmin /home/nsxbkp-siteB/
root@VirtualRove [ ~ ]#

And that completes the configuration on Photon OS. We are good to configure this directory as backup directory in NSX-T.

Couple of things…
The Photon OS is not enabled for ICMP ping by default. You must run following commands on the console to enable ping.
iptables -A OUTPUT -p icmp -j ACCEPT
iptables -A INPUT -p icmp -j ACCEPT

Also, Root account is not permitted to login by default. You need to edit the ‘sshd_config’ file located at ‘/etc/ssh/sshd_config’
You can use any editor to edit this file…

vim /etc/ssh/sshd_config

Scroll it to the end of the file and change following value to enable ssh for root account…

Change the value from ‘no’ to ‘yes’ and save the file. You should be able to SSH to photon OS.

Let’s move to NSX-T side configuration.
Login to NSX-T VIP and navigate to System> Backup & Restore…

Click on Edit for SFTP server and fill in all required information.
FQDN/IP : is your sftp server
Port : 22
Path : We created this in our above steps.
Username, Password & Passphrase.

Save

It will prompt to add for Fingerprints.

Click on ‘Start Backup’ once you save it.

You should see successful backup listed in the UI.

Additionally, you can use WinSCP to login to photon and check for backup directory. You should see recent backup folders.

You also want to set an interval to backup NSX-T configuration as pe the mentioned schedule.
Click on ‘Edit’ from NSX-T UI backup page and set an interval.

I preferred everyday backup, so I set it up to 24 hrs interval.

Check your manager cluster to make sure its stable.

And take a backup again manually.

That’s it for this post.

We have successfully configured SFTP server for our NSX-T environment. We will use this backup to restore it at DR site in case of site failure or in case of NSX-T manager failure for any reason.

Are you looking out for a lab to practice VMware products…? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in box below to receive notification on my new blogs.

How to get free VMware Eval Licenses

Greetings VMware Techies,

We often get into the situation where we need to apply VMware licenses to test VMware products. And most of them are not aware that, VMware provides 60 days trial period for most of the products. For, VMware vCenter Server & ESXi gets the default trial period as soon as you install it. However, for some products you have to ask VMware to provide evaluation license key to test the product. It is easy and simple to get trial licenses for VMware Products.

Check the following link to know the products that are available for trial period license.

https://www.vmware.com/try-vmware.html

Further click on (+) sing to see available products…

Lets assume that You want to test VMware NSX-T Data Center product…
Click on Download Free Trial >>
Next, enter the account information if you have one. If not, then you get an option to create it…

Once you login, clickon ‘License & Download’ and click on ‘Register’

Fill out all the details and click on ‘START FREE TRIAL’

Home screen will show following msg…

And you should see the eval license key under ‘License & Download’ with expiration date.

On the same page, you get an option to download the ova / iso file for the product.

That was easy. 😊
Check out other products available on try-vmware page and enjoy the product evaluation.

Are you looking out for a lab to practice VMware products..? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in box below to receive notification on my new blogs.

VMware vCenter Error – ‘no healthy upstream’


You might encounter ‘no healthy upstream’ error message on newly installed vCenter. This is because of some unexpected parameters while deploying vCenter. I was not able to find the exact root cause for this error, however I knew the resolution from our discussion with technical experts long back.

To start with, here is how it looks on the vCenter when you try to access web client.

You will be able to access vCenter Server Appliance Management Interface at port 5480. Check the services here…

All services show as healthy. In fact, on the summary page shows health status as Good.

Everything looks fine but you can not access web client. I tried restarting vCenter server multiple times with no luck. Tried restarting all services from management interface. Nothing works.

Solution is to change network settings from vCenter Server Appliance Management Interface.

Click on networking & expand nic0.
Notice that the IP address shows as DHCP even if it was given as static.

Click on Edit at top right corner to edit the network settings & select your nic.

Expand the nic0 here and notice that IPv4 shows automatic.
Change this to manual.

Provide credentials on the next page.

Acknowledge the change. Take backup of vCenter if necessary. It also recommends to unregister extensions before you save this.

Also, check the next steps after settings are saved successfully.

 

Click on Finish and you should see the progress.

Access web client once this is finished. You should be able to access it.

Go back to vCenter Server Appliance Management Interface to verify. You should the ip as static.

The issue has been resolved. This is definitely because of some unexpected parameters while deploying the vCenter, since it does not show this error for every deployment. Anyways, wanted to write small blog on it to help techies to resolve the issues just in case if anyone see this error.
Thank you for reading.

Are you looking out for a lab to practice VMware products..? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in box below to receive notification on my new blogs.

VMware vCloud Foundation 4.2.1 Step by Step Phase3 – Deployment

Welcome back. We have covered the background work as well as deployment parameter sheet in earlier posts. If you missed it, you can find it here…

VMware vCloud Foundation 4.2.1 Step by Step Phase1 – Preparation
VMware vCloud Foundation 4.2.1 Step by Step Phase2 – Cloud Builder & Deployment Parameters

Its time to start the actual deployment. We will resolve the issues as we move on.
Let’s upload the “Deployment Parameter” sheet to Cloud Builder and begin the deployment.

Upload the file and Next.  I got an error here.

Bad Request: Invalid input
DNS Domain must match

Figured out to be an additional space in DNS Zone Name here.

This was corrected. Updated the sheet and NEXT.

All good. Validation process started.

To understand & troubleshoot the issues / failures that we might face while deploying VCF, keep an eye on vcf-bringup.log file. The location of the file is ‘/opt/vmware/bringup/logs/’ in cloud builder. This file will give you live update of the deployment and any errors which caused the deployment to fail. Use ‘tail -f vcf-bringup.log’ to get the latest update on deployment. PFB.

Let’s continue with the deployment…

Next Error.

“Error connecting to ESXi host esxi01. SSL Certificate common name doesn’t match ESXi FQDN”

Look at the “vcf-bringup.log” file.

This is because the certificate for an esxi gets generated after it was installed with default name and not when we rename the hostname. You can check the hostname in certificates. Login to an ESXi > Manage> Security & Users> Certificates

You can see here, Even if the hostname on the top shows “esxi01.virtualrove.local, the CN name in certificate is still the “localhost.localdomain”. We must change this to continue.

SSH to the esxi server and run following command to change the hostname, fqdn & to generate new certs.

esxcli system hostname set -H=esxi03
esxcli system hostname set -f=esxi03.virtualrove.local
cd /etc/vmware/ssl
/sbin/generate-certificates
/etc/init.d/hostd restart && /etc/init.d/vpxa restart
Reboot

You need to do this for all hosts by replacing the hostname in the command for each esxi respectively.

Verify the hostname in the cert once server boots up.

Next, Hit retry on cloud builder, and we should be good.

I am not sure why this showed up. I was able to reach to these IP’s from “Cloud Builder”.

 

Anyways, this was warning, and it can be ignored.

Next one was with host tep and edge tep.

VM Kernel ping from IP ‘172.27.13.2’ (‘NSXT_EDGE_TEP’) from host ‘esxi01.virtualrove.local’ to IP ” (‘NSXT_HOST_OVERLAY’) on host ‘esxi02.virtualrove.local’ failed
VM Kernel ping from IP ” (‘NSXT_HOST_OVERLAY’) from host ‘esxi01.virtualrove.local’ to IP ‘172.27.13.3’ (‘NSXT_EDGE_TEP’) on host ‘esxi02.virtualrove.local’ failed
VM Kernel ping from IP ” (‘NSXT_HOST_OVERLAY’) from host ‘esxi02.virtualrove.local’ to IP ‘172.27.13.2’ (‘NSXT_EDGE_TEP’) on host ‘esxi01.virtualrove.local’ failed
VM Kernel ping from IP ‘172.27.13.3’ (‘NSXT_EDGE_TEP’) from host ‘esxi02.virtualrove.local’ to IP ” (‘NSXT_HOST_OVERLAY’) on host ‘esxi01.virtualrove.local’ failed

VM Kernel ping from IP ‘172.27.13.2’ (‘NSXT_EDGE_TEP’) from host ‘esxi01.virtualrove.local’ to IP ‘169.254.50.254’ (‘NSXT_HOST_OVERLAY’) on host ‘esxi03.virtualrove.local’ failed
VM Kernel ping from IP ” (‘NSXT_HOST_OVERLAY’) from host ‘esxi01.virtualrove.local’ to IP ‘172.27.13.4’ (‘NSXT_EDGE_TEP’) on host ‘esxi03.virtualrove.local’ failed
VM Kernel ping from IP ‘169.254.50.254’ (‘NSXT_HOST_OVERLAY’) from host ‘esxi03.virtualrove.local’ to IP ‘172.27.13.2’ (‘NSXT_EDGE_TEP’) on host ‘esxi01.virtualrove.local’ failed
VM Kernel ping from IP ‘172.27.13.4’ (‘NSXT_EDGE_TEP’) from host ‘esxi03.virtualrove.local’ to IP ” (‘NSXT_HOST_OVERLAY’) on host ‘esxi01.virtualrove.local’ failed

First of all, I failed to understand APIPA 169.254.X.X. We had mentioned VLAN 1634 for Host TEP. It should have picked an ip address 172.16.34.X. This VLAN was already in place on TOR and I was able to ping the GW of it from CB. I took a chance here and ignored it since it was a warning.

Next, got warnings for NTP.

Host cb.virtaulrove.local is not currently synchronising time with NTP Server dc.virtaulrove.local
NTP Server 172.16.31.110 and host cb.virtaulrove.local time drift is not below 30 seconds
Host esxi01.virtaulrove.local is not currently synchronising time with NTP Server dc.virtaulrove.local

For ESXi, Restart of ntpd.service resolved issue.
For CB, I had to sync the time manually.

Steps to manually sync NTP…

ntpq -p
systemctl stop ntpd.service
ntpdate 172.16.31.110
Wait for a min and again run this
ntpdate 172.16.31.110
systemctl start ntpd.service
systemctl restart ntpd.service
ntpq -p

verify the offset again. It must be closer to 0.

Next, I locked out root password of Cloud Builder VM due to multiple logon failure. 😊

This is usual since the passwords are complex and sometimes you have to type it manually on the console, and top of that, you don’t even see (in linux) what you are typing.
Anyways, it’s a standard process to reset the root account password for photon OS. Same applies to vCenter. Check the small writeup on it on the below link.

Next, Back to CB, click on “Acknowledge” if you want to ignore the warning.

Next, You will get this window once you resolve all errors.

Click on “Deploy SDDC”.

Important Note: Once you click on “Deploy SDDC”, the bring-up process first builds VSAN on 1st ESXi server from the list and then it deploys vCenter on 1st ESXi host. If bring-up fails for any reason and if you figured out that the one of the parameter in excel sheet is incorrect, then it is tedious job to change the parameter which is already uploaded to CB. You have to use jsongenerator commands to replace the existing excel sheet in the CB. I have not come across such a scenario yet, however there is a good writeup on it from good friend of mine.

Retry Failed Bringup with Modified Input Spec in VCF

So, make sure to fill all correct details in “Deployment Parameter” sheet. 😊

Let the game begin…

Again, keep an eye on vcf-bringup.log file. The location of the file is ‘/opt/vmware/bringup/logs/’ in cloud builder. Use ‘tail -f vcf-bringup.log’ to get the latest update on deployment.

Installation starts. Good luck. Be prepared to see unexpected errors. Don’t loose hopes as there might several errors before the deployment completes. Mine took 1 week to deploy when I did it first time.

Bring-up process started. All looks good here. Status as “Success”. Let’s keep watching.

All looks good here. Till this point I had vCenter in place and it was deploying first NSX-T ova.

Looks great.

Glance at the NSX-T env.

Note that the TEP ip’s for host are from the vlan 1634. However, CB validation stage was picking up apipa.

NSX-T was fine. It moved to SDDC further.

Woo, Bring-up moved to post deployment task.

Moved to AVN (Application Virtual Networking). I am expecting some errors here.

Failed.

“A problem has occurred on the server. Please retry or contact the service provider and provide the reference token. Unable to create logical tier-1 gateway (0)”

This was easy one. vcf-bringup.log showed that it was due to missing DNS record for edge vm. Created DNS record and retry.

Next one,

“Failed to validate BGP Neighbor Perring Status for edge node 172.16.31.125”

Let’s look at the log file.

Time to check NSX-T env.

Tier-0 gateway Interfaces looks good as per out deployment parameters.

However, BGP Neighbors are down.

This was expected since we haven’t done the BGP configuration on TOR (VyOS) yet. Let’s get in to VyOS and run some commands.

set protocols bgp 65001 parameters router-id 172.27.11.253
This command specifies the router-ID. If router ID is not specified it will use the highest interface IP address.

set protocols bgp 65001 neighbor 172.27.11.2 update-source eth4
Specify the IPv4 source address to use for the BGP session to this neighbor, may be specified as either an IPv4 address directly or as an interface name.

set protocols bgp 65001 neighbor 172.27.11.2 remote-as ‘65003’
This command creates a new neighbor whose remote-as is <nasn>. The neighbor address can be an IPv4 address or an IPv6 address or an interface to use for the connection. The command is applicable for peer and peer group.

set protocols bgp 65001 neighbor 172.27.11.3 remote-as ‘65003’
set protocols bgp 65001 neighbor 172.27.11.2 password VMw@re1!
set protocols bgp 65001 neighbor 172.27.11.3 password VMw@re1!

Commit
Save

TOR configuration done for 2711 vlan. Let’s refresh and check the bgp status in nsx-t.

Looks good.

Same configuration to be performed for 2nd VLAN. I am using same VyOS for both the vlans since it’s a lab env. Usually, You will have 2 TOR’s and each BGP peer vlan configured respectively for redundancy purpose.

set protocols bgp 65001 parameters router-id 172.27.12.253
set protocols bgp 65001 neighbor 172.27.12.2 update-source eth5
set protocols bgp 65001 neighbor 172.27.12.2 remote-as ‘65003’
set protocols bgp 65001 neighbor 172.27.12.3 remote-as ‘65003’
set protocols bgp 65001 neighbor 172.27.12.2 password VMw@re1!
set protocols bgp 65001 neighbor 172.27.12.3 password VMw@re1!

Both BGP Neighbors are successful.

Hit Retry on CB and it should pass that phase.

Next Error on Cloud Builder: ‘Failed to validate BGP route distribution.’

Log File.

At this stage, routing has been configured in your NSX-T environment, both edges have been deployed and BGP peering has been done. If you check bgp peer information on edge as well as VyOS router, it will show ‘established’ and even routes from NSX-T environment appears on your VyOS router. Which means, route redistribution from NSX to VyOS works fine and this error means that there are no routes advertised from VyOS (TOR) to NSX environment. Let’s get into VyOS and run some commands.

set protocols bgp 65001 address-family ipv4-unicast network 172.16.31.0/24
set protocols bgp 65001 address-family ipv4-unicast network 172.16.32.0/24

Retry on CB and you should be good.

Everything went smoothly after this. SDDC was deployed successfully.

That was fun. We have successfully deployed vCloud Foundation version 4.2.1 including AVN (Application Virtual Networking).

Time to verify and check the components that have been installed.

SDDC Manager.

Segments in NSX-T which was specified in deployment parameters sheet.

Verify on the TOR (VyOS) if you see these segments as BGP published networks.

Added a test segment called “virtaulrove_overlay_172.16.50.0” in nsx-t to check if the newly created network gets published to TOR.

All looks good. I see the new segment subnet populated on TOR.

Let’s do some testing. As you see above, new segment subnets are being learned from 172.27.11.2 this interface is configured on edge01 VM. Check it here.

We will take down edge01 VM to see if route learning changes to edge02.

Get into nodes on nsx-t and “Enter NSX Maintenance mode” for edge 01 VM.

Edge01, Tunnels & Status down.

Notice that the gateway address has been failed over to 172.27.11.3.

All Fine, All Good. 😊

There are multiple tests that can be performed to check if the deployed environment is redundant at every level.

Additionally, use this command ‘systemctl restart vcf-bringup’ to pause the deployment when required.

For example, in my case NSX-T manger was taking time to get deployed, and due to an interval on cloud builder, it used to cancel the deployment assuming some failure. So, I paused the deployment after nsx-t ova job got triggered from CB and hit ‘Retry’ after nsx got deployed successfully in vCenter. It picked it up from that point and moved on.

You should have enjoyed reading the post. It’s time for you to get started and deploy VCF. See you in future posts. Feel free to comment below if you face any issues when you deploy the VCF environment.

Are you looking out for a lab to practice VMware products..? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in the box below to receive notification on my new blogs.

VMware vCloud Foundation 4.2.1 Step by Step Phase2 – Cloud Builder & Deployment Parameters

We have prepared the environment for VCF deployment. Its time to move to CB and discuss the “Deployment Parameters” excel sheet in detail. You can find my earlier blog here.

Login to Cloud Builder VM and start the deployment process.

Select “vCloud Foundation” here,

The other option “Dell EMC VxRail” to be used when your physical hardware vendor is Dell.

VxRail is hyper-converged appliance. It’s a single device which includes compute, storage, networking and virtualization resources. It comes with pre-configured vCenter and esxi servers. Then there is a manual process to convert this embedded vCenter into user manage vCenter, and that’s when we use this option. If possible, I will write a small blog on it too.

Read all prereqs on this page and make sure to fulfill them before you proceed.

Click on “Download” here to get the “Deployment Parameter” excel sheet.

Let’s dig into this sheet and talk in detail about all the parameters here.

“Prerequisites Checklist” sheet from the deployment parameter. Check all line items one by one and select “Verified” in the status column. This does not affect anywhere; it is just for your reference.

“Management Workloads” sheet.

Place your license keys here.

This sheet also has compute resource calculator for management workload domain. Have a look and try to fit your requirements accordingly.

“Users and Groups”: Define all passwords here. Check out the NSX-T passwords, as the validation fails if it does not match the password policy.

Moving on to next sheet “Hosts and Networks”.

Couple of things to discuss here,

DHCP requirement for NSX-T Host TEP is optional now. It can be defined manually with static IP pools here. However, if you select NO, then DHCP option is still valid.

Moving onto “vSphere Distributed Switch Profile” in this sheet. It has 3 profiles. Earlier VCF version had only one option to deploy with 2 pnics only. Due to high demand from customer to deploy with 4 pnics, this option was introduced. Let’s talk about this option.

Profile-1

This profile will deploy a single vDS with 2 or 4 uplinks. All network traffic will flow through the assigned nics in this vDS. Define the name and pNICs at row # 17,18 respectively.

Profile-2

This one deploys 2 VDS. You can see that the first vDS will carry management traffic and the other one is for NSX. Each vDS can have 2 or 4 pnics.

Profile-3

This one also deploys 2 vDS, just that the VSAN traffic is segregated instead of NSX in earlier case.

Select the profile as per your business requirement and move to next step.

Next – “Deploy Parameters”

Define all parameters here carefully. If something is not good, the cell would turn RED. I have selected VCSA size as small since we are testing the product.

Move to NSX-T section. Have a look at the AVN (Application Virtual Networking). If you select Yes here, then you must specify the BGP peering information and uplinks configuration. If it’s NO, then it does not do BGP peering.

TOR1 & TOR2 IPs interfaces configured on your VyOS. Make sure to create those interfaces. We will see it in detail when we reach to that level in the deployment phase.

We are all set to upload this “Deployment Parameter” sheet to Cloud Builder and begin the deployment. That is all for this blog. We will do the actual deployment in next blog.

Are you looking out for a lab to practice VMware products..? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in the box below to receive notification on my new blogs.

VMware vCloud Foundation 4.2.1 Step by Step Phase1 – Preparation

Finally, after a year and half, I got a chance to deploy latest version of vCloud Foundation 4.2.1. It has been successfully deployed and tested. I have written couple blogs on earlier version (i.e. version 4.0), you can find them here.

https://virtualrove.com/vcf/

Let’s have a look at the Cloud Foundation 4.2.1 Bill of Materials (BOM).

Software ComponentVersionDateBuild Number
Cloud Builder VM4.2.125-May-2118016307
SDDC Manager4.2.125-May-2118016307
VMware vCenter Server Appliance7.0.1.0030125-May-2117956102
VMware ESXi7.0 Update 1d4-Feb-2117551050*
VMware NSX-T Data Center3.1.217-Apr-2117883596
VMware vRealize Suite Lifecycle Manager8.2 Patch 24-Feb-2117513665
Workspace ONE Access3.3.44-Feb-2117498518
vRealize Automation8.26-Oct-2016980951
vRealize Log Insight8.26-Oct-2016957702
vRealize Operations Manager8.26-Oct-2016949153

It’s always a good idea to check release notes of the product before you design & deploy. You can find the release notes here. https://docs.vmware.com/en/VMware-Cloud-Foundation/4.2.1/rn/VMware-Cloud-Foundation-421-Release-Notes.html

Let’s discuss and understand the installation flow,

Configure TOR for the networks that are being used by VCF. In our case, we have VyOS router.
Deploy a Cloud Builder VM on stand alone source ESXi or vCenter.
Install and Configure 4 ESXi Servers as per the pre-reques.
Fill the Deployment Parameters excel sheet carefully.
Upload “Deployment Parameter” excel sheet to Cloud Builder.
Resolve the issues / warning shown on the validation page of CB.
Start the deployment.
Post deployment, you will have a vCenter, 4 ESXi servers, NSX-T env & SDDC manager deployed.
Additionally, you can deploy VI workload domain using SDDC manager. This will allow you to deploy Kubernetes cluster.
Also, vRealize Suite & Workspace ONE can be deployed using SDDC manager.

You definitely need huge amount of compute resources to deploy this solution.
This entire solution was installed on a single ESXi server. Following is the configuration of the server.

Dell PowerEdge R630
2 X Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
256 GB Memory
4 TB SSD

Let’s prepare the infra for VMware vCloud Foundation.

I will call my physical esxi server as a base esxi in this blog.
So, here is my base esxi and VM’s installed on it.

dc.virtaulrove.local – This is a Domain Controller & DNS Server in the env.
VyOS – This virtual router will act as a TOR for VCF env.
jumpbox.virtaulrove.local – To connect to the env.
ESXi01 to ESXi 04 – These will be the target ESXi’s for our VCF deployment.
cb.virtaulrove.local – Cloud Builder VM to deploy VCF.

Here is a look at the TOR and interfaces configured…

Follow my blog here to configure the VyOS TOR.

Network Requirements: Management domain networks to be in place on physical switch (TOR). Jumbo frames (MTU 9000) are recommended on all VLANs or minimum of 1600 MTU.

And a VLAN 1634 for Host TEP’s, which is already configured on TOR at eth3.

Following DNS records to be in place before we start with the installation.

With all these things in place, out first step is to deploy 4 target ESXi servers. Download the correct supported esxi version ISO from VMware downloads.

VMware ESXi7.0 Update 1d4-Feb-2117551050*

If you check VMware downloads page, this version is not available for download.

Release notes says, create a custom image to use it for deployment. However, there is another way to download this version of ESXi image. Let’s get the Cloud Builder image from VMware portal and install it. We will keep ESXi installation on hold for now.

We start the Cloud Builder deployment once this 19 GB ova file is downloaded.

Cloud Builder Deployment:

Cloud Builder is an appliance provided by VMware to build VCF env on target ESXi’s. It is one time use VM and can be powered off after the successful deployment of VCF management domain. After deployment, we will use SDDC manager for managing additional VI domains. I will be deploying this appliance in VLAN 1631, so that it gets access to DC and all our target ESXi servers.

Deployment is straight forward like any other ova deployment. Make sure to you choose right password while deploying the ova. The admin & root password must be a minimum of 8 characters and include at least one uppercase, one lowercase, one digit, and one special character. If this does not meet, then the deployment will fail which results in re-deploying ova.

Once the deployment is complete. Connect to CB using winscp and navigate to ….

/mnt/iso/sddc-foundation-bundle-4.2.1.0-18016307/esx_iso/

You should see an ESXi image at this path.

Click on Download to use this image to deploy our 4 target ESXi servers.

Next step is to create 4 new VM’s on base physical ESXi. These will be our nested ESXi where our VCF env will get install. All ESXi should have identical configuration. I have following configuration in my lab.

vCPU: 12
2 Sockets, 6 cores each.
CPU hot plug: Enabled
Hardware Virtualization: Enabled

Memory: 56 GB

HDD1: Thick: ESXi OS installation
HDD2: Thin VSAN Cache Tier
HDD3: Thin VSAN Capacity Tier
HDD4: Thin VSAN Capacity Tier

And 2 network cards attached to Trun_4095. This will allow an esxi to communicate with all networks on the TOR.

Map the ISO to CD drive and start the installation.

I am not going to show ESXi installation steps, since most of you know it already. Let’s look at the custom settings after the installation.

DCUI VLAN settings should be set to 1631.

Crosscheck the DNS and IP settings on esxi.

And finally, make sure that the ‘Test Management Network’ on DCUI shows OK for all tests.

Repeat this for all 4 esxi.

I have all my 4 target esxi severs ready. Let’s look at the ESXi configuration that has to be in place before we can utilize them for VCF deployment.

All ESXi must have ‘VM network’ and ‘Management network’ VLAN id 1631 configured.
NTP server address should be in place on all ESXi.
SSH & NTP service to be enabled and policy set to ‘Start & Stop with the host’
All additional disks to be present on an ESXi as a SSD and ready for VSAN configuration. You can check it here.

If your base ESXi has HDD and not SSD, then you can use following command to mark those HDD to SSD.

You can either connect to DC and putty to ESXi or open ESXi console and run these commands.

esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d mpx.vmhba1:C0:T1:L0 -o enable_ssd
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d mpx.vmhba1:C0:T2:L0 -o enable_ssd
esxcli storage nmp satp rule add -s VMW_SATP_LOCAL -d mpx.vmhba1:C0:T3:L0 -o enable_ssd
esxcli storage core claiming reclaim -d mpx.vmhba1:C0:T1:L0
esxcli storage core claiming reclaim -d mpx.vmhba1:C0:T2:L0
esxcli storage core claiming reclaim -d mpx.vmhba1:C0:T3:L0

Once done, run ‘esxcli storage core device list’ command and verify if you see SSD instead of HDD.

Well, that should complete all our requisites for target esxi’s.

Till now, we have completed configuration of Domain controller, VyoS router, 4 nested target ESXi & Cloud Builder ova deployment. Following VM’s have been created on my physical ESXi host.

I will see you in next post, where we talk about “Deployment Parameters” excel sheet in detail.

Thank you.

Are you looking out for a lab to practice VMware products..? If yes, then click here to know more about our Lab-as-a-Service (LaaS).

Leave your email address in the box below to receive notification on my new blogs.