02 Feb How to Survive a Brutal VCF 9 DoD Deployment: NFS Validation Failure
In my last post, How to Survive a Brutal VCF 9 DoD Deployment: Offline Depot UMDS Setup I jumped ahead to the post-deployment phase to walk through the UMDS configuration. Now, we are backtracking slightly to continue our initial deployment and cover our next major hurdle: NFS Datastore Validation. This specific issue stalled our progress for weeks while we waited on the networking team. We exhausted every troubleshooting step on the host side to rule out local issues and help the network team pinpoint the exact error.
Troubleshooting VCF 9 NFS Validation
As I mentioned, we wanted to rule out every possible issue on our side before approaching the networking team for help. We started with a simple ping to the NFS VIP, but every packet failed. While that usually screams “network issue,” we refused to leave any stone unturned. We added a VMkernel interface to one of our hosts using the NFS network details and tried to ping the server again over that specific interface. We still saw 100% packet loss.
Next, we verified the firewall settings on the host to ensure the system wasn’t blocking ports 111 and 2049 (TCP and UDP). Everything looked clear. We also double-checked our DNS and VLAN settings, confirming that our configuration was correct. At that point, we knew we had done our part and it was time to engage the networking team.
Network Team Engagement and the Waiting Game
At this point, we stalled while we waited for the networking team to validate things on their end. As is often the case in most environments, the networking team had plenty of other tasks besides troubleshooting our NFS issues. We were truly at the mercy of their availability. Also, as I mentioned at the start of this series, this customer had recently lost their entire contractor team, who served as their primary knowledge base. This meant the remaining staff had very little intimate knowledge of the initial setup, which added another layer of complexity to our troubleshooting efforts.
The NetApp Angle
I should mention that NetApp backed their NFS storage. While the networking team investigated, the storage team worked to rule out any issues on the NetApp side. I also reviewed the official Broadcom NFS Storage Model documentation, but nothing stood out as a clear root cause. However, during my conversations with the storage team, one topic kept coming up: the Export Policy. This detail plays a major factor in our story. Keep in mind that until recently, VCF required vSAN for the management domain. Because of that history, official documentation for using NFS in a management domain remains incredibly scarce.
Finally Uncovering the Problems
The Networking Problem
Our network engineer racked his brain for days troubleshooting this issue. He could see exactly where the ping dropped, but the cause remained a mystery. Between calls with Cisco, sessions with NetApp, and my own consulting insights, we finally tracked down the culprit. It turns out one switch had a duplicate VLAN configuration with conflicting information. Once the packets hit that specific switch, they simply dropped. As soon as the engineer resolved that conflict, the pings started working immediately. After weeks of waiting, a simple configuration fix finally cleared the path for our deployment.
On our next try, we passed validation and could finally start the deployment. Unfortunately, our excitement was short lived. Our deployment failed trying to install vCenter, one of the first things that happens during bring-up.
The Export Policy Problem
I went back into the logs to determine the root cause of the failure. Two specific issues kept appearing: Error 13 (Permission Denied) and VSI node (5001) failure. Since we knew the hosts could reach the NetApp server, these errors pointed to one conclusion: the export policy blocked the connection. We could “see” the datastore, but we lacked the rights to perform any actions on it.
I teamed up with the storage group to troubleshoot. Initially, the blocker stumped us because the export policy used superuser=any, which theoretically prevents root squashing. However, my research uncovered several articles noting that superuser=any can actually cause ESX hosts to squash “root” to “anon.” This effectively kills all write operations. That explained why the process failed during the very first task: deploying the vCenter VM.
Solving the VCF 9 NFS Validation Failure
I asked the storage team to change the export policy to superuser=sys, and we restarted the deployment. We watched the progress anxiously as the day went on. Since the deployment continued to run past quitting time, we felt optimistic and let it run overnight. We returned the next morning to find VCF fully deployed. The setting superuser=sys was the key that finally got us over the finish line.
Join the Conversation
Our first VCF 9 deployment in a DoD environment certainly had its share of ups and downs. We learned a lot about what to avoid and exactly what we must do in the future. Because VCF 9 is so new, we found that vital information was either missing from the official documentation or scarce in online forums. I hope that sharing the hurdles we cleared helps you move forward with your own secure VCF 9 deployment.
Perhaps you encountered different issues or found a unique way to solve a similar problem. I would love to hear about your experiences and the fixes that worked for you. Leave a comment below and let’s keep the conversation going!
Continue the Journey
This post is part of a series dedicated to navigating the complexities of VMware Cloud Foundation 9.
No Comments