The Automation Conundrum
“if an automated system has an error, it will multiply the error until its fixed or shutdown. Efficient Automation makes humans more important, not less”. (Kaufman, 2010)
Whether manufacturing parts or in our case the application of software updates and workflows, errors during an automated process are complex to resolve and have a domino effect due to the intricate relationships built between components.
I want to be clear though that the last thing I ever want to have to do is figure how to update every element within a system(s) manually; determining which versions work and checking for incompatibilities between everything. It is ‘mind-numbing’ work and takes time which could be spent on better and more interesting things. Besides, upgrades are never ending. Once you’ve done it, then you have to do it all over again and again, for eternity. But what if I could press a “single button”, like on my iPhone?
Well…there isn’t really a single button, but having seen both sides of the coin I’m not so sure I’d want one. I think it’s about getting that balance right. If you’re going to have a single button then the entire process has to be “bullet proof”, particularly when multiple products are combined into that automation. If something does go wrong then the effect has to be limited or isolated so does not stop the overall function and resolution needs to be modular and easy to fix.
A factor often forgotten is the human element which requires a higher level of knowledge to now resolve errors as a consequence of the interdependencies created by automation. The natural belief with automation is that employees are freed from time intensive low skilled tasks and so one can redistribute higher skilled employees to more complex projects. But the automated tasks themselves as a whole result in more complexity. These complex relationships are hidden as long as the wheels rotate, but when one stops …
Ideally automation is solving problems and not creating new ones, so it is important to clearly scope what to automate and what not to automate. In addition when combining a number of different products under the same ‘umbrella’, one wants to avoid adding artificial limitations to those products that would not otherwise have existed when configured stand-a-lone. What relationship does the management layer have with these additional products and does it dictate and limit any aspect of the configuration, or enable and ease the deployment? This is an important question to ask. Lastly, how much does it really ease deployment and updates of that particular product?
Bearing the above in mind, VMware’s VMware Cloud Foundation (VCF) combined with Dell’s VxRail does automate a number of tasks which would normally take a lot of effort to piece together manually. For the purposes of this article I’m going to concentrate on Life-Cycle Management (LCM) and some workflows that’s I’ve played around with.
Life Cycle Management
Without going into too much of an origin story about VCF & VxRail, I’ll briefly describe what is Life-cycle managed (LCM) by the product. There is a lot of information already available on VCF & VxRail (I mean…How many times can you watch Peter Parker get bitten by that spider ??).
To quickly understand and get an idea of the components and software being updated by the automated LCM, a good place to look are the Release notes for the products. Below is the link to the VCF 4.4 Release notes which contain the Build of Materials (BoM). The list shows the components deployed and updated by VCF.This document is dated 10th Feb 2022:
And here we have the VxRail equivalent, a compatibility matrix which shows the components including firmware and drivers updated by the VxRail LCM.
This should help the reader understand just how much is handled by Life-Cycle Management. Trying to update the above manually is an incredible amount of work. Although starting with vSphere 7 there is now vLCM which creates the ability to build an ESXi base image which can also be extended with firmware and driver versions. This is based on a declarative model, so a target version is setup as a desired state which the hosts will then adhere to. vLCM replaces the baseline concept used in the vSphere Update Manager (VUM). vLCM is supported in VCF and also supported on the VxRail product. It is not currently supported with VCF on VxRail. There is a great article about vLCM here:
Looking at Figures 1 and 2, there is overlap between VCF and VxRail in terms of the products they contain when they’re not combined, for example ESXi. So, what happens when you join both VCF and VxRail? All LCM activities are now controlled via the VCF SDDC Manager UI. VxRail does maintain responsibility for updating the ESXi components, not VCF. See figure 3 below. The image below is taken from the VCF VxRail Release notes on VMWare’s site. Compared to the standard VCF BoM, here ESXi has been removed from the list and VxRail Manager added.
VxRail on its own does have a robust Life Cycle Management system. Tested and validated software bundles from Engineering are made available for the product containing components shown in figure 2. Combining VxRail with VCF adds the SDDC Manager, NSX-T and optionally the vRealize suite via the vRealize Life Cycle Manager (VRSLCM). Both VCF and VxRail’s LCMs are then combined via use of API and there is no clash between responsibilities of what each product updates as parts of its Life Cycle management.
When combined with VXRail, VCF does have less to LCM and depending on the update, it will hand-off operation to VxRail. When installed on VRSN’s the VCF LCM has a wider scope of responsibility.
Is it Easier?
VCF and VxRail update their respective components, but any LCM task is only initiated using the SDDC manager
The SDDC Manager initiates the operations. This is apparently the “Single Button” and this is something a customer can or at least “should” be able to do on their own. See figure 6. So, the question is, how much easier does SDDC Manager make the upgrade experience for all these products?
SDDC Managers LCM has a number of responsibilities. It firstly has to connect to the online depots for VMware and DELL Software updates and bundles and then has to make the software available to SDDC manager. It is then responsible for creating a task to begin the update. Once the update process begins and depending on the update, it’s down to the product to run through its update process.
As previously outlined, VCF and VCF on VxRail are not single products but combinations of a set of products, then with a layer on top called the SDDC Manager. This model has a bearing on every operation that is subsequently run for this entire stack. SDDC manager then enables a number of workflows so automates tasks. These include installation of a product or component or the update of components of a product.
As mentioned earlier, before an update can be scheduled from the SDDC Manager, the actual updates or bundles need to be available and downloaded to the SDDC Manager. This is done via the UI.
It is nice to be able to see all updates and bundles just for the products you have installed, in a single place. I don’t have to go searching for drivers and firmware online and run through lots of compatibility checks as it’s all done for me.
The SDDC UI perhaps would be easier with better filtering options though. For example, being able to filter by product version so I only see relevant updates. Currently I see every single bundle for multiple versions of VCF whether relevant to me or not. From the perspective of a user this means sifting through lists of bundles and finding the only ones relevant to your version, and this can be confusing. Adding these additional options to the UI, the ability to filter bundles seems like such a basic and obvious function to add. From my understanding some form of a filter has been applied in VCF 4.4.0 so one only sees updates relevant to the installed version of VCF and the version you want to upgrade to so source and target versions.
In terms of applying the updates there is an order of upgrade that should normally be followed. There are pages and documents online giving guidance on which order to update components. Here is a link to VMware’s documentation on upgrade order with VCF 4.4:
VCF/SDDC Manager does create some form of enforcement i.e. which order bundles need to be applied. Again, this is a positive point as at least the UI guides the user on the upgrade order of the components. Certain bundles will read the term “future” and cannot be applied until a prerequisite bundle has been applied. Once completed that “future” update then becomes available for scheduling as the next update.
The rough order for application of updates tends to be:
1. Perform Pre-checks – some which are built into SDDC manager and some external tools such as VxRails’s VxVerify.
2. Upgrade the Management Domain components in following order:a. SDDC Manager
a. SDDC Manager
b. VMware Cloud Foundation Software which are:
– Critical Bugs.
– Security Fixes for the SDDC Manager Appliance.
– The Configuration Drift Bundle (more on this later). This applies configuration changes across
3. vRealize Suite (there will be a separate article dedicated to this)
a. vRealize Suite Life Cycle Manager (VRSLCM).
b. vRealize Suite products.
– For example, vRealize Automation, Operations Manager etc.
c. Workspace ONE Access.
4. NSX-T Data Center
5. vCenter Server
7. Then your Appplications
I have had one case where enforcement did not happen and this concerned the deployment of the VRSLCM. I’d recently installed VCF 4.3.1 and then deployed VRSLCM using the SDDC manager. Before running the deployment, I’d downloaded the various VRSLCM bundles available for download, as a user would into the SDDC Manager. After successfully deploying the VRSLCM I was then told by SDDC Manager while deploying the vRealize environment and building Workspace ONE that the version installed was not compatible with the version of VCF deployed…even though VCF had downloaded and deployed the VRSLCM! (Refer to Figure 8).
The problem may have occurred due to multiple versions of VRSLCM being available to download into SDDC Manager. None were greyed out and back to my original point, only seeing the relevant bundle tied to my installed version of VCF would have helped. I am surprised SDDC Mgr did not enforce any rule to avoid this installation of the incorrect version.
The only option I had to resolve this incorrect version of VRSLCM being deployed by the SDDC mgr was to upgrade from VCF 4.3.1 to VCF 4.4. That did fix the issue but the issue should not have occurred in the first place.
Once bundles are downloaded the upgrades are then ready to run and as described above an order of update is enforced the majority of the time. When the update does begin, that respective product is now running through that update. The LCM’s responsibility is to schedule and launch an Upgrade Task and I think as far as LCM is concerned its job is now to track the activity and progress as the update is applied, which it does incredibly well. SDDC Manager does show detailed messages and status. If any issue does occur then the issue can be pinpointed fairly easily within the SDDC UI. In addition, a number of log files are also available if more in-depth detail about the error is needed. The level of detail and progress shown is detailed compared to what you would see in vcenter or vSphere client. In comparison when deploying Kubernetes from the SDDC Manager, the installation is passed over completely to vSphere. There you get pretty much nothing in terms of status or detail in terms of how far the install of Kubernetes has progressed. I did have an install stuck and just spinning, no messages at all to tell me something was wrong. So yes, it’s great to be able to see such detail during a workflow or update.
Depending on the type of update task running, if an error during the update does occur then the error in many cases will be product related. For example, if an issue occurred at the NSX-T update stage, it is an NSX related update issue rather than LCM. In the example in figure 10 this was an issue that had occurred during the LCM process, but it was an NSX-T update issue and the issue was completely related to a conflicting configuration within NSX-T. LCM’s only part here was to initiate the update and then tell me it was either working or not. I guess what I’m trying to say is that sometimes LCM does get a “bad rap” even though it’s only acting as almost a messenger here – “don’t shoot the messenger”.
This does bring up another interesting topic which I’ll discuss during a future article. What you can and cannot do or “Guardrails”. Something that is not well documented and will have a negative impact on the workflows and LCM within VCF as they will conflict with VCF. From experience the pre-check tools will not always pick these up. The case mentioned above was exactly this. NSX-T based configuration created using the NSX-T manager which is something that will likely happen at a customer site. A network engineer loading NSX-T manager and building some components as they usually would. In this case the component was an NSX-T Edge cluster. The problem is however that the SDDC Manager does have a workflow for this. The conflicting objects and entries within VCF’s databases caused the LCM update of NSX-T to fail until those objects were removed. There are a lot of examples of this where better documentation clearly showing what one can or cannot do are required or the alerts/verification is not catching so a later operation will fail.
What is the Configuration Drift Bundle?
One that I would just run but honestly didn’t really know much about and there is little explanation out there about what this actually does and how it works. The Configuration Drift is applied as a second update so its run after the VCF service upgrade. It’s applied to the Management domain as an upgrade bundle. As VMware introduces modifications in configurations of any component that were deployed by the SDDC Manager, the drift bundle can apply these changes to bring the configuration up to date to a Greenfield installation of that version. So, these updates can be applied to ESXI, NSX-T including its configuration, the BoM components. The one negative for me is that there is very little status or activity shown when this is being applied and by its nature of changes being applied its possible a customer setting could also be overridden. Also, if fails the retry capability is limited to running the entire drift again rather from just the failed operation. These I believe are issues being looked at on a positive note.
Back to my original question, whether it’s easier compared to doing it all manually or just directly from the product itself. I think it does depend on the update, the product and associated workflows. The tracking and status messaging associated to these workflows are very good. It’s important to know how far a workflow or update has progresses and if it does breakdown then where exactly. The fact all the updates for everything sit in one place is also a huge time saver, albeit the limited filtering but that should be resolved. The enforced order for application of updates is important so to be guided by the SDDC Manager what to upgrade first. I think there are some avoidable issues regarding some of the workflows within the SDDC manager as well as the UI, which can have an effect on the overall LCM process. That we’ll look at in more detail in the next part
We’ll also look at some more examples of conflicts that can occur and disrupt workflows or LCM operations. We’ll discuss the other different methods to upgrade VCF using the Offline tools or Skip upgrades as opposed to a sequential upgrade discussed within this article. Finally, a look at the vRealize suite and VRSLCM in more depth and the install as well as the “should I” use SDDC Manager to deploy vRealize ….question.
P.S. The opinions expressed in this article are entirely my own and may/may not represent the views of Dell Technologies.