The Day 2 Operations of Async Tooling with VCF

VMware Cloud Foundation is a solution-based platform that is powered by VMware’s products to a consistent VMware validated design; however, as with all technology, there are several day 2 operational impacts to consider.

By ensuring VCF deployments are maintained within operational guard rails and regular updates, VCF Lifecycle management can be controlled successfully.  What can an administrator do if there are competing business constraints and requirements?  

For example, applying an unexpected full update or a major release of VCF may impact deployment/project plans & timescales.  While security compliance and risk assessments may require a specific upgrade patch to be deployed to allow workloads/applications to be consumed.

Released earlier this month, the new VCF async tool has been created to enable customers the ability to apply supported critical updates to an existing VCF deployment without applying a major release of VCF.

This provides greater flexibility for maintenance windows and project planning for large-scale VCF-powered platforms.

Before using the async tool, there are several considerations to review on the suitability of its use within your business and project.

The below mind-map and links provide more detail.

VCF Async Tooling Mind Map

My VMworld 2021 Session

It’s VMworld time again, and although we are still in a virtual setting I am very happy to have had the opportunity to present an interactive session on my favorite product, vSAN.

Incorporating good operational practices within the design process or within an upgrade review is something I find very useful in the field.

It is often at these stages that there are opportunities to transform a platform, review and ensure business requirements are being met, and understand the business workloads to a greater degree.

My session focussed on the theory of this operational approach to architectural design and troubleshooting while applying it to vSAN as a product.

The links to my mindmap I used during the session are below;

Online Link | Download PDF

Accepted Risks in Architectural Design

With the introduction of any new business solution, there will be risks associated with a technology or design.
Whether they are specific product constraints, technical knowledge gaps, or budgetary/duration of a rollout, a project risk register can often be forgotten about once the project team has released the solution.

An architect can spend many hours documenting, researching, and considering the logical and physical design decisions for various project areas.

“Some might say that we should never ponder.

On our thoughts today ’cause they hold sway over time.”

Noel Gallagher – 1995

With every decision made, there can be an impact on the design and potentially the solution from a conceptual requirement perspective (i.e., SLA/RPO/RTO, Security, availability) and a potential risk to understand and ensure mitigation to protect the business investment.

Having a systematic approach to enterprise risk management has become one of the most valuable takeaways from my VCDX journey and something I continually seek to improve in my fieldwork.

The specific framework or methodology may vary from project to project; however, the ability to relate technological decisions to business objectives is valuable.   

In times of crisis, for example, with a service-impacting incident, a robust method of risk identification and design review is essential for any IT professional or technologist, not just someone with an architect design focus.   

When faced with a seemingly unfixable problem that potentially costs money/brand reputation for a customer, the fear of not knowing enough of a specific technology can be relentless, especially with the number of integration points and ever-changing approaches in the world today.

The ability to review, address, and mitigate technology areas in an agnostic manner can help calm these thoughts and help move forward within long-running troubleshooting or projects in crisis.

Some Thoughts From the Field & for Certification Efforts.

The business has accepted this risk.

Often this is agreed upon without the overarching understanding of a solution.

As an architect, one objective is to minimize the risk impact of new technology, potentially within the operationalizing phase.  

For example, creating a specific monitoring process.   

Once identified, the risk can potentially be lowered, and the initial manual process developed to automate, notify and correct with minimal service impact. 

Developing a risk-based specific check is different from applying a general cloud monitoring service or creating a new local monitoring product/instance.

The decision impacts another area; we don’t have responsibility for that

(i.e., Networking, Security).

As an architect proposing a solution, the aim is to create a working product that meets requirements and a measurable service definition (i.e., SLA, Performance, Cost optimization, Operational improvement, etc.).

Creating a new service with dependencies on other business areas or impacting existing layers without due diligence or review can be risky in the long term and hard to justify within architecture based certifications such as the VCDX.

It’s out of scope for this project & It will be covered in the next phase.

Conflicting requirements and scope can be challenging within projects.  

A pragmatic view of risk identification and mitigation is essential for this. What is the value of a project being delivered if it is not going to be successfully consumed or operationally reliable? 

Lots of business transformation programs consist of multiple projects, which over time increase as a business matures.  

An error in one project could impact user confidence, create operational issues and hinder the transformation journey.  

Recommended Resources

vSAN Fault Domains | Some Design Thoughts

Recently I have been working on a number of projects using VCF, and native vSAN with rack awareness requirements.

Differing Fault Domain approaches using multiple VCF Workload domains

The vSAN fault domain feature is extremely useful to ensure that data component placement considers the physical rack architecture of the datacentre.

vSAN Fault Domain & rack mapping considerations

However, as with all features there are design impacts and operation processes to consider.

Some useful questions I find to think about when using vSAN Fault domains are;

Do you need fault domains at all? Does it solve your business requirement?

What disaster event do you need to protect against? Consider all areas of the infrastructure, additional features can add complexity and in some cases reduce flexibility. Depending on the requirements and physical platform vSAN fault domains will not mask reduced redundancy at other layers of the datacentre (ie network/power links and diversity).

Are you planning for object availability with automatic rebuild?

When the vSAN fault domain feature is enabled and the domains mapped within a cluster, the default 1 fault domain per ESXi host is changed to a rack mapping. Depending on the FTT value, there is a minimum number of fault domains required. Ensure when planning the use of this feature the impact to vSAN capacity/availability following component failure is considered.

Do you need additional hosts for a rebuild, are you relying on administrator intervention?

Why would a rack fail in a platform? Are there rack interdependencies? Is there likely to be multiple rack failures, one and then another, or both at the same time?

Do all the project workloads require the same platform requirements?

Can different approaches be used? Would the use of implicit fault domains, placing hosts thinly across racks rather than lots of servers in a lower number of racks be more effective from a management or cost perspective?

Consider the addition of rebuild capacity and slack space. If using vSAN 7u1 review the use of the new reserved capacity controls. Some features cannot be combined with fault domains.

Each approach has value and could help with capacity, and complexity of operations, however, rack to rack networking and clustered cache sizing should also be considered.

What is the scale of the deployment?

Fault domains require physical mapping and planning. It is a post VCF workload deployment task and if incorrectly configured/maintained, the feature could impact capacity and availability considerably.

Create a strategy/process for rack scaling following a capacity growth trigger. Consider using scripting/automation to maintain physical mapping.

What is the impact to the normal day 2 operations?

How many ESXi hosts per rack can be placed into maintenance with each vSAN option? What is the risk associated with the selected approach. Ensure this is well understood

I have summarised these and other considerations with links to useful documentation references in a mind map below,

My vSAN Fault Domain Consideration Summary Mind Map