VCDX Mentoring Series with Customer Connect

Since transitioning to a panelist role for the VCDX program almost 3 years ago, I have missed the mentoring aspect. Working with VCDX candidates was very rewarding and a helpful learning experience for me to understand other business project use cases and designs.
I still get VCDX mentoring requests and recommend others in the field who may help. However, there is always a limited amount of time for people to volunteer. Typically mentoring is limited to design reviews, final mocks, and some 1:1 Q&A.

How would a person looking to achieve the VCDX develop the skills to create the design and deliver a presentation in the first place?
Having completed a VCP and maybe one VCAP, there is still a lot to do, and the journey to VCDX can seem confusing.

Over the past couple of months, I have been working with Diane Mayer and Daniela Quesada from the VMware Customer Connect Learning (VCCL), team to help plan a dedicated mentoring webinar series for the VCDX program.

With Karl Childs’s tremendous support, Customer Connect Learning has now teamed up with over 20 VCDX holders to discuss each phase of the VCDX journey in detail.

I joined the VCCL team and Karl Childs alongside the presenters for the first webinar. Each session is recorded and provides technical approaches that a VCDX candidate can use to succeed in their journey.

Common questions such as How to get started? , What makes a good project?, What to submit? , are discussed, but in addition, deep dives into areas such as logical design and risk mitigation are covered.
These deeper areas are often stumbling blocks at the submission and on defense day.

Links to sessions

Thank you again to Diane Mayer, Daniela Quesada & Karl Childs for letting me be help develop the series.


Accepted Risks in Architectural Design

With the introduction of any new business solution, there will be risks associated with a technology or design.
Whether they are specific product constraints, technical knowledge gaps, or budgetary/duration of a rollout, a project risk register can often be forgotten about once the project team has released the solution.

An architect can spend many hours documenting, researching, and considering the logical and physical design decisions for various project areas.

“Some might say that we should never ponder.

On our thoughts today ’cause they hold sway over time.”

Noel Gallagher – 1995

With every decision made, there can be an impact on the design and potentially the solution from a conceptual requirement perspective (i.e., SLA/RPO/RTO, Security, availability) and a potential risk to understand and ensure mitigation to protect the business investment.

Having a systematic approach to enterprise risk management has become one of the most valuable takeaways from my VCDX journey and something I continually seek to improve in my fieldwork.

The specific framework or methodology may vary from project to project; however, the ability to relate technological decisions to business objectives is valuable.   

In times of crisis, for example, with a service-impacting incident, a robust method of risk identification and design review is essential for any IT professional or technologist, not just someone with an architect design focus.   

When faced with a seemingly unfixable problem that potentially costs money/brand reputation for a customer, the fear of not knowing enough of a specific technology can be relentless, especially with the number of integration points and ever-changing approaches in the world today.

The ability to review, address, and mitigate technology areas in an agnostic manner can help calm these thoughts and help move forward within long-running troubleshooting or projects in crisis.

Some Thoughts From the Field & for Certification Efforts.

The business has accepted this risk.

Often this is agreed upon without the overarching understanding of a solution.

As an architect, one objective is to minimize the risk impact of new technology, potentially within the operationalizing phase.  

For example, creating a specific monitoring process.   

Once identified, the risk can potentially be lowered, and the initial manual process developed to automate, notify and correct with minimal service impact. 

Developing a risk-based specific check is different from applying a general cloud monitoring service or creating a new local monitoring product/instance.

The decision impacts another area; we don’t have responsibility for that

(i.e., Networking, Security).

As an architect proposing a solution, the aim is to create a working product that meets requirements and a measurable service definition (i.e., SLA, Performance, Cost optimization, Operational improvement, etc.).

Creating a new service with dependencies on other business areas or impacting existing layers without due diligence or review can be risky in the long term and hard to justify within architecture based certifications such as the VCDX.

It’s out of scope for this project & It will be covered in the next phase.

Conflicting requirements and scope can be challenging within projects.  

A pragmatic view of risk identification and mitigation is essential for this. What is the value of a project being delivered if it is not going to be successfully consumed or operationally reliable? 

Lots of business transformation programs consist of multiple projects, which over time increase as a business matures.  

An error in one project could impact user confidence, create operational issues and hinder the transformation journey.  

Recommended Resources


vSAN Fault Domains | Some Design Thoughts

Recently I have been working on a number of projects using VCF, and native vSAN with rack awareness requirements.

Differing Fault Domain approaches using multiple VCF Workload domains

The vSAN fault domain feature is extremely useful to ensure that data component placement considers the physical rack architecture of the datacentre.

vSAN Fault Domain & rack mapping considerations

However, as with all features there are design impacts and operation processes to consider.

Some useful questions I find to think about when using vSAN Fault domains are;

Do you need fault domains at all? Does it solve your business requirement?

What disaster event do you need to protect against? Consider all areas of the infrastructure, additional features can add complexity and in some cases reduce flexibility. Depending on the requirements and physical platform vSAN fault domains will not mask reduced redundancy at other layers of the datacentre (ie network/power links and diversity).

Are you planning for object availability with automatic rebuild?

When the vSAN fault domain feature is enabled and the domains mapped within a cluster, the default 1 fault domain per ESXi host is changed to a rack mapping. Depending on the FTT value, there is a minimum number of fault domains required. Ensure when planning the use of this feature the impact to vSAN capacity/availability following component failure is considered.

Do you need additional hosts for a rebuild, are you relying on administrator intervention?

Why would a rack fail in a platform? Are there rack interdependencies? Is there likely to be multiple rack failures, one and then another, or both at the same time?

Do all the project workloads require the same platform requirements?

Can different approaches be used? Would the use of implicit fault domains, placing hosts thinly across racks rather than lots of servers in a lower number of racks be more effective from a management or cost perspective?

Consider the addition of rebuild capacity and slack space. If using vSAN 7u1 review the use of the new reserved capacity controls. Some features cannot be combined with fault domains.

Each approach has value and could help with capacity, and complexity of operations, however, rack to rack networking and clustered cache sizing should also be considered.

What is the scale of the deployment?

Fault domains require physical mapping and planning. It is a post VCF workload deployment task and if incorrectly configured/maintained, the feature could impact capacity and availability considerably.

Create a strategy/process for rack scaling following a capacity growth trigger. Consider using scripting/automation to maintain physical mapping.

What is the impact to the normal day 2 operations?

How many ESXi hosts per rack can be placed into maintenance with each vSAN option? What is the risk associated with the selected approach. Ensure this is well understood

I have summarised these and other considerations with links to useful documentation references in a mind map below,

My vSAN Fault Domain Consideration Summary Mind Map