- Accepted Risks in Architectural Design
With the introduction of any new business solution, there will be risks associated with a technology or design.
Whether they are specific product constraints, technical knowledge gaps, or budgetary/duration of a rollout, a project risk register can often be forgotten about once the project team has released the solution.
An architect can spend many hours documenting, researching, and considering the logical and physical design decisions for various project areas.
“Some might say that we should never ponder.
On our thoughts today ’cause they hold sway over time.”
– Noel Gallagher – 1995
With every decision made, there can be an impact on the design and potentially the solution from a conceptual requirement perspective (i.e., SLA/RPO/RTO, Security, availability) and a potential risk to understand and ensure mitigation to protect the business investment.
Having a systematic approach to enterprise risk management has become one of the most valuable takeaways from my VCDX journey and something I continually seek to improve in my fieldwork.
The specific framework or methodology may vary from project to project; however, the ability to relate technological decisions to business objectives is valuable.
In times of crisis, for example, with a service-impacting incident, a robust method of risk identification and design review is essential for any IT professional or technologist, not just someone with an architect design focus.
When faced with a seemingly unfixable problem that potentially costs money/brand reputation for a customer, the fear of not knowing enough of a specific technology can be relentless, especially with the number of integration points and ever-changing approaches in the world today.
The ability to review, address, and mitigate technology areas in an agnostic manner can help calm these thoughts and help move forward within long-running troubleshooting or projects in crisis.
Some Thoughts From the Field & for Certification Efforts.
The business has accepted this risk.
Often this is agreed upon without the overarching understanding of a solution.
As an architect, one objective is to minimize the risk impact of new technology, potentially within the operationalizing phase.
For example, creating a specific monitoring process.
Once identified, the risk can potentially be lowered, and the initial manual process developed to automate, notify and correct with minimal service impact.
Developing a risk-based specific check is different from applying a general cloud monitoring service or creating a new local monitoring product/instance.
The decision impacts another area; we don’t have responsibility for that
(i.e., Networking, Security).
As an architect proposing a solution, the aim is to create a working product that meets requirements and a measurable service definition (i.e., SLA, Performance, Cost optimization, Operational improvement, etc.).
Creating a new service with dependencies on other business areas or impacting existing layers without due diligence or review can be risky in the long term and hard to justify within architecture based certifications such as the VCDX.
It’s out of scope for this project & It will be covered in the next phase.
Conflicting requirements and scope can be challenging within projects.
A pragmatic view of risk identification and mitigation is essential for this. What is the value of a project being delivered if it is not going to be successfully consumed or operationally reliable?
Lots of business transformation programs consist of multiple projects, which over time increase as a business matures.
An error in one project could impact user confidence, create operational issues and hinder the transformation journey.
- vSAN Fault Domains | Some Design Thoughts
Recently I have been working on a number of projects using VCF, and native vSAN with rack awareness requirements.
The vSAN fault domain feature is extremely useful to ensure that data component placement considers the physical rack architecture of the datacentre.
However, as with all features there are design impacts and operation processes to consider.
Some useful questions I find to think about when using vSAN Fault domains are;
Do you need fault domains at all? Does it solve your business requirement?
What disaster event do you need to protect against? Consider all areas of the infrastructure, additional features can add complexity and in some cases reduce flexibility. Depending on the requirements and physical platform vSAN fault domains will not mask reduced redundancy at other layers of the datacentre (ie network/power links and diversity).
Are you planning for object availability with automatic rebuild?
When the vSAN fault domain feature is enabled and the domains mapped within a cluster, the default 1 fault domain per ESXi host is changed to a rack mapping. Depending on the FTT value, there is a minimum number of fault domains required. Ensure when planning the use of this feature the impact to vSAN capacity/availability following component failure is considered.
Do you need additional hosts for a rebuild, are you relying on administrator intervention?
Why would a rack fail in a platform? Are there rack interdependencies? Is there likely to be multiple rack failures, one and then another, or both at the same time?
Do all the project workloads require the same platform requirements?
Can different approaches be used? Would the use of implicit fault domains, placing hosts thinly across racks rather than lots of servers in a lower number of racks be more effective from a management or cost perspective?
Consider the addition of rebuild capacity and slack space. If using vSAN 7u1 review the use of the new reserved capacity controls. Some features cannot be combined with fault domains.
Each approach has value and could help with capacity, and complexity of operations, however, rack to rack networking and clustered cache sizing should also be considered.
What is the scale of the deployment?
Fault domains require physical mapping and planning. It is a post VCF workload deployment task and if incorrectly configured/maintained, the feature could impact capacity and availability considerably.
Create a strategy/process for rack scaling following a capacity growth trigger. Consider using scripting/automation to maintain physical mapping.
What is the impact to the normal day 2 operations?
How many ESXi hosts per rack can be placed into maintenance with each vSAN option? What is the risk associated with the selected approach. Ensure this is well understood
I have summarised these and other considerations with links to useful documentation references in a mind map below,
- 2020 Tap Cancer Out Global Grappling Day – Virtual Edition
When I am not tinkering with technology, I am normally found training BJJ.
This year my daughter and I signed up for the virtual Global grappling day in aid of Tap Cancer Out
Thank you for the donations, as you can see below we had some fun rolling for a worthwhile cause.
Thank you to VMware for the matching gifts.
Tap Cancer Out – Global Grappling Day