By Paul McSharry, blogger and vExpert
I recently finished teaching the VMware vSphere 6 troubleshooting workshop. The 5 day course is amazing – it’s fun to teach and I’ve had some great feedback on the content.
The workshop style course makes for a much more interactive learning experience. Rather than traditional module feature walkthroughs followed by a lab with step-by-step instructions, the course can be built around real life experience. It allows the attendee to use a break PowerCLI script to create a problem, and gives an opportunity to find the solution themselves.
For this particular course I start with a review of troubleshooting processes, but add in a discussion which considers design principles and relates them back to the troubleshooting scenarios.
What makes a good design?
In a break fix situation, the fix is important, but how often do we ask ourselves whether we have impacted the design based on our fix? What is the end result to the application workloads? How do you update the design or documentation etc?
Each of the 5 datacenter layers (Management, Virtual Machine, Compute, Network, and Storage), are covered throughout the course, showing how to collect logs, and troubleshooting common areas around each layer.
I’d expect each instructor to lead the workshop in a slightly different way, my approach was to discuss common areas, processes, and specific examples that happened to me when the scary P1 incident occurred.
My approach was to discuss common issues and errors that you see day to day as a vAdmin. Everyone in the course works together to develop a troubleshooting process and walk through some daily checks to try and prevent issues proactively.
- How did we solve it?
- What checks did we do?
- How would be prevent or detect a potential issue similar to this now?
- This brings into common monitoring techniques using solutions such as vROPs.
As an exercise to consolidate the information throughout the week, and to help attendees find a place to start in the break fix labs and real life problem areas, I created several troubleshooting mind maps.
These illustrate the common areas to review, the underlying thought process and concept, and aid as a structured method to resolve potential configuration, and performance issues.
As shown in the mind map (click the image below), the thought process before changing anything is to:
- Understand the trend of the key metrics based on experience or historical information
- Check configuration as per design
- Backup / document configuration prior to change
- Remediate or change
- Verify the fix
- At this point, verify if any change has impacted the intended design infrastructure qualities (i.e. security, performance etc)
By following such a method, troubleshooting and break fix should ultimately become a far less risky situation. The design and implementation of an infrastructure solution is a large investment to a company, it represents many hours of work.
Having a platform deviate from the design a few days after implementation due to troubleshooting in an unplanned fashion is a common frustration and potential issue.
Key takeaways from the course
- Take the time to develop a troubleshooting process
- Understand key areas to monitor and check in the real world
- Interactive discussions on support issues with other vAdmins
- Guiding principles for change control, root cause analysis
- Guiding questions and informational areas to liaise with engineers of other silos (i.e. network, storage)
The mind maps are available as a download over on www.elasticsky.co.uk.
You can hear more thoughts from Paul McSharry over on Twitter.