Livesite, resource names and maintain sanity under stress

Coming from CSS (Customer Service and Support) I was used to work under pressure. As you can imagine, when a customer opens a support ticket it means something is broken or at the bare minimum is not working as they would like, so being able to quickly figure out the root cause of the problem and suggest how to resolve it a key component to the role. Equally important is to be able to manage those situations when things are really broken badly and the stakes are high: imagine an e-commerce website where transactions keep failing for some reason. The customer is losing money and his customers are unhappy (frustrated? fuming?) and likely taking their business elsewhere. Not nice. As Service Engineer in Azure, when one of our services is down it does not impact one customer, it impacts half a Continent! ?

Something I learned quickly in my new role is to think in terms of livesite. What happens if I need to do “x” during a livesite incident? How quickly can I find that information, or get to that tool? This applies to almost everything I do, from seemingly negligible decisions (I need a new Storage Account, how do I name it?) to discussions that will help shape how the Services I support will grow.

Why bothering for the small things? Simple. Everything grows over time and can quickly become un-manageable. Let’s take the Storage Account example: why would I care?

A few months after moving to Redmond I started working with a new Service that was still under development internally, it was not even in Private Preview yet (that service later became Azure Automation) and I had the opportunity to build the production environment from scratch. Working closely with the Engineering team on requirements and specifications, provision the entire infrastructure (Cloud Services, databases, Storage, Networking, certificates and everything else you can imagine goes into building a live Cloud Service). Among other things, since I was the only Service Engineer working on the project at the time I was given the choice to lay out the environment and shape it the way I thought was best, naming resources the way I liked it (“You’ll have to manage them so it’s your choice” is more or less what I was told). That turned out to be key.

The Service went Private Preview in one Region followed by a second one 3-4 months later; after going GA (General Availability) we had a slow growth but it didn’t take long to really pick up pace and we are (as of today) live in 26 Public Azure Regions and 5 Sovereign Clouds Regions (with more already in the pipeline, being build and validated before being dimmed ready for customer consumption). Doing some quick math I currently manage directly or indirectly about 300 Cloud Services, over 200 Storage Accounts, 60 Sql databases, 100 Cosmos Db accounts, roughly 10,000 VMs plus networking (Traffic Manager and the like), provisioning and rotation of hundreds of certificates and keys, all split across about 70 Azure Subscriptions. Considering other teams and Services I collaborate with, I can get access to almost 900 Subscriptions (!)… That’s a lot of moving parts, and I will not even mention our CI/CD (Continuous Integration/Continuous Deployment) solution, with services and components being deployed and updated in an endless cycle across the globe. All this requires good tools, tons of automation and solid processes.

So, what has all of this to do with naming a Storage Account? Imagine a livesite incident happens and you have to narrow down the problem to a single component, in a specific region/service/cluster so that you can take the appropriate action (mitigate the customer impact as fast as possible is the number one priority). Of course you have your monitors and traces to give you details about what is happening, and imagine is a StorageException. But which storage account? Where is it? Wouldn’t it be nice if you had a nice talking resource name rather than a jumble of letters and numbers to guide you?

Over time I tried to suggest (sometimes at the risk of imposing ?) a naming convention that I find useful in those situations and that, more in general, is easy to remember and use in day to day discussions. For example we always use the Azure Region acronym to identify where the resource is located, some identifier about the resource type, an identifier about the scope or use for that resource and, also helpful, a counter (literally a number) because sometimes we need multiple instances of a given resource and we want to keep them ordered properly. So for example webackupstorage1 could be a storage account used to host backups in West Europe and this is the first storage of this type we have in the region. (I just made up the name for this post so don’t try to access it, it does not exist ?).

I apply the same principle to my tools and my machines. Over time I came up with a setup I am comfortable with and I wrote a powershell script I use to prep every new machine I work with. I keep all my tools and documents in OneDrive so the location on disk is always the same (my fingers automatically type it without me even thinking). Moreover, I create junction s in my root C: drive for the most important folders I want easy access to. Type C:\Utility is way faster than C:\Users\name\OneDrive\Utility (this is the folder where I keep all the tools I use most frequently). Even better, I add C:\Utility to my system path so that from the command line I just type the command I need and is readily available no matter what my current folder is.

Clear communication is key, and prompt access to past communications is equally important. Searching old emails can be a daunting task especially if your inbox is really a recycle bin for all sorts of messages and you don’t have a good mechanism to organize your messages (folder, flags, categories etc…). Office applications (particularly Outlook, but OneNote as well) support advanced queries to help quickly find what you are looking for (or at least narrow down the results as much as possible).

The bottom line is: to be efficient you have to be comfortable (with your tools, your environment, your informations, your knowledge etc..), minimize the stress of the situation as much as possible (there is plenty flying around already). Continuously evolving (I’m always refining my setup) and is personal. Something that works for me may not be good for you and vice-versa. Don’t settle for what seems good enough now because you can always do better, just keep working on it, keep improving, even on small and apparently un-important things as a storage account name or your tools location.

Inside of a ring or out, ain’t nothing wrong with going down. It’s staying down that’s wrong. – Muhammad Ali

Livesite, resource names and maintain sanity under stress

Like this:

Related

Leave a Reply Cancel reply

Share this:

Like this:

Related

Leave a Reply Cancel reply