Making k3s Self-Aware

Over the past couple of bank holidays I’ve kept playing around with k3s, which is a fun way to take my mind off the end-of-fiscal-year madness that peaks around this time. In this installment, we’re going to start making it self-aware, or, at the very least, infrastructure-aware, which is the only real way to do truly flexible cloud solutions.

Scaling Conundrums

One of the things I’ve been doing is figuring out how to deploy long-running batch processes on Kubernetes (think Blender, render farms, etc.).

Ordinarily you’d just pack things into a DaemonSet, set your pod affinity and call it a day, but running anything in a containerized environment doesn’t magically provide you with more cores–you can scale your containers horizontally, sure, but you’re only managing whatever capacity you already have provisioned.

My Azure cluster template, which is based on earlier stuff I did for horizontal scaling, does two things to help out with that:

It deploys compute nodes separately from shared storage, so I can tear down the entire cluster, rebuild it against the same shared volume and pick up from where I left off
It uses an Azure compute abstraction called a virtual machine scale set for the agent pool, which makes it easy to deploy a set of identical virtual machines and resize them both vertically (add more cores) or horizontally (add more agents)

There are built-in helpers for auto-scaling, in that you can have (for instance) a CPU metrics trigger for adding new nodes to a scale set, and thanks to a few simple helper scripts I wrote new nodes will pop up on kubectl as they are booted, and the DaemonSet will spread across uneventfully.

But Azure auto-scaling shares a significant limitation with just about every other similar mechanism–it hasn’t a clue about what your your application is really up to.

Which brings us to a particular pet peeve of mine: it doesn’t scale down nicely; i.e., if you tell it to scale down by one node every 5 minutes as soon as average CPU load across the scale set is, say, below 50%, it will remove the last VM it added rather than an idle one–which breaks batch processes, since the most recent machines are usually the ones still working.

To be fair, it isn’t a trivial problem, but it is one I can solve by making k3s infrastructure-aware, and having it manage its own Azure compute resources.

There is a fair amount of prior art around for this. In fact, I originally thought about adopting this nice sample, but not only would I not learn anything by just dropping it in, in the end I thought it was too clunky–it does zero auto-discovery (which I find limiting) and requires a bunch of external components that I can implement as a much simpler Python or Go service¹.

And yes (again), I’m reinventing AKS here, for my own entertainment.

Self-Aware Infrastructure

So, how do you get your infrastructure to manage itself?

If you’re coming in fresh from the AWS boat, you’re probably going “IAM roles!”. And yes, Azure allows you to almost exactly the same thing, through what it calls managed service identities.

A managed service identity (MSI for short) can be provided to a compute resource (a VM, an Azure function, etc.), and affords that resource the ability to perform Azure API calls to manage itself (or other resources) inside a given scope.

This means a VM can literally resize itself, or (in this case) that my k3s master node can manage the scale set where its node pool lives: It can add more nodes, change their VM type, reboot them, etc.

The managed service identity can be user-defined or system assigned, but to be actionable it has to have an Azure AD role assignment with the right privileges (in my case I can use the Virtual Machine Contributor pre-built role, which has the ability to manage VMs and scale sets).

Setting up the Managed Service Identity

I initially tried doing this with a user-assigned identity (just so that I could have a little bit more control and re-use the same identity with other resources), but it was easier to use a system-assigned identity.

Assuming that, like me, you’re doing everything through Azure Resource Manager templates (I also use Terraform, but all of my scale set stuff has usually been done in ARM), the only thing you need to get a valid identity is adding the following attribute to your VM resource definition:

      "identity": {
        "type": "SystemAssigned"
      },

Obviously, there is a little more to the VM definition than that, so for clarity, here is a more complete stanza for a VM with an MSI:

   {
      "comments": "Master Node",
      "type": "Microsoft.Compute/virtualMachines",
      "identity": {
        "type": "SystemAssigned"
      },
      "name": "[parameters('masterName')]",
      "apiVersion": "2019-03-01",
      "location": "[resourceGroup().location]",
      "properties": {...}
      ...
   }

Role Bindings

The trickier thing, though, is assigning it a role. You do that by creating a Microsoft.Authorization/roleAssignments resource, but there are a couple of things that you need do for it to work properly, and which had me stumped for a while:

The name property of that resource needs to be a GUID, which is counter-intuitive even considering some of the oddities I’ve come across while doing ARM templating.
You also need to specify an Azure AD Role GUID for the role you are assigning–i.e., you cannot use a human-readable name like “Virtual Machine Contributor”.

As it turns out, there doesn’t seem to be a trivial way to look up Azure AD management roles by name from inside ARM templates. You can’t even see the GUIDs in the portal, and resources.azure.com does not list the Microsoft.Authorization provider for some reason.

So I ended up using the az CLI to look up the role GUID I wanted:

$ az role definition list | jq -c '.[] | {roleName: .roleName, name: .name}' | grep "Virtual Machine Contributor"
{"roleName":"Classic Virtual Machine Contributor","name":"d73bb868-a0df-4d4d-bd69-98a00b01fccb"}
{"roleName":"Virtual Machine Contributor","name":"9980e02c-c2be-4d73-94e8-173b1dc7cf3c"}

I then assigned it to a template variable like so:

   "variables": {
      "virtualMachineContributor": "[concat('/subscriptions/', subscription().subscriptionId, '/providers/Microsoft.Authorization/roleDefinitions/', '9980e02c-c2be-4d73-94e8-173b1dc7cf3c')]"
      ...
   }

As to the role assignment itself, a few days after I got this working with the system assigned identity, a colleague of mine showed me how to do the same with a user-assigned identity.

First off, you need to create a specific resource for the identity itself:

   {
      "type": "Microsoft.ManagedIdentity/userAssignedIdentities",
      "name": "[variables('managedIdentityName')]",
      "apiVersion": "2018-11-30",
      "location": "[resourceGroup().location]",
   }

…and change the identity field in the VM stanza to reference it like this:

   "identity": {
      "type": "userAssigned",
      "userAssignedIdentities": {
         "[resourceID('Microsoft.ManagedIdentity/userAssignedIdentities', variables('managedIdentityName'))]": {}
      }
   }

…as well as making sure the VM is created after the managed identity by adding the identity resource to the dependsOn field of the VM stanza:

   "dependsOn": [
      "[resourceID('Microsoft.ManagedIdentity/userAssignedIdentities', variables('managedIdentityName'))]",
      ...
   ]

That I had managed to do all by myself earlier on during testing.

But I was lacking the displayName (which is apparently required in this case), and I had both the role assignment syntax and the API version wrong.

I found that difficult to figure out from the documentation (which still lacks proper examples of some features) so here’s mine for the record:

   {
      "type": "Microsoft.Authorization/roleAssignments",
      "name": "[variables('rbacGuid')]",
      "apiVersion": "2017-09-01",
      "properties": {
         "displayName": "[variables('managedIdentityName'))]",
         "roleDefinitionId": "[variables('virtualMachineContributor')]",
         "principalId": "[reference(resourceId('Microsoft.ManagedIdentity/userAssignedIdentities', variables('managedIdentityName')), '2018-11-30', 'Full').properties.principalId]",
         "scope": "[resourceGroup().id]"
      },
      "dependsOn": [
         "[resourceID('Microsoft.ManagedIdentity/userAssignedIdentities', variables('managedIdentityName'))]",
         "[resourceId('Microsoft.Compute/virtualMachines', parameters('masterName'))]"
      ]
   }

In this case, the scope is the entire resource group, since I set up my clusters as two separate resource groups (so this will only give the MSI access to the one it’s in) and the role only allows this identity to manage VMs and VM scale sets. You can be as granular as you want, but in my case the intersection of the role itself with the resources inside that group is restricted enough (and I want the master to be able to perform operations on itself at a later stage).

Using the MSI to invoke Azure APIs

Since this is post is getting to be a bit long, I’m going to go over the basics of how you can then use your service identity from inside the associated VM.

There are essentially three steps:

Discover your subscription ID and resource group (since you really don’t want to hard-code that into your scripts)
Obtain a bearer token for authentication against Azure Resource Manager (this is where your managed identity will come into its own)
Invoke whatever API you need

The first step can be done through the instance metadata service, which works quite similarly to its AWS counterpart; i.e., you invoke a private URL and get whatever metadata the platform has for the instance.

This is trivial to do using requests (I like to use a Session object to make sequential calls more efficient):

res = session.get(
   "http://169.254.169.254/metadata/instance",
   headers = {
      "Metadata": "true"
   },
   params = {
      "api-version": "2018-10-01"
   }
)
if res.status_code == 200:
   instance_metadata = res.json()

The results are quite useful, really (abbreviated for clarity):

{
    "compute": {
        "azEnvironment": "AzurePublicCloud",
        "publisher": "Canonical",
        "sku": "18.04-LTS",
        "version": "18.04.201906040",
        "provider": "Microsoft.Compute",
        "location": "eastus",
        "name": "master0",
        ...
        "resourceGroupName": "acme-prod-k3s-cluster",
        "subscriptionId": "<guid>",
        "vmId": "<guid>",
        "vmScaleSetName": "",
        "vmSize": "Standard_B2ms",
        "zone": ""
    },
    "network": {
       ...
    }
}

We can leave that result aside for the moment–moving on to the second step, we’re going to ask the same service for a bearer token to authenticate against management.azure.com:

res = session.get(
   "http://169.254.169.254/metadata/identity/oauth2/token",
   headers = {
      "Metadata": "true"
   },
   params = {
      "api-version": "2018-02-01",
      "resource": "https://management.azure.com/"
   }
)

And this is where the managed service identity kicks in, providing us with said token and the MSI’s client_id:

{
    "access_token": "<humungously long string>",
    "client_id": "<guid of our MSI>",
    "expires_in": "28800",
    "expires_on": "1560620938",
    "ext_expires_in": "28800",
    "not_before": "1560591838",
    "resource": "https://management.azure.com/",
    "token_type": "Bearer"
}

So, zero need to hack any of these values into config files–the managed service identity makes authenticating to these endpoints a seamless affair.

And if we tie them together with the subscriptionId and the resourceGroupName we got from the initial call, we can now issue an actual management API request:

# Let's merge the agent scaleset name for the sake of brevity:
data = {
   "vmScaleSetName": "agents",
   **instance_metadata
}

res = session.get(
   "https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmScaleSetName}".format(**data),
   headers = {
      "Authorization": "Bearer {access_token}".format(**token)
   },
   params = {
      "api-version": "2018-06-01"
   }
 )

…which gives us the following:

{
   "id": "/subscriptions/<guid>/resourceGroups/acme-prod-k3s-cluster/providers/Microsoft.Compute/virtualMachineScaleSets/agents",
   "location": "eastus",
   "name": "agents",
   "properties": {
      "overprovision": false,
      "provisioningState": "Succeeded",
      "singlePlacementGroup": true,
      "uniqueId": "<guid>",
      "upgradePolicy": {
         "automaticOSUpgrade": false,
         "mode": "Manual"
      },
      "virtualMachineProfile": {
         "diagnosticsProfile": {
               "bootDiagnostics": {
                  "enabled": true,
                  "storageUri": "https://someplace.blob.core.windows.net"
               }
         }
      },
      ...
   }
}

You will notice that Python 3’s .format() came in very handy there.

In fact, dealing with Azure APIs with dynamic languages is great, and the whole thing can be summarized as:

session = requests.Session()
token = get_token(session)
metadata = get_metadata(session)["compute"]
metadata["vmScaleSetName"] = "agents" # or discover it
do_management_call_scaleset_info(session, metadata, token))

Next Steps

I am now working through the process of coding in basic scale set management (shutdown, add new VM, etc.) as well as being able to identify idle VMs in a scale set, and will eventually pack the results into a Go executable that will do the kind of “smart” downscaling I need.

That will then go into a suitable container that can also talk to k3s APIs, so that managing the VMs can tie into Kubernetes resources. Two things I’d like to do are (possibly) evict pods from nodes about to be killed (I already force that when machines are about to be shut down, but it’s a bit rough) and postpone scheduling of new pods until there is enough capacity.

Either are the only really tricky problems here, since picking out idle machines and terminating them seems like a fairly trivial proposition.

Considering we’re getting into the slow swing of Summer (which is due just after the local entropy maxima I expect over the next couple of weeks), it might take a while, but I’ll get there.

It’s also implemented as a set of C# Azure Functions, which, to be honest, doesn’t help me in the least–there are just too many moving parts there… ↩︎

Tao of Mac