Terraforming Azure

As promised, this week I’m going to dive into Azure provisioning using Terraform, which is something I’ve been spending some time on, but which many folk in the Azure universe seem to be unaware of. Well, now there’s a copiously detailed example of how to bring them together, and I’m going to walk you through it.

Infrastructure as Code

Fully automated provisioning is, for me, the key advantage of using cloud solutions, but many people coming to the cloud tend (much to my bemusement) to carry over the manual “point and click” approaches to provisioning they have been using in their existing datacenters, or (at best) to try and get by with simple CLI scripting (be it PowerShell or bash).

As it turns out, infrastructure as code isn’t scripting. Scripting is automation (in the strictest sense), but imperative scripts (i.e.,stuff like az vm create) are not a good approach for managing infrastructure because they do not usually account for state–i.e., they usually ignore the current state of provisioned infrastructure and do not express the desired state of what you intend to achieve.

They may be nice for one-offs and demos, but if you’re going to to do cloud deployments you intend to keep around for more than a week (and update down the road), you need to use the right tool for the job.

Why Terraform?

Going straight to the point, for me (and a few of the customers I’ve worked with) choosing Terraform boils down to:

Hashicorp’s long track record in automation of various kinds
Its declarative approach to infrastructure provisioning
The ability to easily compare existing state with desired state and evaluate/execute diffs to converge to desired state
Its ability to work with multiple cloud/virtualisation providers

These factors translate to a number of benefits in terms of reliability, testability, reproducibility and flexibility, allowing teams to use the same tool to plan, document (as code) and evolve infra-structure without resorting to a myriad other tools.

What About Azure Resource Manager?

Azure Resource Manager (ARM) templates (which I wrote about a couple of years back) are alive and well and provide a tried and tested declarative approach that works wonderfully well (ARM templates are composable, can be validated against running infra and support additive “deltas”), but most customers I come across are put off by the learning curve (which, honestly, is pretty easy) or are looking for the Holy Grail of a single tool to manage everything.

Terraform isn’t perfect by any means, but comes quite close to being universal, and seems to enjoy considerable mindshare even in the traditional Windows/IT environments.

So, what can I do with it on Azure?

Judging from the current state of the Azure provider, just about everything. It can already manage Kubernetes clusters, container instances and Azure Functions, so it covers most generally available services (although not, at this point, Azure Data Lake Analytics, which is a bit sad for me).

But most of the time you’ll want to manage networking and compute resources, so let’s start with a VM:

resource "azurerm_virtual_machine" "linux" {
  name                  = "${var.name}"
  location              = "${var.location}"
  resource_group_name   = "${var.resource_group_name}"
  network_interface_ids = ["${azurerm_network_interface.nic.id}"]
  vm_size               = "${var.vm_size}"

  storage_image_reference = {
    publisher = "Canonical"
    offer     = "UbuntuServer"
    sku       = "18.04-LTS"
    version   = "latest"
  }

  storage_os_disk {
    name              = "${var.name}-os"
    caching           = "ReadWrite"
    create_option     = "FromImage"
    managed_disk_type = "${var.storage_type}"
  }

  os_profile {
    computer_name  = "${var.name}"
    admin_username = "${var.admin_username}"
    admin_password = "${var.admin_password}"
  }

  tags = "${var.tags}"
}

The definition above is a little simpler than on my full example, but should give you an idea of how straightforward it is (and, in fact, how similar it is to an ARM template).

Depending on your naming conventions you might want to tweak various aspects of it, but in general Terraform’s variable interpolation should make it very easy for you to both name and tag resources to your heart’s content.

In fact, it is quite easy to have slight variations on tags on a per-resource basis. For instance, here’s how to merge a tag into a default set:

tags = "${merge(var.tags, map("provisionedBy", "terraform"))}"

Terraform does not try to use generic (and potentially dangerously leaky) abstractions across different cloud providers–a VM on Azure is not represented using the same primitives as one on AWS or GCP, and each provider’s abstractions (or quirks) are not glossed over but surfaced onto the language through specific resource types for each provider.

Another important difference for people used to Azure portal or CLI deployments is that since Terraform uses Azure APIs directly, its actions are not listed in the Deployments blade in your resource groups.

All the individual provisioning actions are listed in the Azure Activity Log, which makes it a little harder to keep track of higher-level activities but ties in pretty well with Terraform’s approach when modifying existing infrastructure–for conciseness, here’s what happens when I change tags on a resource:

$ terraform apply
...
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  ~ azurerm_resource_group.rg
      tags.ProvisionedVia: "Terraform" => ""
      tags.provisionedBy:  "" => "terraform"

Plan: 0 to add, 1 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

azurerm_resource_group.rg: Modifying... (ID: /subscriptions/...)
  tags.ProvisionedVia: "Terraform" => ""
  tags.provisionedBy:  "" => "terraform"
azurerm_resource_group.rg: Modifications complete after 2s (ID: /...)
$ _

This, of course, results in a single API call to change infrastructure state.

Iterating Quickly

This bit has little to do with Terraform itself, but is something I’ve been doing for many years–since I invariably use a Mac or Linux (or WSL) to do deployments, I use Makefiles extensively to save me some typing and to set some variables automatically.

For instance, I use this to pick up the current username and any existing SSH keys and hand them over to Terraform, using the TF_VAR_ prefix to mark them for import into the .tf context:

SSH_KEY := $(HOME)/.ssh/id_rsa.pub
ifneq (,$(wildcard $(SSH_KEY)))
    export TF_VAR_admin_username := $(USER)
    export TF_VAR_ssh_key := $(shell cat $(SSH_KEY))
endif

.PHONY: plan apply destroy

plan apply destroy:
    terraform $@

The example (which I encourage you to explore) comes with a slightly improved version of the above, but the point here is that you should find a way to make it easy for you to invoke Terraform multiple times and iterate upon your configurations with minimal error.

Once you adopt it for production deployments (and you will) you want to make sure you’ve minimised the chances of fumbling manual inputs or setting up variables incorrectly.

Modularity, Logic and Complex Deployments

One of the big advantages of Terraform over resource templates is that modularity is easier to manage (at least for me).

for instance, one of the first things I did was to break out VM creation into a separate module (actually two, one for Windows and another for Linux, but I’m focusing on Linux in this example), which can be invoked like this:

module "linux" {
  source = "./linux"

  name                = "${var.instance_prefix}"
  vm_size             = "${var.standard_vm_size}"
  location            = "${var.location}"
  resource_group_name = "${azurerm_resource_group.rg.name}"
  admin_username      = "${var.admin_username}"
  admin_password      = "${var.admin_password}"

  // this is something that annoys me - passing the resource would be nicer
  diag_storage_name                  = "${azurerm_storage_account.diagnostics.name}"
  diag_storage_primary_blob_endpoint = "${azurerm_storage_account.diagnostics.primary_blob_endpoint}"
  diag_storage_primary_access_key    = "${azurerm_storage_account.diagnostics.primary_access_key}"

  subnet_id                          = "${azurerm_subnet.default.id}"
  storage_type                       = "${var.storage_type}"
  tags                               = "${local.tags}"
}

You can see above that I’d be much happier passing the whole azurerm_storage_account resource as a parameter rather than the individual strings, but this is much less of a pain than trying to use linked ARM templates or merging JSON files (which is something I’ve done with relative success in other projects, but adds to the complexity of maintaining them).

In general, things like conditionals, loops and (thankfully) IP address management (complete with CIDR awareness) are clearer and easier to maintain in Terraform than in Azure templates. Terraform doesn’t do everything, but it does provide a nice scaffolding for other tools to work on top of.

And it does so across regions and multiple resource groups with great ease. You can use Terraform to lay down a solid, sprawling multi-region infrastructure and then provision your machines (and manage software configuration and deployments) with, say, Ansible with minimal overlap, and are still able to update sections of it by tweaking a single master project.

But let’s go back to the example, since it embodies some best practices I keep emphasising whenever I deploy Linux machines.

Hardening Stuff

I never use passwords in my deployments unless I need to deploy a Windows machine. Even though it’s now possible to log in using Active Directory to Linux machines, I prefer to use SSH keys, so I usually excise all mentions of passwords and change VM definitions to read:

os_profile {
  computer_name  = "${var.name}"
  admin_username = "${var.admin_username}"
}

os_profile_linux_config {
  disable_password_authentication = true

  ssh_keys = {
    key_data = "${var.ssh_key}"
    path     = "/home/${var.admin_username}/.ssh/authorized_keys"
  }
}

As you’d expect, ssh_key is, in my case, picked up automatically from the environment by the Makefile outlined above and supplied to Terraform as the contents of TF_VAR_ssh_key.

For good measure, I also tweak the SSH daemon cyphers and change the SSH port (this last bit isn’t so much a real security measure as a way to drastically cut down the amount of noise in logs).

To accomplish that, my preferred approach is cloud-init, which is very easy to use in Terraform and Azure. In the example I fill out the cloud-config YAML file using a template_file data resource to do variable substitution:

data "template_file" "cloud_config" {
  template = "${file("${path.module}/cloud-config.yml.tpl")}"

  vars {
    ssh_port       = "${var.ssh_port}"
    admin_username = "${var.admin_username}"
  }
}

…and then base64encode the rendered version into custom_data:

os_profile {
  computer_name  = "${var.name}"
  admin_username = "${var.admin_username}"
  custom_data    = "${base64encode(data.template_file.cloud_config.rendered)}"
}

The cloud-config template injects a tweaked sshd_config, which is applied on first boot:

#cloud-config

write_files:
  - path: /etc/ssh/sshd_config
    permissions: 0644
    # strengthen SSH cyphers
    content: |
      Port ${ssh_port}
      Protocol 2
      HostKey /etc/ssh/ssh_host_ed25519_key
      KexAlgorithms [email protected]
      ...
  - path: /etc/fail2ban/jail.d/override-ssh-port.conf
    permissions: 0644
    content: |
      [sshd]
      enabled = true
      port    = ${ssh_port}
      logpath = %(sshd_log)s
      backend = %(sshd_backend)s

packages:
  - docker.io
  - fail2ban
  - auditd
  - audispd-plugins

timezone: Europe/Lisbon

runcmd:
  - usermod -G docker ${admin_username}
  - systemctl enable docker
  - systemctl start docker

…where besides setting up a few other packages, I set up fail2ban with an extra config file, tweaked to match the new SSH port.

Once the machine is provisioned (with, say, port 1234), you can check that the result is correctly configured like this:

$ sudo fail2ban-client get sshd action iptables-multiport actionstart
iptables -N f2b-sshd
iptables -A f2b-sshd -j RETURN
iptables -I INPUT -p tcp -m multiport --dports 1234 -j f2b-sshd

Next, let’s look at monitoring, since by default Terraform (like most other tools) doesn’t add any monitoring functionality to your VMs.

Adding Diagnostics and Metrics

In Azure, there are currently four ways to get performance metrics and system diagnostics out of your VMs:

Boot diagnostic logs (which are now complemented with an interactive “serial” console in the portal)
Azure Metrics via the standard (v2.3) Linux agent, which is automatically deployed by the portal via the Microsoft.OSTCExtensions VM extension (which captures metrics in a way the Azure portal can render)
Enhanced metrics via the 3.0 Linux agent, which is deployed via the Azure.Linux.Diagnostics VM extension (which can send out metrics to Event Hub and other niceties)
Through the Operations Management Suite Agent, which requires you to create an OMS workspace, and is best suited for managing large deployments–I have a set of ARM templates for those that I will be updating in the near future, so I won’t cover that today.

That leaves us the first three to play around with for our example. Boot diagnostics, which I consider essential, are pretty easy to enable by adding the following stanza to the VM resource:

boot_diagnostics {
  enabled     = true
  storage_uri = "${var.diag_storage_primary_blob_endpoint}"
}

However, the others are rather more complex, and typically require at least three things:

A configuration file (XML or JSON, depending on version) that specifies the set of metrics to collect
A storage account for storing the diagnostics data, which in turn requires extracting its access keys
An azurerm_virtual_machine_extension resource that takes the configuration, storage and keys and injects the actual agent into the VM

When you create a Linux machine via the Azure portal, it prompts you for the storage account and then injects a default metrics configuration and a default extension (currently the one by Microsoft.OSTCExtensions, version 2.3).

That is what I have working in the example by default, and deploying it as-is yields a machine that is indistinguishable from a portal deployment (you can check that by going to Metrics and see all the Guest entries).

However, it is much more interesting to look at Azure.Linux.Diagnostics, because the credentials it requires for the storage account are different–it needs an SAS token with very specific permissions, which can be specified in detail through this resource:

data "azurerm_storage_account_sas" "diagnostics" {
  connection_string = "${var.diag_storage_primary_connection_string}"
  https_only        = true

  resource_types {
    service   = false
    container = true
    object    = true
  }

  services {
    blob  = true
    queue = false
    table = true
    file  = false
  }

  start  = "2018-06-01"
  expiry = "2118-06-01"

  permissions {
    read    = true
    write   = true
    delete  = false
    list    = true
    add     = true
    create  = true
    update  = true
    process = false
  }
}

The above is actually included–but commented out–in the example, and I provide working templates for metrics configurations and the JSON bits to include as part of the extension settings for both 2.3 and 3.0.

I intend to add conditionals to choose the extension type later on, but for readability and time reasons decided to leave things be for now.

The most notable bit, however (and one that took me a bit to figure out) is that Azure requires you to add the SAS token without the leading question mark, which means you need to add it to protected_settings like so:

diag_storage_sas = "${substr(data.azurerm_storage_account_sas.diagnostics.sas,1,-1)}"

Bottom line: Terraform makes a lot of things easier, but you still need to understand enough about Azure to make things work smoothly.

The good news is that things can only get better from here, since Azure APIs and integrations keep expanding and Microsoft (my current employer, incidentally) stays the course.

In Closing

This was a necessarily brief (but still rather extensive) walkthrough of my example configuration.

Hopefully you’ve found it both enlightening and useful for tackling Azure with Terraform, and I’ll be updating the project itself to include a full blown web stack and perform a few other tweaks for the sake of modularity and reusability (it works and seems readable enough, but can still be improved).

Tao of Mac