The Container Fabric team runs New Relic’s in-house container orchestration and runtime platform. One of the team’s critical functions is to confirm that we have healthy hardware with consistent settings and versions deployed across our container clusters. Simple mismatches in settings among hosts inside Container Fabric can lead to inconsistent performance and behavior that then impacts higher-level services and complicates troubleshooting.

When we started building the platform, we were using Ansible for basic configuration management on our CoreOS Container Linux hosts, and we made the early decision to extend that functionality so that we could also use Ansible to perform some non-intrusive auditing of our systems.

This post will cover how we utilized Ansible as a basic auditing tool to ensure that hosts met expected requirements regarding configuration, health, and security. In the early phases of the Container Fabric project, Ansible helped us keep the number of tools we needed down to a minimum, while helping us ensure that all the important aspects of our hardware were configured as expected. Our hardware configuration tests checked a broad range of things, including BIOS settings and RAID configuration. Hopefully, this post will give you some ideas for using Ansible to run audit checks in your infrastructure.

(Note: This post assumes a working knowledge of Ansible and task execution.)

First, defining our requirements for Ansible as a hardware auditing tool

Before we started building our auditing solution, we defined our requirements.

  • The solution had to be able to detect and report any non-conformant configuration that we identified as important. These checks should be for things we didn’t already permanently enforce with Terraform or Ansible configuration management runs.
  • Container Fabric is built on a diverse set of hosts in diverse data centers, so our solution had to be able to detect the underlying platform and apply the correct checks.
  • Every run had to complete without hard errors and produce a report of the results.
  • The solution had to be completely non-invasive and could be allowed to pull and launch a support container only when absolutely required to properly query the host.
  • The solution had to be useful to both humans and automated provisioning processes.

Creating an Ansible playbook role to run our audits

Since the Container Fabric team already heavily leveraged Ansible playbook roles, we added two new roles to our codebase: validate_barebones, which we designed to run on newly provisioned hosts as soon as we received them (although it could safely run at any time) and validate_postansible, which built upon validate_barebones and was intended to be run on a system that had already undergone configuration management.

So the first thing we needed was to create a simple playbook for our new role.

---
- include: bootstrap_ansible.yaml
  when: no_includes is not defined
- hosts: '{{ target_hosts }}'
  become: yes
  become_method: sudo
  gather_facts: True
  roles:
    - validate_barebones

The no_includes variable is a trick we used to more quickly test our playbooks in development without needing to run the whole Ansible workflow.

We assigned hosts to the {{target_hosts}} variable so that anything launching Ansible must define the hosts targeted in the run, instead of defaulting to the typical—and dangerous—all.

When we ran this playbook, Ansible executed the validate_barebones role. This role consisted of three files: a task YAML file (tasks/main.yaml), a handler YAML file (handlers/main.yaml), and a single Jinja 2 template (templates/barebones_report.j2), which we used when generating reports.

The tasks/main.yaml file is where we defined the tasks for the audit runs, but before we ran the audit tasks, we ran a few tasks to determine which platform the host was running on.

- name: 'Detect AWS Instance'
  shell: "grep -qi amazon /sys/devices/virtual/dmi/id/bios_version"
  register: in_aws
  ignore_errors: True
  changed_when: False
  failed_when: False
   
- name: 'Detect Dell Server' shell: "grep -qi PowerEdge /sys/devices/virtual/dmi/id/product_name" register: dell_hardware ignore_errors: True changed_when: False failed_when: False
- name: 'Check for Unknown Platform' # This check should actually fail the run, if it is True shell: "echo" failed_when: in_aws|failed and dell_hardware|failed

In the first two stanzas, we checked if the host was running in Amazon or if it was a Dell PowerEdge server, and then registered the variables for in_aws and dell_hardware with the results of those checks. If a host was neither in AWS or deployed on Dell hardware, we exited the run, as described in the final stanza. (We never really expect that to happen though.)

Example individual audit task

Once we’ve determined the platform our host was running on, it was time to start running audit checks. The following check ensures that Docker is running on the host:

# Docker
- name: 'Docker daemon is running'
  # This check should actually fail the run, since a lot depends on it.
  shell: 'docker -H unix:///var/run/docker.sock ps'
  register: docker_running
  changed_when: False

Container Fabric runs on CoreOS Container Linux, so must rely on containers to run most software, including the tools required to query important details about the Dell hardware. Because of this, we make sure this check will fail the whole run if Docker is not running (by registering the docker_running variable). The line changed_when:False tells Ansible that this command didn’t make any changes to the system that we want to track.

Example multi-step audit tasks

Besides running simple individual audits, we have more complicated checks. Dell EMC OpenManage Server Administrator (OMSA) is the name of the Dell tool that is required to query most of the Dell specific settings. The following check determined if the OMSA container was already running, launched it if not, and then verified that our host BIOS power profile was configured correctly. We accomplished this check by running a few tasks with a single handler.

The task looks like this:


- name: 'Is OMSA running'
  shell: 'curl -s --no-buffer -XGET --unix-socket /var/run/docker.sock http://localhost/containers/json?all=true | grep -q /omsa'
  register: omsa_local
  changed_when: False
  ignore_errors: True
  when: dell_hardware|success
  failed_when: False
  args:
    warn: False


# Launch Openmanage Container (with sleep to allow processes to start up) - name: 'Launch openmanage container' # This check should actually fail the run, since a lot depends on it. shell: "docker -H unix:///var/run/docker.sock run --privileged -d -p 1311:1311 -p 161:161/udp --restart=always -v /lib/modules/`uname -r`:/lib/modules/`uname -r` --name=omsa registry.example.net/dell-openmanage84:latest && sleep 40" when: (omsa_local|failed and dell_hardware|success) register: launch_omsa changed_when: launch_omsa|success notify: remove omsa
# Hardware BIOS - name: 'Power profile should be PerfOptimized' # NOT PerfPerWattOptimizedDapc shell: 'docker -H unix:///var/run/docker.sock exec omsa omreport chassis biossetup display=shortnames -fmt ssv | grep SysProfile' ignore_errors: True when: dell_hardware|success failed_when: False register: bios_power changed_when: ("PerfOptimized" not in bios_power.stdout)

These tasks demonstrate where we get a little creative with Ansible’s abilities. Let me explain.

We wanted these checks to generate a comprehensive report, so we never wanted the Ansible runs to fail. To ensure they never fail, we define failed_when: False in each task.

We also wanted the report to tell us more than whether or not a system was compliant; we wanted to understand in what way a system was out of compliance.

How the multi-step task works

The first stanza determines if the Dell OpenManage (OMSA) container is already running on the host. If it is, we use the existing container and refrain from uninstalling it at the end of the run. (The args: warn: False option disables a warning that Ansible gives us about using curl via the shell, instead of relying on Ansible’s underlying library functions.)

The second stanza launched the Dell OpenManage container if it wasn’t already running (when: omsa_local|failed) and the system is on Dell hardware (when: dell_hardware|success). If we launch the container, we also register a handler to run at the end with notify: remove omsa. This way we only remove the container if we launched it.

The third stanza executes a check utilizing the Dell OpenManage container. A shell command determines the Dell BIOS power profile setting, which creates a value for register: bios_power. The output for register: bios_power (bios_power.stdout) will often look something like this:

{
  "changed": false,
  "cmd": "docker -H unix:///var/run/docker.sock exec omsa omreport chassis biossetup display=shortnames -fmt ssv | grep SysProfile",
  "delta": "0:00:00.153451",
  "end": "2017-02-16 18:55:53.583467",
  "failed": false,
  "failed_when_result": false,
  "rc": 0,
  "start": "2017-02-16 18:55:53.430016",
  "stderr": "",
  "stdout": "SysProfile;PerfOptimized",
  "stdout_lines": ["SysProfile;PerfOptimized"],
  "warnings": []
}

We then detect any divergence with the changed_when: ("PerfOptimized" not in bios_power.stdout) statement, used in the third stanza. We also add this output to our larger report later.

After the audit run completed, the last step in the task is for Ansible to trigger any registered handlers (handlers/main.yaml), including the one to remove the Dell OpenManage container:


- name: remove omsa
  shell: "docker -H unix:///var/run/docker.sock rm -f omsa"

Generating the audit report

The final task in our playbook role is a local action that only runs once and generates our report.


# Report Results
- name: Generate report
  become: no
  local_action: template src=barebones_report.j2 dest=/tmp/barebones_report.txt
  run_once: True
  changed_when: False

The following is a representative portion of the Jinga 2 template file (templates/barebones_report.j2) that we use to create the final report:


{{'"%s", "%s", "%s", "%s", "%s"' | format("Barebones Section", "Test Name" "Changed", "Hostname", "Data") }}
{% set section = 'platform' %}
{% set test = 'type' %}
{% for i in play_hosts %}
{% if hostvars[i]['in_aws'] is defined and hostvars[i]['in_aws']['rc'] == 0 %}
{{'"%s", "%s", "%s", "%s", "%s"' | format(section, test, "False", i, "AWS instance") }}
{% elif hostvars[i]['dell_hardware'] is defined and hostvars[i]['dell_hardware']['rc'] == 0 %}
{{'"%s", "%s", "%s", "%s", "%s"' | format(section, test, "False", i, "Dell hardware") }}
{% else %}
{{'"%s", "%s", "%s", "%s", "%s"' | format(section, test, "True", i, "UNKNOWN/UNREACHABLE") }}
{% endif %}
{% endfor %}
{% set section = 'bios' %}
{% set test = 'Power Mode' %}
{% for i in play_hosts | sort %}
{% if hostvars[i]['bios_power'] is defined and "skipped" not hostvars[i]['bios_power'] %}
{{'"%s", "%s", "%s", "%s", "%s"' | format(section, test, hostvars[i]['bios_power']['changed'], i, hostvars[i]['bios_power']['stdout'].strip(' \t\n\r')) }}
{% endif %}
{% endfor %}

And finally, here is a partial report (in a comma-separated values (CSV) format) that we ran against two hosts:

"Barebones Section", "Test Name", "Changed", "Hostname", "Data"
"platform", "type", "False", "agent-10.example.net", "Dell hardware"
"platform", "type", "False", "agent-11.example.net", "Dell hardware"
"bios", "Power Mode", "False", "agent-10.example.net", "SysProfile;PerfOptimized"
"bios", "Power Mode", "False", "agent-11.example.net", "SysProfile;PerfOptimized"
"network", "bond0 - Bonding Mode", "False", "agent-11.example.net", 2
"network", "bond0 - Bonding Mode", "True", "agent-11.example.net", 1

The first line is a table header for the rest of the report. The last entry that has a True value in the Changed column indicates a divergence we needed to address. We then used this data to create tickets and schedule any required remediation work.

Proactive > Reactive

Although we always want to believe that hardware is delivered to us in an expected state, it is important to follow the “trust but verify” model to ensure that nothing important has “slipped through the cracks” during earlier testing performed outside our control.

Before the team implemented these checks, much of this discovery process would have been triggered by real-world incidents, and we wouldn’t know there was a problem until we had one. This Ansible-based audit workflow helped us identify problems ahead of time, and also helped us keep track of previously discovered issues so that we didn’t hit them again.

Designing and extending these audit playbooks was very easy for everyone on the team, and by regularly running them across our initial infrastructure, we learned a lot about the critical settings that we needed to track and correct. Using this knowledge we were able to move forward and implement a much more robust testing solution for our production launch using the Python-based TestInfra tool.

 

Sean is a Lead Site Reliability Engineer at New Relic. He is a long-time system administrator and operations engineer who has lived in places ranging from Alaska to Pakistan. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!