LHC Tunnel

LHC Tunnel

Tuesday, 30 January 2018

Keep calm and reboot: Patching recent exploits in a production cloud

At CERN, we have around 8,500 hypervisors running 36,000 guest virtual machines. These provide the compute resources for both the laboratory's physics program but also for the organisation's administrative operations such as paying bills and reserving rooms at the hostel. These resources are spread over many different server configurations, some of them over 5 years old.

With the accelerator stopping over the CERN annual closure until mid March, this is a good period to be planning reconfiguration of compute resources such as the migration of our central batch system which schedules the jobs across the central compute resources to a new system based on HTCondor. The compute resources are heavily used but there is more flexibility to drain some parts in the quieter periods of the year when there is not 10PB/month coming from the detectors. However, this year we have had an unexpected additional task to deploy the fixes for the Meltdown and Spectre exploits across the centre.

The CERN environment is based on Scientific Linux CERN 6 and CentOS 7. The hypervisors are now entirely CentOS 7 based with guests of a variety of operating systems including Windows flavors and CERNVM. The campaign to upgrade involved a number of steps
  • Assess the security risk
  • Evaluate the performance impact
  • Test the upgrade procedure and stability
  • Plan the upgrade campaign
  • Communicate with the users
  • Execute the campaign

Security Risk

The CERN environment consists of a mixture of different services, with thousands of projects on the cloud, distributed across two data centres in Geneva and Budapest. 

Two major risks were identified
  • Services which provided the ability for end users to run their own programs along with others sharing the same kernel. Examples of this are the public login services and batch farms. Public login services provide an interactive Linux environment for physicists to log into from around the world, prepare papers, develop and debug applications and submit jobs to the central batch farms. The batch farms themselves provide 1000s of worker nodes processing the data from CERN experiments by farming event after event to free compute resources. Both of these environments are multi-user and allow end users to compile their own programs and thus were rated as high risk for the Meltdown exploit.
  • The hypervisors provide support for a variety of different types of virtual machines. Different areas of the cloud provide access to different network domains or to compute optimised configurations. Many of these hypervisors will have VMs owned by different end users and therefore can be exposed to the Spectre exploits, even if the performance is such that exploiting the problem would take significant computing time.
The remaining VMs are for dedicated services without access for end user applications or dedicated bare metal servers for I/O intensive applications such as databases and disk or tape servers.

There are a variety of different hypervisor configurations which we split down by processor type (in view of the Spectre microcode patches). Each of these needs independent performance and stability checks.


Microcode
Assessment
#HVs
Processor name(s)
06-3f-02
covered
3332
E5-2630 v3 @ 2.40GHz,E5-2640 v3 @ 2.60GHz
06-4f-01
covered
2460
E5-2630 v4 @ 2.20GHz, E5-2650 v4 @ 2.20GHz
06-3e-04
hopefully
1706
E5-2650 v2 @ 2.60GHz
??
unclear
427
CPU family: 21 Model: 1 Model name: AMD Opteron(TM) Processor 6276 Stepping: 2
06-2d-07
unclear
333
E5-2630L 0 @ 2.00GHz, E5-2650 0 @ 2.00GHz
06-2c-02
unlikely
168
E5645 @ 2.40GHz, L5640 @ 2.27GHz, X5660 @ 2.80GHz

These risks were explained by the CERN security team to the end users in their regular blogs.

Evaluating the performance impact

The High Energy Physics community uses a suite called HEPSPEC06 to benchmark compute resources. These are synthetic programs based on the C++ components of SPEC CPU2006 which match the instruction mix of the typical physics programs. With this benchmark, we have started to re-benchmark (the majority of) the CPU models we have in the data centres, both on the physical hosts and on the guests. The measured performance loss across all architectures tested so far is about 2.5% in HEPSPEC06 (a number also confirmed by by one of the LHC experiments using their real workloads) with a few cases approaching 7%. So for our physics codes, the effect of patching seems measurable, but much smaller than many expected. 

Test the upgrade procedure and stability

With our environment based on CentOS and Scientific Linux, the deployment of the updates for Meltdown and Spectre were dependent on the upstream availability of the patches. These could be broken down into several parts
  • Firmware for the processors - the microcode_ctl packages provide additional patches to protect against some parts of Spectre. This package proved very dynamic as new processor firmware was being added on a regular basis and it was not always clear when this needed to be applied, the package version would increase but it was not always that this included an update for the particular hardware type. Following through the Intel release notes,  there were combinations such as "HSX C0(06-3f-02:6f) 3a->3b" which explains that the processor description 06-3f-02:6f is upgraded from release 0x3a to 0x3b. The fields are the CPU family, model and stepping from /proc/cpuinfo and the firmware level can be found at /sys/devices/system/cpu/cpu0/microcode/version. A simple script (spectre-cpu-microcode-checker.sh) was made available to the end users so they could check their systems and this was also used by the administrators to validate the central IT services.
  • For the operating system, we used a second script (spectre-meltdown-checker.sh) which was derived from the upstream github code at https://github.com/speed47/spectre-meltdown-checker.  The team maintaining this package were very responsive incorporating our patches so that other sites could benefit from the combined analysis.

Communication with the users

For the cloud, there are several resource consumers.
  • IT service administrators who provide higher level functions on top of the CERN cloud. Examples include file transfer services, information systems, web frameworks and experiment workload management systems. While some are in the IT department, others are representatives of their experiments or supporters for online control systems such as those used to manage the accelerator infrastructure.
  • End users consume cloud resources by asking for virtual machines and using them as personal working environments. Typical cases would be a MacOS user who needs a Windows desktop where they would create a Windows VM and use protocols such as RDP to access it when required.
The communication approach was as follows:
  • A meeting was held to discuss the risks of exploits, the status of the operating systems and the plan for deployment across the production facilities. With a Q&A session, the major concerns raised were around potential impact on performance and tuning options. 
  • An e-mail was sent to all owners of virtual machine resources informing them of the upcoming interventions.
  • CERN management was informed of the risks and the plan for deployment.
CERN uses ServiceNow to provide a service desk for tickets and a status board of interventions and incidents. A single entry was used to communicate the current plans and status so that all cloud consumers could go to a single place for the latest information.

Execute the campaign

With the accelerator starting up again in March and the risk of the exploits, the approach taken was to complete the upgrades to the infrastructure in January, leaving February to find any residual problems and resolve them. As the handling of the compute/batch part of the infrastructure was relatively straight forward (with only one service on top), we will focus in the following on the more delicate part of hypervisors running services supporting several thousand users in their daily work.

The layout of our infrastructure with its availability zones (AVZs) determined the overall structure and timeline of the upgrade. With effectively four AVZs in our data centre in Geneva and two AVZs for our remote resources in Budapest, we scheduled the upgrade for the services part of the resources over four days.


The main zones in Geneva were done one per day, with a break after the first one (GVA-A) in case there were unexpected difficulties to handle on the infrastructure or on the application side. The remaining zones were scheduled on consecutive days (GVA-B and GVA-C), the smaller ones (critical, WIG-A, WIG-B) in sequential order on the last day. This way we upgraded around 400 hosts with 4,000 guests per day.

Within each zone, hypervisors were divided into 'reboot groups' which were restarted and checked before the next group was handled. These groups were determined by the OpenStack cells underlying the corresponding AVZs. Since some services required to limit the window of service downtime, their hosting servers were moved to the special Group 1, the only one for which we could give a precise start time.

For each group several steps were performed:
  • install all relevant packages
  • check the next kernel is the desired one
  • reset the BMC (needed for some specific hardware to prevent boot problems)
  • log the nova and ping state of all guests
  • stop all alarming 
  • stop nova
  • shut down all instances via virsh
  • reboot the hosts
  • ... wait ... then fix hosts which did not come back
  • check running kernel and vulnerability status on the rebooted hosts
  • check and fix potential issues with the guests
Shutting down virtual machines via 'virsh', rather than the OpenStack APIs, was chosen to speed up the overall process -- even if this required to switch off nova-compute on the hosts as well (to keep nova in a consistent state). An alternative to issuing 'virsh' commands directly would be to configure 'libvirt-guests', especially in the context of the question whether guests should be shut down and rebooted (which we did during this campaign) or paused/resumed. This is an option we'll have a look at to prepare for similar campaigns in the future.

As some of the hypervisors in the cloud had very long uptimes and this was the first time we systematically rebooted the whole infrastructure since the service went to full production about five years ago, we were not quite sure what kind issues to expect -- and in particular at which scale. To our relief, the problems encountered on the hosts hit less than 1% of the servers and included (in descending order of appearance)
  • hosts stuck in shutdown (solved by IPMI reset)
  • libvirtd stuck after reboot (solved by another reboot)
  • hosts without network connectivity (solved by another reboot)
  • hosts stuck in grub during boot (solved by reinstalling grub) 
On the guest side, virtual machines were mostly ok when the underlying hypervisor was ok as well.
A few additional cases included
  • incomplete kernel upgrades, so the root partition could not be found (solved by booting back into an older kernel and reinstall the desired kernel)
  • file system issues (solved by running file system repairs)
So, despite initial worries, we hit no major issues when rebooting the whole CERN cloud infrastructure!

Conclusions

While these kind of security issues do not arrive very often, the key parts of the campaign follow standard steps, namely assessing the risk, planning the update, communicating with the user community, execution and handling incomplete updates.

Using cloud availability zones to schedule the deployment allowed users to easily understand when there would be an impact on their virtual machines and encourages good practise to load balance resources.

References

Authors

  • Arne Wiebalck
  • Jan Van Eldik
  • Tim Bell

Wednesday, 30 August 2017

Scheduled snapshots

While most of the machines on the CERN cloud are configured using Puppet with state stored in external databases or file stores, there are a few machines where this has been difficult, especially for legacy applications.

Doing a regular snapshot of these machines would be a way of protecting against failure scenarios such as hypervisor failure or disk corruptions.

This could always be scripted by the project administrator using the standard functions in the openstack client but this would also involve setting up the schedules and the credentials externally to the cloud along with appropriate skills for the project administrators. Since it is a common request, the CERN cloud investigated how this could be done as part of the standard cloud offering.

The approach that we have taken uses the Mistral project to execute the appropriate workflows at a scheduled time. The CERN cloud is running a mixture of OpenStack Newton and Ocata but we used the Mistral Pike release in order to have the latest set of fixes such as in the cron triggers. With the RDO packages coming out in the same week as the upstream release, this avoided doing an upgrade later.

Mistral has a set of terms which explain the different parts of a workflow (https://docs.openstack.org/mistral/latest/terminology).

The approach needed several steps
  • Mistral tasks to define the steps
  • Mistral workflows to provide the order to perform the steps in
  • Mistral cron triggers to execute the steps on schedule

Mistral Workflows

The Mistral workflows consist of a set of tasks and a process which decides which task to execute next based on different branch criteria such as success of a previous task or the value of some cloud properties.

Workflows can be private to the project, shared or public. By making these scheduled snapshot workflows public, the cloud administrators can improve the tasks incrementally and the cloud projects will receive the latest version of the workflow next time they execute them. With the CERN gitlab based continuous integration environment, the workflows are centrally maintained and then pushed to the cloud when the test suites have completed successfully.

The following Mistral workflows were defined

instance_snapshot

Virtual machines can be snapshotted so that a copy of the virtual machine is saved and can be used for recovery or cloning in future. The instance_snapshot workflow performs this operation for both virtual machines which have been booted from volume or locally.

Parameter
Description
Default
instance
The name of the instance to be snapshot
Mandatory
pattern
The name of the snapshot to store. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date in the format YYYYMMDDHHMM.
{0}_snapshot_{1}
max_snapshots
The number of snapshots to keep. Older snapshots are cleaned from the store when new ones are created.
0 (i.e. keep all)
wait
Only complete the workflow when the steps have been completed and the snapshot is stored in the image storage
false
instance_stop
Shut the instance down before snapshotting and boot it up afterwards.
false (i.e. do not stop the instanc)
to_addr_success
e-mail address to send the report if the workflow is successful
null (i.e. no mail sent)
to_addr_error
e-mail address to send the report if the workflow failed
null (i.e. no mail sent)

The steps for this workflow are described in the detail in the YAML/YAQL files at https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows.

The operation is very fast with Ceph based boot-from-volumes since the snapshot is done within Ceph. It can however take up to a minute for locally booted VMs while the hypervisor is ensuring the complete disk contents are available. The VM is resumed and the locally booted snapshot is then sent to Glance in the background.

The high level steps are

·      Identify server
·      Stop instance if requested by instance_stop
·      If the VM is locally booted
o   Snapshot the instance
o   Clean up the oldest image snapshot if over max_snapshots
·      If the VM is booted from volume
o   Snapshot the volume
o   Cleanup oldest volume snapshot if over max_snapshots
·      Start instance if requested by instance_stop
·      If there is an error and to_addr_error is set
o   Send an e-mail to to_addr_error
·      If there is no error and to_addr_success is set
o   Send an e-mail to to_addr_success

restore_clone_snapshot
For applications which are not highly available, a common configuration is using a LanDB alias to a particular VM. In the event of a failure, the VM can be cloned from a snapshot and the LanDB alias updated to reflect the new endpoint location for the service. This workflow will create a volume if the source instance is booted from volume. The workflow is called restore_clone_snapshot.

The source instance needs to be still defined since information such as the properties, flavor and availability zone are not included in the snapshot and these are propagated by default.

Parameter
Description
Default
instance
The name of the instance from which the snapshot will be cloned
Mandatory
Date
The date of the snapshot to clone (either YYYYMMDD or YYYYMMDDHHMM)
Mandatory
pattern
The name of the snapshot to clone. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date.
{0}_snapshot_{1}
clone_name
The name of the new instance to be created
Mandatory
avz_name
The availability zone to create the clone in.
Same as the source instance
flavor
The flavour for the cloned instance
Same as the source instance
meta
The properties to copy to the new instance
All properties are copied from the source[1]
wait
Only complete the workflow when the steps have been completed and the cloned VM is running
false
to_addr_success
e-mail address to send the report if the workflow is successful
null (i.e. no mail sent)
to_addr_error
e-mail address to send the report if the workflow failed
null (i.e. no mail sent)

Thus, cloning the machine timbfvlinux143 to timbfvclone143 requires running the workflow with the parameters

{“instance”: “timbfvlinux143”, “clone_name”: “timbfvclone143”, “date”: “20170830” }

This results in

·      A new volume created from the snapshot timbfvlinux143_snapshot_20170830
·      A new VM is created called timbfvclone143 booted from the new volume

An instance clone can be run for VMs which are booted from volume even when the hypervisor is not running. A machine can then be recovered from it's current state using the procedure

·      Instance snapshot of original machine
·      Instance clone from that snapshot (using today's date)
·      If DNS aliases are used, the alias can then be updated to point to the new instance name

For Linux guests, the rename of the hostname to the clone name occurs as the machine is booted. In the CERN environment, this took a few minutes to create the new virtual machine and then up to 10 minutes to wait for the DNS refresh.

For Windows guests, it may be necessary to refresh the Active Directory information given the change of hostname.
restore_inplace_snapshot

In the event of an issue such as a bad upgrade, the administrator may wish to roll back to the last snapshot. This can be done using the restore_inplace_snapshot workflow.

This operation works for locally booted machines, maintains the IP and MAC address but cannot be used if the hypervisor is down. It does not currently work for boot from volume until the revert to snapshot (available in Pike from https://specs.openstack.org/openstack/cinder-specs/specs/pike/cinder-volume-revert-by-snapshot.html) is in production.

Parameter
Description
Default
instance
The name of the instance from which the snapshot will be replaced
Mandatory
date
The date of the snapshot to replace from (either YYYYMMDD or YYYYMMDDHHMM)
Mandatory
pattern
The name of the snapshot to replace from. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date.
{0}_snapshot_{1}
wait
Only complete the workflow when the steps have been completed and the replaced VM is running
false
to_addr_success
e-mail address to send the report if the workflow is successful
null (i.e. no mail sent)
to_addr_error
e-mail address to send the report if the workflow failed
null (i.e. no mail sent)





Mistral Cron Triggers
Mistral has another nice feature where it is able to run a workflow at regular intervals. Compared to standard Unix cron, the Mistral cron triggers use Keystone trusts to save the user token when the trigger is enabled. Thus, the execution is able to run without needing the credentials such as a password or valid Kerberos token.
The steps are as follows to create a cron trigger via Horizon or the CLI.
Parameter
Description
Example
Name
The name of the cron trigger
Nightly Snapshot
Workflow ID
The name or UUID of the workflow
instance_snapshot
Params
A JSON dictionary of the parameters
{“instance”: “timbfvlinux143”, “max_snapshots”: 5, “to_addr_error”: “theadmin@cern.ch”}
Pattern
A cron schedule pattern according to http://en.wikipedia.org/wiki/Cron
* 5 * * * (i.e. run daily at 5a.m.)

This will then execute the instance snapshot at 5a.m. sending a mail to theadmin@cern.ch in the event of a failure of the snapshot. 5 past copies will be kept.

Mistral Executions
When Mistral runs a workflow, it provides details of the steps executed, the timestamps for start and end along with the results. Each step can be inspected individually as part of debugging and root cause analysis in the event of failures.
The Horizon interface gives an easy interface for selecting the failing tasks. There may be tasks reported as ‘error’ but these steps can then have subsequent actions which succeed so an error step may be a normal part of a successful task execution such as using a default if no value can be found.


References

Credits
  • Jose Castro Leon from the CERN IT cloud team did the implementation of the Mistral project and the workflows described.




[1] Except for a CERN specific one called landb-alias for a DNS alias