This document explains how we have deployed various measures to create a green IT system that allows for the managed wake up and shutdown of desktop systems as required (with necessary checks first), monitoring and reporting on this system, as well as options for ordinary users to wake systems when needed too.
The system deployed is entirely independent of external services and may be of potential interest to others considering deploying such a system (hence this document). We do make reference on several occasions to central services (provided by OUCS) available within the university (which may be more appropriate for some implementations) although we chose not to use these services for our system. The ease with which we were able to deploy this system relies in part on the services and infrastructure we already had in place.
As with any project it is always sensible to do some background research first. In particular to see if what you wish to achieve is already possible using systems already available to you or existing products/packages you could install to do the job.
A good place to start is the OUCS information on green IT. This can help you understand what your aims might be and options to achieve those aims. In particular, if you are within the university you may wish to take advantage of the central power management monitoring and wake on LAN facilities available that may reduce the amount of work you need to do to achieve your aim.
We began by estimating our past usage, current usage and potential reduced usage based on different possible implementation scenarios from which we selected an initial aim and further possible steps to take in the future.
Technical systems required
There are three main elements required:
- Desktop hardware that easily supports wake-on-lan (i.e. ideally it is a simple and standard BIOS configuration option and/or software configuration on the machine that is easy to perform and robust when the machine is in a real usage situation). Without this desktop functionality such a project could become impractical to manage and thus generate new problems with other costs to the department.
- Systems/scripts/services in place that can identify when a machine is not in use and shut it down based on the time etc.
- Systems/scripts/services to wake up machines at given times/as required etc as well as potentially provide information and reports
One should also take into account, if there are now lots of machines off overnight, that all potentially come on at the same time, what impact does this have on various services. Our Linux desktops already stagger their scheduled updates so should be fine. Windows desktops, however, often pick up updates on reboot which could have implications.
Clients - Managed shutdown of client/desktop machines
Our aim is to shutdown the machines if they are not in use (i.e. no one logged in and no user based processes running) out of core hours, namely 0am-8am and 6pm-12pm week days and all day on weekends. See the policy being implemented for specific details.
On all client machines need to check BIOS settings and enable the relevant setting, e.g. in our case this was typically Power -> APM Configuration -> Power On By PCI(-E) Device.
Need to also check the OS/installation to see if wake on LAN is enabled by default. Under windows check this under the Device Manager settings for the network interface (we found it to be on by default).
At this stage it is also sensible to test several typical machines (e.g. we tested one from each major spec purchased in the last few years) to ensure that the changes made do actually work, as to visit each machine to change BIOS settings and later find they are not correct would needlessly waste a log of time.
In the case of windows machines Powerdown developed at the University of Liverpool is a small and simple system that is easy to install and would power down any windows machine if it had not been in use for 10 minutes.
However, our initial aim was to power down machine out of core hours rather than at any point when they have been unused for 10 minutes. Essentially this is just a small modification to the original Powerdown.
- (Modified) scripts and files used for our Powerdown variant system
- PsTools which provides the binaries to check the machine is not in use and then shut it down
This and other changes we made are:
- Create a scheduled task file to install rather than using schtasks.exe
- Modified task file to run the task and binaries from a share (rather than locally installing them on each machine) first checking an exceptions file in case the machine has been made exempt from the task
- Modified setup script to install the scheduled task file on a client machine (but not installing the task if the machine is in the exceptions file) setting it to run as the system user
Creating the Scheduled Task File
The command line tool was not flexible enough to create our required task. Note the graphical walk through wizard is also not flexible enough so one needs to use the graphical scheduled task tool advanced options to select the necessary options (in the right order too otherwise options set change too). To create the scheduled task select Scheduled Tasks from the Control Panel Classic View, right click the window that appears and choose new task, name the task, e.g. powersave, double click the task and select the schedule tab. The options that appear include a button to select advanced options. One can now create a weekly task, that only runs on certain days of the week with a specific start and end time each day etc as required. Further more one can display multiple tasks and thus define a different period for some days, e.g. to run all weekend rather than just out of hours. The result is a single file c:\windows\tasks\powersave.job which defines the tasks scheduled, in our case one schedule to run between 6pm and 8am on week days and another to run all day on weekends, in both case running the same task. Having created the necessary job file we can thus put it into our share to be installed by the setup script as required.
The setup/installation script itself can be something triggered via group policy on next reboot, or could be something you run individual on specific machines as required, or could be something you deploy via ssh (if you have sshd running say from cygwin on your windows desktops) etc looping over relevant machines.
All our Linux machines use the puppet configuration management system so it is very easy for us to manage and adjust the configuration.
Under Linux running the command ethtool eth0 will show you information about your primary network interface. If there is a line showing Wake-On: d then by default wake on LAN is disabled. To enable it you can run the command ethtool -s eth0 wol g. However, this setting does not persist across reboots. Using puppet we thus inserted a file enable-wol under /etc/network/if-up.d/ and /etc/network/if-down.d/ to run this command. As an extra precaution, since we have a configuration management system running it also checks this is set on each run (approximately every 3 hours) and would correct it if for any reason it were not set.
We need to insert one or more crontab entries to trigger a sequence of checks and commands to determine whether to shut a machine down at certain times of the day. We insert two crontab entries (using puppet) of the form
- */10 18-23,0-7 * * 1-5 command_to_run
- */10 0-23 * * 6-7 command_to_run
where the first covers the weekday periods for potential shutdown and the second the weekends.
The command to be run is
/opt/sbin/check-logins >/dev/null && /opt/sbin/check-uptime 600 >/dev/null && /opt/sbin/check-tsm-schedule 8:00 >/dev/null && [ ! -e /var/lib/puppet/state/puppetdlock ] && /opt/sbin/logged-shutdown "green IT shutdown" -h now
This command is a sequence of checks which if all passed results in a shutdown. The individual scripts/commands (which could no doubt be cleaned up if desired although they work for our needs) are
- check-logins - check no one is logged in otherwise exit test sequence.
- check-uptime - check machine has been up for N seconds (600 in our case, i.e. 10 minutes), to provide a minimum window after a machine it woken out of hours for someone to login before it would otherwise shutdown again, otherwise exit test sequence.
- check-tsm-schedule - check no TSM backup run is scheduled before the next general wake up event, in our case typically 8am (strictly speaking this does not handle the weekends properly but is a sufficient lower bound on a test so does the right thing still), otherwise exit test sequence. Note this script takes the backup time scheduled from the local log file rather than querying the TSM backup server directly each time it is run.
- [ ! -e /var/lib/puppet/state/puppetdlock ] - check puppet is not performing a configuration run otherwise exit test sequence.
- logged-shutdown - cancels any existing delayed shutdown that may be running (we have an automated system to ensure critical security updates are in place - if a user fails to respond to a series of requests to logout to allow a reboot to occur then ultimately a reboot is automatically scheduled after a further time limit has passed), then writes a log file with various state information, and initiates the desired immediate shutdown.
The scripts themselves are installed and updated on the machines using the puppet configuration management system although of course they could be managed by any other suitable system or install script etc.
Mac OS X
We currently only have a very small number of managed Mac desktops and as yet have thus not incorporated them into the power management system. The Macs are also using puppet for configuration management. One complication/difference with Macs is that they cannot be woken from a shutdown, only from a sleep, and thus need to be treated slightly differently (i.e. putting them into sleep mode not a full shutdown).
To enable wake-on-lan (or in this case resume-on-lan) on a Mac one needs to run 'pmset -a womp 1'. This is being done by puppet by our systems.
Servers/Services - Managed Wake Ups, Monitoring and Reports
For those within the university there are central power management monitoring and wake on LAN facilities available which can thus provide this part of the solution. To use this system there is some initial work required to setup a FiDo system within the department (or OUCS can install and manage a system for you for an annual fee of £400 per year (which is likely to be a lot less than the money you save by deploying a green IT system)). There is then some further work required registering machines with the wake on LAN service initially and each time the hardware is changed as well as registering specific individuals to be able to wake up specific machines.
One could alternatively implement such a system directly. Whether this is a sensible approach will very much depend on the individual needs and the infrastructure and services already in place.
Some pros of using the OUCS central service appear to be:
- Service design, development and management is OUCS' responsibility. They are providing a service to a wider community so may have more resources to put into it.
- Integration with other OUCS services, e.g. uses SSO and ties in with HFS scheduled backups.
- Basic graphical reporting produced by system to show number of machines on/off at a given time etc.
- Users can be registered to allow them to wake up a given PC via a web interface where access is controlled by SSO authentication.
- Can register a set of days of the week and time of day at which to wake a machine up.
Some possible cons/queries of the service are:
- Central service developments are tailored for the masses and can at times fail to exactly meet our needs.
- Although it is a central service we still need to run (and support and maintain) a WOL control node within our network.
- Although it uses SSO many of our users do not, so at present to some this is a complication not a benefit.
- The service offers the ability for individuals to wake up machines remotely. However, to do this the individual's university barcode number needs to be registered against the machine. If we had to manually register people to machines in an ever changing environment it would not be practical. Since it should not matter who wakes up a machine (provided they are a member of our department) ideally they will offer an option for any person registered to our department to be able to wake up any given machine. Another option would be if we can automated this setting from data already available to us, e.g. we have a register of which individuals and machines are in each room, if we also have a register of their university barcode numbers (should have this already) we can easily script lists of people/barcodes to machines and hence if this can be uploaded automatically each night to the OUCS service that could work fine too.
- Machines need to be registered with the system via a web interface. It has a bulk upload facility but if this is still manual then it is generating work for us. We have an asset register and would want to automate the registration of machines with the service based on their presence in our database.
- The service can wake up the machines at a prescribed time. Initial documentation would seem to suggest this is a fixed time on a given set of days of the week. Would ideally like to be able to schedule as many wake up events as desired throughout the week to allow for more flexibility.
We have chosen to implement a solution purely within the department since we had the majority of components needed already in place and with only minor modifications and extensions have implemented a service which for us meets our needs and integrates more tightly with our existing systems:
- Fixed Asset Register (FAR) - we already have a database containing details or all our machines. When a new machine is installed it is part of the process to register a basic entry after which all further facts are harvested and kept up to date. In the future we could also look to products such as Foreman which provide this functionality integrating with the puppet configuration management system we already use to efficiently manage our several hundred systems.
- Configuration Management System - our servers and desktops are configured using puppet and thus deploying further configuration and managing it is relatively straight forward and flexible (there are of course many other configuration management tools such as chef, lcfg, bcfg2 and cfengine). The files puppet uses are themselves kept in a subversion repository to provide managed change tracking.
- System Monitoring - we already run nagios a very flexible system monitoring tool (of which there are other good choices too such as zenoss) where all the configuration is generated from the data in the FAR and managed via puppet. Nagios is already monitoring the state of our systems and services on those systems, producing reports and plots as required.
- Departmental Web Site - website uses the drupal content management system and already provides access controlled material including lists of machines relevant to our users.
With only a few small adjustments and additions these systems provide all we need:
Wake up services
We have three main wake up senarios:
- General wake up(s) within the period 8am-6pm
- General wake up of TSM clients prior to TSM backups (in our case on Thursday evenings from about 6:30pm)
- Out of hours wake up of a specific machine by a user
The first two cases are covered by a simple wake up script which leverages the fact that information about all our systems is already available from the FAR. The script when run triggers a wake up of all relevant machines. This script is installed and managed via puppet (and is dependent on the wakeonlan command from the wakeonlan package which is also installed automatically by puppet as a dependency) on our monitoring system and is run via a crontab entry also managed via puppet. For the simple script the only control of when it runs is the cronjob entry settings which run it to bring/keep the machines on during the core working hours and a further cronjob entry runs specifically to bring the TSM clients up on Thursday nights prior to the start of our backup window (in theory we could do better in the future for the TSM backups by waking the clients just before their individual slot rather than waking them all together before the 1st backup, e.g. by storing the backup slot info in the FAR and using this field within the wake up script).
As we move forward with the system we may wish to treat different sets of desktops differently, e.g. continue to keep computers in lecture rooms on throughout core hours (as they need to be ready to use immediately) whilst having other desktops off for longer periods including potentially parts of the working day. In such a case the simple wake up script would need to be called with different filters and periods for different types of machine (machine purpose already being a field in the FAR and hence easy to filter on more finely than the current example). We now use such a system with a extended wake up script and associated configuration file for the wake ups and an additional initiate check for the shutdowns. There is also a modified scheduled task for the windows machine to use the wake.conf file too. Using the configuration file with both the wake ups and shutdowns mens that in principle these cronjobs could run 24 hours a day as the configuration file can now control whether anything is actually allowed to happen/try to happen.
The third case is covered by two approaches:
- A wake script that a user may run from any Linux system in the department to wake any other by name (the script uses a lookup list of MAC addresses extracted from the FAR and managed by puppet (as is the script itself of course)). The script uses the wakeonlan command line tool which is from the wakeonlan package (which is of course automatically installed by our configuration management system).
- An access controlled web based list of machines (which already existed and is produced dynamically within our CMS via a function that queries the FAR) which has been extended (by extending the function slightly) to also include a link that runs a wake up of the specific machine.
Plots and Reports
Nagios is already monitoring whether our machines are up or down etc. In order to avoid lots of warnings out of hours about machines being down (that are allowed to be down) simply create a new time period within the nagios timeperiods.cfg configuration file and use this time period as the check period for machines that will now be part of the power management system. As ever, the time periods file is installed and managed by puppet for the monitor server and the modified check period is set in the relevant template entries for other nagios configuration entries for system and service monitoring which are then used as before when the configuration is all built from the underlying data held in the FAR.
Nagios already keeps a log of which machines are up/down and using pnp4nagios (again the appropriate nagios configuration lines etc are already being created using simple scripts and data from the FAR) also provides plots of relevant data for each host/service using the performance data that the nagios checks already produce. In order to thus produce suitable aggregated statistics of interest one simply needs a mechanism to use the data already available. The simple solution used was to actually write an additional nagios check which parses the nagios log file together with data from the FAR producing the relevant performance data. Having a single check that produces the aggregated statistic it can then easily be plotted by pnp4nagios with one additional config line in the nagios configs which as ever is a simple tweak of the scripts already generating the config which is again already being managed via puppet.
We actually call the check with several different FAR filters to produce different aggregated statistics. As well as the overall statistics being published we can also easily produce statistics by building (we are split across 5 at present), operating system, machine purpose etc. One can of course potentially read too much into statistics so one should be careful about their interpretation. In general terms after initial deployment, without actively encouraging users to change their natural behaviour, we are seeing about 55-65% of machine off at night and weekends and hence slightly exceeding our initial target. Looking at the different OSes we see a much higher percentage of Windows machines (typically about 85%) off compared to Linux machines (typically about 60%) off which is no doubt a reflection of the fact that the Windows machines are largely used by administrative staff working regular core hours and who log out nightly compared to a broader range of academics who work more variable hours and of whom some rarely, if ever, log out.