Here at the Cancer Research UK Manchester Institute, we have computing needs typical of most biological labs, and indeed scientists in general: we have large batch workloads, a cluster of moderately-powerful computers in the basement, and impatient users that want to spend less of their time waiting on compute jobs. We developed SCAN to use the elastic compute cloud to dynamically expand our compute resources as needed to keep our users happy whilst staying within budget.
This hybrid cloud — the combination of our on-site, owned resources, and public, hireable infrastructure — can bring us the best of both worlds. On quiet days, when job queues are short and owned resources are going unused, we can dedicate generous shares of our private cluster to each job and complete work ahead of schedule. Meanwhile on busy days we can grant jobs minimal shares of our private resources to meet their deadlines, hiring just enough public resources to avoid otherwise-inevitable overruns.
The problems of the hybrid cloud scheduler are well-known. We must anticipate the future well enough to avoid assigning resources generously today, only to starve newly enqueued jobs of resources tomorrow and necessitate expensive public resources when our private cloud could have been sufficient. We must account for the thinner network link between our local site and the public cloud than our local cluster, and be prepared to assign a job running in the cloud more compute resources to make up for data transfer delay, or carefully select jobs with low I/O requirements for cloud execution. We must learn jobs’ characteristics well enough that our scheduler can make the right allocations. Above all, we must accomplish all of this without requiring more from our users than the typical non-specialist scientist is willing or able to give. We cannot ask them how the number of parallel threads or processes employed in a particular task relates to completion time, or the relationship between command-line options and I/O bandwidth requirements. All of these must be transparently learned by the system if it is to be practically usable. We can, however, ask for more intuitively understandable input: when does this task *need* to be done, and when would you *like* it done?
SCAN can then use CELAR to adjust the pool of public resources to maximise the user happiness per unit cost invested in public cloud resources. Whilst CELAR itself does not consider multiple clouds, or hybrid clouds, we can hide the fact that SCAN is also aware of a private cloud, and use CELAR to measure and control the public side of our business.
CELAR’s ability to learn the relationship between workloads and their performance characteristics is particularly useful to SCAN. Instead of asking users to characterise their workloads in advance, which is unlikely to be accurate considering the users are not computing specialists, or asking administrators to anticipate factors that will affect a given analysis’ runtime and profile it accordingly, we can expose as much information as possible to CELAR and permit it to learn when a particular command-line option or input file variation means that extra resources should be allocated to this kind of task.
The costs of moving data over the network, to and from the public cloud, are variable depending on local and global network traffic, and depend on the amount of output produced by a particular job. CELAR can implicitly help us to anticipate this problem as well, by learning the relationship between the compute resources the job was assigned and whether it achieved the user’s desired or essential deadlines once data transfer cost was taken into account.
SCAN is currently implemented at the prototype stage, capable of dispatching work against public cloud resources that are dynamically allocated according to demand, and of adjusting the degree of parallelism (and so resource allocation) used for a particular job according to CELAR’s recommendation. Going forwards in 2015, we plan to explore the best ways to expose information about jobs to CELAR to optimise learning, and augment SCAN, which currently runs against a public cloud, to use a full hybrid cloud. We anticipate an initial prototype public release during 2015.
To summarise then, SCAN provides a bridge between CELAR’s advanced ability to learn the characteristics of cloud workloads, and the batch-oriented interface expected by scientific users. It will allow non-expert users to get the very best use out of their hybrid cloud with minimal administrative effort.
– Chris Smowton, Cancer Research UK
To find out more about the CELAR platform and the progress of SCAN, why not join our LinkedIn group or follow us on Twitter?