Reliable Cron Across the Planet

Arthur Clune
Impossible Dream
Published in
2 min readAug 19, 2016

--

This paper by two Google engineers is a non-technical explanation of issues found in large scale distributed systems.

Starting with cron on a single machine, it nicely shows up the differences as a job runner is scaled out to a distributed system. How do you know a job has started? Finished? How many copies of it are running? What happens when everyone schedules their job at midnight?

First they define a scope for jobs: jobs are submitted not to a machine but to a data centre. By limiting themselves to one DC, they limit the problem but keep enough power to solve major problems.

The system is based on a master node that can modify the shared state, plus slaves/workers. Failure of any of these components is handled with the election of a new master or workers.

Jobs are labelled with a precomputed name and the launch timestamp. Using the label to ensure uniqueness, the master gives the task to a slave, which then announces the start and finish of the job via a Paxos system that shares the state of all jobs. This allows the master to track which jobs have started, crashed or successfully completed.

As ever, failure conditions are the hardest problem, and jobs have to provide a way for the the system to look up their state

To achieve this condition, all operations on external systems, which we may need to continue upon re-election, either have to be idempotent (i.e., we can safely do them again), or we need to be able to look up their state and see whether they completed or not, unambiguously

Finally, even Google have synchronisation issues with jobs

--

--