Main    
   
 
Home Contacts Sitemap ENG    UKR
 
 
C H P O X

CHeckPOinting for linuX

Checkpointing is a technique that provides the possibility of saving process context into disk file and restoring of processes from that file. So process that have been checkpointed and restarted later should work as if it was not interrupted at all. This feature is useful for tasks that have long execution time (i.e. numerical simulations) in the case of system instability, power failures, reboots, etc. Checkpointing is usually a feature of advanced clustering operating systems.

CHPOX is a kernel module that provides processes checkpointing for Linux.


FEATURES

Transparent dumping the state of specified process or process with all child processes into disk file. The process or group of processes may be restarted from that file at point they were dumped.

CHPOX supports: dumping of virtual memory, regular files, terminal state, current working directory, pipes, Unix sockets, multiple non-interacting processes.

It does not crush on openMosix [4] and is SMP safe. It works as a kernel module, so you does not need recompilation of the Linux kernel or recompiling and relinking of your programs.

CHPOX should work with different Linux kernels, based on 2.4.x series i.e. MOSIX [2], etc. Development team provide testing of CHPOX only for pure and openMosix [4] Linux kernels, so other patched versions may have compilation or operation issues.


INSTALLATION

Before installation you must have configured and compiled kernel sources and System.map file for it.

  1. Unzip and untar archive with CHPOX sources:

    tar -xzf chpox-<version>.tar.gz

  2. 'cd' to the directory containing the source codes and run './configure' script:

    cd chpox-<version>
    ./configure


    You can pass path to Linux sources as an argument to 'configure':

    ./configure --with-linux=/path/to/kernel/sources

    If there is no System.map file in Linux directory you must specify path to it:

    ./configure --with-sysmap=/path/to/System.map

    See output of

    ./configure --help

    for other configure options.

  3. Run 'make', 'make install' and 'depmod -a':

    make
    make install
    depmod -ae

!!!WARNING: Compile CHPOX with the same compiler the kernel was compiled.
!!!WARNING: Recompile CHPOX after recompiling kernel



USAGE


Before starting checkpointing and restoring processes you must load CHPOX module:

modprobe chpox_mod

or

insmod chpox_mod

Checkpointing of processes is controlled by proc interface. You can register or unregister process(es) by writing string

<pid>:<signal>:<arg>:<dump file>   into file   '/proc/chpox/register'.

This will cause registration or unregistering of process with identifier <pid> (possibly with all it's child processes). Passing  <arg> == 0  unregisters process with given  <pid> (if <pid> is 0  it unregisters all registered processes). If  <arg> isn't 0   it registers given process (or group of processes). Dump file name is created as  <dump file>.

The meaning of <arg>:

  • if bit 2 is set then executable file is included into dump. It may be useful to transfer processes between different machines with the same shared libraries.
  • if bit 3 is set then shared libraries registered will be included into dump file.
  • if bit 4 is set then CHPOX will checkpoint all child processes of given one.

After registering of process you can checkpoint it as many times as necessary by sending signal <signal> to it.

Examples

Registration of process with PID 1234 for checkpointing with signal 31 (SIGSYS):

echo "1234:31:1:/tmp/proc.dump" > /proc/chpox/register

Registration of process with all it's child processes:

echo "1234:31:9:/tmp/proc.dump" > /proc/chpox/register

Unregistering of one process:

echo "1234:0:0:" > /proc/chpox/register

Unregistering of all registered processes:

echo "0:0:0:" > /proc/chpox/register

After registering of process you can checkpoint it as many times as necessary by sending signal 31 to it:

kill -31 1234

Extended features

File "/proc/chpox/info" holds information about registered processes in form:

<pid>:<signal>: [<flag>|<number of checkpoints>] -> <filename> [<last checkpoint time>]

Flag is one of:
C - new entry
S - checkpoint is in progress
O - last request was processed correctly
E - last request caused an error

Example: 2108:31:9 [O|1] -> /tmp/proc.dump [1040130768.799984]

File "/proc/chpox/libs" holds information about libraries registered for including into dump (but see also `chpoxctl' program). This is necessary if you need to checkpoint and restart of processes on different machines that has different libraries versions. You can add libraries to list by writing string in form "+<library file name>" to this file:

echo "+/lib/ld-linux.so.2" > /proc/chpox/libs

To remove library from the list write string "-<library file name>":

echo "-/lib/ld-linux.so.2" > /proc/chpox/libs

You can clean list of libraries by writing string "-":

echo "-" > /proc/chpox/libs

Any manipulations with libraries list require root rights.
File "/proc/chpox/version" contains version number of chpox module.

Userlevel tools

Another way to control CHPOX is user-level executable called `chpoxctl' which uses ioctl interface for chpox. See output of command

chpoxctl --help

for details.

Examples:

Registration of process with PID 1234 for checkpointing with signal 31 (SIGSYS):

chpoxctl add 1234 31 1 /tmp/proc.dump

Registration of process with all it's child processes:

chpoxctl add 1234 31 9 /tmp/proc.dump

Unregistering of one process:

chpoxctl del 1234

Unregistering of all registered processes:

chpoxctl clear

Adding library to list:

chpoxctl addlib /lib/ld-linux.so.2

Removing one library from list:

chpoxctl dellib /lib/ld-linux.so.2

Removing all libraries from list:

chpoxctl clearlibs

Displaying list of registered libraries:

chpoxctl liblist

RESTORING OF PROCESSES

For restoring of checkpointed process you must execute program `ld-chpox' with loaded chpox module:

ld-chpox /tmp/proc.dump

See output of command

ld-chpox --help

for details.

REGISTERING CHPOX FILE FORMAT AS EXECUTABLE

You will need to enable misc binary format support in Linux kernel (CONFIG_BINFMT_MISC option).
In order to run chpox dumps as a executable file you can register dump format as misc binary format.
In Debian system you can install package binfmt-support and execute following command as root:

update-binfmts --install chpox /usr/local/bin/ld-chpox --magic "CHPOX"

On other systems you can register chpox format by writing string ":chpox:M:0:CHPOX::/usr/local/bin/ld-chpox:" into file "/proc/sys/fs/binfmt_misc/register". But in that case you will need to do this after each reboot.

OPERATION DETAILS

Registration of process for checkpoint will block the specified signal of specified process. Blocking of signal is doing by setting the notifier function that returns 1 and cleans queue of signals before returning. When specified signal is sent to process notifier is executed in the context of process being checkpointed. During execution of notifier, VMA dump and dump of files information are executed. If MOSIX is configured process returns home before dumping. This kind of operation is similar to EPCKPT [1]. EPCKPT is very powerful (supports almost all except sockets) but it needs patching of kernel and thus does not work with MOSIX. CRAK tries to stop the process before checkpointing and after that operates with its structures. CRAK does not designed for SMP (at least version for Linux-2.4.4) but it is declared that it supports sockets (not in version for Linux-2.4.4).

EXAMINING CHPOX DUMP FILES

To see information about chpox dump file you can use file(1) program. Execute command:

file -m chpox.magic <dump-file>

and you will see folowing information about the dump file: chpox file format version, architecture for which dump was created, whether file is complete or corrupted during checkpoint and number of child processes dumped.
File chpox.magic is included into the chpox distribution. Alternatively you can merge it with /etc/magic file.

AUTHORS

Parallel Computing Cluster Group,
Information & Computer Center,
National Taras Shevchenko University of Kyiv

Olexander O. Sudakov, Ph.D.
Eugeniy S. Meshcheryakov

CHPOX is based on:

VMADUMP     Erik Hendriks
EPCKPT        Eduardo Pinheiro
CRAK             Hua Zhong

VERSION

This is version 0.7.2. It is tested with Linux kernel version 2.4.32.

BUGS

This version does not work correctly with interactive programs.
Probably many?..

SUPPORTED ARCHITECTURES

This version of CHPOX tested on machines with i386, PowerPC (in emultator) and s390/s390x (in emulator) architectures.
If you succeed using CHPOX on machines with PowerPC or s390/s390x architectures, please be kind to inform us about that.
Also inform us if you had troubles compiling or using CHPOX.
We have only i386 machines in out department, so in the case you have problems using CHPOX with other architectures we can start to solve them only if somene can grant us access to the machine of appropriate architecture.

TODO

Support for Internet sockets, shared memory, System V IPC, processes with multiple threads.
Better integration with openMosix.

DOWNLOAD

By downloading chpox-1.0.tar.gz, I state that I know and agree that CHPOX IS PROVIDED IN ITS ``AS IS'' CONDITION, WITH NO WARRANTY WHATSOEVER, THAT NO LIABILITY OF ANY KIND FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF CHPOX WILL BE ACCEPTED.

DEBIAN PACKAGES

Also you can download precompiled Debian packages. To do this add following lines to your /etc/apt/sources.list file:

deb http://www.cluster.kiev.ua/support/files/chpox stable main
deb-src http://www.cluster.kiev.ua/support/files/chpox stable main

Alternatively you can download packages here. Package chpox contains user-level tools and package chpox-source contains source code for Linux kernel module which can be compiled by make-kpkg(1).

REFERENCES

  1. http://www.checkpointing.org/
  2. http://www.mosix.org/
  3. http://www.beowulf.org/software/bproc.html
  4. openMosix

CHPOX LINKS

Here you can find links to other projects that use chpox. If your project has success with CHPOX please let us know.

  1. openMosix Add-Ons and Community Contributions
  2. ClusterKnoppix
  3. The chpox - Checkpointing Utility and How to Use It. by Matt Rechenburg
  4. General openmosix demon (gomd)
  5. Quantain
  6. Checkpointing and Distributed Shared Memory in openMosix by Mulyadi Santosa

PUBLICATIONS

  1. Process checkpointing and restart system for Linux / Sudakov O.O., Boyko Yu.V., Tretyak O.V., Korotkova T.P., Meshcheryakov E.S. // Mathematical Machines and Systems. -2003. -N.2. -P.146-153.
 
 
 
Kyiv National Taras Shevchenko University
© Information and Computer Center, 2002-2017