3. Checkpoint/Restart Design

3.1. Requirements

Three code functional requirements of I/O in Cello are:

  1. writing data dumps for subsequent reading by external analysis/visualization applications

  2. writing checkpoint files, and

  3. reading checkpoint files to restart a previously run simulation

(While writing image files such as “png” files is also included in the I/O component of Cello, here we focus on HDF5 files containing actual block data.)

Additionally, writing and reading disk files must be scalable to the largest simulations runnable on the largest HPC platforms available, which necessarily include the largest parallel file systems available.

This scalable I/O approach has been implemented for checkpoint/restart, and will be adapted for use with data dumps in the near future.

3.2. Approach

The approach used includes determining a block ordering to aid mapping blocks to files, what data are written to the files, and how file I/O is parallelized.

3.2.1. Ordering

The approach involves a generalization of the previous MethodOutput method, but enables load-balancing of data between disk files through the use of block orderings to define how blocks are mapped to files. Currently, the ordering used in MethodOutput, which is implicit and embedded in the code, is based on a regular partitioning of root-level blocks together with their descendents. The updated implementation factors out this ordering into an Ordering class, provides a Morton space-filling curve ordering, and allows enables defining other orderings, such as Hilbert curves

3.2.2. File content

The content of the data files must be augmented to include all state data required to recreate a previously saved AMR block array on restart. Some information such as block connectivity are generated as blocks are inserted into the mesh hierarchy. Other information such as method or solver parameters are not stored, but are taken from the parameter file. This allows for “tweaking” of parameters on restart, for example to adjust refinement criteria or solver convergence criteria.

3.2.3. Control flow

Control flow is handled by separate IoWriter or IoReader chare arrays, where each element is associated with a single HDF5 file. Advantages over previous approaches are better load-balancing of I/O operations, and decoupling of I/O operations from the Block chare array. For Enzo-E checkpoint/restart data in particular, IoEnzoReader and IoEnzoWriter chare arrays are used.

3.3. Design

Components of the new I/O approach include

  1. Control management

    • control_restart.cpp

      • Main::r_restart_enter()

      • Main::p_restart_done()

      • Main::restart_exit()

  2. New Classes

    • EnzoMethodCheck

    • IoEnzoReader
      • IoEnzoReader::IoEnzoReader()

    • IoEnzoWriter
      • IoEnzoWriter::IoEnzoWriter()

    • IoReader
      • IoReader::IoReader()

    • IoWriter
      • IoWriter::IoWriter()

    • MethodOrderMorton

3.3.1. Output: checkpoint


3.3.2. Input: restart

The UML sequence diagram below shows how the Simulation group, IoReader chare array, and Block chare array interoperate to read data from a checkpoint directory. Time runs vertically starting from the top, and the three Charm++ group/arrays are arranged into three columns. Code for restart is found in the enzo_control_restart.cpp file.

../_images/io-read.png startup

Restart begins in the “startup” phase, with the unique root block for the (0,0,0) octree in the array-of-octrees calling the Simulation entry method p_restart_enter().

The p_restart_enter() entry method reads the number of restart files from the top-level file-list file, initializes synchronization counters, and creates the IoEnzoReader chare array, one element for each file.

The IoEnzoReader constructors calld the p_io_reader_created() entry method in the root Simulation object to notify it that they’ve been created.

p_io_reader_created counts the number of calls, and after it has received the last IoEnzoReader notification, it distributes the proxy_io_enzo_reader array proxy to all other Simulation objects by calling p_set_io_reader().

p_set_io_reader() stores the incoming proxy, then calls the r_restart_start() barrier across Simulation objects, which is used to guarantee that all proxy elements will have been initialized before any are accessed in subsequent phases. level 0

In the level-0 (root-level) phase, the root Simulation object reads the file names from the file-list file, and calls the p_init_root() entry method in all IoEnzoReader objects, sending the checkpoint directory and file names.

The p_init_root() entry method opens the block-data (HDF5) file and reads global attributes. It also opens and reads tho block-list (text) file, reading in the list of blocks and organizing them by mesh refinement level. It reads in each block data, saving data in blocks levels greater than 0, and sending data to level-0 blocks. Note level-0 blocks exist at the beginning of restart, but no blocks in levels higher than 0 do. Data are packed and sent to blocks in levels <= 0 using the EnzoBlock::p_restart_set_data() entry method.

The EnzoBlock::p_restart_set_data() method unpacks the data into the Block, then notifies the associated IoEnzoReader file object that data has been received using the p_block_ready entry method.

IoEnzoReader::p_block_ready() counts the number of block-reday acknowledgements, and after the last one calls Simulation::p_restart_next_level() to process the next refinement level blocks. level k

The level-k phase for k=1 to L is more complicated than level-0 because the level k > 0 blocks must be created first.

Assuming blocks up through level k-1 have been created, the root Simulation object calls IoEnzoReader::p_create_level(k) for each IoEnzoReader.

In p_create_level(), synchronization counters are initialized for counting the k-level blocks, and then each block in the list of level-k blocks is processed. To reuse code from the adapt phase, level-k blocks are created by refining the parent block, via a p_restart_refine() entry method.

In p_restart_refine(), the parent level k-1 block creates a new child block, inserts the new block in its own child list, and recategorizes as a non-leaf.

In the EnzoBlock constructor, the newly created block checks if it’s in a restart phase, and if so sends an acknowledgement to the associated IoEnzoReader object using the p_block_created() entry method.

In p_block_created the IoEnzoReader object counts the number of acknowledgements from newly-created level-k blocks, and after it receives the last one it calls p_restart_level_created() on the root-level Simulation object. After this, the rest of the level-k phase mirrors that of the level-0 phase. cleanup

In the cleanup section, after all blocks up to the maximum level have been created and initialized, the p_restart_next_level() entry method calls the Charm++ call doneInserting() on the block chare array, then calls p_restart_done() on all the blocks, which completes the restart phase.

3.3.3. Classes


3.4. Data format

Data for a given checkpoint dump are stored in a single checkpoint directory, specified in the user’s parameter file using the Method:check:dir parameter.

The number of data files in the directory is specified using the Method:check:num_files parameter. A rule-of-thumb is to use the same number of files as (physical) nodes in the simulation.

Data files are named block_data- x .h5, where 0 <= x < num_files. The format of data files is given in the next section.

Each data file has an associated block-list text file named block_data- x .block_list. The block-list file contains a list of all block names in the associated data file, together with each block’s mesh refinement level. There is one block listed per line, and the block name and level are separated by a space.

A check.file_list text file is also included, which includes the number of data files, and a list of the file prefixes block_data- x.

Note all blocks are included in the files, not just leaf-blocks, and including blocks in “negative” refinement levels.

3.4.1. Data file contents

The HDF5 data files are used to store all block state data, as well as some global data. Simulation attributes

Metadata for the simulation are stored in the top-level “/” group. These include the following:

  • cycle: Cycle of the simulation dump.

  • dt: Current global time-step.

  • time: Current time in code units.

  • rank: Dimensionality of the problem.

  • lower: Lower extents of the simulation domain.

  • upper: Upper extents of the simulation domain.

  • max_level: Maximum refinement level. Block attributes

Block attributes and data are stored in HDF5 groups with the same name as the block, e.g. “B00:0_00:0_00:0”.

Block attribute data include the following:

  • cycle: Cycle of this block.

  • dt: Current block time-step.

  • time: Current time of this block.

  • lower: Lower extents of the block.

  • upper: Upper extents of the block.

  • index: Index of the block, specified using three 32-bit integers.

  • adapt_buffer: Encoding of the block’s neighbor configuration.

  • num_field_data: currently unused.

  • array: Indices identifying the octree containing the block in the “array-of-octrees”.

  • enzo_CellWidth: Corresponds to the EnzoBlock CellWidth parameter.

  • enzo_GridDimension: Corresponds to the EnzoBlock GridDimension parameter.

  • enzo_GridEndIndex: Corresponds to the EnzoBlock GridEndIndex parameter.

  • enzo_GridLeftEdge: Corresponds to the EnzoBlock GridLeftEdge parameter.

  • enzo_GridStartIndex: Corresponds to the EnzoBlock GridStartIndex parameter.

  • enzo_dt: Corresponds to the EnzoBlock dt parameter.

  • enzo_redshift: Corresponds to the EnzoBlock redshift parameter. Block data

Block data are stored as HDF5 datasets.

Fields are currently stored as arrays of size (mx,my,mz), where mx, my, and mz are the dimensions of the field data including ghost data. (Note that future checkpoint versions may only include non-ghost data to reduce disk space.) Dataset names are field names with "field_ prepended, for example "field_density".

Particles are stored as one-dimensional HDF5 datasets, one dataset per attribute per particle type. Datasets are named using "particle" + particle-type + particle attribute, delimited by underscores. For example, "particle_dark_vx" for the x-velocity particle attribute "vx" values of the "dark" type particles in the block. The length of the arrays equals the number of that type of particle in the block.