CCSM Tutorial - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

CCSM Tutorial

Description:

Ice(2 ) : ice, ice (prescribed mode), ice (mixed layer ocean mode), dice. Coupler(1) : cpl ... 'Other' machines include seaborg, nirvana, eagle, falcon, cheetah ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 42
Provided by: tri64
Category:
Tags: ccsm | falcon | tutorial

less

Transcript and Presenter's Notes

Title: CCSM Tutorial


1

Running CCSM
Tony Craig CCSM Software Engineering Group
ccsm_at_ucar.edu
2
Outline
  • General review of CCSM
  • Setting up and running a simple case
  • Datasets
  • Production
  • Modifying source code
  • Errors
  • Tools
  • Performance

3
Review of CCSM
  • Five components / Ten models
  • Atmosphere(3) atm, datm, latm
  • Ocean(2) ocn, docn
  • Land(2) lnd, dlnd
  • Ice(2) ice, ice (prescribed mode), ice (mixed
    layer ocean mode), dice
  • Coupler(1) cpl
  • Communication via MPI between components and
    coupler only
  • Each component runs on multiple processors via
    MPI, OpenMP, MPI/OpenMP

4
Component parallelization
  • atm MPI, OpenMP, or MPI/OpenMP
  • lnd MPI, OpenMP, or MPI/OpenMP
  • Ice MPI only
  • ocn MPI only
  • cpl OpenMP only
  • The data models, datm, docn, dice, dlnd, and latm
    serial only, 1 processor

5
Configurations
  • A datm, dlnd, docn, dice, cpl
  • B atm, lnd, ocn, ice, cpl
  • C datm, dlnd, ocn, dice, cpl
  • D datm, dlnd, docn, ice, cpl
  • F atm, lnd, docn, ice (prescribed mode), cpl
  • G latm, dlnd, ocn, ice, cpl
  • H atm, dlnd, docn, dice, cpl
  • I datm, lnd, docn, dice, cpl
  • K atm, lnd, docn, dice, cpl
  • M latm, dlnd, docn, ice (ml ocn mode), cpl

6
Resolutions
  • atm/lnd/datm/dlnd T42, T31
  • ocn/ice/docn/dice gx1v3, gx3, gx3v4
  • latm T62
  • Scientifically validated combinations
  • B, T42_gx1v3 b20.007 control run (test.a1 case)
  • B, T31_gx3v4 paleo control run (test.a2 case)

7
Available configurations

8
Platforms
  • IBM
  • SGI
  • Compaq

9
Review of scripts
  • Main script (test.a1.run)
  • Sets primary ccsm environment variables
  • Calls model.setup.csh
  • Gets input datasets
  • Builds components
  • Runs model
  • Archives
  • Harvests

10
Setting up a simple case
  • Use the GUI !!
  • The GUI modifies the scripts and creates a new
    case for you
  • Input CASE, CSMROOT, CSMDATA, EXEROOT
  • Input resolution
  • Input configuration (A-M)
  • Sets processor layout based on configuration
    (first guess)
  • Sets some batch environment variables
  • Works well in the NCAR environment, other sites
    require post script-generation tuning

11
Setting up a simple case, without GUI
  • Create new case directory under scripts, copy
    over test.a1 files
  • Rename file test.a1.run to CASE.run
  • Edit CASE, CSMROOT, CSMDATA, EXEROOT,
    ARCROOT
  • Edit batch environment parameters
  • Edit GRID
  • Edit SETUPS
  • Edit NTASKS, NTHRDS

12
NTASKS, NTHRDS, batch
  • NTASKS are the total number of MPI tasks for
    each component
  • NTHRDS are the number of OpenMP threads per MPI
    task
  • NTASKSNTHRDS total number of processors for
    each component
  • Tuning required to get optimal load balance
  • Batch parameters should match processors used,
    consistency important, task_geometry
    (loadleveler) is very powerful

13
Component parallelization
  • atm MPI, OpenMP, or MPI/OpenMP
  • lnd MPI, OpenMP, or MPI/OpenMP
  • ice MPI only, NTHRDS1
  • ocn MPI only, NTHRDS1
  • cpl OpenMP only, NTASKS1
  • The data models, datm, docn, dice, dlnd, and latm
    serial only, 1 processor, NTASKS1, NTHRDS1

14
Main script configuration summary
  • B case
  • MODELS ( atm lnd ocn ice cpl)
  • SETUPS ( atm lnd ocn ice cpl)
  • NTASKS ( 8 2 40 8 1)
  • NTHRDS ( 4 4 1 1 4)
  • datm/dlnd/ocn/ice case
  • MODELS ( atm lnd ocn ice cpl)
  • SETUPS ( datm dlnd ocn ice cpl)
  • NTASKS ( 1 1 64 16 1)
  • NTHRDS ( 1 1 1 1 4)

15
RUNTYPE
  • Startup - initial startup of model using
    arbitrary initialization
  • set CASE, BASEDATE
  • Continue - continuation of case, bit-for-bit
    guaranteed, uses model restart files
  • set CASE
  • Branch - start new case as a bit-for-bit
    continuation of another case, uses model restart
    files, requires continuous date
  • set CASE, REFCASE, REFDATE
  • Hybrid - start new case, not bit-for-bit
    continuation, uses model initial files in atm and
    land, can change starting date
  • set CASE,BASEDATE,REFCASE,REFDATE

16
Coupler namelist
  • Stop_option ndays, nmonths, newmonth, halfyear,
    newyear, newdecade
  • Stop_n integer (ndays, nmonths)
  • Rest_freq ndays, monthly, quarterly, halfyear,
    yearly
  • Rest_n integer (ndays)
  • Diag_freq daily, weekly, biweekly, monthly,
    quarterly, yearly, ndays
  • Diag_n integer (ndays)
  • info_bcheck integer

17
Data Sets
  • Types
  • Grid files, binary
  • Namelist input, ascii
  • Initial datasets, binary/netcdf
  • Restart datasets, binary
  • History datasets, netcdf
  • Log files, ascii
  • inputdata directory
  • This is usually pointed to by CSMDATA

18
Data Flow, Input
  • Everything is copied to EXEROOT
  • Tools and scripts attempt to automate most of the
    get input files
  • Main script variables include CSMDATA, LFSINP,
    LMSINP, MACINP, RFSINP, RMSINP

19
Data Flow, Output
  • Output files are moved out of EXEROOT
  • Harvesting is a separate process
  • Writing of restart files coordinated by the
    coupler
  • Writing of history files is not coordinated
    between components, monthly average is default
  • Main script variables include LMSOUT, MACOUT,
    RFSOUT

Scripts
EXEROOT
Mass Store
ARCROOT
archiving
harvesting
20
Log Files
  • Each component produces a log file,
    model.log.LID
  • LID is a system date stamp
  • Date stamps are the same on all log files for a
    run
  • Log files are written into the EXEROOT/model
    directories during execution
  • Log files are copied to SCRIPTS/logs at the end
    of a run
  • There are separate stdout and stderr that
    sometimes contain output information

21
Archiving, ccsm_archive
  • Means moving model output to a separate area on a
    local disk, ccsm_archive
  • Local disk area is set by ARCROOT in the main
    script
  • Benefits
  • Allows separation of running and harvesting
  • Mass storage availability does not prevent
    continued execution of the model
  • Allows users to run in volatile temporary space
  • Supports simple harvesting in a clustered
    machine environment (like nirvana)

22
Harvesting, CASE.har
  • Means copying model output to the local mass
    store
  • Separate script in scripts/CASE, CASE.har
  • Typically submitted in batch, can also be run
    interactively
  • Submitted by main script after model run, off by
    default
  • Sources ccsm_joe for important environment
    variables
  • Harvests all files in ARCROOT/atm,lnd,ocn,ice,cp
    l
  • Verifies accurate copy on mass store before
    removing
  • Can scp files to remote machines

23
Exact Restart
  • CCSM can stop and restart exactly
  • The coupler controls the frequency of restart
    file writes
  • Restart files guarantee bit-for-bit continuity at
    a checkpoint boundary
  • rpointer files are updated in the scripts/CASE
    directory after each run

24
Restart file management (1)
  • ccsm_archive
  • In scripts/CASE
  • Called from main script after model run is
    complete, commented out by default
  • ARCROOT/restart contains the latest full set of
    restart files
  • ccsm_archive copies full set of restart datasets
    into ARCROOT/restart after each run
  • ccsm_archive then tars up that restart set into
    the ARCROOT/restart.tars directory
  • These tar files can be large, regular clean up
    required

25
Restart file management (2)
  • ccsm_getrestart
  • In scripts/tools
  • Called from main script before model run starts,
    commented out by default
  • Copies the latest set of restart files from
    ARCROOT/restart to the appropriate directories
  • To backup model run to previous model date
  • Assumes both ccsm_archive and ccsm_getrestart
    have been active in the main script
  • Delete all files in ARCROOT/restart
  • Untar an ARCROOOT/restart.tars file into
    ARCROOT/restart
  • Resubmit

26
Auto-Resubmit
  • RESUBMIT file in scripts/CASE directory
  • contains a single integer
  • If the integer is gt0, main script resubmits
    itself and decrements the integer
  • Runaway jobs
  • FIRST! set value in RESUBMIT file to 0
  • Attempt to kill running jobs

27
Production
  • Modify coupler namelist in cpl.setup.csh, set run
    length and restart frequency, turn down
    diagnostic frequency, set info_bcheck to 0.
  • Run a startup, hybrid, or branch case RUNTYPE
  • Transition to continue RUNTYPE
  • Turn on archiving, harvesting, and
    ccsm_getrestart
  • Edit RESUBMIT file to initiate auto-resubmission

28
Monitoring a run
  • Monitor the batch jobs using llq, bjobs, qstat
  • Verify that runs complete successfully, check for
    timing information at the end of a log file
  • Tail -f EXEROOT/cpl/cpl.log
  • If runs are not succeeding,
  • tail each log file
  • grep for ENDRUN in atm and lnd log files
  • Check stdout and stderr files for component
    messages or system messages
  • Look for core files in EXEROOT/model
  • Look for zero length files in EXEROOT/model
  • Check email

29
Modifying source code
  • Modifying files in the ccsm models directory is
    not recommended
  • Create directories under scripts/CASE
  • src.atm, src.lnd, src.ocn, src.ice, src.cpl
  • Copy subset of model source code to these
    directories and modify it
  • Has highest priority with respect to build
  • Benefits include
  • Release source code remains unmodified and
    available
  • Allows implementation of case dependent code
    modifications

30
Multiple Machine Support
  • Should run on blackforest, babyblue, and ute out
    of the box
  • Other machines include seaborg, nirvana, eagle,
    falcon, cheetah
  • Supported platforms are indicated in OS, SITE,
    MACH, ARCH environment variables in the main
    script
  • See also scripts/tools/test.a1.mods.MACH for
    suggested changes to test.a1.run for other
    machines.

31
Running on a New Machine
  • Main script
  • Set batch queue commands
  • Add new OS, SITE, MACH, ARCH options
  • Set standard CCSM path names, CSMROOT,
  • Harvester submission issues
  • Set data movement variables, LMSINP,
  • Harvester script
  • May require modification
  • Tools
  • May need to modify ccsm_msread, ccsm_mswrite
  • Build
  • Modify models/bld/Macros.OS file

32
ccsm_joe
  • Created by main script
  • Updated every time the main script runs
  • Case dependent
  • Records important ccsm environment variables
  • Can be sourced by other scripts to inherit ccsm
    environment variables

33
Interactive/Batch Issues
  • Can run main script interactively
  • Typically used to build and pre-stage initial
    data
  • Uncomment exit command in main script to stop
    the script before script starts ccsm execution
  • Batch environment highly site dependent
  • NQS
  • Loadleveler
  • LSF
  • PBS

34
Common Errors (1)
  • Model wont build
  • Try rebuilding clean
  • Remove all obj directories, these are
    OBJROOT/model/obj which is normally equivalent
    to EXEROOT/model/obj
  • When rebuilding, make sure SETBLD is true in
    main script
  • Model wont continue due to restart problem
  • Determine cause of problem quota, hardware,
    script, zero length files, rpointer problems
  • Fix if possible
  • Back up to latest good restart dataset
  • Rerun

35
Common Errors (2)
  • Ice model stops due to mp transport error
  • Double ndte in ice.setup.csh ice model namelist
  • Back up to latest good restart dataset
  • Run past previous stop date
  • Reset ndte value
  • Ocean model non-convergence
  • Add about 10 to the number of model
    timesteps/hour in ocn.setup.csh, DT_COUNT
  • Back up to latest good restart dataset
  • Run past previous stop date
  • Reset DT_COUNT
  • Non-convergence on first timestep is special case

36
Tools
  • Under scripts/tools
  • ccsm_getfile hierarchical search for file
  • ccsm_getinput hierarchical search for input
    file
  • ccsm_msread copies a file from local mass store
  • ccsm_mswrite copies a file to local mass store
  • ccsm_checkenvs echo ccsm environment variables,
    used to created ccsm_joe
  • ccsm-getrestart copies restart files from
    ARCROOT/restart to appropriate EXEROOT and
    scripts/CASE directories

37
Performance
  • This is complicated!
  • Issues
  • Performance of components and system as a
    function of resolution and configuration
  • Scalability of individual components, scaling
    efficiency of individual components
  • Task/Thread counts
  • Components sharing nodes, overloading nodes with
    multiple components, overloading threads,
    overloading tasks
  • Load balance of coupled system

38
Component Timings
39
CCSM Load Balancing
  • 40 ocean
  • 32 atm
  • 16 ice
  • 12 land
  • 04 cpl
  • 104 total

processors
53.2
8.6
40.4
6.2
15.0
9.4
3.0
10.0
10.0
3
2
5
55
Timings in seconds per day
40
Component/Hardware layout
  • Machine, set of nodes
  • Nodes, group of processors that share memory
  • Processors, individual computing elements
  • General rules
  • Do not oversubscribe processors, place only 1 MPI
    task or 1 thread on each processor
  • Minimize the number of nodes used for a given
    component and processor requirement
  • Multiple components can share a node as long as
    there is no oversubscription of processors
  • Test several decompositions, layouts, task/thread
    combinations to try to optimize performance

41
Summary
  • CCSM is a complicated multi-executable climate
    model, expect there to be spin-up time
  • CCSM is a scientific research code
  • There are many possible components,
    configurations, platforms, and resolutions we
    are unable to test everything
  • Users are responsible for validating their
    science
  • NCAR can help with software/configuration
    problems, ccsm_at_ucar.edu
  • Please report bugs, fixes, improvements, and
    ports to new hardware, so we can incorporate
    those changes! ccsm_at_ucar.edu
Write a Comment
User Comments (0)
About PowerShow.com