Part III. lab 3 - using condor-g and dagman to submit to the grid

In this exercise, we will use Condor-G and DAGman to submit jobs to the grid.

Getting Set Up

  1. Make Condor available to you by typing:

    $ source /opt/osg/setup.sh
    

  2. Check Condor with condor_q

    Condor should already be set up and running on wkstn108-34.leavey.georgetown.edu. You can check this by running condor_q:

    $ condor_q
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:36236> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    
    0 jobs; 0 idle, 0 running, 0 held

    This command lists every job that Condor has been asked to run that is waiting in the Condor queue. Everyone will be using the same Condor installation for these exercises, so you will often see other students' jobs in the Condor queue alongside your own.

  3. Create Your Working Directories

    Next, create some directories for you to work in. Make them in your home directory:

    $ cd ~
    $ mkdir condor-tutorial
    $ cd condor-tutorial
    $ mkdir submit

Submit a Simple Grid Job with CondorG

Now we are ready to submit our first job with Condor-G. The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is given to condor_submit.

There are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer wkstn108-34.leavey.georgetown.edu and running under the "jobmanager-fork" job manager. We're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer.

For more information, see the condor_submit manual.

Create the Submit File

Move to our scratch submission directory and create the submit file. Verify that it was entered correctly:

$ cd ~/condor-tutorial/submit
USE YOUR FAVOURITE TEXT EDITOR TO ENTER THE FILE
CONTENT
$ cat myjob.submit
Universe   = grid
grid_resource = gt2 osgce.cs.clemson.edu/jobmanager-fork
Executable = /bin/hostname
Arguments  = -f
Log        = /tmp/benc-grid.log
Output     = grid.out
Error      = grid.error
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
Queue

Submit your test job to Condor-G

$ condor_submit myjob.submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.

Run condor_q to see the progress of your job. You can also run condor_q -globus to see Globus-specific status information. (See the condor_q manual for more information.)

$ condor_q


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
   ID    OWNER              SUBMITTED      RUN_TIME  ST PRI SIZE CMD               
   1.0   train99         7/10 17:28   0+00:00:00 I  0   0.0  primetest 143

1 jobs; 1 idle, 0 running, 0 held
$ condor_q -globus


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   1.0   train99       UNSUBMITTED fork     wkstn108-35.leavey.georgetown.edu   /home/train99/cond

Monitoring Progress with tail

In another window, run tail -f on the log file for your job to monitor progress. Re-run tail when you submit one or more jobs throughout this tutorial. You will see how typical Condor-G jobs progress. Use Ctrl+C to stop watching the file.

$ cd ~/condor-tutorial/submit
$ tail -f --lines=500 results.log
000 (001.000.000) 07/10 17:28:48 Job submitted from host: <129.93.164.161:35688>
...
017 (001.000.000) 03/24 19:13:30 Job submitted to Globus
    RM-Contact: wkstn108-34.leavey.georgetown.edu/jobmanager-fork
    JM-Contact: https://wkstn108-34.leavey.georgetown.edu:34127/28997/1174763610/
    Can-Restart-JM: 1
...
027 (001.000.000) 07/10 17:29:01 Job submitted to grid resource
    GridResource: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
    GridJobId: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork https://wkstn108-34.leavey.georgetown.edu:51277/31413/1174756212/
...
001 (001.000.000) 07/10 17:29:01 Job executing on host: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
...
005 (001.000.000) 07/10 17:30:08 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

Verifying completed jobs

When the job is no longer listed in condor_q, or when the log file reports Job terminated, the results can be viewed using condor_history:

$ condor_history
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD
   1.0   train99         7/10 10:28   0+00:00:00 C   ???        /home/train99/cond

When the job completes, verify that the output is as expected. The binary name is different from what you created because of how Globus and Condor-G cooperate to stage your file to execute computer.

$ ls
myjob.submit  myscript.sh*  results.error  results.log   results.output
$ cat results.error
$ cat results.output 
osgce.cs.clemson.edu

If you didn't watch results.log with tail -f, you will want to examine the logged information with cat results.log .

Held Jobs in Condor

When a problem occurs in the middleware, Condor-G will hold your job. Held jobs remain in the queue, waiting for user intervention. When you resolve the problem, you can use condor_release to free the job to continue.

Use condor_hold to manually place jobs on hold (e.g., to delay your run).

For this example, we'll deliberately cause a problem and watch how Condor handles it. We will make the output file non-writable. The job will be unable to copy the results back and will be placed on hold.

Submit the job again, but this time immediately after submitting it, mark the output file as read-only:

$ condor_submit myjob.submit ; chmod a-w grid.output
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3.

Watch the job with tail. When the job goes on hold, use Ctrl+C to exit tail. Note that condor_q reports that the job is in the H or "held state".

$ tail -f --lines=500 grid.log

000 (003.000.000) 07/12 22:35:44 Job submitted from host: <129.93.164.161:32864>
...
027 (003.000.000) 07/12 22:35:57 Job submitted to grid resource
    GridResource: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
    GridJobId: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork https://wkstn108-34.leavey.georgetown.edu:44026/31670/1174757075/
...
001 (003.000.000) 07/12 22:35:57 Job executing on host: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
...
012 (003.000.000) 07/12 22:36:52 Job was held.
        Globus error 155: the job manager could not stage out a file
        Code 2 Subcode 155
...
Ctrl+C
$ condor_q

 
-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:32864> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   2.0   train99         7/12 22:35   0+00:00:55 H  0   0.0  myscript.sh TestJo
 
1 jobs; 0 idle, 0 running, 1 held

Fix the problem (make the file writable again), then release the job. You can specifiy the job's ID, or use the -all option to release all held jobs.

$ chmod u+w grid.output
$ condor_release -all
All jobs released.

Run tail -f in another window to watch the log until the job finishes:

$ tail -f --lines=500 results.log
000 (003.000.000) 07/12 22:35:44 Job submitted from host: <L129.93.164.161:32864>
...
027 (003.000.000) 07/12 22:35:57 Job submitted to grid resource
    GridResource: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
    GridJobId: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork https://wkstn108-34.leavey.georgetown.edu:44026/31670/1174757075/...
...
001 (003.000.000) 07/12 22:35:57 Job executing on host: wkstn108-34.leavey.georgetown.edu
...
012 (003.000.000) 07/12 22:36:52 Job was held.
        Globus error 155: the job manager could not stage out a file
        Code 2 Subcode 155
...
013 (003.000.000) 07/12 22:44:33 Job was released.
        via condor_release (by user train99)
...
027 (003.000.000) 07/12 22:35:57 Job submitted to grid resource
    GridResource: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
    GridJobId: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork https://wkstn108-34.leavey.georgetown.edu:44026/31670/1174757075/...
...
001 (003.000.000) 07/12 22:44:46 Job executing on host: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
...
005 (003.000.000) 07/12 22:44:51 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
Ctrl+C

After your job has finished running, check that the results have been retreived successfully:

$ cat grid.output
osgce.cs.clemson.edu

Clean up the results before continuing:

$ rm grid.out grid.error grid.log

A Simple DAG

Now we'll use DAGman, a tool which will help run several grid jobs at once.

  1. Create a small shell script to monitor the Condor-G queue. We will use this throughout the rest of the tutorial:

    $ cat > watch_condor_q
    #! /bin/sh
    while true; do
         condor_q train99
         condor_q -globus train99
         sleep 10
    done
    Ctrl+D
    $ cat watch_condor_q
    #! /bin/sh
    while true; do
         condor_q
         condor_q -globus
         sleep 10
    done
    $ chmod a+x watch_condor_q 
    

  2. Create a minimal DAG for DAGMan. This DAG will have a single node.

    $ cat > mydag.dag
    Job HelloWorld myjob.submit
    Ctrl+D
    $ cat mydag.dag
    Job HelloWorld myjob.submit

  3. Submit the DAG.

    This section requires you to have three windows open. We will submit the DAG in the first window and watch the progress of it and the job in the other two. We will do these in the following order:

    1. In the first window, submit the DAG and then watch condor with watch_condor_q.

    2. In the second window, tail the results log.

    3. In the third window, tail the DAGMan log.

    Submit the DAG with condor_submit_dag and watch the run with watch_condor_q. condor_dagman is running as a job and submits your real job on your behalf, without your direct intervention. You might see the C (completed) state as your job finishes, but that often goes by too quickly to notice.

    $ condor_submit_dag mydag.dag
    
    Checking your DAG input file and all submit files it references.
    This might take a while... 
    Done.
    -----------------------------------------------------------------------
    File for submitting this DAG to Condor   : mydag.dag.condor.sub
    Log of DAGMan debugging messages         : mydag.dag.dagman.out
    Log of Condor library debug messages     : mydag.dag.lib.out
    Log of the life of condor_dagman itself  : mydag.dag.dagman.log
    
    Condor Log file for all jobs of this DAG : results.log
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 2.
    -----------------------------------------------------------------------
    $ ./watch_condor_q 
  4. In the first log window, watch the job log file as your job runs:

    $ tail -f --lines=500 results.log

  5. In a third window, watch DAGMan's log file by runnning tail -f --lines=500 mydag.dag.dagman.out. We suggest that you re-run this command whenever you submit a DAG during the remainder of this tutorial. This will show you how a typical DAG progresses. Use Ctrl+C to stop watching the file. An example is shown below:

    $ cd ~/condor-tutorial/submit
    $ tail -f --lines=500 mydag.dag.dagman.out
    
    7/10 10:36:43 ******************************************************
    7/10 10:36:43 ** condor_scheduniv_exec.6.0 (CONDOR_DAGMAN) STARTING UP
    7/10 10:36:43 ** $CondorVersion: 6.8.4 Apr 22 2006 $
    7/10 10:36:43 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
    7/10 10:36:43 ** PID = 26844
    7/10 10:36:43 ******************************************************
    7/10 10:36:44 DaemonCore: Command Socket at <129.93.164.161:34571>
    7/10 10:36:44 argv[0] == "condor_scheduniv_exec.6.0"
    7/10 10:36:44 argv[1] == "-Debug"
    7/10 10:36:44 argv[2] == "3"
    7/10 10:36:44 argv[3] == "-Lockfile"
    7/10 10:36:44 argv[4] == "mydag.dag.lock"
    7/10 10:36:44 argv[5] == "-Condorlog"
    7/10 10:36:44 argv[6] == "results.log"
    7/10 10:36:44 argv[7] == "-Dag"
    7/10 10:36:44 argv[8] == "mydag.dag"
    7/10 10:36:44 argv[9] == "-Rescue"
    7/10 10:36:44 argv[10] == "mydag.dag.rescue"
    7/10 10:36:44 Condor log will be written to results.log
    7/10 10:36:44 DAG Lockfile will be written to mydag.dag.lock
    7/10 10:36:44 DAG Input file is mydag.dag
    7/10 10:36:44 Rescue DAG will be written to mydag.dag.rescue
    7/10 10:36:44 Parsing mydag.dag ...
    7/10 10:36:44 Dag contains 1 total jobs
    7/10 10:36:44 Bootstrapping...
    7/10 10:36:44 Number of pre-completed jobs: 0
    7/10 10:36:44 Submitting Job HelloWorld ...
    7/10 10:36:44    assigned Condor ID (7.0.0)
    7/10 10:36:45 Event: ULOG_SUBMIT for Job HelloWorld (7.0.0)
    7/10 10:36:45 0/1 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
    7/10 10:37:05 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (7.0.0)
    7/10 10:37:05 Event: ULOG_EXECUTE for Job HelloWorld (7.0.0)
    7/10 10:38:10 Event: ULOG_JOB_TERMINATED for Job HelloWorld (7.0.0)
    7/10 10:38:10 Job HelloWorld completed successfully.
    7/10 10:38:10 1/1 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
    7/10 10:38:10 All jobs Completed!
    7/10 10:38:10 **** condor_scheduniv_exec.6.0 (condor_DAGMAN) EXITING WITH STATUS 0
    

    The first window, running watch_condor_q, should look something like the following:

    $ ./watch_condor_q 
    
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       2.0   train99         7/10 17:33   0+00:00:03 R  0   2.6  condor_dagman -f -
       3.0   train99         7/10 17:33   0+00:00:00 I  0   0.0  myscript.sh TestJo
    
    2 jobs; 1 idle, 1 running, 0 held
    
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
       3.0   train99       UNSUBMITTED fork     wkstn108-34.leavey.georgetown.edu   /tmp/train99-cond
    
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       2.0   train99         7/10 17:33   0+00:00:33 R  0   2.6  condor_dagman -f -
       3.0   train99         7/10 17:33   0+00:00:15 R  0   0.0  myscript.sh TestJo
    
    2 jobs; 0 idle, 2 running, 0 held
    
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
       3.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /home/train99/cond
    
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
       2.0   train99         7/10 17:33   0+00:01:03 R  0   2.6  condor_dagman -f -
       3.0   train99         7/10 17:33   0+00:00:45 R  0   0.0  myscript.sh TestJo
    
    2 jobs; 0 idle, 2 running, 0 held
    
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
       3.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/train99-cond
    
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    
    0 jobs; 0 idle, 0 running, 0 held
    
    
    -- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
     ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
    
    
    Ctrl+C

  6. Verify your results:

    $ ls -l
    total 12
    -rw-r--r--    1 train99  train99        28 Jul 10 10:35 mydag.dag
    -rw-r--r--    1 train99  train99       523 Jul 10 10:36 mydag.dag.condor.sub
    -rw-r--r--    1 train99  train99       608 Jul 10 10:38 mydag.dag.dagman.log
    -rw-r--r--    1 train99  train99      1860 Jul 10 10:38 mydag.dag.dagman.out
    -rw-r--r--    1 train99  train99        29 Jul 10 10:38 mydag.dag.lib.out
    -rw-------    1 train99  train99         0 Jul 10 10:36 mydag.dag.lock
    -rw-r--r--    1 train99  train99       175 Jul  9 18:13 myjob.submit
    -rwxr-xr-x    1 train99  train99       194 Jul 10 10:36 myscript.sh
    -rw-r--r--    1 train99  train99        31 Jul 10 10:37 results.error
    -rw-------    1 train99  train99       833 Jul 10 10:38 results.log
    -rw-r--r--    1 train99  train99       261 Jul 10 10:37 results.output
    -rwxr-xr-x    1 train99  train99        81 Jul 10 10:35 watch_condor_q
    $ cat results.error 
    $ cat results.output 
    NO - 11 is a factor

    Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a scheduler universe job):

    $ ls
    mydag.dag         mydag.dag.dagman.log  mydag.dag.lib.out  myjob.submit  results.error  results.output
    mydag.dag.condor.sub  mydag.dag.dagman.out  mydag.dag.lock     myscript.sh   results.log    watch_condor_q
    $ cat mydag.dag.condor.sub
    # Filename: mydag.dag.condor.sub
    # Generated by condor_submit_dag mydag.dag
    universe   = scheduler
    executable   = /path/to/condor/bin/condor_dagman
    getenv      = True
    output      = mydag.dag.lib.out
    error      = mydag.dag.lib.out
    log      = mydag.dag.dagman.log
    remove_kill_sig   = SIGUSR1
    arguments   = -f -l . -Debug 3 -Lockfile mydag.dag.lock -Condorlog results.log -Dag mydag.dag -Rescue mydag.dag.rescue
    environment   = _CONDOR_DAGMAN_LOG=mydag.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
    queue
    $ cat mydag.dag.dagman.log
    000 (006.000.000) 07/10 10:36:43 Job submitted from host: <129.93.164.161:33785>
    ...
    001 (006.000.000) 07/10 10:36:44 Job executing on host: <129.93.164.161:33785>
    
    ...
    005 (006.000.000) 07/10 10:38:10 Job terminated.
       (1) Normal termination (return value 0)
          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
          Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
          Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
       0  -  Run Bytes Sent By Job
       0  -  Run Bytes Received By Job
       0  -  Total Bytes Sent By Job
       0  -  Total Bytes Received By Job
    ...

    If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:

    $ cat mydag.dag.dagman.out

  7. Clean up your results. Be careful when deleting mydag.dag.* to not delete mydag.dag. Note the .*!

    $ rm mydag.dag.* results.*

Running a job with a more complex DAG

Typically each node in a DAG will have its own Condor submit file. Create some more submit files to run animation tiles.

$ cat tile.submit
Universe   = grid
grid_resource = gt2 osgce.cs.clemson.edu/jobmanager-condor
Executable = /nfs/home/benc/mandel/mandel10
Arguments  = 0 0 1 0.0582 1.99965 2000  1000 1000 32000
Log        = /tmp/benc-grid.log
Output     = tile.pgm
Error      = tile.error
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
Queue
$ cat convert.submit
Universe   = grid
grid_resource = gt2 osg-ce.grid.uj.ac.za/jobmanager-pbs
Executable = /usr/bin/convert
Arguments  = tile.pgm tile.png
Log        = /tmp/benc-grid.log
Output     = convert.out
Error      = convert.error
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = tile.pgm
transfer_output_files = tile.png
transfer_executable = false
Queue

Put the new nodes in a DAG:

$ cat mydag.dag 
Job Tile tile.submit
Job Convert convert.submit
Parent Tile Child Convert

Change watch_condor_q script

condor_q -dag will organize jobs into their associated DAGs. Change watch_condor_q to use this:

$ rm watch_condor_q
$ cat > watch_condor_q
#! /bin/sh
while true; do
    echo ....
    echo .... Output from condor_q
    echo ....
     condor_q train99
    echo ....
    echo .... Output from condor_q -globus
    echo ....
     condor_q -globus train99
    echo ....
    echo .... Output from condor_q -dag
    echo ....
     condor_q -dag train99
     sleep 10
done
Ctrl+D
$ cat watch_condor_q
#! /bin/sh
while true; do
    echo ....
    echo .... Output from condor_q
    echo ....
     condor_q
    echo ....
    echo .... Output from condor_q -globus
    echo ....
     condor_q -globus
    echo ....
    echo .... Output from condor_q -dag
    echo ....
     condor_q -dag
     sleep 10
done
$ chmod a+x watch_condor_q 

Submit your new DAG and monitor it.

In separate windows, run tail -f --lines=500 results.log and tail -f --lines=500 mydag.dag.dagman.out to monitor the job's progress.

$ condor_submit_dag mydag.dag

Checking your DAG input file and all submit files it references.
This might take a while... 
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor   : mydag.dag.condor.sub
Log of DAGMan debugging messages         : mydag.dag.dagman.out
Log of Condor library debug messages     : mydag.dag.lib.out
Log of the life of condor_dagman itself  : mydag.dag.dagman.log

Condor Log file for all jobs of this DAG : results.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 8.
-----------------------------------------------------------------------
$ ./watch_condor_q

-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   adesmet         7/10 17:45   0+00:00:08 R  0   2.6  condor_dagman -f -
   5.0   adesmet         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0   adesmet         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held

%STARTMore%
-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   train99       UNSUBMITTED fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond
   6.0   train99       UNSUBMITTED fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:00:08 R  0   2.6  condor_dagman -f -
   5.0    |-HelloWorld   7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0    |-Setup        7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:00:12 R  0   2.6  condor_dagman -f -
   5.0   train99         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0   train99         7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   train99       UNSUBMITTED fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond
   6.0   train99       UNSUBMITTED fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:00:12 R  0   2.6  condor_dagman -f -
   5.0    |-HelloWorld   7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh TestJo
   6.0    |-Setup        7/10 17:45   0+00:00:00 I  0   0.0  myscript.sh Setup 

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:00:42 R  0   2.6  condor_dagman -f -
   5.0   train99         7/10 17:45   0+00:00:24 R  0   0.0  myscript.sh TestJo
   6.0   train99         7/10 17:45   0+00:00:24 R  0   0.0  myscript.sh Setup 

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <128.105.185.14:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond
   6.0   adesmet       ACTIVE fork     gk2   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:00:42 R  0   2.6  condor_dagman -f -
   5.0    |-HelloWorld   7/10 17:45   0+00:00:24 R  0   0.0  myscript.sh TestJo
   6.0    |-Setup        7/10 17:45   0+00:00:24 R  0   0.0  myscript.sh Setup 

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:01:12 R  0   2.6  condor_dagman -f -
   5.0   train99         7/10 17:45   0+00:00:54 R  0   0.0  myscript.sh TestJo
   6.0   train99         7/10 17:45   0+00:00:54 R  0   0.0  myscript.sh Setup 

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   5.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond
   6.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:01:12 R  0   2.6  condor_dagman -f -
   5.0    |-HelloWorld   7/10 17:45   0+00:00:54 R  0   0.0  myscript.sh TestJo
   6.0    |-Setup        7/10 17:45   0+00:00:54 R  0   0.0  myscript.sh Setup 

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:01:42 R  0   2.6  condor_dagman -f -
   7.0   train99         7/10 17:46   0+00:00:00 I  0   0.0  myscript.sh work1 
   8.0   train99         7/10 17:46   0+00:00:00 I  0   0.0  myscript.sh Worker

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   7.0   train99       UNSUBMITTED fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond
   8.0   train99       UNSUBMITTED fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:01:42 R  0   2.6  condor_dagman -f -
   7.0    |-WorkerNode_  7/10 17:46   0+00:00:00 I  0   0.0  myscript.sh work1 
   8.0    |-WorkerNode_  7/10 17:46   0+00:00:00 I  0   0.0  myscript.sh Worker

3 jobs; 2 idle, 1 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:02:12 R  0   2.6  condor_dagman -f -
   7.0   train99         7/10 17:46   0+00:00:27 R  0   0.0  myscript.sh work1 
   8.0   train99         7/10 17:46   0+00:00:27 R  0   0.0  myscript.sh Worker

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   7.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond
   8.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:02:12 R  0   2.6  condor_dagman -f -
   7.0    |-WorkerNode_  7/10 17:46   0+00:00:27 R  0   0.0  myscript.sh work1 
   8.0    |-WorkerNode_  7/10 17:46   0+00:00:27 R  0   0.0  myscript.sh Worker

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:02:42 R  0   2.6  condor_dagman -f -
   7.0   train99         7/10 17:46   0+00:00:57 R  0   0.0  myscript.sh work1 
   8.0   train99         7/10 17:46   0+00:00:57 R  0   0.0  myscript.sh Worker

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   7.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond
   8.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:02:43 R  0   2.6  condor_dagman -f -
   7.0    |-WorkerNode_  7/10 17:46   0+00:00:58 R  0   0.0  myscript.sh work1 
   8.0    |-WorkerNode_  7/10 17:46   0+00:00:58 R  0   0.0  myscript.sh Worker

3 jobs; 0 idle, 3 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:03:13 R  0   2.6  condor_dagman -f -
   8.0   train99         7/10 17:46   0+00:01:28 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   8.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:03:13 R  0   2.6  condor_dagman -f -
   8.0    |-WorkerNode_  7/10 17:46   0+00:01:28 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:03:43 R  0   2.6  condor_dagman -f -
   8.0   train99         7/10 17:46   0+00:01:58 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   8.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:03:43 R  0   2.6  condor_dagman -f -
   8.0    |-WorkerNode_  7/10 17:46   0+00:01:58 R  0   0.0  myscript.sh Worker

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:04:13 R  0   2.6  condor_dagman -f -
   9.0   train99         7/10 17:49   0+00:00:02 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   9.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:04:13 R  0   2.6  condor_dagman -f -
   9.0    |-CollectResu  7/10 17:49   0+00:00:02 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:04:43 R  0   2.6  condor_dagman -f -
   9.0   train99         7/10 17:49   0+00:00:32 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   9.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:04:43 R  0   2.6  condor_dagman -f -
   9.0    |-CollectResu  7/10 17:49   0+00:00:32 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:05:13 R  0   2.6  condor_dagman -f -
   9.0   train99         7/10 17:49   0+00:01:02 R  0   0.0  myscript.sh workfi

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
   9.0   train99       DONE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:05:13 R  0   2.6  condor_dagman -f -
   9.0    |-CollectResu  7/10 17:49   0+00:01:02 C  0   0.0  myscript.sh workfi

1 jobs; 0 idle, 1 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:05:43 R  0   2.6  condor_dagman -f -
  10.0   train99         7/10 17:50   0+00:00:13 R  0   0.0  myscript.sh Final 

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
  10.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:05:44 R  0   2.6  condor_dagman -f -
  10.0    |-LastNode     7/10 17:50   0+00:00:13 R  0   0.0  myscript.sh Final 

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:06:14 R  0   2.6  condor_dagman -f -
  10.0   train99         7/10 17:50   0+00:00:43 R  0   0.0  myscript.sh Final 

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        
  10.0   train99       ACTIVE fork     wkstn108-34.leavey.georgetown.edu   /tmp/username-cond


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   train99         7/10 17:45   0+00:06:14 R  0   2.6  condor_dagman -f -
  10.0    |-LastNode     7/10 17:50   0+00:00:43 R  0   0.0  myscript.sh Final 

2 jobs; 0 idle, 2 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE        


-- Submitter: wkstn108-34.leavey.georgetown.edu : <129.93.164.161:35688> : wkstn108-34.leavey.georgetown.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

Ctrl+C

Watching the logs or the condor_q output, you'll note that the CollectResults node (workfinal) wasn't run until both of the WorkerNode nodes (work1 and work2) finished.

Examine your results

$ ls
job.finalize.submit   mydag.dag.condor.sub  myscript.sh           results.setup.error   results.workfinal.error
job.setup.submit      mydag.dag.dagman.log  results.error        results.setup.output  results.workfinal.output
job.work1.submit      mydag.dag.dagman.out  results.finalize.error   results.work1.error   watch_condor_q
job.work2.submit      mydag.dag.lib.out     results.finalize.output  results.work1.output
job.workfinal.submit  mydag.dag.lock       results.log           results.work2.error
mydag.dag         myjob.submit       results.output        results.work2.output
$ tail --lines=500 results.*.error
==> results.finalize.error <==
This is sent to standard error

==> results.setup.error <==
This is sent to standard error

==> results.work1.error <==
This is sent to standard error
%STARTMore%
==> results.work2.error <==
This is sent to standard error

==> results.workfinal.error <==
This is sent to standard error
$ tail --lines=500 results.*.output

==> results.finalize.output <==
I'm process id 29614 on wkstn108-34.leavey.georgetown.edu
Thu Jul 10 10:53:58 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/0d/7c60aa10b34817d3ffe467dd116816/md5/de/03c3eb8a20852948a2af53438bbce1/data Finalize 1
My name (argument 1) is Finalize
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.setup.output <==
I'm process id 29337 on wkstn108-34.leavey.georgetown.edu
Thu Jul 10 10:50:31 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/a5/fab7b658db65dbfec3ecf0a5414e1c/md5/f4/e9a04ae03bff43f00a10c78ebd60fd/data Setup 1
My name (argument 1) is Setup
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.work1.output <==
I'm process id 29444 on wkstn108-34.leavey.georgetown.edu
Thu Jul 10 10:51:04 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/2e/17db42df4e113f813cea7add42e03e/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode1 1
My name (argument 1) is WorkerNode1
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

==> results.work2.output <==
I'm process id 29432 on wkstn108-34.leavey.georgetown.edu
Thu Jul 10 10:51:03 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/ea/9a3c8d16346b2fea808cda4b5969fa/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode2 120
My name (argument 1) is WorkerNode2
My sleep duration (argument 2) is 120
Sleep of 120 seconds finished.  Exiting

==> results.workfinal.output <==
I'm process id 29554 on wkstn108-34.leavey.georgetown.edu
Thu Jul 10 10:53:27 CDT 2003
Running as binary /home/train99/.globus/.gass_cache/local/md5/c9/7ba5d43acad3d9ebdfa633839e75c3/md5/11/cd84efa75305d54100f0f451b46b35/data WorkFinal 1
My name (argument 1) is WorkFinal
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished.  Exiting

Examine your log

$ cat results.log
000 (005.000.000) 07/10 17:45:24 Job submitted from host: <wkstn108-34.leavey.georgetown.edu:35688>
    DAG Node: HelloWorld
...
000 (006.000.000) 07/10 17:45:24 Job submitted from host: <wkstn108-34.leavey.georgetown.edu:35688>
    DAG Node: Setup
...
017 (006.000.000) 07/10 17:45:42 Job submitted to Globus
    RM-Contact: gk2:/jobmanager-fork
    JM-Contact: https://gk2:2349/914/1057877133/
    Can-Restart-JM: 1
...
001 (006.000.000) 07/10 17:45:42 Job executing on host: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
...

017 (005.000.000) 07/10 17:45:42 Job submitted to Globus
    RM-Contact: wkstn108-34.leavey.georgetown.edu:/jobmanager-fork
    JM-Contact: https://wkstn108-34.leavey.georgetown.edu:2348/915/1057877133/
    Can-Restart-JM: 1
...
001 (005.000.000) 07/10 17:45:42 Job executing on host: gk2
...
005 (005.000.000) 07/10 17:46:50 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
005 (006.000.000) 07/10 17:46:50 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
000 (007.000.000) 07/10 17:46:55 Job submitted from host: <wkstn108-34.leavey.georgetown.edu:35688>
    DAG Node: WorkerNode_1
...
000 (008.000.000) 07/10 17:46:56 Job submitted from host: <wkstn108-34.leavey.georgetown.edu:35688>
    DAG Node: WorkerNode_Two
...
017 (008.000.000) 07/10 17:47:09 Job submitted to Globus
    RM-Contact: wkstn108-34.leavey.georgetown.edu:/jobmanager-fork
    JM-Contact: https://wkstn108-34.leavey.georgetown.edu:2364/1037/1057877219/
    Can-Restart-JM: 1
...
001 (008.000.000) 07/10 17:47:09 Job executing on host: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
...
017 (007.000.000) 07/10 17:47:09 Job submitted to Globus
    RM-Contact: wkstn108-34.leavey.georgetown.edu:/jobmanager-fork
    JM-Contact: https://wkstn108-34.leavey.georgetown.edu:2367/1040/1057877220/
    Can-Restart-JM: 1
...
001 (007.000.000) 07/10 17:47:09 Job executing on host: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
...
005 (007.000.000) 07/10 17:48:17 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
005 (008.000.000) 07/10 17:49:18 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
000 (009.000.000) 07/10 17:49:22 Job submitted from host: <wkstn108-34.leavey.georgetown.edu:35688>
    DAG Node: CollectResults
...
017 (009.000.000) 07/10 17:49:35 Job submitted to Globus
    RM-Contact: wkstn108-34.leavey.georgetown.edu:/jobmanager-fork
    JM-Contact: https://wkstn108-34.leavey.georgetown.edu:2383/1185/1057877366/
    Can-Restart-JM: 1
...
001 (009.000.000) 07/10 17:49:35 Job executing on host: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
...
005 (009.000.000) 07/10 17:50:42 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...
000 (010.000.000) 07/10 17:50:42 Job submitted from host: <wkstn108-34.leavey.georgetown.edu:35688>
    DAG Node: LastNode
...
017 (010.000.000) 07/10 17:50:55 Job submitted to Globus
    RM-Contact: wkstn108-34.leavey.georgetown.edu:/jobmanager-fork
    JM-Contact: https://wkstn108-34.leavey.georgetown.edu:2392/1247/1057877446/
    Can-Restart-JM: 1
...
001 (010.000.000) 07/10 17:50:55 Job executing on host: gt2 wkstn108-34.leavey.georgetown.edu/jobmanager-fork
...
005 (010.000.000) 07/10 17:52:02 Job terminated.
   (1) Normal termination (return value 0)
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
      Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
   0  -  Run Bytes Sent By Job
   0  -  Run Bytes Received By Job
   0  -  Total Bytes Sent By Job
   0  -  Total Bytes Received By Job
...

Examine the DAGMan log

$ cat mydag.dag.dagman.out
7/10 17:45:24 ******************************************************
7/10 17:45:24 ** condor_scheduniv_exec.4.0 (CONDOR_DAGMAN) STARTING UP
7/10 17:45:24 ** $CondorVersion: 6.8.4 Apr 22 2006 $
7/10 17:45:24 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
7/10 17:45:24 ** PID = 18826
7/10 17:45:24 ******************************************************
7/10 17:45:24 DaemonCore: Command Socket at <wkstn108-34.leavey.georgetown.edu:35774>
7/10 17:45:24 argv[0] == "condor_scheduniv_exec.4.0"
7/10 17:45:24 argv[1] == "-Debug"
7/10 17:45:24 argv[2] == "3"
7/10 17:45:24 argv[3] == "-Lockfile"
7/10 17:45:24 argv[4] == "mydag.dag.lock"
7/10 17:45:24 argv[5] == "-Condorlog"
7/10 17:45:24 argv[6] == "results.log"
7/10 17:45:24 argv[7] == "-Dag"
7/10 17:45:24 argv[8] == "mydag.dag"
7/10 17:45:24 argv[9] == "-Rescue"
7/10 17:45:24 argv[10] == "mydag.dag.rescue"
7/10 17:45:24 Condor log will be written to results.log
7/10 17:45:24 DAG Lockfile will be written to mydag.dag.lock
7/10 17:45:24 DAG Input file is mydag.dag
7/10 17:45:24 Rescue DAG will be written to mydag.dag.rescue
7/10 17:45:24 Parsing mydag.dag ...
7/10 17:45:24 Dag contains 6 total jobs
7/10 17:45:24 Bootstrapping...
7/10 17:45:24 Number of pre-completed jobs: 0
7/10 17:45:24 Submitting Job HelloWorld ...
7/10 17:45:24    assigned Condor ID (5.0.0)
7/10 17:45:24 Submitting Job Setup ...
7/10 17:45:24    assigned Condor ID (6.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:25 Event: ULOG_SUBMIT for Job Setup (6.0.0)
7/10 17:45:25 0/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job Setup (6.0.0)
7/10 17:45:45 Event: ULOG_GLOBUS_SUBMIT for Job HelloWorld (5.0.0)
7/10 17:45:45 Event: ULOG_EXECUTE for Job HelloWorld (5.0.0)
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job HelloWorld (5.0.0)
7/10 17:46:55 Job HelloWorld completed successfully.
7/10 17:46:55 Event: ULOG_JOB_TERMINATED for Job Setup (6.0.0)
7/10 17:46:55 Job Setup completed successfully.
7/10 17:46:55 Submitting Job WorkerNode_1 ...
7/10 17:46:55    assigned Condor ID (7.0.0)
7/10 17:46:55 Submitting Job WorkerNode_Two ...
7/10 17:46:56    assigned Condor ID (8.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:46:56 Event: ULOG_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:46:56 2/6 done, 0 failed, 2 submitted, 0 ready, 0 pre, 0 post
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_Two (8.0.0)
7/10 17:47:11 Event: ULOG_GLOBUS_SUBMIT for Job WorkerNode_1 (7.0.0)
7/10 17:47:11 Event: ULOG_EXECUTE for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_1 (7.0.0)
7/10 17:48:21 Job WorkerNode_1 completed successfully.
7/10 17:48:21 3/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:21 Event: ULOG_JOB_TERMINATED for Job WorkerNode_Two (8.0.0)
7/10 17:49:21 Job WorkerNode_Two completed successfully.
7/10 17:49:21 Submitting Job CollectResults ...
7/10 17:49:22    assigned Condor ID (9.0.0)
7/10 17:49:22 Event: ULOG_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:22 4/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:49:37 Event: ULOG_GLOBUS_SUBMIT for Job CollectResults (9.0.0)
7/10 17:49:37 Event: ULOG_EXECUTE for Job CollectResults (9.0.0)
7/10 17:50:42 Event: ULOG_JOB_TERMINATED for Job CollectResults (9.0.0)
7/10 17:50:42 Job CollectResults completed successfully.
7/10 17:50:42 Submitting Job LastNode ...
7/10 17:50:42    assigned Condor ID (10.0.0)
7/10 17:50:42 Event: ULOG_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:42 5/6 done, 0 failed, 1 submitted, 0 ready, 0 pre, 0 post
7/10 17:50:57 Event: ULOG_GLOBUS_SUBMIT for Job LastNode (10.0.0)
7/10 17:50:57 Event: ULOG_EXECUTE for Job LastNode (10.0.0)
7/10 17:52:02 Event: ULOG_JOB_TERMINATED for Job LastNode (10.0.0)
7/10 17:52:02 Job LastNode completed successfully.
7/10 17:52:02 6/6 done, 0 failed, 0 submitted, 0 ready, 0 pre, 0 post
7/10 17:52:02 All jobs Completed!
7/10 17:52:02 **** condor_scheduniv_exec.4.0 (condor_DAGMAN) EXITING WITH STATUS 0

Clean up your results. Be careful about deleting the mydag.dag.* files, you do not want to delete mydag.dag, just mydag.dag.*.

$ rm mydag.dag.* results.*

Advanced exercise

Generate 10 different mandelbrot frames (for example, by varying the zoom factor of 32000 by a certain amount ending at zoom factor 1) using Condor and running on the machine at Clemson (using jobmanager-condor) and on the UJ cluster (using jobmanger-pbs). Then assemble them together like this:

convert frame1.png frame2.png ... frame10.png animation.gif

and view animation.gif in your web browser.