Thursday Exercise 4.2: Large Output Data¶

In this exercise, we will run a job that produces a very large output file, based on a few parameters. This exercise should take 15-20 minutes.

Background¶

This exercise will be the reverse of the previous exercise! Instead of large input/small output, we will be using a program that has no input except for a few arguments on the command line, but produces a file that is several GB in size. As before, we will need to write a shell script that runs the program and handles the data.

The Program¶

If you haven't already, log in to learn.chtc.wisc.edu. Download the software package and untar it.

user@learn $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool18/motif-flanks.tar.gz
user@learn $ tar -xzf motif-flanks.tar.gz

Use the cd command to enter the unpacked motif-flanks directory. Take a look at the README file and then do the following:

Compile the code.
Run the program without any arguments.
Based on the README, what is the largest amount of data we might expect?

This program generates all permutations of nucleotide sequences surrounding a given DNA motif. We can choose the length of permutation we want both before and after a motif of our choice. To use this program on the command line and save the output to a FASTA file, we can use the command:

user@learn $ ./motif-flanks 2 AGTTCATGCCT 2 > sequences.fa

According to the usage information and README, the two numerical arguments can add up to 13, at most, and the middle sequence can be any DNA sequence up to 20 characters. The largest output we can expect is around 4 GB.

Test Job¶

Having output of up to 4 GB means two things: we will want to run a smaller test before we run the program at its peak, and the output data will need to go into a shared location like Gluster, instead of returning to the submit server.

First, we'll create a shell script to serve as the job's executable.

What commands do you need to put in the script? What do you need to do with the sequences.fa file before the job exits?
Our script needs to run the motif-flanks command as shown above, redirecting the output to a file called sequences.fa. Then, after that command completes, the sequences.fa file should be moved to your Gluster directory, as it is too large to return to the submit server as usual.
Write the script and then check it against the script below. Yours might look slightly different.

#!/bin/sh

./motif-flanks 4 GATTTTCGATC 4 > sequences.fa
mv sequences.fa /mnt/gluster/username/

Note

Note that the two arguments in the script (4 and 4) are much smaller than the total possible for the software (two values that add up to 13). This is because we want to run a smaller test before submitting a job with the largest possible combination of arguments.

Next, create a submit file for this job, based on other submit files from the school. Some important considerations:

We're writing our file to the job's working directory, so make sure to request several GB of disk space. (request_disk in the submit file)
Add a line to the file that ensures your job will land on computers that have access to Gluster (see the file from the last exercise).
The executable will be the script you wrote above.

Once you have a submit file that does all these things, submit the test job.

Once the job has completed, do the following:

Check the directory where you submitted the job. Has the sequences.fa file returned there, accidentally?
Check your Gluster directory. Did the sequences.fa file get copied there successfully?
Check file size. How big is the sequences.fa file? You can use the ls -lh command with the filename to find out.

If your job successfully copied the sequences.fa file to Gluster and did not return it to your submission directory on the submit server, congratulations! Everything is working as it should and you can now submit a full job.

Final Job¶

Having done a test, it should be straightforward to run a "full scale" job. Edit your run_motif.sh executable so that the motif-flanks command uses larger numerical arguments:

./motif-flanks 6 GATTTTCGATC 7 > sequences.fa

Then submit your job. When it completes, check the size of the output file in Gluster.