Thursday Exercise 4.2: Large Output Data¶
In this exercise, we will run a job that produces a very large output file, based on a few parameters. This exercise should take 15-20 minutes.
This exercise will be the reverse of the previous exercise! Instead of large input/small output, we will be using a program that has no input except for a few arguments on the command line, but produces a file that is several GB in size. As before, we will need to write a shell script that runs the program and handles the data.
If you haven't already, log in to
learn.chtc.wisc.edu. Download the software package and untar it.
[email protected] $ wget http://proxy.chtc.wisc.edu/SQUID/osgschool18/motif-flanks.tar.gz [email protected] $ tar -xzf motif-flanks.tar.gz
cd command to enter the unpacked
motif-flanks directory. Take a look at the README file and then do the following:
- Compile the code.
- Run the program without any arguments.
- Based on the README, what is the largest amount of data we might expect?
This program generates all permutations of nucleotide sequences surrounding a given DNA motif. We can choose the length of permutation we want both before and after a motif of our choice. To use this program on the command line and save the output to a FASTA file, we can use the command:
[email protected] $ ./motif-flanks 2 AGTTCATGCCT 2 > sequences.fa
According to the usage information and README, the two numerical arguments can add up to 13, at most, and the middle sequence can be any DNA sequence up to 20 characters. The largest output we can expect is around 4 GB.
Having output of up to 4 GB means two things: we will want to run a smaller test before we run the program at its peak, and the output data will need to go into a shared location like Gluster, instead of returning to the submit server.
First, we'll create a shell script to serve as the job's executable.
- What commands do you need to put in the script? What do you need to do with the
sequences.fafile before the job exits?
- Our script needs to run the
motif-flankscommand as shown above, redirecting the output to a file called
sequences.fa. Then, after that command completes, the
sequences.fafile should be moved to your Gluster directory, as it is too large to return to the submit server as usual.
- Write the script and then check it against the script below. Yours might look slightly different.
#!/bin/sh ./motif-flanks 4 GATTTTCGATC 4 > sequences.fa mv sequences.fa /mnt/gluster/username/
Note that the two arguments in the script (4 and 4) are much smaller than the total possible for the software (two values that add up to 13). This is because we want to run a smaller test before submitting a job with the largest possible combination of arguments.
Next, create a submit file for this job, based on other submit files from the school. Some important considerations:
- We're writing our file to the job's working directory, so make sure to request several GB of disk space. (
request_diskin the submit file)
- Add a line to the file that ensures your job will land on computers that have access to Gluster (see the file from the last exercise).
executablewill be the script you wrote above.
Once you have a submit file that does all these things, submit the test job.
Once the job has completed, do the following:
- Check the directory where you submitted the job. Has the
sequences.fafile returned there, accidentally?
- Check your Gluster directory. Did the
sequences.fafile get copied there successfully?
- Check file size. How big is the
sequences.fafile? You can use the
ls -lhcommand with the filename to find out.
If your job successfully copied the
sequences.fa file to Gluster and did not return it to your submission directory on the submit server, congratulations! Everything is working as it should and you can now submit a full job.
Having done a test, it should be straightforward to run a "full scale" job. Edit your
run_motif.sh executable so that the
motif-flanks command uses larger numerical arguments:
./motif-flanks 6 GATTTTCGATC 7 > sequences.fa
Then submit your job. When it completes, check the size of the output file in Gluster.