That darn cache! Configuring the SRA Toolkit
Last night I started a batch job on our group's cluster to download and process 9 Illumina libraries from the NCBI SRA.
In the past, I have almost always downloaded such data via direct links to .sra
files on the SRA FTP site, and then converted these files to Fastq format using the fastq-dump
command from the SRA Toolkit.
However, for a while I've been aware that the fastq-dump
command is capable of downloading data (by accession) directly from NCBI.
So last night, I took advantage of this convenience function.
This morning, I woke up to a barrage of warnings on the cluster saying that their was not disk space left.
Impossible, I thought, there are terabytes of free space on the scratch mount.
It was not the scratch partition that had filled up though, it was the $HOME
partition.
I'm lucky I was even able to log on to the machine.
What the heck is going on?!?!
It turns out that by default, fastq-dump
doesn't actually stream .sra
files from the FTP site.
Instead, it downloads intermediate cache files, which by default are stored in ~/ncbi/sra/
.
This is a huge issue for several reasons.
First of all, this behavior is not at all obvious to a new user.
Sure, there may be use cases where this behavior is very useful, but it seems much more appropriate as an opt-in feature than a default.
Secondly, this is bound to create problems for cluster/HPC users all over the place that have very limited storage in their $HOME
directories.
Fortunately, the SRA Toolkit allows you to configure this behavior. The officially-sanctioned approach is to fire up a config program, which will allow you to disable caching behavior and/or remote network access of any kind. The unofficial unsanctioned approach would be to make the config file yourself with a couple of shell commands.
mkdir -p ~/.ncbi
echo '/repository/user/main/public/root = "/scratch/standage/sra-cache"' > ~/.ncbi/user-settings.mkfg
# Uncomment the next command if you want to disable network access altogether
# echo '/repository/user/cache-disabled = "true"' > ~/.ncbi/user-settings.mkfg