< August 2021 >
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405

Sunday, 29 August 2021

10:00 PM

Using snakemake to do simple wildcard operations on many, many, many files (local copy) [Living in an Ivory Basement] 10:00 PM, Sunday, 29 August 2021 08:20 PM, Monday, 08 November 2021

I recently co-taught another snakemake lesson (with Dr. Abhijna Parigi), and was reminded of one of my favorite off-label uses of snakemake: replacing complicated bash for loops with simple and robust snakemake workflows.

An example

As a bioinformatics researcher, I frequently need to do simple operations to many files. As part of this, I usually want to change the filename to represent the change in file content.

For example, let's suppose I have a bunch of FASTQ files (say, the ones here), and I want to subset them to the first 400 lines. The filenames all have the form NAME.fastq, and I want to add .subset.fastq to the end of the subset filenames to distinguish them. (See this shell scripting lesson for more background and motivation for this particular operation.)

Using bash, round 1

For many years I did this with bash for loops. The following code works, assuming the original fastq files are in a data/ subdirectory:

mkdir subset
for i in data/*.fastq
do
    head -400 $i > subset/$(basename $i).subset.fastq
done

Starting from a bunch of files,

>data/F3D0_S188_L001_R1_001.fastq
>data/F3D0_S188_L001_R2_001.fastq
>...

this loop will produce

>subset/F3D0_S188_L001_R1_001.fastq.subset.fastq
>subset/F3D0_S188_L001_R2_001.fastq.subset.fastq

Improving the bash solution

The output filenames are kind of ugly, because fastq is repeated. That's just because bash makes it so easy to append to filenames - we can fix this by adding .fastq into the $(basename ...) call:

mkdir subset2
for i in data/*.fastq
do
    head -400 $i > subset2/$(basename $i .fastq).subset.fastq
done

So... not difficult to read, and fairly straightforward. Why would I use anything else?

Using snakemake instead

tl;dr The bash code above is brittle when I modify it; it's not robust enough for important work.

In my (extensive <sigh>) experience, the above approach fails some reasonable percent of the time. Usually I get it right the first time I write it, and then I modify and tweak it, and chaos ensues because I omit something in the for loop.

So a year or two ago, I decided to try out snakemake for one of these operations.

Here's the contents of a file named Snakefile.subset, which does the same thing as the for loop above -

# pull in all files with .fastq on the end in the #data
FILES = glob_wildcards('data/{name}.fastq')

# extract the {name} values into a list
NAMES = FILES.name

rule all:
    input:
        # use the extracted name values to build new filenames
        expand("subset3/{name}.subset.fastq", name=NAMES)

rule subset:
    input:
        "data/{n}.fastq"
    output:
        "subset3/{n}.subset.fastq"
    shell: """
        head -400 {input} > {output}
    """

and you can run it with snakemake -s Snakefile.subset -j 1.

With this Snakefile, snakemake pulls in all files that match the glob pattern and extracts their names, and then constructs a set of "targets" in rule all that it must create. The subset rule specifies how to build targets of that name.

Why I like snakemake more than bash for this

So why do I like snakemake more? A few reasons that I think are intrinsic to snakemake vs bash -

  • snakemake fails if something is wonky about the filenames, before doing anything!
  • if any of the operations fail, snakemake stops and alerts me by default!
  • I can do the operations in parallel by specifying e.g. snakemake -j 4 to use 4 cores.
  • the templating language for using {...} is nice, simple, and Python-standard (see this blog post on f-strings and also the the templating minilanguage ref).
  • as the operations get more complicated, snakemake doesn't need to get more complicated, while the bash solution tends to complexify into illegibility...
  • I think the snakemake solution is easier to understand and modify!

Above all, the overall structure of snakemake is declarative rather than procedural. We declare what we want the result to look like, and snakemake uses the available rules to create the overall set of steps that must be executed and Makes It Happen. This is what makes the error checking and parallelization possible.

Another "feature" of this solution is that there are more comments because I comment Snakefiles more than bash scripts. This is probably a me-problem that is caused by snakemake forcing me to edit a file :).

I haven't reused Snakefiles that much, but I think you can reuse Snakefiles fairly easily - see next section.

Are there any downsides? The main one is that the snakemake solution feels more heavyweight to me - it involves creating a file, getting the spacing/indentation right, etc. etc. So I still don't use it as much as I probably should.

Thoughts welcome!

--titus

Appendix: A more reusable Snakefile

Below is a Snakefile that's a bit more reusable for situations where your input and output directories don't match the names I used above - you can override PREFIX and OUTPUT by running snakemake -C prefix=PREFIX output=OUTPUT.

(I don't really like the syntax of using f-strings here, but it's cleaner than anything else I've found. Suggestions welcome.)

# pull in all files with .fastq on the end in the 'data' directory.             
PREFIX = config.get('prefix', 'data')
print(f"looking for FASTQ files under '{PREFIX}'/")

OUTPUT = config.get('output', 'subset5')
print(f"subset results will go under '{OUTPUT}'/")

FILES = glob_wildcards(f'{PREFIX}/{{name}}.fastq')

# extract the {name} values into a list                                         
NAMES = FILES.name

# request the output files                                                      
rule all:
    input:
        # use the extracted 'name' values to build new filenames                
        expand("{output}/{name}.subset.fastq", output=OUTPUT, name=NAMES)

# actually do the subsetting                                                    
rule subset_wc:
    input:
        f"{PREFIX}/{{n}}.fastq"
    output:
        "{output}/{n}.subset.fastq"
    shell: """                                                                  
        head -400 {input} > {output}                                            
    """

Feeds

FeedRSSLast fetchedNext fetched after
XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Bits of DNA XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
blogs.perl.org XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
Blue Collar Bioinformatics XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Boing Boing XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Epistasis Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Futility Closet XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
gCaptain XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Hackaday XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
In between lines of code XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
InciWeb Incidents for California XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
LeafSpring XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Living in an Ivory Basement XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
LWN.net XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Mastering Emacs XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Planet Debian XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Planet Emacsen XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
RNA-Seq Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
RStudio Blog - Latest Comments XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
RWeekly.org - Blogs to Learn R from the Community XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
The Adventure Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
The Allium XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Variance Explained XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
January 2022
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
December 2021
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
November 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
October 2021
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
September 2021
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
August 2021
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
July 2021
MonTueWedThuFriSatSun
28293001020304
05060708091011
12131415161718
19202122232425
26272829303101
June 2021
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
May 2021
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
April 2021
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
March 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
February 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
November 2020
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30010203040506
September 2020
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
July 2020
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
June 2020
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
May 2020
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
April 2020
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
February 2020
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282901
January 2020
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930310102
December 2019
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
November 2019
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
October 2019
MonTueWedThuFriSatSun
30010203040506
07080910111213
14151617181920
21222324252627
28293031010203
August 2019
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829303101
July 2019
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
June 2019
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
May 2019
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
April 2019
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
March 2019
MonTueWedThuFriSatSun
25262728010203
04050607080910
11121314151617
18192021222324
25262728293031
February 2019
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728010203
January 2019
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293031010203
December 2018
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
November 2018
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
October 2018
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
September 2018
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
August 2018
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930310102
July 2018
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
June 2018
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
May 2018
MonTueWedThuFriSatSun
30010203040506
07080910111213
14151617181920
21222324252627
28293031010203
April 2018
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30010203040506
February 2018
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272801020304
January 2018
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
December 2017
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
November 2017
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
September 2017
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
August 2017
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293031010203
March 2017
MonTueWedThuFriSatSun
27280102030405
06070809101112
13141516171819
20212223242526
27282930310102
January 2017
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
November 2016
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
October 2016
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
September 2016
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
August 2016
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
July 2016
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
May 2016
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
April 2016
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
December 2014
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
October 2014
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102