2017-06-15 86 views
2

我开始使用Snakemake,我有一个非常基本的问题,我无法在snakemake教程中找到答案。单个规则中的多个输入和输出Snakemake文件

我想创建一个单一的规则snakefile在Linux下逐一下载多个文件。 'expand'不能在输出中使用,因为这些文件需要逐个下载,并且通配符无法使用,因为它是目标规则。

我想到的唯一方法就是这样的东西,它不能正常工作。我无法弄清楚如何发送下载的内容与特定的名称,如“downloaded_files.dwn”使用{}输出特定的目录在后续步骤中使用:

links=[link1,link2,link3,....] 
rule download:  
output: 
    "outdir/{downloaded_file}.dwn" 
params: 
    shellCallFile='callscript', 
run: 
    callString='' 
    for item in links: 
     callString+='wget str(item) -O '+{output}+'\n' 
    call('echo "' + callString + '\n" >> ' + params.shellCallFile, shell=True) 
    call(callString, shell=True) 

我明白任何提示就如何实现这一被解决,哪一部分蛇头我不明白。

+1

如果您不使用'-j'选项运行snakemake,则只有一个规则实例将在给定时间运行。是否需要按照精确的顺序下载文件? – bli

+0

另外,通常使用只有输入的第一个“all”规则,为此可以使用扩展。这将推动工作流程的其余部分。 – bli

+0

有没有可用于确定下载文件名称的链接名称中的模式?请记住,Snakemake的目的是在文件名中使用规律性。 – bli

回答

3

这里是一个注释过的例子,可以帮助您解决问题:

# Create some way of associating output files with links 
# The output file names will be built from the keys: "chain_{key}.gz" 
# One could probably directly use output file names as keys 
links = { 
    "1" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz", 
    "2" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz", 
    "3" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"} 


rule download: 
    output: 
     # We inform snakemake that this rule will generate 
     # the following list of files: 
     # ["outdir/chain_1.gz", "outdir/chain_2.gz", "outdir/chain_3.gz"] 
     # Note that we don't need to use {output} in the "run" or "shell" part. 
     # This list will be used if we later add rules 
     # that use the files generated by the present rule. 
     expand("outdir/chain_{n}.gz", n=links.keys()) 
    run: 
     # The sort is there to ensure the files are in the 1, 2, 3 order. 
     # We could use an OrderedDict if we wanted an arbitrary order. 
     for link_num in sorted(links.keys()): 
      shell("wget {link} -O outdir/chain_{n}.gz".format(link=links[link_num], n=link_num)) 

这里是这样做的另一种方法,使用任意名称为下载的文件,并使用output(虽然有点人为地):

links = [ 
    ("foo_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz"), 
    ("bar_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz"), 
    ("baz_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz")] 


rule download: 
    output: 
     # We inform snakemake that this rule will generate 
     # the following list of files: 
     # ["outdir/foo_chain.gz", "outdir/bar_chain.gz", "outdir/baz_chain.gz"] 
     ["outdir/{f}".format(f=filename) for (filename, _) in links] 
    run: 
     for i in range(len(links)): 
      # output is a list, so we can access its items by index 
      shell("wget {link} -O {chain_file}".format(
       link=links[i][1], chain_file=output[i])) 
     # using a direct loop over the pairs (filename, link) 
     # could be considered "cleaner" 
     # for (filename, link) in links: 
     #  shell("wget {link} -0 outdir/{filename}".format(
     #   link=link, filename=filename)) 

使用snakemake -j 3一个例子,其中三个下载可以并行地进行:

# To use os.path.join, 
# which is more robust than manually writing the separator. 
import os 

# Association between output files and source links 
links = { 
    "foo_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz", 
    "bar_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz", 
    "baz_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"} 


# Make this association accessible via a function of wildcards 
def chainfile2link(wildcards): 
    return links[wildcards.chainfile] 


# First rule will drive the rest of the workflow 
rule all: 
    input: 
     # expand generates the list of the final files we want 
     expand(os.path.join("outdir", "{chainfile}"), chainfile=links.keys()) 


rule download: 
    output: 
     # We inform snakemake what this rule will generate 
     os.path.join("outdir", "{chainfile}") 
    params: 
     # using a function of wildcards in params 
     link = chainfile2link, 
    shell: 
     """ 
     wget {params.link} -O {output} 
     """ 
+0

感谢bli为您提供了出色的解决方案。还有一个问题。这个规则是否也可以修改为并行下载链接? – user3015703

+1

为了平行运行,你可以在'all'规则的'input'中移动'expand',从'run'部分移除'for'循环,并使用'-j'。 'all'规则将导致'download'规则针对每个想要的文件运行一次。我会再增加一个例子,但你可能会尝试。 – bli

+1

@ user3015703我为并行下载添加了一个示例。 – bli

相关问题