2017-03-25 52 views
0

python生成器的新东西我想嵌套它们,即有生成器A取决于生成器B的输出(B正在生成文件路径,A解析文档),但只有第一个文件被读取。嵌套的发生器没有正确触发

下面是一个最小的样品(即使用TREC8all数据)

import itertools 
import spacy 
from bs4 import BeautifulSoup 
import os 
def iter_all_files(p): 
    for root, dirs, files in os.walk(p): 
     for file in files: 
      if not file.startswith('.'): 
       print('using: ' + str(os.path.join(root, file))) 
       yield os.path.join(root, file) 


def gen_items(path): 
    path = next(path) 
    text_file = open(path, 'r').read() 
    soup = BeautifulSoup(text_file,'html.parser') 
    for doc in soup.find_all("doc"): 
     strdoc = doc.docno.string.strip() 
     text_only = str(doc.find_all("text")[0]) 
     yield (strdoc, text_only) 


file_counter = 0 
g = iter_all_files("data/TREC8all/Adhoc") 
gen1, gen2 = itertools.tee(gen_items(g)) 
ids = (id_ for (id_, text) in gen1) 
texts = (text for (id_, text) in gen2) 
docs = nlp.pipe(texts, batch_size=50, n_threads=4) 

for id_, doc in zip(ids, docs): 
    file_counter += 1 
file_counter 

这将输出仅

using: data/TREC8all/Adhoc/fbis/fb396002 
Out[10]: 
33 

如果要分析以下显示,当然一些文件有:

g = iter_all_files("data/TREC8all/Adhoc") 
file_counter = 0 
for file in g: 
    file_counter += 1 
    # print(file) 
    for item in gen_items(g): 
     item_counter += 1 

print(item_counter) 
file_counter 

将返回大约2000个文件,如

using: data/TREC8all/Adhoc/fbis/fb396002 
using: data/TREC8all/Adhoc/fbis/fb396003 
using: data/TREC8all/Adhoc/fbis/fb396004 
using: data/TREC8all/Adhoc/fbis/fb396005 
using: data/TREC8all/Adhoc/fbis/fb396006 
using: data/TREC8all/Adhoc/fbis/fb396007 
using: data/TREC8all/Adhoc/fbis/fb396008 
using: data/TREC8all/Adhoc/fbis/fb396009 
using: data/TREC8all/Adhoc/fbis/fb396010 
using: data/TREC8all/Adhoc/fbis/fb396011 
using: data/TREC8all/Adhoc/fbis/fb396012 
using: data/TREC8all/Adhoc/fbis/fb396013 

因此很明显,我的

g = iter_all_files("data/TREC8all/Adhoc") 
gen1, gen2 = itertools.tee(gen_items(g)) 
ids = (id_ for (id_, text) in gen1) 
texts = (text for (id_, text) in gen2) 
docs = nlp.pipe(texts, batch_size=50, n_threads=4) 

for id_, doc in zip(ids, docs): 

没有消耗以正确的方式嵌套发电机。

编辑

与外部for循环嵌套似乎工作,但不好。有没有更好的方式来制定它?

g = iter_all_files("data/TREC8all/Adhoc") 
for file in g: 
    file_counter += 1 
    # print(file) 
    #for item in gen_items(g): 
    gen1, gen2 = itertools.tee(genFiles(g) 
+0

'path = next(path)' - 为什么你只用了一个迭代器呢?如果你只打算第一个项目? – user2357112

回答

1

,但只有第一个文件是只读

好了,你只告诉Python读取一个文件:

def gen_items(path): 
    path = next(path) 
    ... 

如果你想去过的所有文件,你需要一个循环。

def gen_items(paths): 
    for path in paths: 
     ... 
+0

因此没有更优雅的方式来嵌套生成器? –

+0

@GeorgHeiler:如果你第一次没有注意到,我告诉你使用的循环进入'gen_items'。如果你想要一个发电机来处理另一个发电机的项目,它需要循环。如果你想使用一个带有一个项目的函数,并把它应用到'iter_all_files'产生的项目中,你需要'map'。 – user2357112

0

审查的代码,我不知道 “nlp.pipe” 的意思,试试这样

#docs = nlp.pipe(texts, batch_size=50, n_threads=4) 
for id_, doc in zip(ids, texts): 
    file_counter += 1 
file_counter 

看到 “file_counter”,你就会知道的错误。

+0

好主意。 file_counter仍然只有33. –