python生成器的新东西我想嵌套它们，即有生成器A取决于生成器B的输出（B正在生成文件路径，A解析文档），但只有第一个文件被读取。嵌套的发生器没有正确触发

下面是一个最小的样品（即使用TREC8all数据）

import itertools 
import spacy 
from bs4 import BeautifulSoup 
import os 
def iter_all_files(p): 
    for root, dirs, files in os.walk(p): 
     for file in files: 
      if not file.startswith('.'): 
       print('using: ' + str(os.path.join(root, file))) 
       yield os.path.join(root, file) 


def gen_items(path): 
    path = next(path) 
    text_file = open(path, 'r').read() 
    soup = BeautifulSoup(text_file,'html.parser') 
    for doc in soup.find_all("doc"): 
     strdoc = doc.docno.string.strip() 
     text_only = str(doc.find_all("text")[0]) 
     yield (strdoc, text_only) 


file_counter = 0 
g = iter_all_files("data/TREC8all/Adhoc") 
gen1, gen2 = itertools.tee(gen_items(g)) 
ids = (id_ for (id_, text) in gen1) 
texts = (text for (id_, text) in gen2) 
docs = nlp.pipe(texts, batch_size=50, n_threads=4) 

for id_, doc in zip(ids, docs): 
    file_counter += 1 
file_counter

这将输出仅

using: data/TREC8all/Adhoc/fbis/fb396002 
Out[10]: 
33

如果要分析以下显示，当然一些文件有：

g = iter_all_files("data/TREC8all/Adhoc") 
file_counter = 0 
for file in g: 
    file_counter += 1 
    # print(file) 
    for item in gen_items(g): 
     item_counter += 1 

print(item_counter) 
file_counter

将返回大约2000个文件，如

using: data/TREC8all/Adhoc/fbis/fb396002 
using: data/TREC8all/Adhoc/fbis/fb396003 
using: data/TREC8all/Adhoc/fbis/fb396004 
using: data/TREC8all/Adhoc/fbis/fb396005 
using: data/TREC8all/Adhoc/fbis/fb396006 
using: data/TREC8all/Adhoc/fbis/fb396007 
using: data/TREC8all/Adhoc/fbis/fb396008 
using: data/TREC8all/Adhoc/fbis/fb396009 
using: data/TREC8all/Adhoc/fbis/fb396010 
using: data/TREC8all/Adhoc/fbis/fb396011 
using: data/TREC8all/Adhoc/fbis/fb396012 
using: data/TREC8all/Adhoc/fbis/fb396013

因此很明显，我的

g = iter_all_files("data/TREC8all/Adhoc") 
gen1, gen2 = itertools.tee(gen_items(g)) 
ids = (id_ for (id_, text) in gen1) 
texts = (text for (id_, text) in gen2) 
docs = nlp.pipe(texts, batch_size=50, n_threads=4) 

for id_, doc in zip(ids, docs):

没有消耗以正确的方式嵌套发电机。

编辑

与外部for循环嵌套似乎工作，但不好。有没有更好的方式来制定它？

g = iter_all_files("data/TREC8all/Adhoc") 
for file in g: 
    file_counter += 1 
    # print(file) 
    #for item in gen_items(g): 
    gen1, gen2 = itertools.tee(genFiles(g)

来源

2017-03-25 Georg Heiler

'path = next（path）' - 为什么你只用了一个迭代器呢？如果你只打算第一个项目？ – user2357112

，但只有第一个文件是只读

好了，你只告诉Python读取一个文件：

def gen_items(path): 
    path = next(path) 
    ...

如果你想去过的所有文件，你需要一个循环。

def gen_items(paths): 
    for path in paths: 
     ...

来源

2017-03-25 17:03:32 user2357112

因此没有更优雅的方式来嵌套生成器？ –

@GeorgHeiler：如果你第一次没有注意到，我告诉你使用的循环进入'gen_items'。如果你想要一个发电机来处理另一个发电机的项目，它需要循环。如果你想使用一个带有一个项目的函数，并把它应用到'iter_all_files'产生的项目中，你需要'map'。 – user2357112

审查的代码，我不知道 “nlp.pipe” 的意思，试试这样

#docs = nlp.pipe(texts, batch_size=50, n_threads=4) 
for id_, doc in zip(ids, texts): 
    file_counter += 1 
file_counter

看到 “file_counter”，你就会知道的错误。

来源

2017-03-25 16:24:25 cjremond

好主意。 file_counter仍然只有33. –

嵌套的发生器没有正确触发

编辑

回答

相关问题