TM结合语料库

的名单我有我所获取的web内容的URL列表，并列入到这TM语料库：TM结合语料库

library(tm) 
library(XML) 

link <- c(
"http://www.r-statistics.com/tag/hadley-wickham/",              
"http://had.co.nz/",                      
"http://vita.had.co.nz/articles.html",                 
"http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html",       
"http://www.analyticstory.com/hadley-wickham/" 
)    

create.corpus <- function(url.name){ 
doc=htmlParse(url.name) 
parag=xpathSApply(doc,'//p',xmlValue) 
if (length(parag)==0){ 
    parag="empty" 
} 
cc=Corpus(VectorSource(parag)) 
meta(cc,"link")=url.name 
return(cc) 
} 

link=catch$url 
cc <- lapply(link, create.corpus)

这让我语料的“大名单”，每一个URL。结合逐一作品：

x=cc[[1]] 
y=cc[[2]] 
z=c(x,y,recursive=T) # preserved metadata 
x;y;z 
# A corpus with 8 text documents 
# A corpus with 2 text documents 
# A corpus with 10 text documents

但这变得不可行的有几千语料的列表。那么如何在保持元数据的同时将语料库列表合并到一个语料库中？

来源

2014-01-07 Henk

您可以使用do.call调用c：

do.call(function(...) c(..., recursive = TRUE), cc) 
# A corpus with 155 text documents

来源

2014-01-07 12:13:41

工程！从来没有意识到你可以使用（...）这种方式。 – Henk

我不认为tm提供任何内置功能的加入/合并胼很多。但毕竟一个语料库是一个文档列表，所以问题是如何将列表列表转换为列表。我会做创建使用所有文档的新文集，然后手动分配荟萃：

y = Corpus(VectorSource(unlist(cc))) 
meta(y,'link') = do.call(rbind,lapply(cc,meta))$link

来源

2014-01-07 12:33:19 agstudy

您的代码不起作用，因为catch没有定义，所以我不知道到底是什么是应该做的。

但现在TM语料库刚好可以放入一个载体，使一个大语料库：https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/tm_combine

所以也许c(unlist(cc))会工作。我没有办法测试这是否会工作，因为你的代码没有运行。

来源

2017-11-17 21:23:41 wordsforthewise

回答

相关问题