R中的HTML字符实体替换

我有一大组HTML文件，其中包含节点span中杂志的文本。我的PDF到HTML转换器在整个HTML中插入字符实体 。问题是在R中，我使用xmlValue函数（在XML包中）来提取文本，但是在任何存在 的地方，单词之间的空间被消除。例如：R中的HTML字符实体替换

<span class="ft6">kids,&nbsp;and kids in your community,&nbsp;in         DIY&nbsp;projects.&nbsp;</span>

将陆续xmlValue功能的出来：

"kids,and kids in your community,in DIYprojects."

我在想，最简单的方法来解决，这将是通过xmlValue运行span节点之前找到所有 ，并用" "（空格）替换它们。我将如何处理？

来源

2013-01-15 Gene Burinsky

我已经重写了答案，以反映原始海报无法从XMLValue获取文本的问题。可能有不同的方法来解决这个问题，但一种方法是直接打开/替换/写入HTML文件本身。通常用正则表达式处理XML/HTML是一个糟糕的想法，但在这种情况下，我们有一个直接的问题是不需要的非空白空间，所以它可能不是太多问题。以下代码是如何创建匹配文件列表并在内容上执行gsub的示例。根据需要修改或扩展应该很容易。

setwd("c:/test/") 
# Create 'html' file to use with test 
txt <- "<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in         DIY&nbsp;projects.&nbsp;</span> 
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in         DIY&nbsp;projects.&nbsp;</span> 
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in         DIY&nbsp;projects.&nbsp;</span>" 
writeLines(txt, "file1.html") 

# Now read files - in this case only one 
html.files <- list.files(pattern = ".html") 
html.files 

# Loop through the list of files 
retval <- lapply(html.files, function(x) { 
      in.lines <- readLines(x, n = -1) 
      # Replace non-breaking space with space 
      out.lines <- gsub("&nbsp;"," ", in.lines) 
      # Write out the corrected lines to a new file 
      writeLines(out.lines, paste("new_", x, sep = "")) 
})

来源

2013-01-15 00:23:10 SlowLearner

这是'' 没有'的方式$ nbsp'，所以'GSUB（ “ ”，””，测试）'应该工作。 – thelatemail

@thelatemail感谢您发现 - 现在修正了错别字。在正常醒来之前必须避免张贴... – SlowLearner

我试过gsub。问题是xmlValue的输入不是一个字符向量，它是一个“XMLinternalNode”。 gsub需要可转换为字符向量或字符向量的东西，但都不是这样。 –

R中的HTML字符实体替换

回答

相关问题