在R中仅包含空白的行跳过

我有一些阅读html子网站的问题。他们大多数工作得很好，但例如http://www-history.mcs.st-andrews.ac.uk/Biographies/De_Morgan.html在H1和H3中有空行。因为这样我的data.frame是一个混乱的问题，例如： data frame example。框架Containts 4列“名称”“出生日期和地点”“日期和地点deat”“链接”。我试图在LaTeX中创建一个表格，但由于这些行有空格，我的标签在某些点上出现了错误的方向，而一个家伙的名字就是他的出生日期等等。要阅读站点IM使用简单地使用从j = 1到长度（LinkiWlasciwy）环在R中仅包含空白的行跳过

matematyk=LinkWlasciwy[j] %>% read_html() %>% html_nodes(selektor1) %>% html_text()

其中selektor1 = “H3字体，H1”。之后，我将它保存到.txt文件中，并在另一个脚本中读取它，我应该根据这些数据创建.tex文件。在我看来，最好只删除文件中只包含空格的行，例如空格，\ n等。在我的txt文件中，例如，

Marie-Sophie Germain | 1776年4月1日

在Paris，France | 1831年6月27日

在法国巴黎| www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html |

作为分隔符，我使用“|”。不是所有的人都是一样的，有的只包含一个空间，有的两个等等。我要的只是带来每一个错误的记录。

Marie-Sophie Germain | 1776年4月1日在法国巴黎| 1831年6月27日在法国巴黎| www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html |

我不得不删除HTTP：//从文本样本，因为我没有尚未10声誉，他们被算作链接

来源

2016-02-28 Karol Kreczman

（[^ \ t]）[\ t] + $，看看这个帖子http://stackoverflow.com/questions/9532340/how-to-remove-trailing-white-spaces-using-是gular-expression-without-removing – 2016-02-28 10:08:05

非常感谢您，我无法找到该主题。 –

快乐它可以帮助你 – 2016-02-28 10:51:47

您可以使用库stringi：

library(stringi) 
line<-c("Marie-Sophie Germain| 1 April 1776", 
" ", 
"in Paris, France| 27 June 1831", 
" ", 
"in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|") 

line2<- line[stri_count_regex(line, "^[ \\t]+$") ==0] 
line2 
stri_paste(line2, collapse="")

结果：

[1] "Marie-Sophie Germain| 1 April 1776in Paris, France| 27 June 1831in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|"

来源

2016-02-28 10:26:40 bartoszukm

谢谢你的时间，但使用（[^ \ t \ r \ n]）[\ t] + $在gsub中解决了所有问题。 –

在R中仅包含空白的行跳过

回答

相关问题