2011-12-07 28 views
5

我有一段非常长的电话日志作为文本文件,我试图将它读入R,但它并没有真正实现。文本有一个结构,但肯定不是表格。其结构如下当数据不在表格中时,我如何将文本文件读入R

  1. 每个记录由多个行,以便readlines方法是不太合适
  2. 每个记录中的每一行是一个单独的字段
  3. 一些记录的第二场
  4. 之后有一个额外的场
  5. 每条新记录都以空行记录。

    readLinesscan如果能指定的记录是由“\ n \ n”和字段(或列)被分离“\ n”

下面是一个例子分开会工作

TheInstitute 5467 
    telephone line 4125526987 x 4567 
    datetime 2011110516 12:56 
    blay blay blah who knows what, but anyway it may have a comma 

TheInstitute 5467 
    telephone line 4125526987 x 4567 
    datetime 2011110516 12:58 
    blay blay blah who knows what 

TheInstitute 5467 
    telephone line 412552999 x 4999 
    bump phone line 4125527777 
    datetime 2011110516 12:59 
    blay blay blah who knows what 

TheInstitute 5467 
    telephone line 4125526987 x 4567 
    bump phone line 4125527777 
    datetime 2011110516 13:51 
    blay blay blah who knows what, but anyway it may have a comma 

TheInstitute 5467 
    telephone line 4125526987 x 4567 
    datetime 2011110516 14:56 
    blay blay blah who knows what 

我该如何在R中做到这一点?我试过扫描,粘贴,strsplit技巧,但我旋转圈。我可能必须将它列入清单,因为它可以处理不等数量的元素。我想让所有记录具有相同数量的字段,并且对于那些没有一个字段的记录(这里称为凹凸电话),我希望他们只是将NA作为该字段中的值。即使只是开始,我也会很感激帮助。从那里我可以玩玩具。

回答

14

scan函数中使用multi.line = TRUE时,记录应该以两个行尾结束。我这样做是周围文件textConnection,但你可以使用一个有效的文件名:

inp <- scan(textConnection(txt), multi.line=TRUE, 
      what=list(place="character", tline1="character", 
      cline1="character", cline2 ="character", cline3="character"), sep="\n") 
Read 5 records 
> str(as.data.frame(inp)) 
'data.frame': 5 obs. of 5 variables: 
$ place : Factor w/ 1 level "TheInstitute 5467": 1 1 1 1 1 
$ tline1: Factor w/ 2 levels " telephone line 4125526987 x 4567",..: 1 1 2 1 1 
$ cline1: Factor w/ 4 levels " bump phone line 4125527777",..: 2 3 1 1 4 
$ cline2: Factor w/ 4 levels " blay blay blah who knows what",..: 2 1 3 4 1 
$ cline3: Factor w/ 3 levels ""," blay blay blah who knows what",..: 1 1 2 3 1 
> as.data.frame(inp) 
       place        tline1 
1 TheInstitute 5467 telephone line 4125526987 x 4567 
2 TheInstitute 5467 telephone line 4125526987 x 4567 
3 TheInstitute 5467 telephone line 412552999 x 4999 
4 TheInstitute 5467 telephone line 4125526987 x 4567 
5 TheInstitute 5467 telephone line 4125526987 x 4567 
         cline1 
1 datetime 2011110516 12:56 
2 datetime 2011110516 12:58 
3 bump phone line 4125527777 
4 bump phone line 4125527777 
5 datetime 2011110516 14:56 
                  cline2 
1 blay blay blah who knows what, but anyway it may have a comma 
2         blay blay blah who knows what 
3          datetime 2011110516 12:59 
4          datetime 2011110516 13:51 
5         blay blay blah who knows what 
                  cline3 
1                 
2                 
3         blay blay blah who knows what 
4 blay blay blah who knows what, but anyway it may have a comma 
5                 
+0

+1非常好... – Andrie

+0

...但我猜你需要进一步分割'place','tline'和'cline1'分成子列? – Tommy

+0

我认为接下来的任务是围绕'日期时间'和'凹凸线'数据移动,但是我并不认为提问者要求解析评论。 –