2017-08-12 23 views
0

我在R中使用xml2包来提取具有相同类名称的某些节点。我试图提取出现在xml中'角色'和'公司'标签下面的开始和结束日期(都有类名'日期')。但还有其他的日期标签与我不需要的培训相关。另外,格式因xml而异。有什么功能可以帮助我选择每个角色标签后面的日期标签吗?下面是XML片段:如何使用R在XML中选择一些具有相同名称的特定节点

<span class="work-hist-mark" id="57" inprof="n">CAREER HISTORY:</span> 
 
No Company Position Years * 
 
<span class="company" id="58" inprof="y">Nasioncom</span> 
 
<span class="role" id="59_1" inprof="y">Helpdesk</span> 
 
1st level 
 
<span class="date" id="60_1" inprof="y">Jan 1999</span> 
 
- 
 
<span class="date" id="60_2" inprof="y">June 2000</span> 
 
* 
 
<span class="role" id="61_1_1" inprof="y">Komputer Sistem System Engineer</span> 
 
<span class="date" id="61_2_1" inprof="y">June 2000</span> 
 
- 
 
<span class="date" id="61_2_2" inprof="y">Oct 2003</span> 
 
* 
 
<span class="role" id="62_1_1" inprof="y">Servicesoft Network Engineer</span> 
 
<span class="date" id="62_2_1" inprof="y">Oct 2003</span> 
 
- 
 
<span class="date" id="62_2_2" inprof="y">June 2006</span> 
 
* 
 
<span class="company" id="63_1" inprof="y">EDS</span> 
 
<span class="role" id="63_2_1" inprof="y">Infrastructure Associate</span> 
 
<span class="date" id="63_3_1" inprof="y">July</span> 
 
- 
 
<span class="date" id="63_3_2" inprof="y">Nov 2006</span> 
 
* 
 
<span class="company" id="64_1" inprof="y">Atos Origin</span> 
 
<span class="role" id="64_2_1" inprof="y">Technical Specialist</span> 
 
<span class="date" id="64_3_1" inprof="y">Nov 2006</span> 
 
- 
 
<span class="date" id="64_3_2" inprof="y">Nov 2008</span> 
 
* 
 
<span class="company" id="65" inprof="y">Hewlett Packard</span> 
 
<span class="role" id="66_1" inprof="y">Wintel Server Specialist</span> 
 
Level 3 
 
<span class="date" id="67_1" inprof="y">Nov 2008</span> 
 
to 
 
<span class="date" id="67_2" inprof="y">present</span> 
 
TRAINING ATTENDED: 
 
<span class="date" id="68" inprof="y">2001</span> 
 
<span class="sofwr" id="69" inprof="y">HP</span> 
 
& 
 
<span class="sofwr" id="70" inprof="y">Compaq Proliant server</span> 
 
series 
 
<span class="date" id="71_1_1" inprof="y">2003</span> 
 
/
 
<span class="date" id="71_1_2" inprof="y">05</span> 
 
<span class="role" id="71_2_1" inprof="y">Sophos Antivirus Technical Consultant</span> 
 
<span class="company" id="71_3" inprof="y">Mail Monitor SMTP</span> 
 
<span class="location" id="71_4" inprof="y">Pure</span> 
 
Message for 
 
<span class="sofwr" id="72" inprof="y">Exchange</span> 
 
or 
 
<span class="sofwr" id="73" inprof="y">UNIX</span> 
 
(antivirus + antispam) SAV Integrated (http web scanning) Remote Update (design for mobile user) Sophos in multiple platforms (open source eg: 
 
<span class="sofwr" id="74" inprof="y">UNIX</span> 
 
, 
 
<span class="sofwr" id="75" inprof="y">Linux</span> 
 
, 
 
<span class="sofwr" id="76" inprof="y">Mac9 &10</span> 
 
, 
 
<span class="sofwr" id="77" inprof="y">FreeBSD</span> 
 
) 
 
<span class="company" id="78" inprof="n">Small Business Enterprise</span> 
 
<span class="date" id="79" inprof="y">2005</span> 
 
Watchguard X500/ X2500 Add-on: 
 
<span class="company" id="80" inprof="y">GatewayAV, Weblocker & Spam</span> 
 
screen 
 
<span class="date" id="81" inprof="n">2007</span> 
 
<span class="sofwr" id="82" inprof="y">Microsoft Windows Vista</span> 
 
Install, configuring and managing 
 
<span class="sofwr" id="83" inprof="y">Windows Vista</span>

回答

0

这很有趣,因为该数据是脏的(即一些日期只是多年,别人都在一个月的前三个字母与去年和连结整整一个月)。

我不确定如何选择解决脏数据组件,但您正在寻找readr程序包,特别是parse_date命令。

下面是一个例子。假设我有一个字符串,表示“Jan foo 05,2016 bar”,我想从数据中获取datetime对象。

library(readr) 
df1 <- "Jan foo 05, 2016 bar" 
parse_date(df1, "%b foo %d, %Y bar") 

[1] "2016-01-05" 

您需要采取相同的方法。我建议将每行存储为一个观察值,然后将您的观察值过滤到只发生日期的地方。从那里你可以像我一样使用parse_date使用相同的方法。因为你的日期格式不同,所以你需要一个函数if/else或其他类型的处理程序来适应数据的差异。

对于过滤组件,您可以使用this thread上提及的方法使用dyplr的过滤命令。

有意义吗?祝你好运!

相关问题