2016-11-15 115 views
1

解析这个数据我可以从这里 http://mips.helmholtz-muenchen.de/proj/ppi/ 下载在页面,它是写的最后一个数据:“你可以得到完整的数据集”我怎么可以用XML

然后我试图使用xml

library(XML) 
doc <- xmlTreeParse("path to/allppis.xml", useInternal = TRUE) 
root <- xmlRoot(doc) 

但似乎空

我想要什么?

如果我打开allppi.xml从该网站下载, 我想具体的行解析为一个txt文件,它<fullName>开始,以</fullName>

结束,例如,如果我打开该文件,我可以看到这

<fullName>S100A8;CAGA;MRP8; calgranulin A (migration inhibitory factor-related protein 8)</fullName> 

然后我想有这个

Proteins     description 
S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8) 
+0

您需要先下载并解压缩文件,然后才能解析。 [这显示了一种方式](http://stackoverflow.com/questions/23899525/using-r-to-download-zipped-data-file-extract-and-import-csv)。所以试试'temp < - tempfile(); download.file(“http://mips.helmholtz-muenchen.de/proj/ppi/data/mppi.gz”,temp); unz(temp,“allppis.xml”)',然后'doc < - xmlTreeParse(temp,useInternal = TRUE);根< - xmlRoot(doc)' – user20650

+0

也有这个可能有用的软件包https://www.bioconductor.org/packages/release/bioc/html/RpsiXML.html – user20650

+0

@ user20650现在我只需键入doc,然后看到xml在里面,但它在哪里保存它?你能帮我得到我想要的确切输出吗? –

回答

2

我想你想是这样的(阙stion不是很清楚IMO)。我也认为,主要问题是默认的命名空间,这绝对是一个皇家疼痛:

library(xml2) 
library(purrr) 
library(dplyr) 
library(stringi) 

doc <- read_xml("allppis.xml") 

ns <- xml_ns_rename(xml_ns(doc), d1="x") 

xml_find_all(doc, ".//x:proteinInteractor/x:names/x:fullName", ns) %>% 
    xml_text() %>% 
    stri_split_fixed("; ", n=2, simplify=TRUE) %>% 
    as_data_frame() %>% 
    setNames(c("Proteins", "Description")) %>% 
    mutate(Proteins=trimws(Proteins), 
     Description=trimws(Description)) 
## # A tibble: 3,628 × 2 
##    Proteins             Description 
##    <chr>               <chr> 
## 1 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8) 
## 2 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14) 
## 3 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14) 
## 4 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8) 
## 5 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8) 
## 6 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14) 
## 7 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14) 
## 8 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8) 
## 9    TRP3         calcium influx channel protein 
## 10   IP3R-3     inositol 1,4,5-trisphosphate receptor, type 3 
## # ... with 3,618 more rows 

你需要清理一下了一下(View()得到的数据帧明白我的意思)。

+0

非常感谢你!我没有什么担心,1-有时候我没有看到蛋白质ID,但是可以说明,另一列中的每个蛋白质都可以有'db ='和'id ='?我绝对接受你的答案。再次感谢 –