我想你想是这样的(阙stion不是很清楚IMO)。我也认为,主要问题是默认的命名空间,这绝对是一个皇家疼痛:
library(xml2)
library(purrr)
library(dplyr)
library(stringi)
doc <- read_xml("allppis.xml")
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, ".//x:proteinInteractor/x:names/x:fullName", ns) %>%
xml_text() %>%
stri_split_fixed("; ", n=2, simplify=TRUE) %>%
as_data_frame() %>%
setNames(c("Proteins", "Description")) %>%
mutate(Proteins=trimws(Proteins),
Description=trimws(Description))
## # A tibble: 3,628 × 2
## Proteins Description
## <chr> <chr>
## 1 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
## 2 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 3 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 4 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
## 5 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
## 6 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 7 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 8 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8)
## 9 TRP3 calcium influx channel protein
## 10 IP3R-3 inositol 1,4,5-trisphosphate receptor, type 3
## # ... with 3,618 more rows
你需要清理一下了一下(View()
得到的数据帧明白我的意思)。
您需要先下载并解压缩文件,然后才能解析。 [这显示了一种方式](http://stackoverflow.com/questions/23899525/using-r-to-download-zipped-data-file-extract-and-import-csv)。所以试试'temp < - tempfile(); download.file(“http://mips.helmholtz-muenchen.de/proj/ppi/data/mppi.gz”,temp); unz(temp,“allppis.xml”)',然后'doc < - xmlTreeParse(temp,useInternal = TRUE);根< - xmlRoot(doc)' – user20650
也有这个可能有用的软件包https://www.bioconductor.org/packages/release/bioc/html/RpsiXML.html – user20650
@ user20650现在我只需键入doc,然后看到xml在里面,但它在哪里保存它?你能帮我得到我想要的确切输出吗? –