我怎么可以用XML

解析这个数据我可以从这里 http://mips.helmholtz-muenchen.de/proj/ppi/ 下载在页面，它是写的最后一个数据：“你可以得到完整的数据集”我怎么可以用XML

然后我试图使用xml包

library(XML) 
doc <- xmlTreeParse("path to/allppis.xml", useInternal = TRUE) 
root <- xmlRoot(doc)

但似乎空

我想要什么？

如果我打开allppi.xml从该网站下载，我想具体的行解析为一个txt文件，它<fullName>开始，以</fullName>

结束，例如，如果我打开该文件，我可以看到这

<fullName>S100A8;CAGA;MRP8; calgranulin A (migration inhibitory factor-related protein 8)</fullName>

然后我想有这个

Proteins     description 
S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)

来源

2016-11-15 Learner Algorithm

您需要先下载并解压缩文件，然后才能解析。 [这显示了一种方式]（http://stackoverflow.com/questions/23899525/using-r-to-download-zipped-data-file-extract-and-import-csv）。所以试试'temp < - tempfile（）; download.file（“http://mips.helmholtz-muenchen.de/proj/ppi/data/mppi.gz”，temp）; unz（temp，“allppis.xml”）'，然后'doc < - xmlTreeParse（temp，useInternal = TRUE）;根< - xmlRoot（doc）' – user20650

也有这个可能有用的软件包https://www.bioconductor.org/packages/release/bioc/html/RpsiXML.html – user20650

@ user20650现在我只需键入doc，然后看到xml在里面，但它在哪里保存它？你能帮我得到我想要的确切输出吗？ –

我想你想是这样的（阙stion不是很清楚IMO）。我也认为，主要问题是默认的命名空间，这绝对是一个皇家疼痛：

library(xml2) 
library(purrr) 
library(dplyr) 
library(stringi) 

doc <- read_xml("allppis.xml") 

ns <- xml_ns_rename(xml_ns(doc), d1="x") 

xml_find_all(doc, ".//x:proteinInteractor/x:names/x:fullName", ns) %>% 
    xml_text() %>% 
    stri_split_fixed("; ", n=2, simplify=TRUE) %>% 
    as_data_frame() %>% 
    setNames(c("Proteins", "Description")) %>% 
    mutate(Proteins=trimws(Proteins), 
     Description=trimws(Description)) 
## # A tibble: 3,628 × 2 
##    Proteins             Description 
##    <chr>               <chr> 
## 1 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8) 
## 2 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14) 
## 3 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14) 
## 4 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8) 
## 5 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8) 
## 6 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14) 
## 7 S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14) 
## 8 S100A8;CAGA;MRP8 calgranulin A (migration inhibitory factor-related protein 8) 
## 9    TRP3         calcium influx channel protein 
## 10   IP3R-3     inositol 1,4,5-trisphosphate receptor, type 3 
## # ... with 3,618 more rows

你需要清理一下了一下（View()得到的数据帧明白我的意思）。

来源

2016-11-16 05:07:27 hrbrmstr

非常感谢你！我没有什么担心，1-有时候我没有看到蛋白质ID，但是可以说明，另一列中的每个蛋白质都可以有'db ='和'id ='？我绝对接受你的答案。再次感谢 –

我怎么可以用XML

回答

相关问题