2014-04-13 27 views
1

我有一个用Rcurl写的小脚本,它把我和波兰语语料库连接起来,询问目标词的频率。但是,此解决方案仅适用于标准字符。如果我问波兰语字母(即“ę”,“±”)这个词,它的回归不匹配。输出日志表明该脚本没有在URL地址中正确传输波兰语字符。getForm - 如何发送特殊字符?

我的脚本:

#slowo = word; 
wordCorpusChecker<- function (slowo, korpus=2) { 
#this line help me bypass the redirection page after calling for specific word 
curl = getCurlHandle(cookiefile = "", verbose = TRUE, 
         followlocation=TRUE, encoding = "utf-8") 
#standard call for submitting html form 
getForm("http://korpus.pl/poliqarp/poliqarp.php", 
      query = slowo, corpus = as.character(korpus), showMatch = "1", 
      showContext = "3",leftContext = "5", rightContext = "5", 
      wideContext = "50", hitsPerPage = "10", 
      .opts = curlOptions(
      verbose = T, 
      followlocation=TRUE, 
      encoding = "utf-8" 
     ) 
      , curl = curl) 
#In test2 there is html of page where I can find information I'm interested in 
    test1 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl) 
    test2 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl) 
#"scrapping" the frequency from html website 
a<-regexpr("Found <em>", test2)[1]+ 
     as.integer(attributes(regexpr("Found <em>", test2))) 
     b<-regexpr("</em> results<br />\n", test2)[1] - 1 
     c<-a:b 
     value<-substring(test2, c[1], c[length(c)]) 
     return(value) 

    } 

#if you try this you will get nice result about "pies" (dog) frequency in polish corpus 
    wordCorpusChecker("pies") 

#if you try this you will get no match because of the special characters 
    wordCorpusChecker("kałuża") 

#the log from `verbose`: 

    GET /poliqarp/poliqarp.php?query=ka%B3u%BFa&corpus=2&showMatch=1&showContext=3&leftContext=5&rightContext=5&wideContext=50&hitsPerPage=10 

我试图指定encoding选项,但是手动说,这是指查询的结果。我正在试验curlUnescape,但没有获得正面结果。请寻求建议。

回答

0

一种解决方案是指定UTF编码例如

> "ka\u0142u\u017Ca" 
[1] "kałuża" 
wordCorpusChecker("ka\u0142u\u017Ca") 

[1] "55" 
+0

感谢您的书于我管理使用'iconv'to更改特殊字符转换为Unicode。非常感谢! – KkK