1
我有一个用Rcurl
写的小脚本,它把我和波兰语语料库连接起来,询问目标词的频率。但是,此解决方案仅适用于标准字符。如果我问波兰语字母(即“ę”,“±”)这个词,它的回归不匹配。输出日志表明该脚本没有在URL地址中正确传输波兰语字符。getForm - 如何发送特殊字符?
我的脚本:
#slowo = word;
wordCorpusChecker<- function (slowo, korpus=2) {
#this line help me bypass the redirection page after calling for specific word
curl = getCurlHandle(cookiefile = "", verbose = TRUE,
followlocation=TRUE, encoding = "utf-8")
#standard call for submitting html form
getForm("http://korpus.pl/poliqarp/poliqarp.php",
query = slowo, corpus = as.character(korpus), showMatch = "1",
showContext = "3",leftContext = "5", rightContext = "5",
wideContext = "50", hitsPerPage = "10",
.opts = curlOptions(
verbose = T,
followlocation=TRUE,
encoding = "utf-8"
)
, curl = curl)
#In test2 there is html of page where I can find information I'm interested in
test1 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
test2 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
#"scrapping" the frequency from html website
a<-regexpr("Found <em>", test2)[1]+
as.integer(attributes(regexpr("Found <em>", test2)))
b<-regexpr("</em> results<br />\n", test2)[1] - 1
c<-a:b
value<-substring(test2, c[1], c[length(c)])
return(value)
}
#if you try this you will get nice result about "pies" (dog) frequency in polish corpus
wordCorpusChecker("pies")
#if you try this you will get no match because of the special characters
wordCorpusChecker("kałuża")
#the log from `verbose`:
GET /poliqarp/poliqarp.php?query=ka%B3u%BFa&corpus=2&showMatch=1&showContext=3&leftContext=5&rightContext=5&wideContext=50&hitsPerPage=10
我试图指定encoding
选项,但是手动说,这是指查询的结果。我正在试验curlUnescape
,但没有获得正面结果。请寻求建议。
感谢您的书于我管理使用'iconv'to更改特殊字符转换为Unicode。非常感谢! – KkK