2014-03-01 55 views
4

我想对德语推文进行情绪分析。我使用的代码与英语一起工作良好,但是当我加载德语单词列表时,所有分数仅为零。据我猜测,它必须与单词列表的不同结构有关。所以我需要知道的是,如何使我的代码适应德语单词列表的结构。有人可以看看这两个列表?Twitter舆情分析使用德语语言设置SentiWS

English Wordlist
German Wordlist

# load the wordlists 
    pos.words = scan("~/positive-words.txt",what='character', comment.char=';') 
    neg.words = scan("~/negative-words.txt",what='character', comment.char=';') 

     # bring in the sentiment analysis algorithm 
     # we got a vector of sentences. plyr will handle a list or a vector as an "l" 
     # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply: 
     score.sentiment = function(sentences, pos.words, neg.words, .progress='none') 
     { 
      require(plyr) 
      require(stringr) 
      scores = laply(sentences, function(sentence, pos.words, neg.words) 
      { 
      # clean up sentences with R's regex-driven global substitute, gsub(): 
      sentence = gsub('[[:punct:]]', '', sentence) 
      sentence = gsub('[[:cntrl:]]', '', sentence) 
      sentence = gsub('\\d+', '', sentence) 
      # and convert to lower case: 
      sentence = tolower(sentence) 
      # split into words. str_split is in the stringr package 
      word.list = str_split(sentence, '\\s+') 
      # sometimes a list() is one level of hierarchy too much 
      words = unlist(word.list) 
      # compare our words to the dictionaries of positive & negative terms 
      pos.matches = match(words, pos.words) 
      neg.matches = match(words, neg.words) 
      # match() returns the position of the matched term or NA 
      # we just want a TRUE/FALSE: 
      pos.matches = !is.na(pos.matches) 
      neg.matches = !is.na(neg.matches) 
      # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): 
      score = sum(pos.matches) - sum(neg.matches) 
      return(score) 
      }, 
      pos.words, neg.words, .progress=.progress) 
      scores.df = data.frame(score=scores, text=sentences) 
      return(scores.df) 
     } 

    # and to see if it works, there should be a score...either in German or in English 
    sample = c("ich liebe dich. du bist wunderbar","I hate you. Die!");sample 
    test.sample = score.sentiment(sample, pos.words, neg.words);test.sample 

回答

3

这可能会为你工作:

readAndflattenSentiWS <- function(filename) { 
    words = readLines(filename, encoding="UTF-8") 
    words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words) 
    words <- unlist(strsplit(words, ",")) 
    words <- tolower(words) 
    return(words) 
} 
pos.words <- c(scan("positive-words.txt",what='character', comment.char=';', quiet=T), 
       readAndflattenSentiWS("SentiWS_v1.8c_Positive.txt")) 
neg.words <- c(scan("negative-words.txt",what='character', comment.char=';', quiet=T), 
       readAndflattenSentiWS("SentiWS_v1.8c_Negative.txt")) 

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { 
    # ... see OP ... 
} 

sample <- c("ich liebe dich. du bist wunderbar", 
      "Ich hasse dich, geh sterben!", 
      "i love you. you are wonderful.", 
      "i hate you, die.") 
(test.sample <- score.sentiment(sample, 
           pos.words, 
           neg.words)) 
# score        text 
# 1  2 ich liebe dich. du bist wunderbar 
# 2 -2  ich hasse dich, geh sterben! 
# 3  2 i love you. you are wonderful. 
# 4 -2     i hate you, die. 
2

在德国的名单列表与此名称: SentiWS_v1.8c_Negative.txt和SentiWS_v1.8c_Positive.txt 没有在您加载的方式,这只适用于英文版:

pos.words = scan("~/positive-words.txt",what='character', comment.char=';') 
neg.words = scan("~/negative-words.txt",what='character', comment.char=';') 

除此之外名单是在不同的格式:
德语版,就是这样:

Abbau|NN -0.058 Abbaus,Abbaues,Abbauen,Abbaue 
Abbruch|NN -0.0048 Abbruches,Abbrüche,Abbruchs,Abbrüchen 
Abdankung|NN -0.0048 Abdankungen 
Abdämpfung|NN -0.0048 Abdämpfungen 
Abfall|NN -0.0048 Abfalles,Abfälle,Abfalls,Abfällen 
Abfuhr|NN -0.3367 Abfuhren 

英文版:

魅力
慈善
魅力
迷人
迷人
纯洁
便宜
最低

德国那些遵循此模式:word|NN\tnumber <similar words comma separated>\n
英文网站遵循这个模式word\n
而且每个文档的标题是不同的,所以你可能想跳过标题(在英文名单似乎是一个文章,不推文,或推文的话)

解决方案,获得两个文件的格式相同,然后做任何你想做的或准备你的代码从两种类型的数据中读取。
现在你的程序已经可以在英文版上运行了,所以我建议你改变德文版的格式。您可以更改每个空格或逗号作为\n,然后消除所有的NN和数字。