从术语频率计数（数字）重新创建歌词（单词）

我试图从术语频率计数“重新创建”音乐歌词。我有两个源数据文件。第一个简单列出了我正在使用的歌词语料库中的5000个最常用术语，从大多数使用的（1）到最少使用（5000）的顺序排列。第二个文件是歌词库本身，由超过20万首歌曲组成。从术语频率计数（数字）重新创建歌词（单词）

每个“歌曲”是逗号分隔的字符串，如下所示：“SONGID1，SONGID2,1：13,2：10,4：6,7：15，....”其中前两个条目是歌曲的ID标签，然后是歌词（冒号左边的数字）和歌曲中使用的词语的次数（冒号右边的数字）。在上面的例子中，这意味着在给定的歌曲中，“I”（5000个最常用术语中的第一个条目“1”）出现13次，而“the”（第二常用术语）出现10次，等等。

我想要做的就是从这个“termID：termCount”格式转到实际“重新创建”原始（尽管是混乱）歌词，其中我将冒号左边的数字设置为实际词条，然后列出这些术语在术语计算在冒号右侧的情况下是适当的次数。再次，使用上面的简短示例，我的首选结果输出为：“SONGID1，SONGID2，I I I I I I I I I I I I I I I I I I I the the the the the and the and and and and and ...”等等。谢谢！

来源

2013-12-09 user3084485

也许以下（未经测试）会激励你。你没有说如何你想要输出，所以你可能想要更改print() s文件写入或什么。

//assumes that each word is on its own line, sorted from most to least common 
String[] words = loadStrings("words.txt"); 

//two approaches: 
//loadStrings() again, but a lot of memory usage for big files. 
//buffered reader, which is more complicated but works well for large files. 
BufferedReader reader = createReader("songs.txt"); 
String line = reader.readLine(); 
while(line != null){ 
    String[] data = line.split(","); 
    print(data[0] + ", " + data[1]); //the two song IDs 
    for(int i = 2; i < data.length; i++){ 
    String[] pair = data[i].split(":"); 
    // inelegant, but clear. You may have to subtract 1, if 
    // the words index from 1 but the array indexes from 0 
    for(int j = 0; j < int(pair[1]); j++) 
     print(words[int(pair[0])] + " "); 
    } 
    println(); 
    line = reader.readLine(); 
} 
reader.close();

来源

2013-12-10 00:33:46 kevinsa5

从术语频率计数（数字）重新创建歌词（单词）

回答

相关问题