我试图解析文本(http://pastebin.com/raw.php?i=0wD91r2i)并检索单词及其出现次数。但是,我不能在最终输出中包含专有名词。我不太清楚如何完成这项任务。确定字符串是否是文本中的专有名词
我试图在这个
public class TextAnalysis
{
public static void main(String[] args)
{
ArrayList<Word> words = new ArrayList<Word>(); //instantiate array list of object Word
try
{
int lineCount = 0;
int wordCount = 0;
int specialWord = 0;
URL reader = new URL("http://pastebin.com/raw.php?i=0wD91r2i");
Scanner in = new Scanner(reader.openStream());
while(in.hasNextLine()) //while to parse text
{
lineCount++;
String textInfo[] = in.nextLine().replaceAll("[^a-zA-Z ]", "").split("\\s+"); //use regex to replace all punctuation with empty char and split words with white space chars in between
wordCount += textInfo.length;
for(int i=0; i<textInfo.length; i++)
{
if(textInfo[i].toLowerCase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches any special word case, add count of special words then continue to next word
{
specialWord++;
continue;
}
if(!textInfo[i].matches(".*\\w.*")) continue; //also if text matches white space then continue
boolean found = false;
for(Word word: words) //check whether word already exists in list -- if so add count
{
if(word.getWord().equals(textInfo[i]))
{
word.addOccurence(1);
word.addLine(lineCount);
found = true;
}
}
if(!found) //else add new entry
{
words.add(new Word(textInfo[i], lineCount, 1));
}
}
}
//adds data from capital word to lowercase word ATTEMPT AT PROPER NOUNS HERE
for(Word word: words)
{
for(int i=0; i<words.size(); i++)
{
if(Character.isUpperCase(word.getWord().charAt(0)) && word.getWord().toLowerCase().equals(words.get(i).getWord()))
{
words.get(i).addOccurence(word.getOccurence());
words.get(i).addLine(word.getLine());
}
}
}
Comparator<Word> occurenceComparator = new Comparator<Word>() //comparares list based on number of occurences
{
public int compare(Word n1, Word n2)
{
if(n1.getOccurence() < n2.getOccurence()) return 1;
else if (n1.getOccurence() == n2.getOccurence()) return 0;
else return -1;
}
};
Collections.sort(words);
// Collections.sort(words, occurenceComparator);
// ArrayList<Word> top_words = new ArrayList<Word>(words.subList(0,100));
// Collections.sort(top_words);
System.out.printf("%-15s%-15s%s\n", "Word", "Occurences", "Word Distribution Index");
for(Word word: words)
{
word.setTotalLine(lineCount);
System.out.println(word);
}
System.out.println(wordCount);
System.out.printf("%s%.3f\n","The connecting word index is ",specialWord*100.0/wordCount);
}
catch(IOException ex)
{
System.out.println("WEB URL NOT FOUND");
}
}
}
那种格式化掉,不知道如何正确地做到这一点。
它决定一个单词是否大写,如果有单词的小写版本,则将数据添加到小写单词中。但是,这并不包含文本中不会出现小写字母的文字,例如“Four”或“Now”。如果不交叉引用字典,我该如何解决这个问题?
编辑:我已经解决了问题MYSELF。
但是,谢谢Wes试图回答。
除了使用某种字典之外,没有办法做到这一点。 – 2014-12-05 23:39:16
我不认为有可能用一个逻辑来判断一个单词是否是专有名词。 – khelwood 2014-12-05 23:40:04
嗯,我不认为我必须涵盖每一个案例,但我相信标点符号(。!?)后的单词应该被认为是一般非专有名词,尽管可能会有一些误报。我只需要一个适用于特定文本文件的解决方案 – 2014-12-05 23:50:14