2014-12-05 45 views
0

我试图解析文本(http://pastebin.com/raw.php?i=0wD91r2i)并检索单词及其出现次数。但是,我不能在最终输出中包含专有名词。我不太清楚如何完成这项任务。确定字符串是否是文本中的专有名词

我试图在这个

public class TextAnalysis 
{ 
    public static void main(String[] args) 
    { 
     ArrayList<Word> words = new ArrayList<Word>(); //instantiate array list of object Word 
     try 
     { 
      int lineCount = 0; 
      int wordCount = 0; 
      int specialWord = 0; 
      URL reader = new URL("http://pastebin.com/raw.php?i=0wD91r2i"); 
      Scanner in = new Scanner(reader.openStream()); 
      while(in.hasNextLine()) //while to parse text 
      { 
       lineCount++; 
       String textInfo[] = in.nextLine().replaceAll("[^a-zA-Z ]", "").split("\\s+"); //use regex to replace all punctuation with empty char and split words with white space chars in between 
       wordCount += textInfo.length; 
       for(int i=0; i<textInfo.length; i++) 
       { 
        if(textInfo[i].toLowerCase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches any special word case, add count of special words then continue to next word 
        { 
         specialWord++; 
         continue; 
        } 
        if(!textInfo[i].matches(".*\\w.*")) continue; //also if text matches white space then continue 
        boolean found = false; 
        for(Word word: words) //check whether word already exists in list -- if so add count 
        { 
         if(word.getWord().equals(textInfo[i])) 
         { 
          word.addOccurence(1); 
          word.addLine(lineCount); 
          found = true; 
         } 
        } 
        if(!found) //else add new entry 
        { 
         words.add(new Word(textInfo[i], lineCount, 1)); 
        } 
       } 
      } 
      //adds data from capital word to lowercase word ATTEMPT AT PROPER NOUNS HERE 
      for(Word word: words) 
      { 
       for(int i=0; i<words.size(); i++) 
       { 
        if(Character.isUpperCase(word.getWord().charAt(0)) && word.getWord().toLowerCase().equals(words.get(i).getWord())) 
        { 
         words.get(i).addOccurence(word.getOccurence()); 
         words.get(i).addLine(word.getLine()); 
        } 
       } 
      } 

      Comparator<Word> occurenceComparator = new Comparator<Word>() //comparares list based on number of occurences 
      { 
       public int compare(Word n1, Word n2) 
       { 
        if(n1.getOccurence() < n2.getOccurence()) return 1; 
        else if (n1.getOccurence() == n2.getOccurence()) return 0; 
        else return -1; 
       } 
      }; 
      Collections.sort(words); 
      // Collections.sort(words, occurenceComparator); 
      // ArrayList<Word> top_words = new ArrayList<Word>(words.subList(0,100)); 
      // Collections.sort(top_words); 
      System.out.printf("%-15s%-15s%s\n", "Word", "Occurences", "Word Distribution Index"); 
      for(Word word: words) 
      { 
       word.setTotalLine(lineCount); 
       System.out.println(word); 
      } 
      System.out.println(wordCount); 
      System.out.printf("%s%.3f\n","The connecting word index is ",specialWord*100.0/wordCount); 
     } 
     catch(IOException ex) 
     { 
      System.out.println("WEB URL NOT FOUND"); 
     } 
    } 
} 

那种格式化掉,不知道如何正确地做到这一点。

它决定一个单词是否大写,如果有单词的小写版本,则将数据添加到小写单词中。但是,这并不包含文本中不会出现小写字母的文字,例如“Four”或“Now”。如果不交叉引用字典,我该如何解决这个问题?

编辑:我已经解决了问题MYSELF。

但是,谢谢Wes试图回答。

+0

除了使用某种字典之外,没有办法做到这一点。 – 2014-12-05 23:39:16

+0

我不认为有可能用一个逻辑来判断一个单词是否是专有名词。 – khelwood 2014-12-05 23:40:04

+0

嗯,我不认为我必须涵盖每一个案例,但我相信标点符号(。!?)后的单词应该被认为是一般非专有名词,尽管可能会有一些误报。我只需要一个适用于特定文本文件的解决方案 – 2014-12-05 23:50:14

回答

1

看起来你的算法似乎是假设任何出现大写字母的单词,但不会出现未被大写的是一个专有名词。所以如果是这样的话,那么你可以使用下面的算法来获得专有名词。

//Assume you have tokenized your whole file into a Collection called allWords. 
HashSet<String> lowercaseWords = new HashSet<>(); 
HashMap<String,String> lowerToCap = new HashMap<>(); 
for(String word: allWords) { 
    if (Character.isUpperCase(word.charAt(0))){ 
     lowerToCap.put(word.toLowerCase(),word); 
    } 
    else {  
     lowercaseWords.add(word.toLowerCase); 
    } 
} 

//remove all the words that we've found as capitalized, only proper nouns will be left 
lowercaseWords.removeAll(lowerToCap.keySet()); 
for(String properNounLower:lowercaseWords) { 
    System.out.println("Proper Noun: "+ lowerToCap.get(properNounLower)); 
} 
+1

您还可以使用带有String.CASE_INSENSITIVE_ORDER构造函数参数的TreeMap,这将消除该小写字母大写的地图 – 2014-12-05 23:58:24

+0

我还没有在类中学习HashSet或HashMap ,所以我不确定我是否可以利用它。另外,我想知道这是否会解释只显示大写的单词,例如文本文件中的“四个”。 – 2014-12-06 00:06:58

+0

它应该只打印只显示大写的文字。所以应该出现“四”。 – 2014-12-06 00:12:20

相关问题