2014-01-22 66 views
1

给定多个彼此略有不同的文本(某些单词丢失/被其他单词替换),是否有一个很好的算法来创建某种“模板”?例如:多文本比较算法

Input: 
- Hello my name is Bob 
- Hi my name is Bob 
- Hi my name is John 

Output: 
- * my name is * 

的算法应该尽可能容忍异常,例如,如果我添加第四个输入:

- Hello i am Bob 

它不应该影响结果太多。

+0

搜索频繁出现的子串? –

回答

0

这是我将如何实现它。这不是花哨的或者使用最好的approuch,但是有诀窍!

我用java来实现,你可以从中提取算法。

public static String matchText(List<String> textsSet) 
{ 
    double maxWeight = 0; 
    String subStringMaxWeight = ""; 

    // create a table to store all substring occurrences 
    Hashtable<String, Integer> subStringsSet = new Hashtable<String, Integer>(); 

    // starts a loop through text's list 
    for (String text : textsSet) 
    { 
     // splits text into words (separeted by spaces) 
     String[] subStrings = text.split(" "); 
     // if text has more than 1 word, then we are considering it as a "phrase" 
     if (subStrings.length > 1) 
     { 
      // starts a loop to generate all possible word combination 
      // "aaa bbb ccc ddd eee" text will generate these combinations: 
      // aaa bbb 
      // aaa bbb ccc 
      // aaa bbb ccc ddd 
      // aaa bbb ccc ddd eee 
      //  bbb ccc 
      //  bbb ccc ddd 
      //  bbb ccc ddd eee 
      //   ccc ddd 
      //   ccc ddd eee 
      //    ddd eee 
      for (int posIni = 0; posIni < subStrings.length-1; posIni++) 
      { 
       for (int posEnd = posIni + 1; posEnd < subStrings.length; posEnd++) 
       { 
        StringBuilder subString = new StringBuilder(); 

        // if posIni is not the first word, adds * at the begining 
        if (posIni > 0) 
         subString.append("*"); 

        // generate the substring from posIni to posEnd 
        for (int pos = posIni; pos <= posEnd; pos++) 
        { 
         if (subString.length() > 0) 
          subString.append(" "); 
         subString.append(subStrings[pos]); 
        } 

        // if posEnd is not the last word, adds * at the end 
        if (posEnd < subStrings.length - 1) 
         subString.append(" *"); 

        // check if substring already exists in the Hashtable      
        Integer counter = subStringsSet.get(subString.toString()); 
        if (counter == null) 
        { 
         // substring does not exist, so stores it with the counter = 1 
         counter = new Integer(1); 
         subStringsSet.put(subString.toString(), counter); 
        } 
        else 
        { 
         // substring exists, so increments it's counter 
         subStringsSet.remove(subString.toString()); 
         counter = new Integer(counter.intValue() + 1); 
         subStringsSet.put(subString.toString(), counter); 
        } 

        // calculate the weight of the substring based on the number of occurrences plus weight based on the number of words. 
        // counter has heavier weight, but the number of words also has some weight (defined here as 0.05 - feel free to change it) 
        double weight = counter.intValue() + ((posEnd - posIni + 1) * 0.05); 
        if (weight > maxWeight) 
        { 
         maxWeight = weight; 
         subStringMaxWeight = subString.toString(); 
        } 

       } 

      } 
     } 
    } 
    // returns the substring 
    return subStringMaxWeight; 
+0

对于给定字符串的所有排列,但在我的操作中,这个问题的解决方案不错。 – Diversity