Java从句子中提取子串

有组合的词是，不是，不包含。我们必须在句子中匹配这些词，并且必须将其分开。Java从句子中提取子串

Intput：if name is tom and age is not 45 or name does not contain tom then let me know.

预期输出：

If name is 
tom and age is not 
45 or name does not contain 
tom then let me know

我试图下面的代码分割和提取，但的“是”的发生是在“不是”，以及其我的代码无法找出：

public static void loadOperators(){ 
     operators.add("is"); 
     operators.add("is not"); 
     operators.add("does not contain"); 
    } 

public static void main(String[] args) { 
    loadOperators(); 
    for(String s : operators){ 
     System.out.println(str.split(s).length - 1); 
    } 
}

来源

2017-09-19 Rishi

你输入你的输出是相同的。 –

请显示确切的输入和预期的输出。我怀疑分裂一个句子和你想象的一样微不足道，当然不是英语。 –

那么，你想在每个单词搜索组合后添加一个换行符？那是你要的吗？你可以使用String.replace（）来做到这一点。 –

由于可能存在多个发生字split不会解决您的使用案例，因为is和is not是您的不同运营商。你会理想的：

Iterate : 
1. Find the index of the 'operator'. 
2. Search for the next space _ or word. 
3. Then update your string as substring from its index to length-1.

来源

2017-09-19 11:23:46 nullpointer

我不完全确定你想达到什么，但让我们试试看吧。

对于你的情况，一个简单的“解决办法”可能会工作得很好：排序运营商通过它们的长度，降。这样，“最大的匹配”将首先找到。您可以将“最大”定义为字面上最长的字符串，或者最好是字数（包含的空格数），因此is a优先于contains

您需要确保没有匹配重叠，但可以通过比较所有匹配的开始和结束索引以及根据某些标准丢弃重叠来完成，比如第一次匹配获胜

来源

2017-09-19 11:31:21 Felk

该代码完成了您似乎想要做的事情（或者我猜测您希望做的事情）：

public static void main(String[] args) { 
    List<String> operators = new ArrayList<>(); 
    operators.add("is"); 
    operators.add("is not"); 
    operators.add("does not contain"); 

    String input = "if name is tom and age is not 45 or name does not contain tom then let me know."; 
    List<String> output = new ArrayList<>(); 

    int lastFoundOperatorsEndIndex = 0; // First start at the beginning of input 

    for (String operator : operators){ 
     int indexOfOperator = input.indexOf(operator); // Find current operator's position 

     if (indexOfOperator > -1) { // If operator was found 
      int thisOperatorsEndIndex = indexOfOperator + operator.length(); // Get length of operator and add it to the index to include operator 
      output.add(input.substring(lastFoundOperatorsEndIndex, thisOperatorsEndIndex).trim()); // Add operator to output (and remove trailing space) 
      lastFoundOperatorsEndIndex = thisOperatorsEndIndex; // Update startindex for next operator 
     } 
    } 
    output.add(input.substring(lastFoundOperatorsEndIndex, input.length()).trim()); // Add rest of input as last entry to output 

    for (String part : output) { // Output to console 
     System.out.println(part); 
    } 
}

但它高度依赖于句子的顺序和运营商。如果我们正在谈论用户输入，那么这个任务将更复杂太多。

更好的使用正则表达式（正则表达式）的方法是：

public static void main(String... args) { 
    // Define inputs 
    String input1 = "if name is tom and age is not 45 or name does not contain tom then let me know."; 
    String input2 = "the name is tom and he is 22 years old but the name does not contain jack, but merry is 24 year old."; 

    // Output split strings 
    for (String part : split(input1)) { 
     System.out.println(part.trim()); 
    } 

    System.out.println(); 

    for (String part : split(input2)) { 
     System.out.println(part.trim()); 
    } 
} 

private static String[] split(String input) { 
    // Define list of operators - 'is not' has to precede 'is'!! 
    String[] operators = { "\\sis not\\s", "\\sis\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" }; 

    // Concatenate operators to regExp-String for search 
    StringBuilder searchString = new StringBuilder(); 

    for (String operator : operators) { 
     if (searchString.length() > 0) { 
      searchString.append("|"); 
     } 
     searchString.append(operator); 
    } 

    // Replace all operators by operator+\n and split resulting string at \n-character 
    return input.replaceAll("(" + searchString.toString() + ")", "$1\n").split("\n"); 
}

通知运营商的订单！ '是'必须来后'不是'或'不是'将永远被拆分。

您可以通过对运算符'is'使用负向预测来防止此问题。因此"\\sis\\s"会变成"\\sis(?! not)\\s"（读数是：“is”，之后是“not”）。

简约的版本（与JDK 1.6+）看起来是这样的：

private static String[] split(String input) { 
    String[] operators = { "\\sis(?! not)\\s", "\\sis not\\s", "\\sdoes not contain\\s", "\\sdoes contain\\s" }; 
    return input.replaceAll("(" + String.join("|", operators) + ")", "$1\n").split("\n"); 
}

来源

2017-09-19 11:45:56 zoku

但是对于'is'的重复，它不起作用：如果名字是'tom'并且姓氏是'jerry'或者年龄不是'16'或者名字不包含'tom'。按照您的逻辑输出（在第二行“的”不处理）：如果名字是 “嗵”和姓是“杰里”或年龄不 “16”或名称不包含“嗵” – Rishi

那是什么我的意思是“高度依赖”。如果你想分析自然句子，你需要一个更复杂的方法。 RegExp，例如查找关键字。 – zoku

我添加了一个使用regExp的例子。它仍然取决于运营商的“是”和“不是”（或包含其他运营商的任何其他运营商）的顺序。 – zoku

Java从句子中提取子串

回答

相关问题