2017-10-04 53 views
-1

我试图将一个句子分成一组单词。我所看到的也是考虑数据分块时的度量。Java正则表达式将句子中的单词与值和其度量单词分隔为单个词

E.g (Made-up). 
document= The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs. 

什么是必需的,是一组单词

the 
root 
cause 
problem 
... 
40 degrees 
30 percent 
1.67 gpm 
1-19666 tablet 
3 hrs 

我已经试过的是

List<String> bagOfWords = new ArrayList<>();  
String [] words = StringUtils.normalizeSpace(document.replaceAll("[^0-9a-zA-Z_.-]", " ")).split(" "); 
for(String word :words){ 
    bagOfWords.add(StringUtils.normalizeSpace(word.replaceAll("\\.(?!\\d)", " ")));   
    }     
System.out.println("NEW 2 :: " + bagOfWords.toString()); 
+0

你在寻找一个正则表达式,可以解决这个问题或寻找可以适用于任何句子的东西吗? –

+0

任何句子。基本上就是用它的单位来拉出价值。 – Betafish

回答

2

让我们假设是一个包含若干个一个字后面跟着一个又一个那不。然后这里是代码:

private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs"; 

    // ... 

    Pattern pattern = Pattern.compile("(\\b\\S*\\d\\S*\\b\\s+)?\\b\\S+\\b"); 
    Matcher matcher = pattern.matcher(DOC); 
    List<String> words = new ArrayList<>(); 
    while (matcher.find()) { 
     words.add(matcher.group()); 
    } 
    for (String word : words) { 
     System.out.println(word); 
    } 

说明:

  • \\b查找单词边界
  • \\S是一个非空格字符。所以你可以在一个单词中包含所有内容,如点或逗号。
  • (...)?是第一个可选部分。它用一个数字捕捉单词,如果有的话。即它有一些字符(\\S*),然后是一个数字(\\d),然后再一些字符(\\S*
  • 第二个单词很简单:至少有一个非空白字符。因此它有+,而不是S之后的*
1

你的问题范围有点大,但是这里有一个黑客可以用于这种格式的大多数句子。

首先,您可以创建一个前缀列表,其中包含您单位的关键字,如hrs, tablet, gpm ...,一旦您拥有了这些,您就可以轻松挑选出来。

String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs."; 
    if(document.endsWith(".")){ 
     document = document.substring(0, document.length() -1); 
    } 
    System.out.println(document); 
    String[] splitted = document.split(" "); 
    List<String> keywords = new ArrayList(); 
    keywords.add("degrees"); 
    keywords.add("percent"); 
    keywords.add("gpm"); 
    keywords.add("tablet"); 
    keywords.add("hrs"); 

    List<String> words = new ArrayList(); 

    for(String s : splitted){ 
     if(!s.equals(",")){ 
      //if s is not a comma; 
      if(keywords.contains(s) && words.size()!=0){ 
       //if s is a keyword append to last item in list 
       int lastIndex = words.size()-1; 
       words.set(lastIndex, words.get(lastIndex)+" "+s); 
      } 
      else{ 
       words.add(s); 
      } 
     } 
    } 
    for(String s : words){ 
     System.out.println(s); 
    } 
相关问题