将字符串拆分为行和句子，但忽略缩写

有一些字符串内容，我必须拆分。首先，我需要将字符串内容分成几行。将字符串拆分为行和句子，但忽略缩写

这是我该怎么办：

str.split('\n').forEach((item) => { 
    if (item) { 
     // TODO: split also each line into sentences 

     let  data  = { 
        type : 'item', 
        content: [{ 
         content : item, 
         timestamp: Math.floor(Date.now()/1000) 
        }] 
       }; 

     // Save `data` to DB 
    } 
});

但现在我还需要每一行分成句子。我对此的困难是正确分割它。因此我会使用.（点和空格）来分割线条。但也有缩略语的数组，不应分割线：

cont abbr = ['vs.', 'min.', 'max.']; // Just an example; there are 70 abbrevations in that array

...而且有几个规则：

任何数量和网点或单个字母和点也应该被忽略，因为分割字符串：1.，2.，30.，A.，b.
大写和小写应该被忽略：Max. Lorem ipsum不应被分裂。 Lorem max. ipsum。

例

const str = 'Just some examples:\nThis example has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar.';

的该结果应该是四个数据对象：

{ type: 'item', content: [{ content: 'Just some examples:', timestamp: 123 }] } 
{ type: 'item', content: [{ content: 'This example has min. 2 lines.', timestamp: 123 }] } 
{ type: 'item', content: [{ content: 'Max. 10 lines.', timestamp: 123 }] } 
{ type: 'item', content: [{ content: 'There are some words: 1. Foo and 2. bar.', timestamp: 123 }] }

来源

2016-10-06 user3142695

你可能，可能的话，可以用一个正则表达式来做到这一点（我做不到，但并不意味着这是不可能的），但它会写一个野兽并保持。我建议使用一个非常宽松的正则表达式来扫描字符串中的潜在匹配，然后在上下文中对照像您所描述的一组规则对它们进行评估。它仍然很复杂，但至少应该更易于阅读和排除故障。另外，如果你正在分裂自然语言文本，不要忽视''你好，我是Sue，“她说。 “这是一个字符串？”她问。 “这是。”'和'我喜欢'字符串'这样的单位。' – Palpatim

可以首先检测串中的缩写和numberings，并更换每一个虚拟字符串点。在将剩下的点分开后，可以恢复原始点。一旦你有了句子，你就可以像在原始代码中一样将每一个换行换行。

更新的代码允许在缩写中使用多个点（如p.o.和s.v.p.所示）。

var i, j, strRegex, regex, abbrParts; 
 
const DOT = "_DOT_"; 
 
const abbr = ["p.o.", "s.v.p.", "vs.", "min.", "max."]; 
 

 
var str = 'Just some examples:\nThis example s.v.p. has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar. And also A. p.o. professional letters.'; 
 

 
console.log("String: " + str); 
 

 
// Replace dot in abbreviations found in string 
 
for (i = 0; i < abbr.length; i++) { 
 
    abbrParts = abbr[i].split("."); 
 
    strRegex = "(\\W*" + abbrParts[0] + ")"; 
 
    for (j = 1; j < abbrParts.length - 1; j++) { 
 
     strRegex += "(\\.)(" + abbrParts[j] + ")"; 
 
    } 
 
    strRegex += "(\\.)(" + abbrParts[abbrParts.length - 1] + "\\W*)"; 
 
    regex = new RegExp(strRegex, "gi"); 
 
    str = str.replace(regex, function() { 
 
     var groups = arguments; 
 
     var result = groups[1]; 
 
     for (j = 2; j < groups.length; j += 2) { 
 
      result += (groups[j] === "." ? DOT + groups[j+1] : ""); 
 
     } 
 
     return result; 
 
    }); 
 
} 
 

 
// Replace dot in numbers found in string 
 
str = str.replace(/(\W*\d+)(\.)/gi, "$1" + DOT); 
 

 
// Replace dot in letter numbering found in string 
 
str = str.replace(/(\W+[a-zA-Z])(\.)/gi, "$1" + DOT); 
 

 
// Split the string at dots 
 
var parts = str.split("."); 
 

 
// Restore dots in sentences 
 
var sentences = []; 
 
regex = new RegExp(DOT, "gi"); 
 
for (i = 0; i < parts.length; i++) { 
 
    if (parts[i].trim().length > 0) { 
 
     sentences.push(parts[i].replace(regex, ".").trim() + "."); 
 
     console.log("Sentence " + (i + 1) + ": " + sentences[i]); 
 
    } 
 
}

来源

2016-10-06 20:45:27 ConnorsFan

我忘记了一种缩写：它们可以有两个点，比如'p.o.'。这将为你的代码创建一个字符串'professional'，一个新的字符串'p_DOT_ofession' - 不应该。我该如何改进你的代码？ – user3142695

今天晚些时候（下班后）我会回复你的。 – ConnorsFan

'new RegExp（“（\ W *”+ abbrParts [0] +“）（\。）（”+ abbrParts [1] +“\ W *）”，“gi”）包含反斜杠的常见错误。 '“\。”是一个无效的转义序列。 –

将字符串拆分为行和句子，但忽略缩写

回答

相关问题