2017-04-24 51 views
2

问题是我试图用php语句分割文本文件。我目前使用以下功能:解析带引文的文本文件

$results = preg_split('/(?<=[.?!])\s+/', $stringtest, -1, PREG_SPLIT_NO_EMPTY); 

的问题是,有这样的句子:

In his book The Symposium, Plato wrote “Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men” (qtd. in Isay 11). 

它拆分这样的:

[0] In his book The Symposium, Plato wrote “Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men” (qtd. 
[1] in Isay 11). 

另一个例子是:

Dr. Evelyn Hooker, a heterosexual psychologist... 

博士部分将是一个问题。

这些文本全部来自NLP的MASC语料库。

+4

什么是你的问题 –

+0

@JayBlanchard:?我猜OP想拆就标点符号但由于他们。也出现在其他地方,造成麻烦。 – Rahul

+0

我不认为正则表达式是一个很好的工具。 – jrook

回答

1

您可以扩展@ndn's solution以实现您所需的功能。请注意,$before_regexes包含已知缩写的列表,添加您的语料库中存在的缩写。那里我加了qtd

然后,请注意$before_regexes$after_regexes已配对。我的$is_sentence_boundary阵列中加入'/(?:[”’"\'»])\s*\Z/u'/'/\A(?:\(\p{L})/u'对并标记它作为非句子边界(与第一false正则表达式对装置:找到引号(”’"'»),0 +空格,再接着用((与\( )和任何Unicode字母(\p{L}),那么就应该是没有分裂

function sentence_split($text) { 
    $before_regexes = array('/(?:[”’"\'»])\s*\Z/u', 
     '/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe}))\Z/su', 
     '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su', 
     '/(?:(?:[\[\(]*\.\.\.[\]\)]*))\Z/su', 
     '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|qtd)\.\s))\Z/su', 
     '/(?:(?:\b[Ee]tc\.\s))\Z/su', 
     '/(?:(?:[\.!?…]+\p{Pe})|(?:[\[\(]*…[\]\)]*))\Z/su', 
     '/(?:(?:\b\p{L}\.))\Z/su', 
     '/(?:(?:\b\p{L}\.\s))\Z/su', 
     '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su', 
     '/(?:(?:[\"”\']\s*))\Z/su', 
     '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su', 
     '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su', 
     '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su'); 
    $after_regexes = array('/\A(?:\(\p{L})/u', 
     '/\A(?:)/su', 
     '/\A(?:[\p{N}\p{Ll}])/su', 
     '/\A(?:[^\p{Lu}])/su', 
     '/\A(?:[^\p{Lu}]|I)/su', 
     '/\A(?:[^p{Lu}])/su', 
     '/\A(?:\p{Ll})/su', 
     '/\A(?:\p{L}\.)/su', 
     '/\A(?:\p{L}\.\s)/su', 
     '/\A(?:\p{N})/su', 
     '/\A(?:\s*\p{Ll})/su', 
     '/\A(?:)/su', 
     '/\A(?:\p{Lu}[^\p{Lu}])/su', 
     '/\A(?:\p{Lu}\p{Ll})/su'); 
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, false, true, true, true); 
    $count = 13; 

    $sentences = array(); 
    $sentence = ''; 
    $before = ''; 
    $after = substr($text, 0, 10); 
    $text = substr($text, 10); 

    while($text != '') { 
     for($i = 0; $i < $count; $i++) { 
      if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) { 
       if($is_sentence_boundary[$i]) { 
        array_push($sentences, $sentence); 
        $sentence = ''; 
       } 
       break; 
      } 
     } 

     $first_from_text = $text[0]; 
     $text = substr($text, 1); 
     $first_from_after = $after[0]; 
     $after = substr($after, 1); 
     $before .= $first_from_after; 
     $sentence .= $first_from_after; 
     $after .= $first_from_text; 
    } 

    if($sentence != '' && $after != '') { 
     array_push($sentences, $sentence.$after); 
    } 

    return $sentences; 
} 

PHP demo