2014-09-23 125 views
0
re_newspeaker =   r'^(<bullet> | )(?P<name>(%s|(((Mr)|(Ms)|(Mrs))\. [-A-Za-z \']+(of [A-Z][a-z]+)?))|((The ((VICE|ACTING|Acting))?(PRESIDENT|SPEAKER|CHAIR(MAN)?)(pro tempore)?)|(The PRESIDING OFFICER)|(The CLERK)|(The CHIEF JUSTICE)|(The VICE PRESIDENT)|(Mr\. Counsel [A-Z]+))(\([A-Za-z.\'\- ]+\))?)\.' 


re_speaking =   r'^(<bullet> | )((((((Mr)|(Ms)|(Mrs))\. [A-Za-z \'\-]+(of [A-Z][a-z]+)?)|((The (VICE |Acting |ACTING)?(PRESIDENT|SPEAKER)(pro tempore)?)|(The PRESIDING OFFICER)|(The CLERK))(\([A-Za-z.\'\- ]+\))?))\.)?(?P<start>.)' 

由于某种原因,上述正则表达式没有捕获带撇号的名称。Python正则表达式匹配撇号

例如:D'STALL先生 未匹配。任何与正则表达式模式的帮助将是最受赞赏。

代码的作用是获取输入并将其标记为XML。如下所示:

<speaker=Mr. D'STALL</speaker><speaking>Mr. President, I have been seeking to obtain a report on 
this bill. I am not on the Budget Committee, and I am not on the 
Government Relations Committee. But from what I understand, this is a 
very important bill, a big bill, a complex bill, far reaching in its 
contents. I have been queried, along with all other Senators, I 
suppose, as to whether or not they would have any objection to the 
adoption of the committee amendments, en bloc. I am going to object to 
the adoption of the committee amendments, en bloc, until I see the 
committee report.</speaking> 

    Mr. D'STALL. Mr. President, I have been seeking to obtain a report on 
this bill. I am not on the Budget Committee, and I am not on the 
Government Relations Committee. But from what I understand, this is a 
very important bill, a big bill, a complex bill, far reaching in its 
contents. I have been queried, along with all other Senators, I 
suppose, as to whether or not they would have any objection to the 
adoption of the committee amendments, en bloc. I am going to object to 
the adoption of the committee amendments, en bloc, until I see the 
committee report. 

该正则表达式不符合上述段落。

+5

这是多么可怕的不可维护的模式,你去那里。我认为这个问题会影响两种模式? – 2014-09-23 08:47:42

+1

http://regex101.com/r/dT6dN8/1 – 2014-09-23 08:47:58

+0

你的正则表达式需要在开始时有一个'space'或'bullet',它是否在你的输入中? – vks 2014-09-23 08:49:51

回答

0
re_newspeaker =   r'^(<bullet> | )(?P<name>(%s|(((Mr)|(Ms)|(Mrs))\. [-A-Z\']+|((Miss) [-A-Z\']+)(of [A-Z][a-z]+)?))|((The ((VICE|ACTING|Acting))?(PRESIDENT|SPEAKER|CHAIR(MAN)?)(pro tempore)?)|(The PRESIDING OFFICER)|(The CLERK)|(The CHIEF JUSTICE)|(The VICE PRESIDENT)|(Mr\. Counsel [A-Z]+))(\([A-Za-z.\- ]+\))?)\.' 

re_speaking =   r'^(<bullet> | )((((((Mr)|(Ms)|(Mrs))\. [A-Z\']+|((Miss) [-A-Z\']+)(of [A-Z][a-z]+)?)|((The (VICE |Acting |ACTING)?(PRESIDENT|SPEAKER)(pro tempore)?)|(The PRESIDING OFFICER)|(The CLERK))(\([A-Za-z.\- ]+\))?))\.)?(?P<start>.)' 

上述RegEx解决了我的问题。我以为如果其他人有这个问题,我会发布它!