2017-09-29 23 views
1

我试图将古老的SGML文件中的合法文档移动到数据库中。在java中使用正则表达式,我有很好的运气。但是,我遇到了一个小问题。看起来文件的每个部分的标签在文件之间不是标准的。例如,最常见的标签是:解析具有模糊标签的结构化文档中的数据

(<numeric>) 
    (<alpah>) 
     (<ROMAN>) 
      (<ALPHA>) 

Ex。 (1)(a)(I)(A)

但是,还有其他文件有变化,有可能在()被抛出。我目前的算法具有与每个级别的每个元素相匹配的硬编码RegEx。但我需要一种方法来动态设置每个级别的标签类型,因为我正在浏览文档。

有没有人遇到过这样的问题?有没有人有什么建议?

在此先感谢。

编辑:

下面是我用它来解析出不同的项目RegExs:

Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3} 
SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.) 
Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?) 
SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"]) 
SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$) 

而且这里的一些示例文本。我早点错过了。虽然数据的最终来源是SGML,但我解析的东西略有不同。除了具有样式标签外,它或多或少都是纯文本。

<tab><b>SECTION 5.</b> In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1) 
introductory portion, (1)(b), and (3)(b)(II) as follows: 

<tab><b>13-5-142. National instant criminal background check system - reporting.</b> 
(1) On and after March 20, 2013, the state court administrator shall send electronically 
the following information to the Colorado bureau of investigation created pursuant to 
section 24-33.5-401, referred to in this section as the "bureau": 

<tab>(b) The name of each person who has been committed by order of the court to the 
custody of the office of behavioral health in the department of human services pursuant 
to section 27-81-112 or 27-82-108; and 

<tab>(3) The state court administrator shall take all necessary steps to cancel a record 
made by the state court administrator in the national instant criminal background check 
system if: 

<tab>(b) No less than three years before the date of the written request: 

<tab>(II) The period of commitment of the most recent order of commitment or 
recommitment expired, or a court entered an order terminating the person's incapacity or 
discharging the person from commitment in the nature of habeas corpus, if the record in 
the national instant criminal background check system is based on an order of 
commitment to the custody of the office of behavioral health in the department of human 
services; except that the state court administrator shall not cancel any record pertaining to 
a person with respect to whom two recommitment orders have been entered pursuant to 
section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section 
27-81-112 (11) on the grounds that further treatment is not likely to bring about 
significant improvement in the person's condition; or 
+0

SGML是否符合架构(DTD)?一般来说,当解析结构化数据时,最好使用标准解析器而不是正则表达式。 –

+0

我应该提到SGML结构不好。从我所知道的情况来看,这些文档的开发人员使用样式来定义每个项目。每个项目都没有可能的描述性标签。 – Thomas

+0

您能否提供更多正确且形式不当的SGML示例并提供您想要的输出示例?另外,你可以发布你试过的正则表达式,这样我们可以1)检查它们2)编辑它以工作(如果可能的话)和3)不尝试你已经尝试过的东西 – ctwheels

回答

1

您对该问题的陈述含糊不清,所以唯一可能的答案是一般方法。我一直在处理这种不精确格式的文档转换。

CS可以帮助的工具是状态机。如果可以检测到(例如,使用正则表达式)格式正在改变为新的约定,这是适当的。这会改变状态,在这种情况下,它相当于翻译器用于当前和随后的文本块。它在下一个状态改变之前一直有效。总体来说,算法是这样的:

translator = DEFAULT 
while (chunks of input remain) { 
    chunk = GetNextChunkOfInput // a line, paragraph, etc. 
    new_translator = ScanChunkForStateChange(chunk, translator) 
    if (new_translator != null) translator = new_translator // found a state change! 
    print(translator.Translate(chunk)) // use the translator on the chunk 
} 

在这个框架内,这是一个繁琐的过程来设计的笔译和状态改变谓语。你所希望做的就是尝试,检查输出结果并修复问题,重复直到你无法改善为止。此时,您可能已经在输入中发现了最大结构,因此单独使用模式匹配的算法(无需尝试使用AI进行语义建模)不会让您变得更远。

+0

谢谢基因。我调整了我的算法,使其更接近您的伪代码,并获得更好的结果。就像你说的,我应该能够调整它以获得更好的结果。 – Thomas

0

文字摘要你贴可以通过SGML解析器使用自定义的语法规则在DOCTYPE又名DTD进行解析和结构(假设在你的榜样<tab>表示实际tab开始元素标签,而不是一个TAB字符)。我已经采取了你的片段,将其存储在一个名为data.ent文件,然后创建以下文件SGML,doc.sgm,引用它:

<!DOCTYPE doc [ 
    <!ELEMENT doc O O (tab)+> 
    <!ELEMENT tab - O (((b,c?)|c),text)> 
    <!ELEMENT text O O (#PCDATA|b)+> 
    <!ELEMENT b - - (#PCDATA)> 
    <!ELEMENT c - - (#PCDATA)> 
    <!ENTITY data SYSTEM "data.ent"> 
    <!ENTITY startc "<c>"> 
    <!ENTITY endc "</c>"> 
    <!SHORTREF intab "(" startc ")" endc> 
    <!USEMAP intab tab> 
    <!USEMAP #EMPTY text> 
]> 
&data 

这些DTD规则解析您的数据的结果(在使用osgmlnorm doc.sgm命令行)如下:

<DOC> 
    <TAB> 
    <B>SECTION 5.</B> 
    <TEXT>In Colorado Revised Statutes, 13-5-142, <B>amend</B> (1) 
     introductory portion, (1)(b), and (3)(b)(II) as follows: 
    </TEXT> 
    </TAB> 
    <TAB> 
    <B>13-5-142. National instant criminal background check system 
     reporting.</B> 
    <C>1</C> 
    <TEXT>On and after March 20, 2013, the state court administrator 
     shall send electronically the following information to the 
     Colorado bureau of investigation created pursuant to section 
     24-33.5-401, referred to in this section as the "bureau": 
    </TEXT> 
    </TAB> 
    <TAB> 
    <C>b</C> 
    <TEXT>The name of each person who has been committed by order 
     of the court to the custody of the office of behavioral health 
     in the department of human services pursuant to section 27-81-112 
     or 27-82-108; and 
    </TEXT> 
    </TAB> 
    <TAB> 
    <C>3</C> 
    <TEXT>The state court administrator shall take all necessary steps 
     to cancel a record made by the state court administrator in the 
     national instant criminal background check system if: 
    </TEXT> 
    </TAB> 
    <TAB> 
    <C>b</C> 
    <TEXT>No less than three years before the date of the written 
     request: 
    </TEXT> 
    </TAB> 
    <TAB> 
    <C>II</C> 
    <TEXT>The period of commitment of the most recent order of 
     commitment or recommitment expired, or a court entered an order 
     terminating the person's incapacity or discharging the person 
     from commitment in the nature of habeas corpus, if the record in 
     the national instant criminal background check system is based on 
     an order of commitment to the custody of the office of behavioral 
     health in the department of human services; except that the state 
     court administrator shall not cancel any record pertaining to 
     a person with respect to whom two recommitment orders have been 
     entered pursuant to section 27-81-112 (7) and (8), or who was 
     discharged from treatment pursuant to section 27-81-112 (11) on 
     the grounds that further treatment is not likely to bring about 
     significant improvement in the person's condition; or 
    </TEXT> 
    </TAB> 
</DOC> 

说明:

  • 的SGML DTD我创建使用SGML标签推论来推断一个虚构的DOC 元素作为文档元素,以及人造TEXTC元素; 的主要目的是强加文件结构的 TAB元件,每个包含部分标识符(如 <b>SECTION 5.</b>(c)),随后部分主体文本的序列
  • 我也由一个特设的元件C包装部分标识符 文字放在大括号中(() characters);由于 DTD的SHORTREF映射规则,由SGML处理器自动插入起始端元件 C;这些告诉SGML,一个TAB 元件内,SGML应由endc实体的 值(其扩展到</C>)代替由 startc实体(其扩展为<C>)的值的所有(字符,并且所有)字符
  • <!USEMAP #EMPTY text>关闭括号的扩张在TAB节这样的 TEXT身体部位引用(7)(8)在 正文没有得到改变(虽然这些可能变成类似HTML的 链接以及使用SGML)

如果您使用<tab>表示TAB(ASCII 9)字符,SGML也可以处理它,例如,通过将TAB字符翻译为使用SHORTREF规则的<TAB>标签。

请注意您需要安装osgmlnorm程序;如果您使用的是Ubuntu,则可以使用sudo apt-get install opensp进行安装,在其他Linux变体和Mac OS上使用类似的方法进行安装。对于您的应用程序,您可能需要使用osx程序(也是OpenSP的一部分)将标准化的解析结果输出到XML(尽管上面显示的输出可以解析为XML),然后使用Java XML API处理结构化内容满足您的需求。

相关问题