我试图将古老的SGML文件中的合法文档移动到数据库中。在java中使用正则表达式,我有很好的运气。但是,我遇到了一个小问题。看起来文件的每个部分的标签在文件之间不是标准的。例如,最常见的标签是:解析具有模糊标签的结构化文档中的数据
(<numeric>)
(<alpah>)
(<ROMAN>)
(<ALPHA>)
Ex。 (1)(a)(I)(A)
但是,还有其他文件有变化,有可能在()被抛出。我目前的算法具有与每个级别的每个元素相匹配的硬编码RegEx。但我需要一种方法来动态设置每个级别的标签类型,因为我正在浏览文档。
有没有人遇到过这样的问题?有没有人有什么建议?
在此先感谢。
编辑:
下面是我用它来解析出不同的项目RegExs:
Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3}
SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.)
Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?)
SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"])
SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$)
而且这里的一些示例文本。我早点错过了。虽然数据的最终来源是SGML,但我解析的东西略有不同。除了具有样式标签外,它或多或少都是纯文本。
<tab><b>SECTION 5.</b> In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1)
introductory portion, (1)(b), and (3)(b)(II) as follows:
<tab><b>13-5-142. National instant criminal background check system - reporting.</b>
(1) On and after March 20, 2013, the state court administrator shall send electronically
the following information to the Colorado bureau of investigation created pursuant to
section 24-33.5-401, referred to in this section as the "bureau":
<tab>(b) The name of each person who has been committed by order of the court to the
custody of the office of behavioral health in the department of human services pursuant
to section 27-81-112 or 27-82-108; and
<tab>(3) The state court administrator shall take all necessary steps to cancel a record
made by the state court administrator in the national instant criminal background check
system if:
<tab>(b) No less than three years before the date of the written request:
<tab>(II) The period of commitment of the most recent order of commitment or
recommitment expired, or a court entered an order terminating the person's incapacity or
discharging the person from commitment in the nature of habeas corpus, if the record in
the national instant criminal background check system is based on an order of
commitment to the custody of the office of behavioral health in the department of human
services; except that the state court administrator shall not cancel any record pertaining to
a person with respect to whom two recommitment orders have been entered pursuant to
section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section
27-81-112 (11) on the grounds that further treatment is not likely to bring about
significant improvement in the person's condition; or
SGML是否符合架构(DTD)?一般来说,当解析结构化数据时,最好使用标准解析器而不是正则表达式。 –
我应该提到SGML结构不好。从我所知道的情况来看,这些文档的开发人员使用样式来定义每个项目。每个项目都没有可能的描述性标签。 – Thomas
您能否提供更多正确且形式不当的SGML示例并提供您想要的输出示例?另外,你可以发布你试过的正则表达式,这样我们可以1)检查它们2)编辑它以工作(如果可能的话)和3)不尝试你已经尝试过的东西 – ctwheels