2015-01-06 75 views
1

我试图解析嵌入在下面的HTML文件中的XML。下面是从标签中的一个细节:将HTML标记解析为XML

  DOM<tr class="iris_table_row"> 
       <td style=" width:37.50%; text-align:left; " class="ta_10"><span class="ta_10">Tangible assets</span></td> 
       <td style=" width:2.50%; text-align:right; " class="ta_10"><span class="ta_10">2</span></td> 
       <td style=" width:30.00%; text-align:right; " class="ta_61"><ix:nonFraction contextRef="cfwd_31_03_2014" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">7,956</ix:nonFraction></td> 
       <td style=" width:1.25%; " class="ta_61" /> 
       <td style=" width:26.25%; text-align:right; " class="ta_60"><ix:nonFraction contextRef="cfwd_31_03_2013" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">5,402</ix:nonFraction></td> 
       <td style=" width:1.25%; " class="ta_60" /> 
       <td style=" width:1.25%; " class="ta_10" /> 
      </tr> 

我使用DOM解析器的java做这种尝试,但它不能识别XML标签。

下面的代码中的db.parse(fXmlFile)的值是“null”。

File fXmlFile = new File("Prod223_1254_04903825_20140331 copy.xml"); 

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); 
    dbf.setValidating(false); 
    dbf.setNamespaceAware(true); 
    dbf.setIgnoringComments(false); 
    dbf.setIgnoringElementContentWhitespace(false); 
    dbf.setExpandEntityReferences(false); 
    DocumentBuilder db = dbf.newDocumentBuilder(); 

    System.out.println(db.parse(fXmlFile)); 

我怎样才能得到所有的标签和信息到java?理想情况下,我可以将它们加载到一个bean中。

这是我试图解析的文件类型的一个例子。

<?xml version="1.0" encoding="utf-8"?><html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" xmlns:ixt="http://www.xbrl.org/inlineXBRL/transformation/2010-04-20" xmlns:ixt2="http://www.xbrl.org/inlineXBRL/transformation/2011-07-31" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xl="http://www.xbrl.org/2003/XLink" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:iris="http://www.iris.co.uk/ixbrl" xmlns:ns0="http://www.xbrl.org/uk/gaap/core-full/2009-09-01" xmlns:ns5="http://www.xbrl.org/uk/gaap/core/2009-09-01" xmlns:ns6="http://www.xbrl.org/uk/reports/direp/2009-09-01" xmlns:ns7="http://www.xbrl.org/uk/cd/business/2009-09-01" xmlns:ns8="http://www.xbrl.org/uk/all/types/2009-09-01" xmlns:ns9="http://xbrl.org/2005/xbrldt" xmlns:ns10="http://www.xbrl.org/uk/all/common/2009-09-01" xmlns:ns11="http://www.xbrl.org/2006/ref" xmlns:ns12="http://www.xbrl.org/uk/cd/countries/2009-09-01" xmlns:ns13="http://www.xbrl.org/uk/all/ref/2009-09-01" xmlns:ns14="http://www.xbrl.org/uk/cd/currencies/2009-09-01" xmlns:ns15="http://www.xbrl.org/uk/cd/exchanges/2009-09-01" xmlns:ns16="http://www.xbrl.org/uk/cd/languages/2009-09-01" xmlns:ns17="http://www.xbrl.org/2004/ref" xmlns:ns18="http://www.xbrl.org/uk/all/gaap-ref/2009-09-01" xmlns:ns19="http://www.xbrl.org/uk/reports/aurep/2009-09-01" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:ns20="http://www.govtalk.gov.uk/uk/fr/tax/full-gaap-dpl/2013-10-01" xmlns:ns21="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap-main/2013-10-01" xmlns:ns22="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap/2013-10-01" xmlns:ns23="http://www.govtalk.gov.uk/uk/fr/tax/dpl-core/2013-10-01"> 
<head> 
    <meta name="PostingEntryNumber" content="4" /> 
    <meta name="PeriodRecordNumber" content="2341" /> 
    <meta content="application/xhtml+xml; charset=UTF-8" http-equiv="Content-Type" /> 
    <meta name="description" content="iXBRL report production" /> 
    <meta name="Mode" content="CH" /> 
    <meta http-equiv="X-UA-Compatible" content="IE=8" /> 

    <title>Shortt Orthopaedics Limited - Limited company - abbreviated - 11.6</title> 
    <style type="text/css"> 
     @media print 
     { 
      hr { display:none; } 
      .portraitpage 
      { 
       min-height:273mm; 
       max-width:170mm; 
      } 
      .landscapepage 
      { 
       min-height:170mm; 
       max-width:273mm; 
      } 
     } 
     @media screen 
     { 
      .portraitpage 
      { 
       max-width:170mm; 
       min-height:273mm; 
       margin:12mm 20mm 12mm 20mm; 
      } 
      .landscapepage 
      { 
       max-width:273mm; 
       min-height:170mm; 
       margin:12mm 20mm 12mm 20mm; 
      } 
     } 
     body{ margin:0px; font-size:1.3em; } 
     td{ padding:0px; } 
     div.portraitpage{ page-break-after:always; position:relative; } 
     div.landscapepage{ page-break-after:always; position:relative; } 
      div.header{ position:relative; } 
      div.footer{ left:0px; right:0px; bottom:0px; text-align:center; position:absolute; } 
    div.container{ position:relative; } 
        div.maintext{ width:100.00%; position:relative; } 
        div.tagged_blob{ width:100.00%; position:relative; } 
           table.iris_table{ width:100.00%; border-collapse:collapse; } 
       table.iris_table_header{ width:100.00%; border-collapse:collapse; } 
       table.iris_table_footer{ width:100.00%; border-collapse:collapse; } 
     div.hr.iris_hr{ width:100.00%; } 
      td.total_single{ border-top:thin solid black; } 
      td.total_double{ border-top:double black; } 
     .ta_10{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_11{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_12{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_13{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_20{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_21{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_22{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_23{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_30{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_31{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_32{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_33{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_40{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_41{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_42{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_43{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_50{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_51{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_52{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_53{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_60{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_61{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_62{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_63{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_70{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_71{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_72{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_73{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_80{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_81{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_82{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_83{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_90{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_91{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_92{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_93{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_100{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_101{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_102{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_103{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_110{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_111{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_112{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_113{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_120{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_121{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_122{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; } 
     .ta_123{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; } 
     .ta_130{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; } 
     .ta_131{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; } 
     .ta_132{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; } 
     .ta_133{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; } 
     .ta_140{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; } 
     .ta_141{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; } 
     .ta_142{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; } 
     .ta_143{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; } 
    </style> 
</head> 
<body xml:lang="en"> 
    <div style="display:none"> 
     <ix:header> 
      <ix:hidden> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:NameAuthor" order="1" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionOrTitleAuthor" order="2" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:UKCompaniesHouseRegisteredNumber" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">07189486</ix:nonNumeric> 
       <ix:nonNumeric contextRef="CountriesHypercube_FY_31_03_2014_Set1" name="ns7:CountryFormationOrIncorporation" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" /> 
       <ix:nonNumeric contextRef="CurrenciesHypercube_FY_31_03_2014_Set2" name="ns7:PrincipalCurrencyUsedInBusinessReport" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" /> 
       <ix:nonNumeric contextRef="EntityOfficersHypercube_FY_31_03_2014_Set3" name="ns5:NameDirectorSigningAccounts" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" /> 
       <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:StartDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">1.4.13</ix:nonNumeric> 
       <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:EndDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric> 
       <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:BalanceSheetDate" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityAccountsType" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Company accounts</ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:LegalFormOfEntity" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Private Limited Company</ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionPeriodCoveredByReport" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">FY</ix:nonNumeric> 
       <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityTrading" format="ixt2:booleantrue" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">true</ix:nonNumeric> 

[计算器限制正文]

+0

如果stackoverflow限制正文文本,请删除与您的问题无关的位。这个限制是有原因的;你不需要发布4KByte的XML来表达你的观点。 (此外,您的要点是什么*您没有指定*哪个*标签要以何种形式加载) – Tomalak

+0

我没有指定要加载所有标签的标签。以什么形式?字符串标签的字符串等等。你知道如何解析HTML吗? –

+0

不同地问,结果是什么,整个行动的最终目标是什么?一个HTML文件?并且请减少你的帖子大小,这也将帮助你建立一个有意义的例子。 – Tomalak

回答

0

我想你需要两步法。

  • 使用HTML解析器去嵌入XML问题
  • ...然后使用DOM解析器上的内容

HTML并不总是符合XML规范(除非你使用XHTML已经变得不那么流行)。浏览器让许多事情像失踪标签,单引号和双引号,没有值的属性等滑落,这可能是您的网站无法解析的原因。

许多都可用。

0

根据该文件,DTD validation always takes place,即使你告诉它不要!

你想要做的是创建一个新的DTD,它将你的名字空间添加到标准的XHTML DTD;在W3网站discusses how to acheive this,以及例如他们给是MATHML:

首先,定义实例化MATHML DTD并将其连接到内容模型内容模型模块:

<!-- File: mathml-model.mod --> 
<!ENTITY % XHTML1-math 
    PUBLIC "-//W3C//DTD MathML 2.0//EN" 
      "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" > 
%XHTML1-math; 

<!ENTITY % Inlspecial.extra 
    "%a.qname; | %img.qname; | %object.qname; | %map.qname; 
     | %Mathml.Math.qname;" > 

接下来,定义一个DTD驱动程序,将我们的新内容模型模块标识为DTD的内容模型,并将处理转交给XHTML 1.1驱动程序(例如):

<!-- File: xhtml-mathml.dtd --> 
<!ENTITY % xhtml-model.mod 
     SYSTEM "mathml-model.mod" > 
<!ENTITY % xhtml11.dtd 
    PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
      "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" > 
%xhtml11.dtd;