我想解析（使用Javascript或JSoup）网站。我的问题是我不知道如何访问想要的数据，因为在那个文件中实际上没有Ids。从没有ID的HTML解析数据

我有类似：

  <div id content> 
<table> 
<tbody> 
<tr> 
<td align > 
<div style=> 
<table> 
<tbody> 
<tr></tr> 
<tr></tr> 
    <tr> 

    <td> 
    <br></br> 
    <h2><div class=""></div>Related</h2> 

    Adaptation: 
    <a href="/link">nameOfBook</a> 
    <br></br> 

    Prequel: 
    <a href="/link2">nameOfBook2</a> 
    <br></br> 

    Other: 

    <a href="link3"></a> 
    <br></br> 
    <br></br> 
    <h2></h2> 
    <table width0"></table> 
    ..........many tables and a..... 
    </tr> 
        </tbody> 
       </table> 
      </div> 
     </td> 
    </tr> 
</tbody>

希望其可以理解的，该网站是相当大的。我想要相关的东西。所以我想要续集与三个名字和他们的链接。然后前传名字3。

目前我得到#content，然后我得到所有h2标签的数组，并检查第二个孩子，如果它等于“相关”。然后我得到父（td）并遍历所有“a”。在这个td超过200 a's。

我的计划是现在迭代这些，并检查之前，“a”是否来到术语（前传，续集或适应），但它听起来有点复杂。

或者我可以解析两个h2标签之间的所有内容，因为它总是在那里。或者，我可以检查链接，因为想要的链接始终具有相同的结构。因此，搜索该结构，然后转到父级并检查它是什么名称。

任何人都可以帮助我吗？在整个文件中没有id或名字。我很确定，我可以找到一个解决方法，但它会太复杂，一些JS知识很容易得到。

更新：

It's不知道有多少前传/续集无论标签将在那里。我真正知道的唯一事情就是两个h2标签之间会有一个“相关”文本，下一个开始h2是新的开始。

并改变了上面的例子：现在它是正确的结构，＃内容再次在一个div中，但我认为这并不重要，因为我可以直接访问内容。

来源

2015-05-20 Nemos

向我们展示您的JS，这会更容易理解和帮助;）。 – Bladepianist

我建议使用** DOM **和** XPath **。 http://stackoverflow.com/questions/6466831/selecting-element-from-dom-with-javascript-and-xpath – Brcinho

你确定你的标记与''外面的表吗？ – Camusensei

您可以使用document.querySelector或document.querySelectorAll并以相对方式选择元素。

例如：选择前三a标签专区内[ID =“内容”]

var allAnchorsInDiv = document.querySelectorAll("div[id='content'] a"); //Basically this is an array of anchors. 
//select anchors from array.

如果你没有任何标识的话，那么你应该使用相对路径（像Xpath或CSS选择器）。

使用CSS选择器，你会使用这样的事情，

document.querySelectorAll('body>div:first-of-type>a');

或者你可以使用XPath，请参阅w3school

注：如果你想要的东西更容易一些，你甚至可以使用jQuery完成相同。

更新：

所以，您的需求，你必须这样做。

用文本选择文本节点。
找到它旁边的节点锚点节点。

因此，

var myKeyTerm = "Sequel"; //Set your keyterm here. 
var myAnchorTags = []; 
var myTextNode = document.evaluate("//text()[contains(., '"+myKeyTerm +"')]" ,document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue; 
if(myTextNode) { 
    var nextNode = myTextNode; 
    do { 
     nextNode = nextNode.nextSibling; 
     if(nextNode && nextNode.nodeName == "A") { 
      myAnchorTags.push(nextNode); 
     } 
     else nextNode = null; 
    } 
    while(nextNode); 
} 
//All the nodes that follow your required text is in myAnchorTags array.

来源

2015-05-20 09:47:10

ohh不知道我可以使用XPAtH与js。不，第一个例子会返回超过80个元素，我应该如何区分它们？第二个会返回我的第一个“a”，对吧？然后我可以使用.nexSibling等获得其他的。但是，我如何获得续集/前传文本？ – Nemos

这可能会帮助您从js http://www.w3schools.com/xpath/xpath_examples.asp中的XPath开始。并编辑我的答案一点点（分离Xpath和CSS选择器），以获得更好的清晰度。 –

谢谢，我编辑了我的评论:)我仍然按Enter键当我想要创建一个段落：/ – Nemos

我对这个问题采取将是：

var content = document.getElementById("content"); 
var h2 = content.getElementsByTagName("h2")[0]; // the first h2 element 
var link1 = h2.nextElementSibling; 
var link2 = link1.nextElementSibling; 
var link3 = link2.nextElementSibling; 
var link4 = link3.nextElementSibling; 
console.log("Sequel: ", link1.innerHTML, link1.href); 
console.log("Sequel: ", link2.innerHTML, link2.href); 
console.log("Sequel: ", link3.innerHTML, link3.href); 
console.log("Prequel: ", link4.innerHTML, link4.href);

此方法具有即使是第一次（剥离出）table中的链接工作的优势。

但是，如果第一（剥离出来），它不会工作table包含h2元素......在这种情况下，而不是

var h2 = content.getElementsByTagName("h2")[0]; // the first h2 element

您应该使用

var h2 = Array.prototype.filter.call(content.children, function(c) {return c.tagName.toLowerCase() == "h2"})[0];

编辑

function listlinks(){ 
var prequels = []; 
var sequels = []; 
var all_h2_elems = document.getElementsByTagName("h2"); 
var h2_start = Array.prototype.filter.call(all_h2_elems, function(el){return el.innerText.indexOf("Related") != -1})[0]; 
var parent = h2_start.parentElement; 
var h2_elems = Array.prototype.filter.call(parent.children, function(c) {return c.tagName.toLowerCase() == "h2"}); 
if (h2_elems.length < 2) console.log("You lied, you said there were always 2 h2 tags!"); 
if (!h2_start.isSameNode(h2_elems[0])) console.log("Hmmm, there should not be a h2 tag before the 'Related' one, fix your question."); 
var sequel = false; 
var prequel = false; 
var current = h2_elems[0]; 
var end = h2_elems[1] 
while(current && !current.isSameNode(end)) 
{ 
    if (current.tagName === undefined) 
    { 
    if (current.nodeValue.indexOf("Sequel") != -1) 
    { 
     if (sequel || prequel) { console.log("wtf? another sequel?"); break; } 
     sequel = true; 
    } 
    else if (current.nodeValue.indexOf("Prequel") != -1) 
    { 
     if (!sequel) { console.log("wtf? prequel should be AFTER sequel"); break; } 
     prequel = true; 
     sequel = false; 
    } 
    else if (current.nodeValue.match(/[a-z]/)){ 
     prequel = false; 
     sequel = false; 
     // stop outputing links if anything else is found 
    } 
    } // end if (current.tagName === undefined) 
    else if (current.tagName.toLowerCase() === "a") 
    { 
    if (prequel) prequels.push(current.innerHTML + " : " + current.href); 
    if (sequel) sequels.push(current.innerHTML + " : " + current.href); 
    } 
    current = current.nextSibling; 
} 
    return [prequels,sequels]; 
} 
listlinks().forEach(function(el,i){console.log(i?"Sequels:":"Prequels:",el)})

来源

2015-05-20 09:59:16 Camusensei

谢谢，但我不知道我需要多少Elementts在HTML中。所以，一种静态的方式可能不起作用。我编辑了我的起始帖子。 – Nemos

你走了。所有链接，而不管控制台中页面输出的其他内容。 – Camusensei

没关系，我没有看到标记改变-_-'我会修复 – Camusensei

从没有ID的HTML解析数据

回答

编辑

相关问题