2017-08-16 17 views
1

我试图提取价格过使用JSoup亚马逊,但有两个不同的元素,我可以提取它。我可以从元素中的aria-label属性中获取它,或者我可以从元素中的文本中获取它。最好,我总是想从aria-label属性中获得它,但有时它不存在,所以我需要从第二个span类中提取它。我的问题是,如何创建一个if语句来检查属性是否有任何文本,如果没有,则尝试从第二个span类中提取文本?JSoup - 检查某些元素,看看他们是否有文字,然后选择只有一个

另外,我试图从具有相同名称的类得到几个价格,但是当我写doc.select("span.sx-price.sx-price-large").get(0).text()例如,没有弹出。

这里是我想从

<a class="a-size-small a-link-normal a-text-normal" href="http://rads.stackoverflow.com/amzn/click/B01MZYYWUH">1</a></div> 
 
<div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary a-text-bold">Product Description</span><br><span class="a-size-small a-color-secondary">... Cards Radeon&trade; <em>RX</em> 460 Graphics Cards Radeon&trade; R9 <em>390</em> Graphics Cards ...</span></div> 
 
</div></div></div></div></div></div></li> 
 
<li id="result_2" data-asin="B00IAAU6SS" class="s-result-item celwidget "> 
 
    <div class="s-item-container"> 
 
    <div class="a-fixed-left-grid"> 
 
    <div class="a-fixed-left-grid-inner" style="padding-left:218px"> 
 
    <div class="a-fixed-left-grid-col a-col-left" style="width:218px;margin-left:-218px;_margin-left:-109px;float:left;"> 
 
     <div class="a-row"> 
 
     <div aria-hidden="true" class="a-column a-span12 a-text-center"> 
 
      <a class="a-link-normal a-text-normal" href="http://rads.stackoverflow.com/amzn/click/B00IAAU6SS"><img src="https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US218_.jpg" srcset="https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US218_.jpg 1x, https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US327_FMwebp_QL65_.jpg 1.5x, https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US436_FMwebp_QL65_.jpg 2x, https://images-na.ssl-images-amazon.com/images/I/419c5Ci-UqL._AC_US500_FMwebp_QL65_.jpg 2.2935x" width="218" height="218" alt="Product Details" class="s-access-image cfMarker" data-search-image-load></a> 
 
      <div class="a-section a-spacing-none a-text-center"></div> 
 
     </div> 
 
     </div> 
 
    </div> 
 
    <div class="a-fixed-left-grid-col a-col-right" style="padding-left:2%;*width:97.6%;float:left;"> 
 
    <div class="a-row a-spacing-small"> 
 
     <div class="a-row a-spacing-none scx-truncate-medium sx-line-clamp-3 s-list-title-long"> 
 
     <a class="a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal" title="Arctic Accelero Xtreme IV 280(X) - High-End Graphics Card Cooler with Backside Cooler for Efficient RAM and VR-Cooling - DCACO-V930001-GBA01" href="http://rads.stackoverflow.com/amzn/click/B00IAAU6SS"> 
 
      <h2 data-attribute="Arctic Accelero Xtreme IV 280(X) - High-End Graphics Card Cooler with Backside Cooler for Efficient RAM and VR-Cooling - DCACO-V930001-GBA01" data-max-rows="3" class="a-size-medium s-inline s-access-title a-text-normal">Arctic Accelero Xtreme IV 280(X) - High-End Graphics Card Cooler with Backside Cooler for Efficient RAM and VR-Cooling - DCACO-V930001-GBA01</h2> 
 
     </a> 
 
     </div> 
 
     <div class="a-row a-spacing-none"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">ARCTIC</span></div> 
 
    </div> 
 
    <div class="a-row"> 
 
    <div class="a-column a-span7"> 
 
    <div class="a-row a-spacing-none"><a class="a-link-normal a-text-normal" href="http://rads.stackoverflow.com/amzn/click/B00IAAU6SS"><span aria-label="$85.99" class="a-color-base sx-zero-spacing"><span class="sx-price sx-price-large"> 
 
     <sup class="sx-price-currency">$</sup> 
 
     <span class="sx-price-whole">85</span> 
 
     <sup class="sx-price-fractional">99</sup> 
 
     </span> 
 
     </span></a><span class="a-letter-space"></span><i class="a-icon a-icon-prime a-icon-small s-align-text-bottom" aria-label="Prime"><span class="a-icon-alt">Prime</span></i> 
 
    </div> 
 
    <div class="a-row a-spacing-mini"> 
 
     <div class="a-row a-spacing-none"><span class="a-size-small a-color-secondary">FREE Shipping on eligible orders</span></div> 
 
     <div class="a-row a-spacing-none"><span class="a-size-small a-color-price">Only 8 left in stock - order soon.</span></div> 
 
    </div> 
 
    <div class="a-row a-spacing-mini"> 
 
    <div class="a-row a-spacing-none"> 
 
     <div class="a-row a-spacing-mini"></div> 
 
     <span class="a-size-small a-color-secondary">More Buying Choices</span> 
 
    </div> 
 
    <div class="a-row a-spacing-none"> 
 
    <a class="a-size-small a-link-normal a-text-normal" href="http://rads.stackoverflow.com/amzn/click/B00IAAU6SS"><span class="a-color-secondary a-text-strike"></span><span class="a-size-base a-color-base">$85.99</span>

帮助提取项目将不胜感激的项目之一的HTML代码,感谢您的时间!

回答

0

鉴于您的例子 - 我们选择span。该检查之后,如果你的“选择”元素是NULL如果是NULL然后选择从子跨度的文本。

试试这个(它已经不是我测试 - 只要写入内存(当然你需要先取一份文件,我假设你已经有了)):

try{ 
Element span = doc.select("span.a-color-base.sx-zero-spacing").first(); 

if(span != null) { 
    System.out.println(span.attr("aria-label")); 
} else { 
    Element beforeSep = doc.select("span.sx-price-whole").first(); 
    Element afterSep = doc.select("sup.sx-price-fractional").first(); 

    System.out.println(beforeSep.text() + "." + afterSep.text()); 
} 
} catch (Exception ex){ 
    // exception handler 
} 
+0

那是不行的,因为即使上面犯规的HTML代码包含的咏叹调标签,当然其他项目都会有一个,所以咏叹调标签永远不会为空 – coolyfrost

0

我建议选择元素因为它的名字表明它包含一个价格.sx-price。然后,您可以选择预期属性为aria-label的父元素,使用简单正则表达式检查它是否包含价格 - 如果为true,则直接从此属性获取价格,否则从嵌套子跨度收集数据。

下面你可以找到一个代码,我有玩的,效果很好。

final Document doc = Jsoup.parse(html); 

final Elements prices = doc.select(".sx-price"); 

final Pattern pattern = Pattern.compile("^\\$?([0-9]+)\\.([0-9]{2})$"); 

for (Element el : prices) { 
    String price = ""; 
    if (el.parent().hasAttr("aria-label") && pattern.matcher(el.parent().attr("aria-label")).find()) { 
     System.out.println("Extracting price from aria-label..."); 
     price = el.parent().attr("aria-label"); 

    } else { 
     System.out.println("Extracting price from span body..."); 
     String currency = el.select(".sx-price-currency").text(); 
     String whole = el.select(".sx-price-whole").text(); 
     String fractional = el.select(".sx-price-fractional").text(); 

     price = String.format("%s%s.%s", currency, !whole.isEmpty() ? whole : "00", !fractional.isEmpty() ? fractional : "00"); 
    } 

    System.out.println(price); 
} 

我希望它有帮助。

相关问题