刮易趣特色产品页面集的链接

我试图建立使用Python和BeautifulSoup进入eBay的精选集和检索的集合中的所有产品的URL（最有收藏17个产品网页抓取工具，虽然有些还多或少）。下面是我试图在我的代码，以刮擦收集的网址：http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018 刮易趣特色产品页面集的链接

这里是我到目前为止的代码：

import requests 
from bs4 import BeautifulSoup 

url = 'http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018' 
soup = BeautifulSoup(requests.get(url).text, 'html.parser') 

product_links = [] 

item_thumb = soup.find_all('div', attrs={'class':'itemThumb'}) 
for link in item_thumb: 
    product_links.append(link.find('a').get('href')) 

print product_links

这应该刮追加17个链接到列表中product_links。但是，它仅适用于中途。具体来说，它只会每次删除前12个产品链接，剩下的5个链接不变，即使所有17个链接都在相同的HTML标签和属性中找到。在网页的HTML代码更仔细地观察，我发现唯一的区别是，第12个环节，最终5是由一块XML脚本的，我已经在这里包括分离：

<script escape-xml="true"> 
     if (typeof(collectionState) != 'object') { 
      var collectionState = { 
       itemImageSize: {sWidth: 280, sHeight: 280, lWidth: 580, lHeight: 620}, 
       page: 1, 
       totalPages: 2, 
       totalItems: 17, 
       pageId: '2057253', 
       currentUser: '', 
       collectionId: '323101965012', 
       serviceHost: 'svcs.ebay.com/buying/collections/v1', 
       owner: 'ebaytecheditor', 
       csrfToken: '', 
       localeId: 'en-US', 
       siteId: 'EBAY-US', 
       countryId: 'US', 
       collectionCosEnabled: 'true', 
       collectionCosHostExternal: 'https://api.ebay.com/social/collection/v1', 
       collectionCosEditEnabled: 'true', 
       isCollectionReorderEnabled: 'false', 
       isOwnerSignedIn: false || false, 
       partiallySignedInUser: '@@[email protected]@[email protected]@', 
       baseDomain: 'ebay.com', 
       currentDomain: 'www.ebay.com', 
       isTablet: false, 
       isMobile: false, 
       showViewCount: true 
      }; 
     } 
    </script>

什么是功能这个脚本？这个剧本有可能是我的刮刀忽略了最后5个链接的原因吗？有没有办法解决这个问题并为最后的五个方案做好准备？

来源

2016-08-18 Federico Scivittaro

发生这种情况，因为在接下来的5个环节是使用JavaScript加载。 –

最后几通过一个AJAX产生请求http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018：

的URL是由使用ebayhomeeditor，什么必须有的产品编号这是都在您访问的网页的原始网址中。

获取数据必不可少的唯一参数是itemsPerPage但您可以随意播放其余的内容并查看它们的效果。

params = {"itemsPerPage": "10"} 
soup= BeautifulSoup(requests.get("http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018", params=params).content) 
print([a["href"] for a in soup.select("div.itemThumb div.itemImg.image.lazy-image a[href]")])

这将使你：

['http://www.ebay.com/itm/yamazaki-home-tower-book-end-white-stationary-holder-desktop-organizing-steel/171836462366?hash=item280240551e', 'http://www.ebay.com/itm/tetris-constructible-interlocking-desk-lamp-neon-light-nightlight-by-paladone/221571335719?hash=item3396ae4627', 'http://www.ebay.com/itm/iphone-docking-station-dock-native-union-new-in-box/222202878086?hash=item33bc52d886', 'http://www.ebay.com/itm/turnkey-pencil-sharpener-silver-office-home-school-desk-gift-peleg-design/201461359979?hash=item2ee808656b', 'http://www.ebay.com/itm/himori-weekly-times-desk-notepad-desktop-weekly-scheduler-30-weeks-planner/271985620013?hash=item3f539b342d']

所以把它在一起，得到的所有URL：

In [23]: params = {"itemsPerPage": "10"} 

In [24]: with requests.Session() as s: 
    ....:   soup1 = BeautifulSoup(s.get('http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018').content, 
    ....:        "html.parser") 
    ....:   main_urls = [a["href"] for a in soup1.select("div.itemThumb div.itemImg.image.lazy-image a[href]")] 
    ....:   soup2 = BeautifulSoup(s.get("http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018", params=params).content, 
    ....:        "html.parser") 
    ....:   print(len(main_urls)) 
    ....:   main_urls.extend(a["href"] for a in soup2.select("div.itemThumb div.itemImg.image.lazy-image a[href]")) 
    ....:   print(main_urls) 
    ....:   print(len(main_urls)) 
    ....:  
12 
['http://www.ebay.com/itm/archi-desk-accessories-pen-cup-designed-by-hsunli-huang-for-moma/262435041373?hash=item3d1a58f05d', 'http://www.ebay.com/itm/moorea-seal-violet-light-crane-scissors/201600302323?hash=item2ef0507cf3', 'http://www.ebay.com/itm/kikkerland-photo-holder-with-6-magnetic-wooden-clothespin-mh69-cable-47-long/361394782932?hash=item5424cec2d4', 'http://www.ebay.com/itm/authentic-22-design-studio-merge-concrete-pen-holder-desk-office-pencil/331846509549?hash=item4d4397e3ed', 'http://www.ebay.com/itm/supergal-bookend-by-artori-design-ad103-metal-black/272273290322?hash=item3f64c0b452', 'http://www.ebay.com/itm/elago-p2-stand-for-ipad-tablet-pcchampagne-gold/191527567203?hash=item2c97eebf63', 'http://www.ebay.com/itm/this-is-ground-mouse-pad-pro-ruler-100-authentic-natural-retail-100/201628986934?hash=item2ef2062e36', 'http://www.ebay.com/itm/hot-fuut-foot-rest-hammock-under-desk-office-footrest-mini-stand-hanging-swing/152166878943?hash=item236dda4edf', 'http://www.ebay.com/itm/unido-silver-white-black-led-desk-office-lamp-adjustable-neck-brightness-level/351654910666?hash=item51e0441aca', 'http://www.ebay.com/itm/in-house-black-desk-office-organizer-paper-clips-memo-notes-monkey-business/201645856763?hash=item2ef30797fb', 'http://www.ebay.com/itm/rifle-paper-co-2017-maps-desk-calendar-illustrated-worldwide-cities/262547131670?hash=item3d21074d16', 'http://www.ebay.com/itm/muji-erasable-pen-black/262272348079?hash=item3d10a66faf', 'http://www.ebay.com/itm/rifle-paper-co-2017-maps-desk-calendar-illustrated-worldwide-cities/262547131670?hash=item3d21074d16', 'http://www.ebay.com/itm/muji-erasable-pen-black/262272348079?hash=item3d10a66faf', 'http://www.ebay.com/itm/yamazaki-home-tower-book-end-white-stationary-holder-desktop-organizing-steel/171836462366?hash=item280240551e', 'http://www.ebay.com/itm/tetris-constructible-interlocking-desk-lamp-neon-light-nightlight-by-paladone/221571335719?hash=item3396ae4627', 'http://www.ebay.com/itm/iphone-docking-station-dock-native-union-new-in-box/222202878086?hash=item33bc52d886', 'http://www.ebay.com/itm/turnkey-pencil-sharpener-silver-office-home-school-desk-gift-peleg-design/201461359979?hash=item2ee808656b', 'http://www.ebay.com/itm/himori-weekly-times-desk-notepad-desktop-weekly-scheduler-30-weeks-planner/271985620013?hash=item3f539b342d'] 
19 

In [25]:

有什么获取返回所以只用了一套一点点重叠存储main_urls或致电名单上的设置：

In [25]: len(set(main_urls)) 
Out[25]: 17

不知道为什么出现这种情况并没有真的试图弄清楚，如果你烦恼，那么你可以解析“TOTALITEMS：17”从AJAX调用返回源，第一个电话后减去main_urls的长度，并设置{"itemsPerPage": str(len(main_urls) - int(parsedtotal))}但我不会太担心。

来源

2016-08-18 12:03:15

有没有任何eBay API可以做到这一点？ – pratibha

刮易趣特色产品页面集的链接

回答

相关问题