2016-08-19 133 views
4

我在美丽的汤中选择这个'div'对象然后解析其中的数据时遇到了麻烦。用美丽的汤解析数据绑定HTML中的标记

首先,我必须解码HTML网页上的实体(如https://mothereff.in/html-entities)。

我将采取哪些步骤来,例如,以编程方式选择

(海: '/ S3/fhphotos/CIRD-72K6-H9_SID_1.jpg,宽度= 1000 &高度= 1000 &模式= MAX')

从下面

<div data-bind="component: { name: &#39;product-detail&#39;, params: {hasVariants:true,name:&#39;BROOKS LOUNGE CHAIR&#39;,hasCategory:true,superCategoryName:&#39;Furniture&#39;,categoryDisplayName:&#39;Living Room&#39;,categorySlug:&#39;living-room&#39;,subcategoryDisplayName:&#39;Chairs&#39;,subcategorySlug:&#39;chairs&#39;,collection:{id:1529,name:&#39;Irondale&#39;,description:&#39;Each piece is a striking conversation-starter. Tables are made from reclaimed doors paired with salvaged architecture or old machine parts. Storage solutions are inspired by libraries of the 1940’s. Cast iron beds with linen panels as well as seating in linen, lush velvet and top-grain leather offer a distinctive found feel.&#39;,isFeatured:true,isNew:false,image:&#39;/FourHandsMarketplace/media/General/Featured%20Collections/IRONDALE.jpg?width=500&#39;,shortDescription:&#39;Moving from Parisian flea market to modern to industrial, understated elegance is a common theme. Waxed leathers and distressed irons mix with fabrics for an intriguing style blend.\r\n&#39;,uri:&#39;/collections/irondale&#39;},attributes:[{id:384,name:&#39;COVER&#39;,displayOrder:30,swatches:true,values:[{id:12710,name:&#39;EBONY&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-G6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12711,name:&#39;STONEWASH DARK GREEN&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]},{id:385,name:&#39;FINISH&#39;,displayOrder:40,swatches:true,values:[{id:12712,name:&#39;BLACK WASH WEATHERED&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-K5_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12713,name:&#39;DISTRESSED WASHED OLD OAK&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-K6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]}],products:[{attributeValueIds:[12710,12712],description:&#39;Our take on the classic Adirondack emphasizes comfort with thick, top-grain leather cushioning. Wire-brushed oak is finished in black and hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.75&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >88&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Black Washed Weathered&#39;,&#39;Ebony&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_3.jpg&#39;},{order:11,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_4.jpg&#39;}],priceHtml:&#39;$520.00&#39;,itemNumber:&#39;CIRD-72K5-G6H6&#39;,name:&#39;Brooks Lounge Chair-Ebony, Blk Wsh Weath&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false},{attributeValueIds:[12711,12713],description:&#39;Our take on the classic Adirondack emphasizes comfort with green, stonewashed cotton canvas cushioning. Wire-brushed oak is hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.5&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >147&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Distressed Washed Old Oak&#39;,&#39;Stonewash Dark Green&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_3.jpg&#39;}],priceHtml:&#39;$290.00&#39;,itemNumber:&#39;CIRD-72K6-H9&#39;,name:&#39;Brooks Lounge Chair-Stonewsh Drk Green&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false}],activeItemNumber:&#39;CIRD-72K5-G6H6&#39;,priceDescription:&#39;Wholesale Price&#39;} }"></div> 

的代码?

回答

0

哪里这个网站字符串的来源和究竟你有兴趣的提取,但对于美丽的汤的一部分,你只需要它并不完全清楚:

soup = BeautifulSoup(s) 
text = soup.div['data-bind'] 

s是串在你的问题。在获得'数据绑定'attribute之前,我们首先得到'div'tag

该格式让我感到困惑,因为它与json类似,类似于python字典,但没有一个解析器喜欢输入。我猜它的JavaScript?我写这个question激发了快速和肮脏的括号计数循环:

nest_lvl = 0 
lvl_string = list() 
for char in text: 
    if char == '{': 
     nest_lvl += 1 
    elif char == '}': 
     nest_lvl -= 1 

    try: 
     lvl_string[nest_lvl] += char 
    except IndexError:   # first iter 
     lvl_string.append(char) 

    if char == '}': 
     print nest_lvl, lvl_string[nest_lvl] 
     lvl_string[nest_lvl] = '' 

将有望让你开始它。同样,解析部分实际上取决于解析器的普遍程度以及您想要提取的内容。