2015-10-15 147 views
1

我需要获取html文档的平均div高度和宽度。计算div标记的平均高度和平均宽度

我尝试这种解决方案,但它不工作:

import numpy as np 
average_width = np.mean([div.attrs['width'] for div in my_doc.get_div() if 'width' in div.attrs]) 
average_height = np.mean([div.attrs['height'] for div in my_doc.get_div() if 'height' in div.attrs]) 
print average_height,average_width 

get_div方法返回所有的列表DIV通过beautifulSoup

这里的find_all方法检索是一个例子:

print my_doc.get_div()[1] 

<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;"> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">Journal of  Infection (2015) 
    </span> 
    <span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">xx</span> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">, 1</span> 
    <span style="font-family: EICMDD+AdvPS44A44B; font-size:7px">e</span> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">4 
    <br/> 
    </span> 
</div> 

当我得到的属性,它完美的作品

print my_doc.get_div()[1].attrs 

{u'style': u'position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;'} 

但是当我试图获得价值

print my_doc.get_div()[1].attrs['width'] 

我得到一个错误:

KeyError: 'width' 

,但我不理解,因为当我检查类型:

print type(my_doc.get_div()[1].attrs) 

这是一本字典,<type 'dict'>

+0

?你可以给网页或更多的HTML页面的源? – SIslam

+0

@SIslam,我编辑了我的帖子 –

+0

你如何计算'div'的宽度?例如:我有一个'div'设置为100%宽度。如果我的窗口是全屏的话,大概是〜1900px。如果我的窗口更小,'div'更小。那么它的宽度是多少? '平均'这个概念是怎么来的? –

回答

1

可能有更好way-

路-1

下面是我测试的代码,以提取宽度高度

from bs4 import BeautifulSoup 

html_doc = '''<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;"> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">Journal of  Infection (2015) 
    </span> 
    <span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">xx</span> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">, 1</span> 
    <span style="font-family: EICMDD+AdvPS44A44B; font-size:7px">e</span> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">4 
    <br/> 
    </span> 
</div>''' 

soup = BeautifulSoup(html_doc,'html.parser')  
my_att = [i.attrs['style'] for i in soup.find_all("div")] 
dd = ''.join(my_att).split(";") 
dd_cln= filter(None, dd) 
dd_cln= [i.strip() for i in dd_cln ] 
my_dict = dict(i.split(':') for i in dd_cln) 
print my_dict['width'] 

分路-2 使用正则表达式所描述here。是U使用numpy的意思

工作代码 -

import numpy as np 
import re 
from bs4 import BeautifulSoup 

html_doc = '''<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;"> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">Journal of  Infection (2015) 
    </span> 
    <span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">xx</span> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">, 1</span> 
    <span style="font-family: EICMDD+AdvPS44A44B; font-size:7px">e</span> 
    <span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">4 
    <br/> 
    </span> 
</div>''' 

soup = BeautifulSoup(html_doc,'html.parser')  
my_att = [i.attrs['style'] for i in soup.find_all("div")] 
css = ''.join(my_att) 
print css 
width_list = map(float,re.findall(r'(?<=width:)(\d+)(?=px;)', css)) 
height_list = map(float,re.findall(r'(?<=height:)(\d+)(?=px;)', css)) 
print np.mean(height_list) 
print np.mean(width_list) 
+0

其实它确实工作,因为关键字是'样式'而不是'宽度'的字典,我试试这个解决方案http://stackoverflow.com/questions/10401110/using-beautiful-soup-to-convert- css-attributes-to-individual-html-attributes: 'import cssutils a = cssutils.parseStyle(my_doc.get_div()[1]。attrs ['style']) print a ['width']' 但我得到这个错误: '错误\t属性:“CSS背景和边框模块级别3”属性的值无效:textbox 1px solid [1:20 :border] 警告\t财产:未知物业名称。 [1:47:写作模式]' –

+0

同样在这里!是的,这可能是python库没有的自定义标签! – SIslam

+0

更改了答案! – SIslam