2016-07-24 52 views

回答

6

要添加更多一点的@Bob's answer和假设您还需要找到其中可能有其他script标签的HTML标签script

的思想是定义的正则表达式将用于既locating the element with BeautifulSoup并提取email值:

import re 

from bs4 import BeautifulSoup 


data = """ 
<body> 
    <script>jQuery(window).load(function() { 
     setTimeout(function(){ 
     jQuery("input[name=Email]").val("[email protected]"); 
     }, 1000); 
    });</script> 
</body> 
""" 
pattern = re.compile(r'\.val\("([^@][email protected][^@]+\.[^@]+)"\);', re.MULTILINE | re.DOTALL) 
soup = BeautifulSoup(data, "html.parser") 

script = soup.find("script", text=pattern) 
if script: 
    match = pattern.search(script.text) 
    if match: 
     email = match.group(1) 
     print(email) 

打印:[email protected]

在这里,我们使用的是simple regular expression for the email address,但我们可以走得更远,并更加严格,但我怀疑这将是实际需要的这个问题。

2

不可能只使用BeautifulSoup,但你可以做到这一点,例如与BS +正则表达式

import re 
from bs4 import BeautifulSoup as BS 

html = """<script> ... </script>""" 

bs = BS(html) 

txt = bs.script.get_text() 

email = re.match(r'.+val\("(.+?)"\);', txt).group(1) 

或像这样:

... 

email = txt.split('.val("')[1].split('");')[0]