如何使用美丽的汤<script>标签提取字符串？

在一个给定的.html页面，我有一个脚本标签，像这样：如何使用美丽的汤<script>标签提取字符串？

 <script>jQuery(window).load(function() { 
    setTimeout(function(){ 
    jQuery("input[name=Email]").val("[email protected]"); 
    }, 1000); 
});</script>

如何使用美丽的汤提取电子邮件地址？

来源

2016-07-24 dundonian

要添加更多一点的@Bob's answer和假设您还需要找到其中可能有其他script标签的HTML标签script。

的思想是定义的正则表达式将用于既locating the element with BeautifulSoup并提取email值：

import re 

from bs4 import BeautifulSoup 


data = """ 
<body> 
    <script>jQuery(window).load(function() { 
     setTimeout(function(){ 
     jQuery("input[name=Email]").val("[email protected]"); 
     }, 1000); 
    });</script> 
</body> 
""" 
pattern = re.compile(r'\.val\("([^@][email protected][^@]+\.[^@]+)"\);', re.MULTILINE | re.DOTALL) 
soup = BeautifulSoup(data, "html.parser") 

script = soup.find("script", text=pattern) 
if script: 
    match = pattern.search(script.text) 
    if match: 
     email = match.group(1) 
     print(email)

打印：[email protected]。

在这里，我们使用的是simple regular expression for the email address，但我们可以走得更远，并更加严格，但我怀疑这将是实际需要的这个问题。

来源

2016-07-24 07:22:39 alecxe

不可能只使用BeautifulSoup，但你可以做到这一点，例如与BS +正则表达式

import re 
from bs4 import BeautifulSoup as BS 

html = """<script> ... </script>""" 

bs = BS(html) 

txt = bs.script.get_text() 

email = re.match(r'.+val\("(.+?)"\);', txt).group(1)

或像这样：

... 

email = txt.split('.val("')[1].split('");')[0]

来源

2016-07-24 01:34:18 Bob

如何使用美丽的汤<script>标签提取字符串？

回答

相关问题