从HTML内容获取社交网络信息

我正在做关于在互联网上处理新闻文本的研究。所以，我正在编写一个程序，通过新闻网址获取和存储数据库中的新闻。从HTML内容获取社交网络信息

例如，这是一个随机news url（西班牙新闻网站）。所以，我使用BeautifulSoup来获取HTML内容，经过一些简单的过程后，我获得了新闻标题，摘要，内容，类别以及有关新闻的更多信息。

但是，正如你可以在我的例子中使用的消息看，还存在一些“社交网络”的信息（新闻图像的右侧）：

的建议数量（脸谱）
号鸣叫特（Twitter）
号+ 1S（谷歌+）

的，我想也获得这些信息，所以我试图处理从该部分HTML内容，但它不存在！这是我做了什么：

>>> import urllib 
>>> from BeautifulSoup import BeautifulSoup as Soup 
>>> news = urllib.urlopen('http://elcomercio.pe/mundo/1396187/noticia-horror-eeuu-cinco-ninos-muertos-deja-tiroteo-escuela-religiosa') 
>>> soup = Soup(news.read()) 
>>> sociales = soup.findAll('ul', {'class': 'sociales'})[0].findAll('li') 
>>> len(sociales) 
3

这是Facebook的部分HTML内容：

>>> sociales[0] # facebook 
<li class="top"> 
<div class="fb-plg"> 
<div id="fb-root"></div> 
<script>(function(d, s, id) { 
    var js, fjs = d.getElementsByTagName(s)[0]; 
    if (d.getElementById(id)) {return;} 
    js = d.createElement(s); js.id = id; 
    js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=224939367568467"; 
    fjs.parentNode.insertBefore(js, fjs); 
}(document, 'script', 'facebook-jssdk'));</script> 
<div class="fb-like" data-href="http://elcomercio.pe/noticia/1396187/horror-eeuu-cinco-ninos-muertos-deja-tiroteo-escuela-religiosa" data-send="false" data-layout="box_count" data-width="70" data-show-faces="false" data-action="recommend"></div></div></li>

Twitter的一部分：

>>> sociales[1] # twitter 
<li><a href="https://twitter.com/share" class="twitter-share-button" data-count="vertical" data-via="elcomercio" data-lang="es">Tweet</a><script type="text/javascript" src="//platform.twitter.com/widgets.js"></script></li>

Google+的部分：

>>> sociales[2] # google+ 
<li><script type="text/javascript" src="https://apis.google.com/js/plusone.js"> 
    {lang: 'es'} 
</script><g:plusone size="tall"></g:plusone></li>

正如你所看到的，我正在寻找的信息因为没有包含在HTML内容中，所以我猜测它是通过一些API链接获得的。

所以我的问题是：无论如何，我可以从某个新闻的HTML内容中获得我正在寻找的信息（Facebook推荐数量，推文数量，+ 1的数量）？

来源

2012-04-03 juliomalegria

这是我的解决方案。我发布它，因为也许有一天有人会有同样的问题。我遵循@Hoff的建议，我用phantomjs。

所以首先我安装了它（Linux，Windows或MacOS，无所谓）。你只需要能够在您提示/控制台一样运行它作为一个命令：

phantomjs file.js

这里是phantomjs installation guide。

于是，我做了一个简单的脚本，接收一个URL，并返回一个BeautifulSoup对象（执行所有的JavaScript后）：

import os 
import os.path 
import hashlib 
import subprocess 
from BeautifulSoup import BeautifulSoup 

PHANTOM_DIR = os.path.join(os.getcwd(), 'phantom') 

try: 
    os.stat(PHANTOM_DIR) 
except OSError: 
    os.mkdir(PHANTOM_DIR) 

PHANTOM_TEMPLATE = """var page = require('webpage').create(); 
page.open('%(url)s', function (status) { 
    if (status !== 'success') { 
     console.log('Unable to access network'); 
    } else { 
     var p = page.evaluate(function() { 
      return document.getElementsByTagName('html')[0].innerHTML 
     }); 
     console.log(p); 
    } 
    phantom.exit(); 
});""" 

def get_executed_soup(url): 
    """ Returns a BeautifulSoup object with the parsed HTML of the url 
     passed, after executing all the scripts in it. """ 
    file_id = hashlib.md5(url).hexdigest() 
    PHANTOM_ABS_PATH = os.path.join(PHANTOM_DIR, 'phantom%s.js' % file_id) 
    OUTPUT_ABS_PATH = os.path.join(PHANTOM_DIR, 'output%s.html' % file_id) 
    phantom = open(PHANTOM_ABS_PATH, 'w') 
    phantom.write(PHANTOM_TEMPLATE % {'url': url}) 
    phantom.close() 
    cmd = 'phantomjs ' + PHANTOM_ABS_PATH + ' > ' + OUTPUT_ABS_PATH 
    stdout, stderr = subprocess.Popen(cmd, shell=True).communicate() 
    output = open(OUTPUT_ABS_PATH, 'r') 
    soup = BeautifulSoup(output.read()) 
    output.close() 
    os.remove(PHANTOM_ABS_PATH) 
    os.remove(OUTPUT_ABS_PATH) 
    return soup

这就是它！

PS：我只在Linux上测试过，所以如果有人在Windows和/或MacOS上尝试这个，请分享你的“体验”。谢谢:)

PS 2：我也在Windows中测试过，像魅力一样工作！

我还张贴这在我的personal blog :)

来源

2012-04-12 17:51:31 juliomalegria

好东西，谢谢发布！ – Hoff 2012-04-15 16:07:10

您使用的客户端（urllib）不会执行任何JavaScript，大多数社交插件都会使用它来显示您想要的数据。

你需要的是一个能够运行javascipt的客户端，phantomjs是一个不错的选择，并且here's a good explanation on how to do just what you want。

来源

2012-04-03 16:17:07 Hoff

有任何phantomjs Python模块？ – juliomalegria 2012-04-03 17:40:50

曾经是PyPhantomJs，但它已经停产，对于简单的用例，您可以简单地使用子进程来运行phantomjs linux命令 – Hoff 2012-04-05 09:16:54

从HTML内容获取社交网络信息

回答

相关问题