2009-07-28 243 views
39

我想实现的是从python的任何网站获取网站截图。如何使用Python截取网站截图/图像?

ENV:Linux的

+4

快速搜索网站会带来很多很多近似重复的内容。这是一个很好的开始:http://stackoverflow.com/questions/713938/how-can-i-generate-a-screenshot-of-a-webpage-using-a-server-side-script – Shog9 2009-07-28 22:55:30

+0

Shog9:谢谢!你的链接有一些...会检查它。 – 2009-07-28 23:22:11

+0

Shog9:你为什么不把它添加为答案?所以它可以给你点。 – 2009-07-28 23:27:05

回答

8

在Mac上,有webkit2png,在Linux + KDE上,您可以使用khtml2png。我试过前者,效果很好,听说后者正在使用。

我最近遇到了QtWebKit,它声称是跨平台的(Qt将WebKit放入他们的库中,我想)。但我从来没有尝试过,所以我不能告诉你更多。

QtWebKit链接显示了如何从Python进行访问。你应该至少可以使用子进程对其他进程做同样的事情。

0

你不提你在运行什么样的环境,这使得一个很大的不同,因为没有一个纯Python Web浏览器是能够呈现HTML的。

但是,如果您使用的是Mac,我已经使用webkit2png,并取得了巨大成功。如果没有,正如其他人指出的那样,有很多选择。

5

我不能评论ars的答案,但我实际上得到Roland Tapken's code运行使用QtWebkit,它工作得很好。

只想确认Roland在他的博客上发布的内容在Ubuntu上的效果如何。我们的产品版本最终没有使用他写的任何东西,但我们使用的PyQt/QtWebKit绑定取得了很大的成功。

38

这里有一个简单的解决方案使用的WebKit: http://webscraping.com/blog/Webpage-screenshots-with-webkit/

import sys 
import time 
from PyQt4.QtCore import * 
from PyQt4.QtGui import * 
from PyQt4.QtWebKit import * 

class Screenshot(QWebView): 
    def __init__(self): 
     self.app = QApplication(sys.argv) 
     QWebView.__init__(self) 
     self._loaded = False 
     self.loadFinished.connect(self._loadFinished) 

    def capture(self, url, output_file): 
     self.load(QUrl(url)) 
     self.wait_load() 
     # set to webpage size 
     frame = self.page().mainFrame() 
     self.page().setViewportSize(frame.contentsSize()) 
     # render image 
     image = QImage(self.page().viewportSize(), QImage.Format_ARGB32) 
     painter = QPainter(image) 
     frame.render(painter) 
     painter.end() 
     print 'saving', output_file 
     image.save(output_file) 

    def wait_load(self, delay=0): 
     # process app events until page loaded 
     while not self._loaded: 
      self.app.processEvents() 
      time.sleep(delay) 
     self._loaded = False 

    def _loadFinished(self, result): 
     self._loaded = True 

s = Screenshot() 
s.capture('http://webscraping.com', 'website.png') 
s.capture('http://webscraping.com/blog', 'blog.png') 
33

下面是从各种渠道帮助抓住我的解决方案。它需要完整的网页屏幕截图,并裁剪(可选),并从裁剪后的图像生成缩略图。以下是要求:安装的NodeJS

  • 使用节点的包管理器安装phantomjs

    1. 要求npm -g install phantomjs

    2. 安装硒(在你的virtualenv,如果你正在使用)
    3. 安装imageMagick
    4. 将幻影添加到系统路径(在窗口中)

    import os 
    from subprocess import Popen, PIPE 
    from selenium import webdriver 
    
    abspath = lambda *p: os.path.abspath(os.path.join(*p)) 
    ROOT = abspath(os.path.dirname(__file__)) 
    
    
    def execute_command(command): 
        result = Popen(command, shell=True, stdout=PIPE).stdout.read() 
        if len(result) > 0 and not result.isspace(): 
         raise Exception(result) 
    
    
    def do_screen_capturing(url, screen_path, width, height): 
        print "Capturing screen.." 
        driver = webdriver.PhantomJS() 
        # it save service log file in same directory 
        # if you want to have log file stored else where 
        # initialize the webdriver.PhantomJS() as 
        # driver = webdriver.PhantomJS(service_log_path='/var/log/phantomjs/ghostdriver.log') 
        driver.set_script_timeout(30) 
        if width and height: 
         driver.set_window_size(width, height) 
        driver.get(url) 
        driver.save_screenshot(screen_path) 
    
    
    def do_crop(params): 
        print "Croping captured image.." 
        command = [ 
         'convert', 
         params['screen_path'], 
         '-crop', '%sx%s+0+0' % (params['width'], params['height']), 
         params['crop_path'] 
        ] 
        execute_command(' '.join(command)) 
    
    
    def do_thumbnail(params): 
        print "Generating thumbnail from croped captured image.." 
        command = [ 
         'convert', 
         params['crop_path'], 
         '-filter', 'Lanczos', 
         '-thumbnail', '%sx%s' % (params['width'], params['height']), 
         params['thumbnail_path'] 
        ] 
        execute_command(' '.join(command)) 
    
    
    def get_screen_shot(**kwargs): 
        url = kwargs['url'] 
        width = int(kwargs.get('width', 1024)) # screen width to capture 
        height = int(kwargs.get('height', 768)) # screen height to capture 
        filename = kwargs.get('filename', 'screen.png') # file name e.g. screen.png 
        path = kwargs.get('path', ROOT) # directory path to store screen 
    
        crop = kwargs.get('crop', False) # crop the captured screen 
        crop_width = int(kwargs.get('crop_width', width)) # the width of crop screen 
        crop_height = int(kwargs.get('crop_height', height)) # the height of crop screen 
        crop_replace = kwargs.get('crop_replace', False) # does crop image replace original screen capture? 
    
        thumbnail = kwargs.get('thumbnail', False) # generate thumbnail from screen, requires crop=True 
        thumbnail_width = int(kwargs.get('thumbnail_width', width)) # the width of thumbnail 
        thumbnail_height = int(kwargs.get('thumbnail_height', height)) # the height of thumbnail 
        thumbnail_replace = kwargs.get('thumbnail_replace', False) # does thumbnail image replace crop image? 
    
        screen_path = abspath(path, filename) 
        crop_path = thumbnail_path = screen_path 
    
        if thumbnail and not crop: 
         raise Exception, 'Thumnail generation requires crop image, set crop=True' 
    
        do_screen_capturing(url, screen_path, width, height) 
    
        if crop: 
         if not crop_replace: 
          crop_path = abspath(path, 'crop_'+filename) 
         params = { 
          'width': crop_width, 'height': crop_height, 
          'crop_path': crop_path, 'screen_path': screen_path} 
         do_crop(params) 
    
         if thumbnail: 
          if not thumbnail_replace: 
           thumbnail_path = abspath(path, 'thumbnail_'+filename) 
          params = { 
           'width': thumbnail_width, 'height': thumbnail_height, 
           'thumbnail_path': thumbnail_path, 'crop_path': crop_path} 
          do_thumbnail(params) 
        return screen_path, crop_path, thumbnail_path 
    
    
    if __name__ == '__main__': 
        ''' 
         Requirements: 
         Install NodeJS 
         Using Node's package manager install phantomjs: npm -g install phantomjs 
         install selenium (in your virtualenv, if you are using that) 
         install imageMagick 
         add phantomjs to system path (on windows) 
        ''' 
    
        url = 'http://stackoverflow.com/questions/1197172/how-can-i-take-a-screenshot-image-of-a-website-using-python' 
        screen_path, crop_path, thumbnail_path = get_screen_shot(
         url=url, filename='sof.png', 
         crop=True, crop_replace=False, 
         thumbnail=True, thumbnail_replace=False, 
         thumbnail_width=200, thumbnail_height=150, 
        ) 
    

    这些是所生成的图像:

  • -1

    尝试此..

    #!/usr/bin/env python 
    
    import gtk.gdk 
    
    import time 
    
    import random 
    
    while 1 : 
        # generate a random time between 120 and 300 sec 
        random_time = random.randrange(120,300) 
    
        # wait between 120 and 300 seconds (or between 2 and 5 minutes) 
        print "Next picture in: %.2f minutes" % (float(random_time)/60) 
    
        time.sleep(random_time) 
    
        w = gtk.gdk.get_default_root_window() 
        sz = w.get_size() 
    
        print "The size of the window is %d x %d" % sz 
    
        pb = gtk.gdk.Pixbuf(gtk.gdk.COLORSPACE_RGB,False,8,sz[0],sz[1]) 
        pb = pb.get_from_drawable(w,w.get_colormap(),0,0,0,0,sz[0],sz[1]) 
    
        ts = time.time() 
        filename = "screenshot" 
        filename += str(ts) 
        filename += ".png" 
    
        if (pb != None): 
         pb.save(filename,"png") 
         print "Screenshot saved to "+filename 
        else: 
         print "Unable to get the screenshot."