2016-05-02 177 views
0

我在尝试获取频道标题时遇到了网页抓取问题。我不知道如何解决这个问题,但是通过使用频道功能进行一些测试,似乎视频链接与它一起工作,只有频道链接应该与YoutubeChannel功能一起使用。Python - 网页抓取问题

关于如何解决它的任何想法?

#Required Modules 
import urllib 
import re 

#Defining the YouTube Video function 
def YoutubeVideo(): 
    #Making videoLink equal to whatever the user enters as their video link 
    videoLink = input ('\nWhat is your video link? (In quotations, with http included)\n') 

    #Goes to the video URL, opens it and reads the HTML file 
    htmlfile = urllib.urlopen(videoLink) #Searches for this URL 
    htmltext = htmlfile.read() #Reads the HTML file and sets it to htmltext 

    #Setup for the view counter 
    regexView = "<div class=\"watch-view-count\">(.+?)</div>" #Searches for the view count number and sets it to regexView 
    pattern = re.compile(regexView) 
    viewCount = re.findall(pattern, htmltext) 

    #Setup for the video title 
    regexTitle = "<title>(.+?)</title>" #Searches for the title of the video 
    patternTitle = re.compile(regexTitle) 
    videoTitle = re.findall(patternTitle, htmltext) 

    #Setup for the video upload date 
    regexUpload = "<strong class=\"watch-time-text\">(.+?)</strong>" 
    patternUpload = re.compile(regexUpload) 
    videoUpload = re.findall(patternUpload, htmltext) 

    print ("\n%s" % (videoLink)) #Prints the video link, primarily for testing 
    print ("\nThe title of your video is %s and has %s views.\nIt was %s." % (videoTitle, viewCount, videoUpload)) #Prints the information about the video 


#Defining the YouTube Channel function 
def YoutubeChannel(): 
    #Making channelLink equal to whatever the user enters as their video link 
    channelLink = input ('\nWhat is your channel link? (In quotations, with http included)\n') 

    #Goes to the video URL, opens it and reads the HTML file 
    htmlfile = urllib.urlopen(channelLink) #Searches for this URL 
    htmltext = htmlfile.read() #Reads the HTML file and sets it to htmltext 

    #Setup for the channel name 
    channelTitle = "<title>(.+?)</title>" #Searches for the title of the video 
    patternChannelTitle = re.compile(channelTitle) 
    channelTitle = re.findall(patternChannelTitle, htmltext) 

    print (channelTitle) 



ans = True 
while ans: 
    print ("\n[1] Get information regarding a YouTube video.") 
    print ("\n[2] Get information regarding a YouTube channel.") 
    print ("\n[Q] Quit the application.") 

    ans = raw_input("\nWhat would you like to do now? ") 
    if ans == "1": 
     YoutubeVideo() 
    elif ans == "2": 
     YoutubeChannel() 
    elif ans == "q": 
     sys.exit(0) 
    elif ans != "": 
     print "Not a valid choice, try again." 
+4

使用'BeautifulSoup'或类似的东西,至少,而不是正则表达式可以很容易地安装解析html。 – Pythonista

+3

也是,使用[Youtube API](https://developers.google.com/youtube/)会不会更容易?我确定有很多脚本可以让你的生活更轻松。 – patrick

回答

0

IM不熟悉你使用的是什么来解析HTML内容 但你可以使用BeautifulSoup这是很容易

import requests 
from bs4 import BeautifulSoup 

# channel url = https://www.youtube.com/channel/XXXXXX 

url = "your channel link" 
page = requests.get(url) 
plain_text = page.text 
soup = BeautifulSoup(plain_text,"html.parser") 
span = soup.find('span',{'class' : 'qualified-channel-title-text'}) 
title =soup.find('a',{'class' : 'spf-link branded-page-header-title-link yt- uix-sessionlink'}) 
title = title.get('title') 
print(title) 

你可以看到使用 跨度整个标题的HTML标签IM与链接和小图片 和一个文本中的标题和它使用类“spf链接品牌页面标题链接yt- uix会话链接” 然后即时获取标题属性:)

的希望,如果你要运行这个,这是非常有用

注意,您必须安装beautifulsoup并要求 以及那些可以用管道