2017-02-19 65 views
0

我想从用户输入中获取网站名称,并且最大不超过。的,他想爬爬网网站...但没有得到任何solution..here的我的代码我想从python的url用户输入

import requests 
from bs4 import * 
from urllib import request 


url1 = input("Enter url you want to crawl:") 
max_pages1 = int(input("Enter no. of pages you want to crawl:")) 


def web_crawler(max_pages,url): 
    page = 1 
    while page <= max_pages: 
     url4 = str(url) + str(page) 
     url_get = requests.get(url4) 
     plain_text = url_get.text 
     soup = BeautifulSoup(plain_text,"html.parser") 
     for a in soup.findAll('a',{'rel':'bookmark'}): 
      href = a.get('href') 
      title = a.string 
      #print(title) 
      print(href) 
      #info_about_web_pages(href) 
     page +=1 

def info_about_web_pages(url): 
    url_get = requests.get(url) 
    plain_text = url_get.text 
    soup = BeautifulSoup(plain_text,"html.parser") 
    links = set() 
    for about in soup.findAll('a'): 
     href = about.get('href') 
     links.update([href]) 

    print(links) 

web_crawler(max_pages1,url1) 

页它表明我没有在输出

+0

你有一个你想要做这个的网址的例子吗?你确定具有属性'rel'的锚点:'bookmark'是否在其源代码中? –

+0

是的URL是在rel:书签.... .. url是http://www.fonearena.com/blog/ – Trunks

回答

1

如果没有与属性没有锚你正试图在html源代码中找到,那么这将始终不会打印任何内容。尝试打印soup.prettify(),看看你正在寻找的标签是否存在。当我不打印这些值时,往往会出现这种情况,因为这个值没有我正在寻找的属性。

+0

哪里放汤。在'soup = BeautifulSoup(plain_text,“html.parser”)'put'print(str(soup.prettify()))之后的行中,在上面的代码 – Trunks

+0

中prettify –