2017-08-09 28 views
0

我想刮链接在https://www.panpages.my/search_results?q=如何返回未开始HREF链接“目录”使用python

我写的Python脚本来获得 我要过滤的链接每个页面中的所有链接未起"\Listings"

请在下面找到我的剧本,并帮助我:

import requests 
from bs4 import BeautifulSoup 
import re 
from io import StringIO 
import csv 

data = open("D:/Mine/Python/Projects/Freelancer/seekProgramming/rootpages.csv").read() 
dataFile = StringIO(data) 
csvReader = csv.reader(dataFile) 
f = open('paylinks.csv', 'w', newline = '') 
writer = csv.writer(f) 

for row in csvReader: 
    myurl = row[0] 
    def simple_web_scrapper(url): 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, 'html.parser') 
     for root in soup.findAll('div', {"class": "mid_section col-xs-10 col-sm-7 tmargin xs-nomargin"}): 
      for link in root.findAll('a'): 
       href = link.get('href') 
       print(href) 
    simple_web_scrapper(myurl) 
+0

它工作正常。非常感谢 – Saranaone

回答

0
for row in csvReader: 
    myurl = row[0] 
    def simple_web_scrapper(url): 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, 'html.parser') 
     for root in soup.findAll('div', {"class": "mid_section col-xs-10 col-sm-7 tmargin xs-nomargin"}): 
      for link in root.findAll('a'): 
       href = link.get('href') 
       if href.startswith('\listings'): #that's the row you need 
        print(href) 
    simple_web_scrapper(myurl) 
+0

@Saranaone,有什么反馈?我的回答有帮助吗? –