2015-10-05 52 views
0

我正在从一个目录中挖出按摩治疗师的名字以及他们的地址。这些地址全部保存到CSV的一列中,但是每个治疗师的标题/名称在2列或3列中每列保存一个字。如何保存字符串,在Python中每列一个单词?

我需要做什么才能得到多数民众赞成被提取一列中保存,就像被保存在地址字符串办? (代码最上面两行是从页例如HTML,下一组的代码是从脚本针对此元素的提取物)

<span class="name"> 
    <img src="/images/famt-placeholder-sm.jpg" class="thumb" alt="Tiffani D Abraham"> Tiffani D Abraham</span> 


import mechanize 
from lxml import html 
import csv 
import io 
from time import sleep 

def save_products (products, writer): 

    for product in products: 

     for price in product['prices']: 
      writer.writerow([ product["title"].encode('utf-8') ]) 
      writer.writerow([ price["contact"].encode('utf-8') ]) 
      writer.writerow([ price["services"].encode('utf-8') ]) 

f_out = open('mtResult.csv', 'wb') 
writer = csv.writer(f_out) 

links = ["https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=2&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=3&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=4&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=5&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=6&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=7&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=8&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=9&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=10&PageSize=10" ] 

br = mechanize.Browser()  

for link in links: 

    print(link) 
    r = br.open(link) 

    content = r.read() 

    products = []   
    tree = html.fromstring(content)   
    product_nodes = tree.xpath('//ul[@class="famt-results"]/li') 

    for product_node in product_nodes: 

     product = {} 


     price_nodes = product_node.xpath('.//a') 

     product['prices'] = [] 
     for price_node in price_nodes: 

      price = {} 
      try: 
       product['title'] = product_node.xpath('.//span[1]/text()')[0] 

      except: 
       product['title'] = "" 

      try: 
       price['services'] = price_node.xpath('./span[2]/text()')[0] 

      except: 
       price['services'] = "" 

      try: 
       price['contact'] = price_node.xpath('./span[3]/text()')[0] 

      except: 
       price['contact'] = "" 

      product['prices'].append(price) 
     products.append(product) 
    save_products(products, writer) 

f_out.close() 
+0

请添加数据的一部分,你的问题,它会更容易明白你的意思。 – LetzerWille

+0

@LetzerWille这是我从提取的页面:'的https://www.amtamassage.org/findamassage/results.html匹配=确切&L = NY' - 这就是正在发生的每3个治疗师行CSV,用的顺序?从名字,地址,专业名称降序。地址和专业化仅保存在列A中,但名称分布在列B,C和D上,每个列中都有一个字。 我已经发布了整个脚本。 – McLeodx

+0

我已经意识到问题是产品[“title”]的数据是一个字符串而不是一个列表(不同于'services'和'contact'的数据都是列表)。我知道我需要改变导致它期望列表而不是字符串的东西,但我不确定哪部分代码需要调整。 – McLeodx

回答

0

我还不能肯定是否能解决您遇到的问题,但无论哪种方式有一些改进和修改你可能会感兴趣的。

例如,由于每一个环节由一个网页索引可以通过不同的链接循环容易,而不是复制所有50下降到一个列表。每个治疗师每页也有自己的索引,所以你也可以循环访问每个治疗师信息的xpaths。

#import modules 
import mechanize 
from lxml import html 
import csv 
import io 

#open browser 
br = mechanize.Browser() 

#create file headers 
titles = ["NAME"] 
services = ["TECHNIQUE(S)"] 
contacts = ["CONTACT INFO"] 

#loop through all 50 webpages for therapist data 
for link_index in range(1,50): 

    link = "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=" + str(link_index) + "&PageSize=10" 
    r = br.open(link) 
    page = r.read()  
    tree = html.fromstring(page)   

    #loop through therapist data for each therapist per page 
    for therapist_index in range(1,10): 

     #store names 
     title = tree.xpath('//*[@id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[1]/text()') 
     titles.append(" ".join(title)) 

     #store techniques and convert to unicode 
     service = tree.xpath('//*[@id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[2]/text()') 
     try: 
      services.append(service[0].encode("utf-8")) 
     except: 
      services.append(" ") 

     #store contact info and convert to unicode 
     contact = tree.xpath('//*[@id="content"]/div[2]/ul[1]/li[' + str(therapist_index) + ']/a/span[3]/text()') 
     try: 
      contacts.append(contact[0].encode("utf-8")) 
     except: 
      contacts.append(" ") 

#open file to write to 
f_out = open('mtResult.csv', 'wb') 
writer = csv.writer(f_out) 

#get rows in correct format 
rows = zip(titles, services, contacts) 

#write csv line by line 
for row in rows: 
    writer.writerow(row) 
f_out.close() 

脚本遍历所提供的网页,所有50个链接,并且似乎是刮每个治疗师的所有相关信息,如果提供。最后,它将所有数据打印到csv,并且所有数据都存储在“名称”,“技术”和“联系信息”的相应列下,如果这是您最初的努力。

希望这会有所帮助!

相关问题