2014-05-25 103 views
0

我正在编写一个脚本来解析技术类别下列出的每家公司的纳斯达克文件。这是一个用逗号分隔的CSV。但是,有时一家公司的名字被列为XXX,Inc.。这个逗号在脚本中混淆了我的制表,所以它得到了错误的值。我正在解析公司股票代码,所以',Inc.'会搞乱地方。跳过CSV文件中的某些字符

我对Python相当陌生,所以我没有太多经验,但我一直在尽我所能,并且已经获得它来读取和写入CSV,但这个解析问题对我来说很困难。这是我目前有:

try: 
    # py3 
    from urllib.request import Request, urlopen 
    from urllib.parse import urlencode 
except ImportError: 
    # py2 
    from urllib2 import Request, urlopen 
    from urllib import urlencode 

import csv 
import urllib.request 
import string 

def _request(): 
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download' 
    req = Request(url) 
    resp = urlopen(req) 
    content = resp.read().decode().strip() 
    content1 = content.replace('"', '') 
    return content1 

def symbol_quote(): 
    counter = 1 
    recursive = 9*counter 

    values = _request().split(',') 
    values2 = values[recursive] 
    return values2 
    counter += 1 


def csvwrite(): 
    import csv 
    path = "symbol_comp.csv" 
    data = [symbol_quote()] 
    parsing = False 

    with open(path, 'w', newline='') as csv_file: 
     writer = csv.writer(csv_file, delimiter=' ') 
     for line in data: 
      writer.writerow(line) 

我没有说得那么它循环和行为根据计数器但因为没有一点现在。这个解析问题更加紧迫。

任何人都可以请一个新手出来吗?

+2

哇,停下来。你正在使用'csv.writer'来写*你的数据,而不是'csv.reader'来读*你的数据(它将处理转义逗号 - 通过括住引号它来)。 – roippi

回答

0

变化_request()使用csv.reader()cStringIO.StringIO(),并返回一个csv.reader对象,您可以遍历:

try: 
    # py3 
    from urllib.request import Request, urlopen 
    from urllib.parse import urlencode 
except ImportError: 
    # py2 
    from urllib2 import Request, urlopen 
    from urllib import urlencode 

import csv, cStringIO 
##import urllib.request 
import string 

def _request(): 
    url = 'http://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&render=download' 
    req = Request(url) 
    resp = urlopen(req) 
    sio = cStringIO.StringIO(resp.read().decode().strip()) 
    reader = csv.reader(sio) 
    return reader 

用法:

data = _request() 
print 'fields:\n{}\n'.format('|'.join(data.next())) 
for n, row in enumerate(data): 
    print '|'.join(row) 
    if n == 5: break 

# fields: 
# Symbol|Name|LastSale|MarketCap|ADR TSO|IPOyear|Sector|Industry|Summary Quote| 
# 
# VNET|21Vianet Group, Inc.|25.87|1137471769.46|43968758|2011|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/vnet| 
# TWOU|2U, Inc.|13.28|534023394.4|n/a|2014|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/twou| 
# DDD|3D Systems Corporation|54.4|5630941606.4|n/a|n/a|Technology|Computer Software: Prepackaged Software|http://www.nasdaq.com/symbol/ddd| 
# JOBS|51job, Inc.|64.32|746633699.52|11608111|2004|Technology|Diversified Commercial Services|http://www.nasdaq.com/symbol/jobs| 
# WUBA|58.com Inc.|37.25|2959078388.5|n/a|2013|Technology|Computer Software: Programming, Data Processing|http://www.nasdaq.com/symbol/wuba| 
# ATEN|A10 Networks, Inc.|10.64|638979699.12|n/a|2014|Technology|Computer Communications Equipment|http://www.nasdaq.com/symbol/aten| 
相关问题