2013-09-05 39 views
1

我做了下面的代码,但我想改进它。我不想重新读取文件,但是如果我删除sales_input.seek(0),它不会迭代抛出销售中的每一行。我怎样才能改善这一点?重新读取python中的csv文件,而无需再次加载它

def computeCritics(mode, cleaned_sales_input = "data/cleaned_sales.csv"): 
    if mode == 1: 
     print "creating customer.critics.recommendations" 
     critics_output = open("data/customer/customer.critics.recommendations", 
           "wb") 
     ID = getCustomerSet(cleaned_sales_input) 
     sales_dict = pickle.load(open("data/customer/books.dict.recommendations", 
             "r")) 
    else: 
     print "creating books.critics.recommendations" 
     critics_output = open("data/books/books.critics.recommendations", 
           "wb") 
     ID = getBookSet(cleaned_sales_input) 
     sales_dict = pickle.load(open("data/books/users.dict.recommendations", 
             "r")) 
    critics = {} 
    # make critics dict and pickle it 
    for i in ID: 
     with open(cleaned_sales_input, 'rb') as sales_input: 
      sales = csv.reader(sales_input) # read new 
      for j in sales: 
       if mode == 1: 
        if int(i) == int(j[2]): 
         sales_dict[int(j[6])] = 1 
       else: 
        if int(i) == int(j[6]): 
         sales_dict[int(j[2])] = 1 
      critics[int(i)] = sales_dict 
    pickle.dump(critics, critics_output) 
    print "done" 

cleaned_sales_input看起来像

6042772,2723,3546414,9782072488887,1,9.99,314968 
6042769,2723,3546414,9782072488887,1,9.99,314968 
... 

,其中6号是书和号码0是客户ID

我希望得到一个字典至极的样子

critics = { 
    CustomerID1: { 
     BookID1: 1, 
     BookID2: 0, 
     ........ 
     BookIDX: 0 
    }, 
    CustomerID2: { 
     BookID1: 0, 
     BookID2: 1, 
     ... 
    } 
} 

critics = { 
    BookID1: { 
     CustomerID1: 1, 
     CustomerID2: 0, 
     ........ 
     CustomerIDX: 0 
    }, 
    BookID1: { 
     CustomerID1: 0, 
     CustomerID2: 1, 
     ... 
     CustomerIDX: 0 
    } 
} 

我希望这不是多少信息

+0

你是否对此进行了配置文件以查看csv阅读是否是瓶颈? – RickyA

+0

抱歉,这是什么配置文件?我从来没有听说过。 –

+0

[profiler](http://docs.python.org/2/library/profile.html)用于查看代码的每个部分花费多少时间。您可以这样做来识别代码中的瓶颈。在配置文件之前优化事物几乎是无用的,因为你不知道瓶颈是什么。所以也许你的文件读取不是这里的瓶颈。 – RickyA

回答

2

以下是一些建议:

让我们在这个代码模式先来看看:

for i in ID: 
    for j in sales: 
     if int(i) == int(j[2]) 

通知,i只被用j[2]比较。这是循环中唯一的目的。 int(i) == int(j[2])只能为每个i最多一次。

所以,我们完全可以通过改写它作为

for j in sales: 
    key = j[2] 
    if key in ID: 

基于函数名称getCustomerSetgetBookSet删除for i in ID循环,听起来好像 ID是一组(而不是一个列表或元组)。我们希望ID是一个集合,因为 测试集合中的成员资格是O(1)(而不是列表或元组的O(n))。


下一步,考虑这条线:

critics[int(i)] = sales_dict 

这里有一个潜在的缺陷。此行将为ID中的每个i分配sales_dict至 。每个键int(i)被映射到非常相同的dict。正如我们循环salesID,我们正在修改sales_dict这样,例如:

sales_dict[int(j[6])] = 1 

但是,这将导致在critics所有critics点被同时修改,因为所有的键的相同的dict ,sales_dict。我怀疑这是你想要的。

为了避免这一缺陷,我们需要做的sales_dict的副本:

critics = {i:sales_dict.copy() for i in ID} 

def computeCritics(mode, cleaned_sales_input="data/cleaned_sales.csv"): 
    if mode == 1: 
     filename = 'customer.critics.recommendations' 
     path = os.path.join("data/customer", filename) 
     ID = getCustomerSet(cleaned_sales_input) 
     sales_dict = pickle.load(
      open("data/customer/books.dict.recommendations", "r")) 
     key_idx, other_idx = 2, 6 
    else: 
     filename = 'books.critics.recommendations' 
     path = os.path.join("data/books", filename)   
     ID = getBookSet(cleaned_sales_input) 
     sales_dict = pickle.load(
      open("data/books/users.dict.recommendations", "r")) 
     key_idx, other_idx = 6, 2 

    print "creating {}".format(filename) 
    ID = {int(item) for item in ID} 
    critics = {i:sales_dict.copy() for i in ID} 
    with open(path, "wb") as critics_output: 
     # make critics dict and pickle it 
     with open(cleaned_sales_input, 'rb') as sales_input: 
      sales = csv.reader(sales_input) # read new 
      for j in sales: 
       key = int(j[key_idx]) 
       if key in ID: 
        other_key = int(j[other_idx]) 
        critics[key][other_key] = 1      
       critics[key] = sales_dict 
     pickle.dump(dict(critics), critics_output) 
     print "done" 
+0

对不起,没有添加它,但我想让字典看起来像c = {ID1 {书:1,书:0 ........书:0},ID2 .....}所以我必须这样做,还是我只是被封锁了? –

+2

你的代码和字典之间没有明显的关系。您需要填写更多详细信息,比如'ID'等于什么,以及您的问题之前的'cleaned_sales_input'的样本是否可以回答。 – unutbu

+0

我增加了更多的信息,我希望这不是很多^^ –

0

@ unutbu的回答是好,但如果你坚持这种结构可以把整个文件在内存中:

sales = [] 
with open(cleaned_sales_input, 'rb') as sales_input: 
    sales_reader = csv.reader(sales_input)  
    [sales.append(line) for line in sales_reader] 

    for i in ID: 
     for j in sales: 
      #do stuff 
相关问题