使用大数据集优化循环Python

这是我第一次用Python做这么大的工作，所以我需要一些帮助。使用大数据集优化循环Python

我有一个MongoDB的（或Python字典）结构如下：

{ 
    "_id": { "$oid" : "521b1fabc36b440cbe3a6009" }, 
    "country": "Brazil", 
    "id": "96371952", 
    "latitude": -23.815124482000001649, 
    "longitude": -45.532670811999999216, 
    "name": "coffee", 
    "users": [ 
    { 
     "id": 277659258, 
     "photos": [ 
     { 
      "created_time": 1376857433, 
      "photo_id": "525440696606428630_277659258", 
     }, 
     { 
      "created_time": 1377483144, 
      "photo_id": "530689541585769912_10733844", 
     } 
     ], 
     "username": "foo" 
    }, 
    { 
     "id": 232745390, 
     "photos": [ 
     { 
      "created_time": 1369422344, 
      "photo_id": "463070647967686017_232745390", 
     } 
     ], 
     "username": "bar" 
    } 
    ] 
}

现在，我要创建两个文件，一个与摘要和其他与每个连接的权重。我的环路，适用于小型数据集如下：

#a is the dataset 
data = db.collection.find() 
a =[i for i in data] 

#here go the connections between the locations 
edges = csv.writer(open("edges.csv", "wb")) 
#and here the location data 
nodes = csv.writer(open("nodes.csv", "wb")) 

for i in a: 

    #find the users that match 
    for q in a: 
     if i['_id'] <> q['_id'] and q.get('users') : 
      weight = 0 
      for user_i in i['users']: 
       for user_q in q['users']: 
        if user_i['id'] == user_q['id']: 
         weight +=1 
      if weight>0: 
       edges.writerow([ i['id'], q['id'], weight]) 


    #find the number of photos 
    photos_number =0 
    for p in i['users']: 
     photos_number += len(p['photos']) 


    nodes.writerow([ i['id'], 
        i['name'], 
        i['latitude'], 
        i['longitude'], 
        len(i['users']), 
        photos_number 
       ])

的结垢问题：我有20000点的位置，每个位置最多可以有2000个用户，每个用户可能有大约10张照片。

有没有更有效的方法来创建上述循环？也许多线程，JIT，更多的索引？因为如果我在单线程中运行以上可以达到20000^2 * 2000 * 10的结果...

那么我怎样才能更有效地处理上述问题呢？感谢

来源

2013-08-26 Diolor

样式更改：用'！='替换'<>'。另外，“a”中有什么？ – Tadeck

'a'代表字典。我更新了我的问题。 – Diolor

我不认为它代表字典。否则'因为我在a'会迭代_keys_，所以进一步使用'i''_ id']'键会产生一个错误。我想这是一个列表。 – Tadeck

@YuchenXie和@ PaulMcGuire建议的微优化可能不是您的主要问题，即您循环使用20,000 x 20,000 = 400,000,000对条目，然后有一个2,000 x 2000用户对的内部循环。这会很慢。

幸运的是，通过对i['users']中的用户id进行预缓存set s，并用一个简单的集合交点替换您的内部循环，可以使内部循环快得多。这会将在Python解释器中发生的O(num_users^2)操作更改为发生在C中的O(num_users)操作，这应该有所帮助。（我只是用2000的整数列表对它进行计时;在我的计算机上，它从你做这件事的方式从156ms到41μs，对于4000倍的加速时间。）

你也可以切掉一半你注意到这个关系是对称的，因此你可以在i = a[1],q = a[2]和i = a[2],q = a[1]这两个点都没有意义。

考虑到这些，并@ PaulMcGuire的建议综合考虑，与其他文体的改动，你的代码变得（警告：今后未经测试的代码）：

from itertools import combinations, izip 

data = db.collection.find() 
a = list(data) 

user_ids = [{user['id'] for user in i['users']} if 'users' in i else set() 
      for i in a] 

with open("edges.csv", "wb") as f: 
    edges = csv.writer(f) 
    for (i, i_ids), (q, q_ids) in combinations(izip(a, user_ids), 2): 
     weight = len(i_ids & q_ids) 
     if weight > 0: 
      edges.writerow([i['id'], q['id'], weight]) 
      edges.writerow([q['id'], i['id'], weight]) 

with open("nodes.csv", "wb") as f: 
    nodes = csv.writer(f) 
    for i in a: 
     nodes.writerow([ 
      i['id'], 
      i['name'], 
      i['latitude'], 
      i['longitude'], 
      len(i['users']), 
      sum(len(p['photos']) for p in i['users']), # total number of photos 
     ])

希望这应该是足够的加速的。如果没有，那么@雨辰谢的建议可能会有所帮助，尽管我很怀疑，因为stdlib/OS在缓冲这种事情方面相当不错。（你可以使用文件对象上的缓冲设置。）

否则，它可能会归结为尝试从Python（在Cython或手写的C）中获取核心循环，或给PyPy一个镜头。尽管如此，我怀疑这会给你带来什么巨大的加速。

您也可以将重量计算推入Mongo，这可能会更聪明;我从来没有真正使用它，所以我不知道。

来源

2013-08-26 14:25:48 Dougal

谢谢，这就是我真正想要的。它确实加快了速度。然而，我得到一个'TypeError：'类型为'generator'的对象在'sum'（len（p ['photos']）中为'['users']））'行中没有len（）。这是否意味着我没有为该用户提供'p ['照片']？再次感谢。 – Diolor

哎呦，对不起，这是一个错字：应该是'sum（len（p ['photos']）给我['users']）中的p。在另一个地方，人们试图把发电机对象的长度，而不是增加每个列表的长度。 – Dougal

是否崩溃这个循环：

photos_number =0 
for p in i['users']: 
    photos_number += len(p['photos'])

到：

photos_number = sum(len(p['photos']) for p in i['users'])

帮助呢？

你的体重计算：

 weight = 0 
     for user_i in i['users']: 
      for user_q in q['users']: 
       if user_i['id'] == user_q['id']: 
        weight +=1

也应该是可折叠到：

 weight = sum(user_i['id'] == user_q['id'] 
         for user_i,user_q in product([i['users'],q['users']))

由于真等同于1，汇总所有的布尔条件是一样的计数所有的值真正。

来源

2013-08-26 13:05:54 PaulMcG

为什么要这样？它看起来像把循环放在一行，并依靠'sum（）'而不是加法。我错过了什么吗？ – Tadeck

您正在将迭代代码移动到Python的已编译C库中，而不是通过Python字节码中的循环。同样在for循环中使用产品而不是明确的for循环。当然，这些只是盲目的优化，更正式的分析会告诉你真正的瓶颈在哪里。 – PaulMcG

瓶颈是磁盘I/O。

当您合并结果并使用一个或多个writerows而不是多个writerow时，它应该快得多。

来源

2013-08-26 13:26:21

使用大数据集优化循环Python

回答

相关问题