从2D Python列表中提取独特元素，并将它们放入新的2D列表中

现在，我有一个包含三列和众多行的2D列表，每列包含一个独特类型的东西。第一列是UserID，第二列是时间戳，第三列是URL。该列表如下所示：从2D Python列表中提取独特元素，并将它们放入新的2D列表中

[[304070, 2015:01:01, 'http:something1'], 
[304070, 2015:01:02, 'http:something2'], 
[304070, 2015:01:03, 'http:something2'], 
[304070, 2015:01:03, 'http:something2'], 
[304071, 2015:01:04, 'http:something2'], 
[304071, 2015:01:05, 'http:something3'], 
[304071, 2015:01:06, 'http:something3']]

正如您所看到的，存在一些重复的URL，无论用户标识和时间戳如何。

我需要提取那些包含唯一URL的行，并将它们放入一个新的2D列表中。

例如，第二行，第三行，第四行和第五行都具有相同的URL，而不考虑用户ID和时间戳。我只需要第二行（第一个出现）并将其放入我的新2D列表中。话虽如此，第一行有一个唯一的URL，我也将它放到我的新列表中。最后两行（第六和第七）具有相同的URL，我只需要第六行。

因此，我的新名单应该是这样的：

[304070, 2015:01:01, 'http:something1'], 
[304070, 2015:01:02, 'http:something2'], 
[304071, 2015:01:05, 'http:something3']]

我想过用这样的事情：

for i in range(len(oldList): 
    if oldList[i][2] not in newList: 
     newList.append(oldList[i])

但显然这一次是不行的，监守oldList[i][2]是一个元素， not in newList正在检查整个2D列表，即检查每一行。这样的代码只会创建一个oldList的确切副本。

或者，我可以消除那些有重复的URL的行，因为在一百万行的2D列表上使用for循环加追加操作符真的需要一段时间。

来源

2016-03-01 JY078

的要对此是使用一个set一个好办法。逐个浏览列表中的一个列表，将该列表添加到该列表中，并将包含该列表的完整列表添加到新列表中。如果一个URL已经在集合中，则放弃当前列表并移至下一个列表。

old_list = [[304070, "2015:01:01", 'http:something1'], 
      [304070, "2015:01:02", 'http:something2'], 
      [304070, "2015:01:03", 'http:something2'], 
      [304070, "2015:01:03", 'http:something2'], 
      [304071, "2015:01:04", 'http:something2'], 
      [304071, "2015:01:05", 'http:something3'], 
      [304071, "2015:01:06", 'http:something3']] 
new_list = [] 
url_set = set() 

for item in old_list: 
    if item[2] not in url_set: 
     url_set.add(item[2]) 
     new_list.append(item) 
    else: 
     pass 

>>> print(new_list) 
[[304070, '2015:01:01', 'http:something1'], [304070, '2015:01:02', 'http:something2'], [304071, '2015:01:05', 'http:something3']]

来源

2016-03-01 02:39:49 MattDMo

这似乎是对我不必要的空间使用。它可能会更快，但是对于普通的网址，它所占用的空间几乎是它的两倍。 – BoltKey

你是什么意思？你在说什么空间？ – MattDMo

我的意思是物理记忆。您将所有的url保存在url_set中，这会分配更多的内存。还是Python以某种方式引用它的参考？ – BoltKey

您需要创建一个函数，该函数使用url搜索项目列表。

def hasUrl(list, url): 
    for item in list: 
     if item[1] == url: 
      return True 
    return False

然后，你的新列表创建算法应该看起来像这样。

for i in range(len(oldList)): 
    if not hasUrl(newList, oldList[i][2]): # check if url is in list 
     newList.append(oldList[i])

此外，没有必要创建一个范围。通过数值的Python for循环迭代，所以你可以写只是

for item in oldList: 
    if not hasUrl(newList, item[2]): # check if url is not in list 
     newList.append(item)

来源

2016-03-01 02:39:14 BoltKey

if item [1] == url1：return True。这基本上是反复添加重复的项目吗？对不起，哈哈。没有看到'如果不'。我的不好 – JY078

最初没有'not'，我在意识到编辑错误后添加了它。 – BoltKey

my_list = [[304070, '2015:01:01', 'http:something1'], 
      [304070, '2015:01:02', 'http:something2'], 
      [304070, '2015:01:03', 'http:something2'], 
      [304070, '2015:01:03', 'http:something2'], 
      [304071, '2015:01:04', 'http:something2'], 
      [304071, '2015:01:05', 'http:something3'], 
      [304071, '2015:01:06', 'http:something3']]

从原来的名单拉出所有网址。从这个列表创建一个集合，为url生成唯一的值。使用列表理解来遍历这个集合，并在生成的url列表（urls）上使用index来找到该URL的第一个匹配项。

最后，使用另一个列表理解与enumerate一起选择具有匹配索引值的行。

urls = [row[2] for row in my_list] 
urls_unique = set(urls) 
idx = [urls.index(url) for url in urls_unique] 
my_shorter_list = [row for n, row in enumerate(my_list) if n in idx] 

>>> my_shorter_list 
[[304070, '2015:01:01', 'http:something1'], 
[304070, '2015:01:02', 'http:something2'], 
[304071, '2015:01:05', 'http:something3']]

来源

2016-03-01 02:48:55 Alexander

>>> old_list = [[304070, "2015:01:01", 'http:something1'], 
...   [304070, "2015:01:02", 'http:something2'], 
...   [304070, "2015:01:03", 'http:something2'], 
...   [304070, "2015:01:03", 'http:something2'], 
...   [304071, "2015:01:04", 'http:something2'], 
...   [304071, "2015:01:05", 'http:something3'], 
...   [304071, "2015:01:06", 'http:something3']] 
>>> temp_dict = {} 
>>> for element in old_list: 
...  if element[2] not in temp_dict: 
...   temp_dict[element[2]] = [element[0], element[1], element[2]] 
... 
>>> temp_dict.values() 
[[304070, '2015:01:01', [304070, '2015:01:02', 'http:something2'], 'http:something1'], [304071, '2015:01:05', 'http:something3']]

注意：我假设不同的URL列表中的顺序并不重要。如果确实如此，请使用OrderedDict而不是默认的dict。

来源

2016-03-01 02:55:05

从2D Python列表中提取独特元素，并将它们放入新的2D列表中

回答

相关问题