列表中最常见的Python元素

105

什么是在Python列表中查找最常见元素的有效方法？列表中最常见的Python元素

我的列表项可能不可排列，因此不能使用字典。同样在绘制的情况下，应返回索引最低的项目。例如：

>>> most_common(['duck', 'duck', 'goose']) 
'duck' 
>>> most_common(['goose', 'duck', 'duck', 'goose']) 
'goose'

来源

2009-10-05 hoju

如果列表中的项目不可散列，那么如何确定它们何时“平等”？在确定非可哈希项目的等式时效率的损失可能会否定任何效率，你希望用一个好的算法获得:) – 2009-10-05 07:05:23

我认为他意味着这些项目可以是可变的，因此不可能成为哈希映射中的键。 – fortran 2009-10-05 07:35:17

是的，这就是我的意思 - 有时它会包含列表 – hoju 2009-10-05 12:02:31

随着提议，我很惊讶，没有人提出这么多的解决方案，我会考虑一个明显的例子（用于非可排除但可比的元素） - [itertools.groupby] [1]。 itertools提供快速，可重用的功能，并让您将一些棘手的逻辑委托给经过良好测试的标准库组件。考虑例如：

import itertools 
import operator 

def most_common(L): 
    # get an iterable of (item, iterable) pairs 
    SL = sorted((x, i) for i, x in enumerate(L)) 
    # print 'SL:', SL 
    groups = itertools.groupby(SL, key=operator.itemgetter(0)) 
    # auxiliary function to get "quality" for an item 
    def _auxfun(g): 
    item, iterable = g 
    count = 0 
    min_index = len(L) 
    for _, where in iterable: 
     count += 1 
     min_index = min(min_index, where) 
    # print 'item %r, count %r, minind %r' % (item, count, min_index) 
    return count, -min_index 
    # pick the highest-count/earliest item 
    return max(groups, key=_auxfun)[0]

这当然可以写得更简洁，但我的目标是最大限度地清晰。两个print声明可以不注释以更好地看到机器的行动;例如，与打印未注释：

print most_common(['goose', 'duck', 'duck', 'goose'])

发射：

SL: [('duck', 1), ('duck', 2), ('goose', 0), ('goose', 3)] 
item 'duck', count 2, minind 1 
item 'goose', count 2, minind 0 
goose

正如所看到的，SL是对列表，每个对一个项，随后由该项目的原始列表索引（要实现的关键条件是，如果具有相同最高计数的“最常见”项目> 1，则结果必须是最早出现的项目）。

groupby组由唯一的项目（通过operator.itemgetter）。辅助功能，max计算期间称为每分组一次，接收并解包在内部的基团 - 两个项目(item, iterable)其中可迭代的项目也有两个项的元组，(item, original index) [[的SL]的项目]的元组。

然后辅助功能使用一个循环，以确定该组的可迭代，和最小原始索引条目的两个计数;它会将它们作为组合“质量关键点”返回，并且最小索引符号发生变化，因此max操作会将原来列表中较早出现的那些项目视为“更好”。

此代码可能是更简单的，如果它在时间和空间，如担心一个小减少约大O问题...：

def most_common(L): 
    groups = itertools.groupby(sorted(L)) 
    def _auxfun((item, iterable)): 
    return len(list(iterable)), -L.index(item) 
    return max(groups, key=_auxfun)[0]

相同的基本想法，只是表示更简单和紧凑...但是，唉，一个额外的O（N）辅助空间（用于将组的可迭代列表包含到列表中）和O（N平方）时间（以获得每个项目的L.index）。虽然过早的优化是一切罪恶的根源，编程，故意挑选一个O（N的平方）当O（N日志N）的方法之一是使用只是去太多对可扩展性的粮食 - ！）

最后，对于那些喜欢“清晰度和性能”的人来说，奖金1班轮版本适当地修改了名字:-)。

from itertools import groupby as g 
def most_common_oneliner(L): 
    return max(g(sorted(L)), key=lambda(x, v):(len(list(v)),-L.index(x)))[0]

来源

2009-10-05 15:16:29

如果你的列表有不同的类型，这会在Python3上中断。 – AlexLordThorsen 2016-02-24 22:47:19

'groupby'需要先排序（O（NlogN））;使用具有'most_common（）'的'Counter（）'可以胜过它，因为它使用heapq来查找最高频率的项目（只有1项，即O（N）时间）。由于Counter（）现在被大量优化（计数发生在C循环中），即使对于小列表，它也可以轻松击败该解决方案。它将大量清单从水中吹出。 – 2017-10-14 21:26:07

只有“最低指数”的关系要求才能成为解决这个问题的有效解决方案。对于更一般的情况，你绝对应该使用Counter方法。 – 2017-10-14 22:11:01

如果他们不哈希的，可以对它们进行排序，并做一个遍历结果计数的项目（同一项目将彼此相邻）。但是使它们可以被哈希和使用字典可能会更快。

def most_common(lst): 
    cur_length = 0 
    max_length = 0 
    cur_i = 0 
    max_i = 0 
    cur_item = None 
    max_item = None 
    for i, item in sorted(enumerate(lst), key=lambda x: x[1]): 
     if cur_item is None or cur_item != item: 
      if cur_length > max_length or (cur_length == max_length and cur_i < max_i): 
       max_length = cur_length 
       max_i = cur_i 
       max_item = cur_item 
      cur_length = 1 
      cur_i = i 
      cur_item = item 
     else: 
      cur_length += 1 
    if cur_length > max_length or (cur_length == max_length and cur_i < max_i): 
     return cur_item 
    return max_item

来源

2009-10-05 06:39:56

这是一个更简单的方法http://ideone.com/Nq81vf，比较亚历克斯的'Counter（）'解决方案 – Miguel 2017-01-27 14:49:38

排序列表的副本，发现最长的运行。您可以在使用每个元素的索引对其进行排序之前修饰列表，然后选择以平局为单位从最低索引开始的运行。

来源

2009-10-05 06:40:21 Boojum

这些项目可能无法比较。 – 2013-06-29 12:06:32

这是明显慢溶液（为O（n^2）），如果没有排序，也不散列是可行的，但相等的比较（==）可用：

def most_common(items): 
    if not items: 
    raise ValueError 
    fitems = [] 
    best_idx = 0 
    for item in items: 
    item_missing = True 
    i = 0 
    for fitem in fitems: 
     if fitem[0] == item: 
     fitem[1] += 1 
     d = fitem[1] - fitems[best_idx][1] 
     if d > 0 or (d == 0 and fitems[best_idx][2] > fitem[2]): 
      best_idx = i 
     item_missing = False 
     break 
     i += 1 
    if item_missing: 
     fitems.append([item, 1, i]) 
    return items[best_idx]

但让您的项目可哈希或排序（正如其他答案所建议的）如果列表（n）的长度很大，几乎总是能够更快找到最常见的元素。 O（n）平均具有散列，O（n * log（n））最差排序。

来源

2009-10-05 06:46:53 pts

对于downvoter：这个答案有什么问题？当排序和散列都不可行时，其他答案是否提供了解决方案？ – pts 2018-03-09 23:33:16

-1

这里：

def most_common(l): 
    max = 0 
    maxitem = None 
    for x in set(l): 
     count = l.count(x) 
     if count > max: 
      max = count 
      maxitem = x 
    return maxitem

我有一个模糊的感觉存在于某处，这将使你的每个元素的计数标准库的方法，但我不能找到它。

来源

2009-10-05 06:56:18

'最大'是一种方法。你会改变变量的名字吗？ – 2009-10-05 07:04:27

请注意，set（）也需要可哈希的项目，否则在这种情况下解决方案将无法工作。 – 2009-10-05 07:04:44

等一下，我错过了那个不可排除的部分。但是如果物体具有平等性，应该很容易使它们变得易碎。 – 2009-10-05 08:40:33

-1

>>> li = ['goose', 'duck', 'duck'] 

>>> def foo(li): 
     st = set(li) 
     mx = -1 
     for each in st: 
      temp = li.count(each): 
      if mx < temp: 
       mx = temp 
       h = each 
     return h 

>>> foo(li) 
'duck'

来源

2009-10-05 06:56:50

当n很大并且唯一元素的数量也很大时，这具有可怕的性能特征：O（n）用于转换为集合并且O（m * n）= O（n^2）用于计数（其中m是唯一身份的数量）。排序和步行分别为O（n log n）和0（n）步行。 – jmucchiello 2009-10-05 07:12:28

是的，你是对的。现在我知道这是一个可怕的解决方案，为什么。感谢评论！ :-) – 2009-10-05 07:22:15

一个班轮：

def most_common (lst): 
    return max(((item, lst.count(item)) for item in set(lst)), key=lambda a: a[1])[0]

来源

2009-10-05 07:04:24 willurd

327

一个更简单的一行：

def most_common(lst): 
    return max(set(lst), key=lst.count)

来源

2009-10-05 07:14:52 newacct

+15

执行委员会表示，* [..]如果提取的是索引最低的项目，则应退还。*该代码一般不符合该要求。 – Stephan202 2009-10-05 07:45:14

另外，OP声明元素必须是可散列的：集合必须包含可哈希的对象。 – EOL 2009-10-05 09:16:42

另外，这种方法在算法上很慢（对于'set（lst）'中的每个元素，必须再次检查整个列表）...对于大多数用途来说可能足够快，但是... – EOL 2009-10-05 09:17:27

# use Decorate, Sort, Undecorate to solve the problem 

def most_common(iterable): 
    # Make a list with tuples: (item, index) 
    # The index will be used later to break ties for most common item. 
    lst = [(x, i) for i, x in enumerate(iterable)] 
    lst.sort() 

    # lst_final will also be a list of tuples: (count, index, item) 
    # Sorting on this list will find us the most common item, and the index 
    # will break ties so the one listed first wins. Count is negative so 
    # largest count will have lowest value and sort first. 
    lst_final = [] 

    # Get an iterator for our new list... 
    itr = iter(lst) 

    # ...and pop the first tuple off. Setup current state vars for loop. 
    count = 1 
    tup = next(itr) 
    x_cur, i_cur = tup 

    # Loop over sorted list of tuples, counting occurrences of item. 
    for tup in itr: 
     # Same item again? 
     if x_cur == tup[0]: 
      # Yes, same item; increment count 
      count += 1 
     else: 
      # No, new item, so write previous current item to lst_final... 
      t = (-count, i_cur, x_cur) 
      lst_final.append(t) 
      # ...and reset current state vars for loop. 
      x_cur, i_cur = tup 
      count = 1 

    # Write final item after loop ends 
    t = (-count, i_cur, x_cur) 
    lst_final.append(t) 

    lst_final.sort() 
    answer = lst_final[0][2] 

    return answer 

print most_common(['x', 'e', 'a', 'e', 'a', 'e', 'e']) # prints 'e' 
print most_common(['goose', 'duck', 'duck', 'goose']) # prints 'goose'

来源

2009-10-05 08:02:50 steveha

这是一个为O（n）的解决方案。

mydict = {} 
cnt, itm = 0, '' 
for item in reversed(lst): 
    mydict[item] = mydict.get(item, 0) + 1 
    if mydict[item] >= cnt : 
     cnt, itm = mydict[item], item 

print itm

（逆转的使用，以确保它返回最低的索引项）

来源

2009-10-05 10:29:02 ThisIsMeMoony

您可能不需要这个了，但这是我为类似问题所做的。（它看起来长于这是因为评论。）

itemList = ['hi', 'hi', 'hello', 'bye'] 

counter = {} 
maxItemCount = 0 
for item in itemList: 
    try: 
     # Referencing this will cause a KeyError exception 
     # if it doesn't already exist 
     counter[item] 
     # ... meaning if we get this far it didn't happen so 
     # we'll increment 
     counter[item] += 1 
    except KeyError: 
     # If we got a KeyError we need to create the 
     # dictionary key 
     counter[item] = 1 

    # Keep overwriting maxItemCount with the latest number, 
    # if it's higher than the existing itemCount 
    if counter[item] > maxItemCount: 
     maxItemCount = counter[item] 
     mostPopularItem = item 

print mostPopularItem

来源

2010-04-14 00:35:16

你可以使用计数器[item] = counter.get（item，0）+ 1来代替try/except部分 – XueYu 2016-09-28 00:02:57

105

借用here，这可以使用Python 2.7使用：比Alex的解决方案快

from collections import Counter 

def Most_Common(lst): 
    data = Counter(lst) 
    return data.most_common(1)[0][0]

作品约4-6倍，比newacct提出的单线快50倍。

要检索第一次出现在列表中关系的情况下，元素：

def most_common(lst): 
    data = Counter(lst) 
    return max(lst, key=data.get)

来源

2014-01-01 20:10:48 Alex

这可能对某些人有用，但是......不幸的是Counter是一个字典子类，OP说他不能不使用字典（因为项目可能不可散列）。 – Danimal 2014-09-08 15:32:41

喜欢这个。上面的@newacct单线程可能很简单，但它运行在O（n^2）;也就是说，其中n是列表的长度。这个解决方案是O（n）。 – BoltzmannBrain 2015-05-22 16:50:42

就像简单和速度一样......对OP来说可能并不理想。但很适合我！ – Thom 2015-10-20 12:50:34

要在统计数据模式是已知的，当然Python有一个内置函数来完成这一功能是什么为您提供：

>>> from statistics import mode 
>>> mode([1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 6, 6, 6]) 
3

来源

2016-04-07 13:43:14

这并不满足什么返回时，有不止一个最常见的值OP的要求 - 一个statistics.StatisticsError提高 – 2016-04-07 14:06:31

哎呀，阅读时错过的要求。尽管如此，我仍然认为这个答案是有价值的，因为没有人在这个问题上提出过这个答案，对于那些限制要求最少的人来说这是一个很好的解决方案。这是“列表python中最常见的项目”的最佳结果之一 – 2016-04-07 17:15:52

在这种情况下，请使用pandas DataFrames中的mode函数。 – Elmex80s 2017-03-13 22:34:11

-3

def popular(L): 
C={} 
for a in L: 
    C[a]=L.count(a) 
for b in C.keys(): 
    if C[b]==max(C.values()): 
     return b 
L=[2,3,5,3,6,3,6,3,6,3,7,467,4,7,4] 
print popular(L)

来源

2016-07-18 17:15:15 Pronoy

-2

def most_common(lst): 
    if max([lst.count(i)for i in lst]) == 1: 
     return False 
    else: 
     return max(set(lst), key=lst.count)

来源

2017-02-03 15:48:08 Ecanales

请提供一些关于您的代码的信息，只是发布代码不是一个完整的答案 – jhhoff02 2017-02-03 16:09:16

有没有人有理由在其他15个答案中使用它？ – cpburnz 2017-02-03 20:43:51

-1

我需要做这在最近的节目。我承认，我无法理解亚历克斯的答案，所以这就是我最终的结果。

def mostPopular(l): 
    mpEl=None 
    mpIndex=0 
    mpCount=0 
    curEl=None 
    curCount=0 
    for i, el in sorted(enumerate(l), key=lambda x: (x[1], x[0]), reverse=True): 
     curCount=curCount+1 if el==curEl else 1 
     curEl=el 
     if curCount>mpCount \ 
     or (curCount==mpCount and i<mpIndex): 
      mpEl=curEl 
      mpIndex=i 
      mpCount=curCount 
    return mpEl, mpCount, mpIndex

我计时靠在Alex的解决方案，它是短名单快约10％-15％，但一旦你去了100元以上（经测试可达200000）是慢约20％。

来源

2017-03-13 17:22:30 user3596355

列表中最常见的Python元素

回答

相关问题