Python Cookbook —— 数据结构技巧

一、序列展开与多重赋值

任何数据序列（或可迭代对象）都可以只通过一个赋值操作展开自身并同时赋值给多个变量。只需要确保被赋值的变量的数目和结构与序列相符合即可。如：

>>> p = (4, 5)
>>> x, y = p
>>> x
4
>>> y
5

对于嵌套的多层次序列，此种方式的多重赋值仍然适用：

>>> data = [ 'ACME', 50, 91.1, (2012, 12, 21) ]
>>> name, shares, price, date = data
>>> name
'ACME'
>>> date
(2012, 12, 21)
>>> name, shares, price, (year, mon, day) = data
>>> name
'ACME'
>>> year
2012
>>> mon
12
>>> day
21

序列展开同样适用于任何可迭代对象（即内部实现了 __iter__ 方法的对象，如字符串等），示例如下：

>>> s = 'Hello'
>>> a, b, c, d, e = s
>>> a
'H'
>>> b
'e'
>>> e
'o'

忽略特定的值
在序列展开并赋值时，可以忽略指定项目，方法如下：

>>> data = [ 'ACME', 50, 91.1, (2012, 12, 21) ]
>>> _, shares, price, _ = data
>>> shares
50
>>> price
91.1

展开为固定长度

在序列展开并赋值时，被赋值的变量数目和结构如果与序列本身不符合，则会报出 ValueError ：

>>> p = (4, 5, 6)
>>> x, y = p
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack (expected 2)

如上面的示例，当展开的序列长度大于期望的变量数目时，可以对变量使用 * 操作符将多个项目保存在列表结构中，如：

def drop_first_last(grades):
    first, *middle, last = sorted(grades)
    return sum(middle) / len(middle)

grades = [ 100, 94, 96, 88, 70, 62, 80 ]
print(f"{ drop_first_last(grades) }")
# => 85.6
print(sum([94, 96, 88, 70, 80]) / 5)
# => 85.6

其他应用示例如：

>>> record = ('Dave', 'dave@example.com', '773-555-1212', '847-555-1212')
>>> name, email, *phone_numbers = record
>>> name
'Dave'
>>> email
'dave@example.com'
>>> phone_numbers
['773-555-1212', '847-555-1212']

* 操作符的特性使其可以很方便地应用在字符串分割中：

>>> line = 'nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false'
>>> uname, *fields, homedir, sh = line.split(':')
>>> uname
'nobody'
>>> homedir
'/var/empty'
>>> sh
'/usr/bin/false'

结合 _ 还可以写出如下代码以在赋值时忽略序列中的多个项目：

>>> record = ('ACME', 50, 123.45, (12, 18, 2012))
>>> name, *_, (*_, year) = record
>>> name
'ACME'
>>> year
2012

二、检索序列中最大或最小的 N 个项目

Python 的 heapq 模块包含 nlargest() 和 nsmallest() 函数，可以用来筛选某个数据序列中最大或最小的 N 个值。

import heapq

nums = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2]
print(heapq.nlargest(3, nums))
print(heapq.nsmallest(3, nums))

# => [42, 37, 23]
# => [-4, 1, 2]

上面两个函数也可以接收一个名为 key 的参数，使其可以应用在更复杂的数据结构中：

import heapq

portfolio = [
    {'name': 'IBM', 'shares': 100, 'price': 91.1},
    {'name': 'AAPL', 'shares': 50, 'price': 543.22},
    {'name': 'FB', 'shares': 200, 'price': 21.09},
    {'name': 'HPQ', 'shares': 35, 'price': 31.75},
    {'name': 'YHOO', 'shares': 45, 'price': 16.35},
    {'name': 'ACME', 'shares': 75, 'price': 115.65}
]

cheap = heapq.nsmallest(2, portfolio, key=lambda s: s['price'])
expensive = heapq.nlargest(2, portfolio, key=lambda s: s['price'])

print(cheap)
# => [{'name': 'YHOO', 'shares': 45, 'price': 16.35}, {'name': 'FB', 'shares': 200, 'price': 21.09}]
print(expensive)
# => [{'name': 'AAPL', 'shares': 50, 'price': 543.22}, {'name': 'ACME', 'shares': 75, 'price': 115.65}]

heapq 中的 nlargest() 和 nsmallest() 两个函数的原理都是将数据序列转换为列表结构并且以堆（heap）的形式进行组织。
heap 最重要的属性为， heap[0] 永远是序列中最小的值。使用 heapq.heappop() 方法可以获取 heap 中的第一个值（即最小值），而之前第二小的值则移动到第一的位置。即不断调用 heappop() 可以一直获取当前序列中最小的值。

>>> nums = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2]
>>> import heapq
>>> heapq.heapify(nums)
>>> nums
[-4, 2, 1, 23, 7, 2, 18, 23, 42, 37, 8]
>>> heapq.heappop(nums)
-4
>>> heapq.heappop(nums)
1
>>> heapq.heappop(nums)
2
>>> heapq.heappop(nums)
2
>>> heapq.heappop(nums)
7

PS：如果只想检索某一个最大值或最小值，min() 或者 max() 更快；
如果 nlargest(N, items) 或 nsmallest(N, items) 中的 N 与 items 集合的大小很接近，则先对集合进行排序再分片的方式更快一点，即 sorted(itmes)[:N]
（实际 nlargest 和 nsmallest 内部本身也是这样实现的）

三、实现一个加权队列

以下的代码实现了一种支持加权的队列，即队列中的项目会按指定的权重排序，并且每次调用 pop 方法都可以返回权重最高的项目。

import heapq

class PriorityQueue:
    def __init__(self):
        self._queue = []
        self._index = 0

    def push(self, item, priority):
        heapq.heappush(self._queue, (-priority, self._index, item))
        self._index += 1

    def pop(self):
        return heapq.heappop(self._queue)[-1]

# test code
q = PriorityQueue()
q.push('foo', 1)
q.push('bar', 5)
q.push('spam', 4)
q.push('grok', 1)

print(q.pop())
# => bar
print(q.pop())
# => spam
print(q.pop())
# => foo
print(q.pop())
# => grok

其中 heapq.heappush() 函数可以向 _queue 序列中插入数据，之后再使用 heapq.heappop() 函数获取序列中的数据时，总能保证取出的数据是当时队列中最小的那个。pop 和 push 操作的复杂度为 O(logN)，比普通列表形式的操作（复杂度为 O(N)）效率更高。
同时数据是以元组 (-priority, _index, item) 的形式存入到 _queue 队列中的，元组可以依据自身的每个数据项依次进行比对并确定大小关系。因而权重 -priority 可以作为排序的首要依据（加负号是为了使权重高的值更小，可以优先被 pop() 返回）。当权重一样即 -priority 的值相同时，则根据插入的顺序（_index）返回数据。

四、比较字典中的 Value

参考下面的代码示例：

prices = {
    'ACME': 45.23,
    'AAPL': 612.78,
    'IBM': 205.55,
    'HPQ': 37.20,
    'FB': 10.75
}

print(min(prices))  # => 'AAPL'
print(max(prices))  # => 'IBM'

从输出中可以看出，min(prices) 和 max(prices) 函数只是处理字典 prices 中的 keys，对 values 则不做任何操作。
可以将其改为如下形式：

1 2	min(prices.values()) # => 10.75 max(prices.values()) # => 612.78

则此时上述两个函数又只对字典中的 values 有效，输出的结果中也只包含 values，不包含与之关联的 key 的值。

如果想根据 values 对字典中的数据进行排序，同时输出的结果中既包含 value，又包含与之关联的 key。则可以使用 zip() 函数。
zip() 函数以多个可迭代的对象作为参数，将各对象中位置对应的元素打包成一个个元组。
回到前面的需求，则可以将字典的 keys 和 values 拆分到两个列表中，再通过 zip() 函数将其中的数据合并成一个个元组（等同于之前的键值对），而 value 作为元组的第一个元素。

prices = {
    'ACME': 45.23,
    'AAPL': 612.78,
    'IBM': 205.55,
    'HPQ': 37.20,
    'FB': 10.75
}

for item in zip(prices.values(), prices.keys()):
    print(item)
# => (45.23, 'ACME')
# => (612.78, 'AAPL')
# => (205.55, 'IBM')
# => (37.2, 'HPQ')
# => (10.75, 'FB')

max_price = max(zip(prices.values(), prices.keys()))
print(max_price)
# => (612.78, 'AAPL')

五、去除序列中的重复元素

集合（set）是 Python 中的一种数据结构，它包含了一系列无序的不重复元素。因此可以通过将其他类型的数据转为 set 类型，为序列中的数据去重。
但此种方法不能保留原序列中数据项原本的排列顺序（因为 set 是无序的）。

1
2
3

>>> a = [1, 5, 2, 1, 9, 1, 5, 10]
>>> set(a)
{1, 2, 5, 9, 10}

下面的函数 dedupe 则实现了去重并保留原本的排列顺序：

def dedupe(items):
    seen = set()
    
    for item in items:
        if item not in seen:
            yield item
            seen.add(item)

a = [1, 5, 2, 1, 9, 1, 5, 10]
print(list(dedupe(a)))

# => [1, 5, 2, 9, 10]

而对于更复杂的数据结构，比如对如下的列表进行去重操作：
[{'x':1, 'y':2}, {'x':1, 'y':3}, {'x':1, 'y':2}, {'x':2, 'y':4}]

则可以将 dedupe 函数改为如下形式：

def dedupe(items, key=None):
    seen = set()

    for item in items:
        val = item if key is None else key(item)
        if val not in seen:
            yield item
            seen.add(val)

a = [{'x':1, 'y':2}, {'x':1, 'y':3}, {'x':1, 'y':2}, {'x':2, 'y':4}]
print(list(dedupe(a, key=lambda d: (d['x'],d['y']))))
# => [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]

print(list(dedupe(a, key=lambda d: (d['x']))))
# => [{'x': 1, 'y': 2}, {'x': 2, 'y': 4}]

额，自行体会吧。。。

六、找出序列中最常出现的项

collections.Counter 类可以用来寻找序列中出现次数最多的几个项目。

from collections import Counter

words = [
    'look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes',
    'the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not', 'around', 'the',
    'eyes', "don't", 'look', 'around', 'the', 'eyes', 'look', 'into',
    'my', 'eyes', "you're", 'under'
]

word_counts = Counter(words)
top_three = word_counts.most_common(3)
print(top_three)
# => [('eyes', 8), ('the', 5), ('look', 4)]

morewords = ['why', 'are', 'you', 'not', 'looking', 'in', 'my', 'eyes']
word_counts.update(morewords)
print(word_counts.most_common(3))
# => [('eyes', 9), ('the', 5), ('look', 4)]

a = Counter(words)
b = Counter(morewords)
print(a + b)
# => Counter({'eyes': 9, 'the': 5, 'look': 4, 'my': 4, 'into': 3, 'not': 2, 'around': 2, "don't": 1, "you're": 1, 'under': 1, 'why': 1, 'are': 1, 'you': 1, 'looking': 1, 'in': 1})

参考资料

Python Cookbook, 3rd Edition