Numpy - Fast Stable Arg-sort Of Large Array By Frequency
Solution 1:
I might be missing something, but it seems that with a Counter
you can then sort the indexes of each element according to the count of that element's value, using the element value and then the index to break ties. For example:
from collections import Counter
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
t = [(counts[v], v, i) for i, v inenumerate(a)]
t.sort()
print([v[2] for v in t])
t.sort(reverse=True)
print([v[2] for v in t])
Output:
[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[23, 21, 20, 19, 12, 11, 7, 5, 15, 10, 9, 3, 2, 1, 0, 22, 18, 17, 6, 16, 8, 4, 14, 13]
If you want to maintain ascending order of indexes with groups with equal counts, you can just use a lambda function for the descending sort:
t.sort(key = lambda x:(-x[0],-x[1],x[2]))
print([v[2] for v in t])
Output:
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 14, 13]
If you want to maintain the ordering of elements in the order that they originally appeared in the array if their counts are the same, then rather than sort on the values, sort on the index of their first occurrence in the array:
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
idxs = {}
t = []
for i, v in enumerate(a):
ifnot v in idxs:
idxs[v] = i
t.append((counts[v], idxs[v], i))
t.sort()
print([v[2] for v in t])
t.sort(key = lambda x:(-x[0],x[1],x[2]))
print([v[2] for v in t])
Output:
[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 13, 14]
To sort according to count, and then position in the array, you don't need the value or the first index at all:
from collections import Counter
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
t = [(counts[v], i) for i, v inenumerate(a)]
t.sort()
print([v[1] for v in t])
t.sort(key = lambda x:(-x[0],x[1]))
print([v[1] for v in t])
This produces the same output as the prior code for the sample data, for your string array:
a = ['g', 'g', 'c', 'f', 'd', 'd', 'g', 'a', 'a', 'a', 'f', 'f', 'f',
'g', 'f', 'c', 'f', 'a', 'e', 'b', 'g', 'd', 'c', 'b', 'f' ]
This produces the output:
[18, 19, 23, 2, 4, 5, 15, 21, 22, 7, 8, 9, 17, 0, 1, 6, 13, 20, 3, 10, 11, 12, 14, 16, 24]
[3, 10, 11, 12, 14, 16, 24, 0, 1, 6, 13, 20, 7, 8, 9, 17, 2, 4, 5, 15, 21, 22, 19, 23, 18]
Solution 2:
I just figured myself probably very fast solution for any dtype using just numpy functions without python looping, it works in O(N log N)
time. Used numpy functions: np.unique
, np.argsort
and array indexing.
Although wasn't asked in original question, I implemented extra flag equal_order_by_val
if it is False then array elements with same frequencies are sorted as equal stable range, meaning that there could be c d d c d c
output like in outputs dumps below, because this is the order as elements go in original array for equal frequency. When flag is True such elements are in addition sorted by value of original array, resulting in c c c d d d
. In other words in case of False we sort stably just by key freq
, and when it is True we sort by (freq, value)
for ascending order and by (-freq, value)
for descending order.
import string, math
import numpy as np
np.random.seed(0)
# Generating input data
hi, n, desc = 7, 25, True
letters = np.array(list(string.ascii_letters), dtype = np.object_)[:hi]
a = np.random.choice(letters, (n,), p = (
lambda p = np.random.random((letters.size,)): p / p.sum()
)())
for equal_order_by_val in [False, True]:
# Solving task
us, ui, cs = np.unique(a, return_inverse = True, return_counts = True)
af = cs[ui]
sort_key = -af if desc else af
if equal_order_by_val:
shift_bits = max(1, math.ceil(math.log(us.size) / math.log(2)))
sort_key = ((sort_key.astype(np.int64) << shift_bits) +
np.arange(us.size, dtype = np.int64)[ui])
ix = np.argsort(sort_key, kind = 'stable') # Do sorting itself# Printing resultsprint('\nequal_order_by_val:', equal_order_by_val)
for name, val in [
('i_col', np.arange(n)), ('original_a', a),
('freqs', af), ('sorted_a', a[ix]),
('sorted_freqs', af[ix]), ('sorting_ix', ix),
]:
print(name.rjust(12), ' '.join([str(e).rjust(2) for e in val]))
outputs:
equal_order_by_val: False
i_col 0123456789101112131415161718192021222324
original_a g g c f d d g a a a f f f g f c f a e b g d c b f
freqs 5537335444777573741253327
sorted_a f f f f f f f g g g g g a a a a c d d c d c b b e
sorted_freqs 7777777555554444333333221
sorting_ix 3101112141624016132078917245152122192318
equal_order_by_val: True
i_col 0123456789101112131415161718192021222324
original_a g g c f d d g a a a f f f g f c f a e b g d c b f
freqs 5537335444777573741253327
sorted_a f f f f f f f g g g g g a a a a ccc d d d b b e
sorted_freqs 7777777555554444333333221
sorting_ix 3101112141624016132078917215224521192318
Post a Comment for "Numpy - Fast Stable Arg-sort Of Large Array By Frequency"