Numpy - Fast Stable Arg-sort Of Large Array By Frequency

September 28, 2023 Post a Comment

I have large 1D NumPy array a of any comparable dtype, some of its elements may be repeated. How do I find sorting indexes ix that will stable-sort (stability in a sense described

Solution 1:

I might be missing something, but it seems that with a Counter you can then sort the indexes of each element according to the count of that element's value, using the element value and then the index to break ties. For example:

from collections import Counter

a = [ 1,  1,  1,  1,  3,  0,  5,  0,  3,  1,  1,  0,  0,  4,  6,  1,  3,  5,  5,  0,  0,  0,  5,  0]
counts = Counter(a)

t = [(counts[v], v, i) for i, v inenumerate(a)]
t.sort()
print([v[2] for v in t])
t.sort(reverse=True)
print([v[2] for v in t])

Output:

[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[23, 21, 20, 19, 12, 11, 7, 5, 15, 10, 9, 3, 2, 1, 0, 22, 18, 17, 6, 16, 8, 4, 14, 13]

If you want to maintain ascending order of indexes with groups with equal counts, you can just use a lambda function for the descending sort:

t.sort(key = lambda x:(-x[0],-x[1],x[2]))
print([v[2] for v in t])

Output:

[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 14, 13]

If you want to maintain the ordering of elements in the order that they originally appeared in the array if their counts are the same, then rather than sort on the values, sort on the index of their first occurrence in the array:

a = [ 1,  1,  1,  1,  3,  0,  5,  0,  3,  1,  1,  0,  0,  4,  6,  1,  3,  5,  5,  0,  0,  0,  5,  0]
counts = Counter(a)

idxs = {}
t = []
for i, v in enumerate(a):
    ifnot v in idxs:
        idxs[v] = i
    t.append((counts[v], idxs[v], i))

t.sort()
print([v[2] for v in t])
t.sort(key = lambda x:(-x[0],x[1],x[2]))
print([v[2] for v in t])

Output:

[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 13, 14]

To sort according to count, and then position in the array, you don't need the value or the first index at all:

from collections import Counter

a = [ 1,  1,  1,  1,  3,  0,  5,  0,  3,  1,  1,  0,  0,  4,  6,  1,  3,  5,  5,  0,  0,  0,  5,  0]
counts = Counter(a)

t = [(counts[v], i) for i, v inenumerate(a)]
t.sort()
print([v[1] for v in t])
t.sort(key = lambda x:(-x[0],x[1]))
print([v[1] for v in t])

This produces the same output as the prior code for the sample data, for your string array:

a = ['g',  'g',  'c',  'f',  'd',  'd',  'g',  'a',  'a',  'a',  'f',  'f',  'f',
     'g',  'f',  'c',  'f',  'a',  'e',  'b',  'g',  'd',  'c',  'b',  'f' ]

This produces the output:

[18, 19, 23, 2, 4, 5, 15, 21, 22, 7, 8, 9, 17, 0, 1, 6, 13, 20, 3, 10, 11, 12, 14, 16, 24]
[3, 10, 11, 12, 14, 16, 24, 0, 1, 6, 13, 20, 7, 8, 9, 17, 2, 4, 5, 15, 21, 22, 19, 23, 18]

Solution 2:

I just figured myself probably very fast solution for any dtype using just numpy functions without python looping, it works in O(N log N) time. Used numpy functions: np.unique, np.argsort and array indexing.

Although wasn't asked in original question, I implemented extra flag equal_order_by_val if it is False then array elements with same frequencies are sorted as equal stable range, meaning that there could be c d d c d c output like in outputs dumps below, because this is the order as elements go in original array for equal frequency. When flag is True such elements are in addition sorted by value of original array, resulting in c c c d d d. In other words in case of False we sort stably just by key freq, and when it is True we sort by (freq, value) for ascending order and by (-freq, value) for descending order.

Try it online!

import string, math
import numpy as np
np.random.seed(0)

# Generating input data

hi, n, desc = 7, 25, True
letters = np.array(list(string.ascii_letters), dtype = np.object_)[:hi]
a = np.random.choice(letters, (n,), p = (
    lambda p = np.random.random((letters.size,)): p / p.sum()
)())

for equal_order_by_val in [False, True]:
    # Solving task

    us, ui, cs = np.unique(a, return_inverse = True, return_counts = True)
    af = cs[ui]
    sort_key = -af if desc else af
    if equal_order_by_val:
        shift_bits = max(1, math.ceil(math.log(us.size) / math.log(2)))
        sort_key = ((sort_key.astype(np.int64) << shift_bits) +
            np.arange(us.size, dtype = np.int64)[ui])
    ix = np.argsort(sort_key, kind = 'stable') # Do sorting itself# Printing resultsprint('\nequal_order_by_val:', equal_order_by_val)
    for name, val in [
        ('i_col', np.arange(n)),  ('original_a', a),
        ('freqs', af),            ('sorted_a', a[ix]),
        ('sorted_freqs', af[ix]), ('sorting_ix', ix),
    ]:
        print(name.rjust(12), ' '.join([str(e).rjust(2) for e in val]))

outputs:

equal_order_by_val: False
       i_col  0123456789101112131415161718192021222324
  original_a  g  g  c  f  d  d  g  a  a  a  f  f  f  g  f  c  f  a  e  b  g  d  c  b  f
       freqs  5537335444777573741253327
    sorted_a  f  f  f  f  f  f  f  g  g  g  g  g  a  a  a  a  c  d  d  c  d  c  b  b  e
sorted_freqs  7777777555554444333333221
  sorting_ix  3101112141624016132078917245152122192318

equal_order_by_val: True
       i_col  0123456789101112131415161718192021222324
  original_a  g  g  c  f  d  d  g  a  a  a  f  f  f  g  f  c  f  a  e  b  g  d  c  b  f
       freqs  5537335444777573741253327
    sorted_a  f  f  f  f  f  f  f  g  g  g  g  g  a  a  a  a  ccc  d  d  d  b  b  e
sorted_freqs  7777777555554444333333221
  sorting_ix  3101112141624016132078917215224521192318

alezinhacris

Numpy - Fast Stable Arg-sort Of Large Array By Frequency

Solution 1:

Solution 2:

Post a Comment for "Numpy - Fast Stable Arg-sort Of Large Array By Frequency"

Widget HTML #3