Efficiency Of Sorting By Multiple Keys In Python

February 04, 2024 Post a Comment

I have a list of strings that I want to sort by two custom key functions in Python 3.6. Comparing the multi-sort approach (sorting by the lesser key then by the major key) to the m

Solution 1:

Here's a third way to time:

start = time()
l3 = sorted(lst, key=lambda x: (''.join(sorted(x)) + "/" + x[::2]))
t3 = time() - start

and expanding the last line to

assertl1== l2 == l3

This uses a single string as the key, but combining the two string keys you view as as being the "primary" and "secondary" keys. Note that:

>>> chr(ord("0") - 1)
'/'

That's why the two keys can be combined - they're separated by an ASCII character that compares "less than" any ASCII digit (of course this is wholly specific to the precise kind of keys you're using).

This is typically a little faster than multisort() for me, using the precise program you posted.

prepare=3.628943920135498 multisort=15.646344423294067 multikey=34.255955934524536 slowdown=2.1893903782103075 onekey=15.11461067199707

I believe the primary reason "why" is briefly explained at the end of a modern CPython distribution's Objects/listsort.txt:

As noted above, even the simplest Python comparison triggers a large pile of C-level pointer dereferences, conditionals, and function calls. This can be partially mitigated by pre-scanning the data to determine whether the data is homogeneous with respect to type. If so, it is sometimes possible to substitute faster type-specific comparisons for the slower, generic PyObject_RichCompareBool.

When there's a single string used as the key, this pre-sorting scan deduces that all the keys in the list are in fact strings, so all the runtime expense of figuring out which comparison function to call can be skipped: sorting can always call the string-specific comparison function instead of the all-purpose (and significantly more expensive) PyObject_RichCompareBool.

multisort() also benefits from that optimization.

But multikey() doesn't, much. The pre-sorting scan sees that all the keys are tuples, but the tuple comparison function itself can't assume anything about the types of the tuple's elements: it has to resort to PyObject_RichCompareBool every time it's invoked. (Note: as touched on in comments, it's not really quite that simple: some optimization is still done exploiting that the keys are all tuples, but it doesn't always pay off, and at best is less effective - and see the next section for clearer evidence of that.)

Focus

There's a whole lot going on in the test case, which leads to needing ever-greater effort to explain ever-smaller distinctions.

So to look at the effects of the type homogeneity optimizations, let's simplify things a lot: no key function at all. Like so:

from random import random, seed
fromtime import time

length =10000000
seed(1234567891)
xs = [random() for _ inrange(length)]

ys = xs[:]
start=time()
ys.sort()
e1 =time() -start

ys = [(x,) for x in xs]
start=time()
ys.sort()
e2 =time() -start

ys = [[x] for x in xs]
start=time()
ys.sort()
e3 =time() -start
print(e1, e2, e3)

Here's typical output on my box:

3.1991195678710938 12.756590843200684 26.31903386116028

So it's by far fastest to sort floats directly. It's already very damaging to stick the floats in 1-tuples, but the optimization still gives highly significant benefit: it takes over twice as long again to stick the floats in singleton lists. In that last case (and only in that last case), PyObject_RichCompareBool is always called.

Solution 2:

Update

Tim and a_guest nudged me to analyze more. Here are counts of the different comparisons (per element) and other statistics:

               length:     10,000100,0001,000,00010,000,000| msort  mkey | msort  mkey | msort  mkey | msort  mkey |----------------------+-------------+-------------+-------------+-------------+<''.join(sorted(x)) |11.6411.01|13.8611.88|13.1012.00|12.1612.06|< x[::2]             |11.990.96|13.513.20|13.775.35|13.796.68|==''.join(sorted(x)) |11.99|15.29|18.60|21.26|== x[::2]             |0.98|3.42|6.60|9.20|----------------------+-------------+-------------+-------------+-------------+time, μs per element  |0.841.03|0.951.42|1.312.35|1.393.15|----------------------+-------------+-------------+-------------+-------------+
tracemalloc peak MiB  |0.821.67|8.2617.71|82.7178.0|8261780|----------------------+-------------+-------------+-------------+-------------+
sys.getsizeof MiB     |0.581.80|5.7217.82|57.6179.5|5811808||0.60|6.00|60.4|608|----------------------+-------------+-------------+-------------+-------------+

Numbers of comparisons

The numbers of comparisons are per element, i.e., I counted all comparisons and divided by the list size. So for example for a list with a million elements, multisort did 13.77 < comparisons per x[::2] string and then 13.10 < comparisons per ''.join(sorted(x)) string. Whereas multikey has many == comparisons of the strings and fewer < comparisons of the strings. As Tim pointed out, unsafe_tuple_compare uses the slower PyObject_RichCompareBool(..., Py_EQ) before it uses the faster unsafe_latin_compare (on the ''.join(sorted(x)) strings) or another slower PyObject_RichCompareBool(..., Py_LT) (on the x[::2] strings).

A key thing to note is that multisort's numbers of comparisons per element stay roughly constant and it only uses the faster unsafe_latin_compare. Whereas multikey's numbers of comparisons other than the < comparisons on ''.join(sorted(x)) grow rapidly, including the additional slower equality comparisons.

Baca Juga

That has to do with your data, as x is the string representation of an integer from 0 to 1,000,000. That causes more and more duplicates, both for ''.join(sorted(x)) (which also has duplicates from the sorting, e.g., 314159 and 911345 both become '113459') and for x[::2] (which also has duplicates from the slicing, e.g., 123456 and 183957 both become '135').

For multisort this is good, as duplicates mean less work for it - equal things dont need to be "sorted further" (and I think streaks of equal things are good for Timsort's galloping).

But multikey suffers from from the duplicates, because the == comparisons on ''.join(sorted(x)) result in "they're equal" more often, leading to more comparisons of the x[::2] strings. And these x[::2] often turn out unequal, note how the numbers for < x[::2] are relatively large compared to the == x[::2] comparisons (e.g., at ten million elements, there were 9.20 == comparisons and 6.68 < comparisons, so 73% of the time they were unequal). Which means the tuples also often turn out unequal, so they do need to be "sorted further" (and I think it's bad for galloping). And this further sorting will compare whole tuples again, also meaning even more== comparisons of the ''.join(sorted(x)) strings even though they're equal! That's why in multikey, the < comparisons of the ''.join(sorted(x)) strings remain fairly stable (from 11.01 per string for 10,000 elements, up to only 12.06 for 10 million elements) while their == comparisons grow so much (from 11.99 up to 21.26).

Times and Memory

The runtimes in the above table also reflect the little growth for multisort and the large growth for multikey. The one time where multisort got really quite a bit slower was at the step from 100,000 to 1,000,000 elements, going from 0.95 μs to 1.31 μs per element. As can be seen from the tracemalloc and sys.getsizeof rows of my table, its memory stepped up from ~6 MiB to ~59 MiB, and the CPU has about 34 MiB cache, so cache misses might be responsible for that.

For the sys.getsizeof values I put all keys into a list and add the size of the list (since sorted also stores them) and the sizes of all strings/tuples in the list. For multisort, my table shows two values separately, as the two sorts happen one after the other, and the memory for the first sort is released before the second sort.

(Note: runtimes differ from my original answer as I had to use a different computer to sort ten million elements.)

Code for counting comparisons

I wrap the strings in an S object that increases a counter for each comparison.

import random
from collections import Counter

classS:
    def__init__(self, value, label):
        self.value = value
        self.label = label
    def__eq__(self, other):
        comparisons['==', self.label] += 1return self.value == other.value
    def__ne__(self, other):
        comparisons['!=', self.label] += 1return self.value != other.value
    def__lt__(self, other):
        comparisons['<', self.label] += 1return self.value < other.value
    def__le__(self, other):
        comparisons['<=', self.label] += 1return self.value <= other.value
    def__ge__(self, other):
        comparisons['>=', self.label] += 1return self.value >= other.value
    def__gt__(self, other):
        comparisons['>', self.label] += 1return self.value > other.value

defkey1(x):
    return S(''.join(sorted(x)), "''.join(sorted(x))")

defkey2(x):
    return S(x[::2], 'x[::2]')

defmultisort(l):
    tmp = sorted(l, key=lambda x: key2(x))
    returnsorted(tmp, key=lambda x: key1(x))

defmultikey(l):
    returnsorted(l, key=lambda x: (key1(x), key2(x)))

funcs = [
    multisort,
    multikey,
]

deftest(length, rounds):
    print(f'{length = :,}')

    largest = 1000000for _ inrange(3):
        lst = list(map(str, random.choices(range(largest + 1), k=length)))
        for func in funcs:
            global comparisons
            comparisons = Counter()
            func(lst)
            print(func.__name__)
            for comparison, frequency in comparisons.most_common():
                print(f'{frequency / length:5.2f}', *comparison)
            print()

test(10_000, 1)
test(100_000, 10)
test(1_000_000, 1)
test(10_000_000, 1)

Original answer

Yeah, I've often used / pointed out that two simpler sorts can be faster than one more complex sort. Here are benchmarks with a few more versions:

At length = 10,000 elements, multikey took about 1.16 times as long as multisort:

multisort           12.09 ms  12.21 ms  12.32 ms  (no previous result)
multikey            14.13 ms  14.14 ms  14.14 ms  (same as previous result)
multisort_tupled    15.40 ms  15.61 ms  15.70 ms  (same as previous result)
multisort_inplaced  11.46 ms  11.49 ms  11.49 ms  (same as previous result)

At length = 100,000, it took about 1.43 times as long:

length =100,000
multisort           0.162 s  0.164 s  0.164 s  (no previous result)
multikey            0.231 s  0.233 s  0.237 s  (same as previous result)
multisort_tupled    0.222 s  0.225 s  0.227 s  (same as previous result)
multisort_inplaced  0.156 s  0.157 s  0.158 s  (same as previous result)

At length = 1,000,000, it took about 2.15 times as long:

multisort           1.894 s  1.897 s  1.898 s  (no previous result)
multikey            4.026 s  4.060 s  4.163 s  (same as previous result)
multisort_tupled    2.734 s  2.765 s  2.771 s  (same as previous result)
multisort_inplaced  1.840 s  1.841 s  1.863 s  (same as previous result)

Reasons I see:

Building tuples costs extra time, and comparing tuples of things is slower than comparing just those things. See the times of multisort_tupled, where in one of the two sorts I wrap each real key in a single-value tuple. That made it much slower.
With large data, cpu/memory caches play a more significant role. Having extra tuples with two things per tuple is three times as many objects in memory as just having one of those things. Leads to more cache misses. Or even to (more) swap file usage if you go real big.
~~Tuples are registered for the reference cycle garbage collector, which can play a significant role. We don't produce reference cycles here, so we can disable it during the sorting. Speeds it up a bit.~~ Edit: I still think it should be at least slightly faster, but with more testing I got conflicting times regarding this, so took this out.

Btw, notice my multisort_inplaced is a bit faster still. If you do multiple sorts, it's pointless to do them all with sorted. After the first one, just use inplace sort. No reason to create further new lists, which takes time/memory for the lists and takes time to update the reference counts for all the list elements.

Benchmark code (Try it online!):

import random
from time import perf_counter as timer

defmultisort(l):
    tmp = sorted(l, key=lambda x: x[::2])
    returnsorted(tmp, key=lambda x: ''.join(sorted(x)))

defmultikey(l):
    returnsorted(l, key=lambda x: (''.join(sorted(x)), x[::2]))

defmultisort_tupled(l):
    tmp = sorted(l, key=lambda x: x[::2])
    returnsorted(tmp, key=lambda x: (''.join(sorted(x)),))

defmultisort_inplaced(l):
    tmp = sorted(l, key=lambda x: x[::2])
    tmp.sort(key=lambda x: ''.join(sorted(x)))
    return tmp

funcs = [
    multisort,
    multikey,
    multisort_tupled,
    multisort_inplaced,
]

deftest(length, rounds):
    print(f'{length = :,}')

    largest = 1000000
    lst = list(map(str, random.choices(range(largest + 1), k=length)))

    prev = Nonefor func in funcs:
        times = []
        for _ inrange(5):
            time = 0for _ inrange(rounds):
                copy = lst.copy()
                start = timer()
                result = func(copy)
                time += timer() - start
            times.append(time / rounds)
        print('%-19s' % func.__name__,
              *('%5.2f ms ' % (t * 1e3) for t insorted(times)[:3]),
              # *('%5.3f s ' % t for t in sorted(times)[:3]),'(%s previous result)' % ('no'if prev isNoneelse'same as'if result == prev else'differs from'))
        prev = result

test(10_000, 10)
# test(100_000, 10)# test(1_000_000, 1)

alezinhacris