Efficiency Of Sorting By Multiple Keys In Python
Solution 1:
Here's a third way to time:
start = time()
l3 = sorted(lst, key=lambda x: (''.join(sorted(x)) + "/" + x[::2]))
t3 = time() - start
and expanding the last line to
assertl1== l2 == l3
This uses a single string as the key, but combining the two string keys you view as as being the "primary" and "secondary" keys. Note that:
>>> chr(ord("0") - 1)
'/'
That's why the two keys can be combined - they're separated by an ASCII character that compares "less than" any ASCII digit (of course this is wholly specific to the precise kind of keys you're using).
This is typically a little faster than multisort()
for me, using the precise program you posted.
prepare=3.628943920135498 multisort=15.646344423294067 multikey=34.255955934524536 slowdown=2.1893903782103075 onekey=15.11461067199707
I believe the primary reason "why" is briefly explained at the end of a modern CPython distribution's Objects/listsort.txt
:
As noted above, even the simplest Python comparison triggers a large pile of C-level pointer dereferences, conditionals, and function calls. This can be partially mitigated by pre-scanning the data to determine whether the data is homogeneous with respect to type. If so, it is sometimes possible to substitute faster type-specific comparisons for the slower, generic PyObject_RichCompareBool.
When there's a single string used as the key, this pre-sorting scan deduces that all the keys in the list are in fact strings, so all the runtime expense of figuring out which comparison function to call can be skipped: sorting can always call the string-specific comparison function instead of the all-purpose (and significantly more expensive) PyObject_RichCompareBool
.
multisort()
also benefits from that optimization.
But multikey()
doesn't, much. The pre-sorting scan sees that all the keys are tuples, but the tuple comparison function itself can't assume anything about the types of the tuple's elements: it has to resort to PyObject_RichCompareBool
every time it's invoked. (Note: as touched on in comments, it's not really quite that simple: some optimization is still done exploiting that the keys are all tuples, but it doesn't always pay off, and at best is less effective - and see the next section for clearer evidence of that.)
Focus
There's a whole lot going on in the test case, which leads to needing ever-greater effort to explain ever-smaller distinctions.
So to look at the effects of the type homogeneity optimizations, let's simplify things a lot: no key
function at all. Like so:
from random import random, seed
fromtime import time
length =10000000
seed(1234567891)
xs = [random() for _ inrange(length)]
ys = xs[:]
start=time()
ys.sort()
e1 =time() -start
ys = [(x,) for x in xs]
start=time()
ys.sort()
e2 =time() -start
ys = [[x] for x in xs]
start=time()
ys.sort()
e3 =time() -start
print(e1, e2, e3)
Here's typical output on my box:
3.1991195678710938 12.756590843200684 26.31903386116028
So it's by far fastest to sort floats directly. It's already very damaging to stick the floats in 1-tuples, but the optimization still gives highly significant benefit: it takes over twice as long again to stick the floats in singleton lists. In that last case (and only in that last case), PyObject_RichCompareBool
is always called.
Solution 2:
Update
Tim and a_guest nudged me to analyze more. Here are counts of the different comparisons (per element) and other statistics:
length: 10,000100,0001,000,00010,000,000| msort mkey | msort mkey | msort mkey | msort mkey |----------------------+-------------+-------------+-------------+-------------+<''.join(sorted(x)) |11.6411.01|13.8611.88|13.1012.00|12.1612.06|< x[::2] |11.990.96|13.513.20|13.775.35|13.796.68|==''.join(sorted(x)) |11.99|15.29|18.60|21.26|== x[::2] |0.98|3.42|6.60|9.20|----------------------+-------------+-------------+-------------+-------------+time, μs per element |0.841.03|0.951.42|1.312.35|1.393.15|----------------------+-------------+-------------+-------------+-------------+
tracemalloc peak MiB |0.821.67|8.2617.71|82.7178.0|8261780|----------------------+-------------+-------------+-------------+-------------+
sys.getsizeof MiB |0.581.80|5.7217.82|57.6179.5|5811808||0.60|6.00|60.4|608|----------------------+-------------+-------------+-------------+-------------+
Numbers of comparisons
The numbers of comparisons are per element, i.e., I counted all comparisons and divided by the list size. So for example for a list with a million elements, multisort
did 13.77 <
comparisons per x[::2]
string and then 13.10 <
comparisons per ''.join(sorted(x))
string. Whereas multikey
has many ==
comparisons of the strings and fewer <
comparisons of the strings. As Tim pointed out, unsafe_tuple_compare
uses the slower PyObject_RichCompareBool(..., Py_EQ)
before it uses the faster unsafe_latin_compare
(on the ''.join(sorted(x))
strings) or another slower PyObject_RichCompareBool(..., Py_LT)
(on the x[::2]
strings).
A key thing to note is that multisort
's numbers of comparisons per element stay roughly constant and it only uses the faster unsafe_latin_compare
. Whereas multikey
's numbers of comparisons other than the <
comparisons on ''.join(sorted(x))
grow rapidly, including the additional slower equality comparisons.
That has to do with your data, as x
is the string representation of an integer from 0 to 1,000,000. That causes more and more duplicates, both for ''.join(sorted(x))
(which also has duplicates from the sorting, e.g., 314159 and 911345 both become '113459') and for x[::2]
(which also has duplicates from the slicing, e.g., 123456 and 183957 both become '135').
For multisort
this is good, as duplicates mean less work for it - equal things dont need to be "sorted further" (and I think streaks of equal things are good for Timsort's galloping).
But multikey
suffers from from the duplicates, because the ==
comparisons on ''.join(sorted(x))
result in "they're equal" more often, leading to more comparisons of the x[::2]
strings. And these x[::2]
often turn out unequal, note how the numbers for < x[::2]
are relatively large compared to the == x[::2]
comparisons (e.g., at ten million elements, there were 9.20 ==
comparisons and 6.68 <
comparisons, so 73% of the time they were unequal). Which means the tuples also often turn out unequal, so they do need to be "sorted further" (and I think it's bad for galloping). And this further sorting will compare whole tuples again, also meaning even more==
comparisons of the ''.join(sorted(x))
strings even though they're equal! That's why in multikey
, the <
comparisons of the ''.join(sorted(x))
strings remain fairly stable (from 11.01 per string for 10,000 elements, up to only 12.06 for 10 million elements) while their ==
comparisons grow so much (from 11.99 up to 21.26).
Times and Memory
The runtimes in the above table also reflect the little growth for multisort
and the large growth for multikey
. The one time where multisort
got really quite a bit slower was at the step from 100,000 to 1,000,000 elements, going from 0.95 μs to 1.31 μs per element. As can be seen from the tracemalloc
and sys.getsizeof
rows of my table, its memory stepped up from ~6 MiB to ~59 MiB, and the CPU has about 34 MiB cache, so cache misses might be responsible for that.
For the sys.getsizeof
values I put all keys into a list and add the size of the list (since sorted
also stores them) and the sizes of all strings/tuples in the list. For multisort
, my table shows two values separately, as the two sorts happen one after the other, and the memory for the first sort is released before the second sort.
(Note: runtimes differ from my original answer as I had to use a different computer to sort ten million elements.)
Code for counting comparisons
I wrap the strings in an S
object that increases a counter for each comparison.
import random
from collections import Counter
classS:
def__init__(self, value, label):
self.value = value
self.label = label
def__eq__(self, other):
comparisons['==', self.label] += 1return self.value == other.value
def__ne__(self, other):
comparisons['!=', self.label] += 1return self.value != other.value
def__lt__(self, other):
comparisons['<', self.label] += 1return self.value < other.value
def__le__(self, other):
comparisons['<=', self.label] += 1return self.value <= other.value
def__ge__(self, other):
comparisons['>=', self.label] += 1return self.value >= other.value
def__gt__(self, other):
comparisons['>', self.label] += 1return self.value > other.value
defkey1(x):
return S(''.join(sorted(x)), "''.join(sorted(x))")
defkey2(x):
return S(x[::2], 'x[::2]')
defmultisort(l):
tmp = sorted(l, key=lambda x: key2(x))
returnsorted(tmp, key=lambda x: key1(x))
defmultikey(l):
returnsorted(l, key=lambda x: (key1(x), key2(x)))
funcs = [
multisort,
multikey,
]
deftest(length, rounds):
print(f'{length = :,}')
largest = 1000000for _ inrange(3):
lst = list(map(str, random.choices(range(largest + 1), k=length)))
for func in funcs:
global comparisons
comparisons = Counter()
func(lst)
print(func.__name__)
for comparison, frequency in comparisons.most_common():
print(f'{frequency / length:5.2f}', *comparison)
print()
test(10_000, 1)
test(100_000, 10)
test(1_000_000, 1)
test(10_000_000, 1)
Original answer
Yeah, I've often used / pointed out that two simpler sorts can be faster than one more complex sort. Here are benchmarks with a few more versions:
At length = 10,000
elements, multikey
took about 1.16 times as long as multisort
:
multisort 12.09 ms 12.21 ms 12.32 ms (no previous result)
multikey 14.13 ms 14.14 ms 14.14 ms (same as previous result)
multisort_tupled 15.40 ms 15.61 ms 15.70 ms (same as previous result)
multisort_inplaced 11.46 ms 11.49 ms 11.49 ms (same as previous result)
At length = 100,000
, it took about 1.43 times as long:
length =100,000
multisort 0.162 s 0.164 s 0.164 s (no previous result)
multikey 0.231 s 0.233 s 0.237 s (same as previous result)
multisort_tupled 0.222 s 0.225 s 0.227 s (same as previous result)
multisort_inplaced 0.156 s 0.157 s 0.158 s (same as previous result)
At length = 1,000,000
, it took about 2.15 times as long:
multisort 1.894 s 1.897 s 1.898 s (no previous result)
multikey 4.026 s 4.060 s 4.163 s (same as previous result)
multisort_tupled 2.734 s 2.765 s 2.771 s (same as previous result)
multisort_inplaced 1.840 s 1.841 s 1.863 s (same as previous result)
Reasons I see:
- Building tuples costs extra time, and comparing tuples of things is slower than comparing just those things. See the times of
multisort_tupled
, where in one of the two sorts I wrap each real key in a single-value tuple. That made it much slower. - With large data, cpu/memory caches play a more significant role. Having extra tuples with two things per tuple is three times as many objects in memory as just having one of those things. Leads to more cache misses. Or even to (more) swap file usage if you go real big.
Tuples are registered for the reference cycle garbage collector, which can play a significant role. We don't produce reference cycles here, so we can disable it during the sorting. Speeds it up a bit.Edit: I still think it should be at least slightly faster, but with more testing I got conflicting times regarding this, so took this out.
Btw, notice my multisort_inplaced
is a bit faster still. If you do multiple sorts, it's pointless to do them all with sorted
. After the first one, just use inplace sort
. No reason to create further new lists, which takes time/memory for the lists and takes time to update the reference counts for all the list elements.
Benchmark code (Try it online!):
import random
from time import perf_counter as timer
defmultisort(l):
tmp = sorted(l, key=lambda x: x[::2])
returnsorted(tmp, key=lambda x: ''.join(sorted(x)))
defmultikey(l):
returnsorted(l, key=lambda x: (''.join(sorted(x)), x[::2]))
defmultisort_tupled(l):
tmp = sorted(l, key=lambda x: x[::2])
returnsorted(tmp, key=lambda x: (''.join(sorted(x)),))
defmultisort_inplaced(l):
tmp = sorted(l, key=lambda x: x[::2])
tmp.sort(key=lambda x: ''.join(sorted(x)))
return tmp
funcs = [
multisort,
multikey,
multisort_tupled,
multisort_inplaced,
]
deftest(length, rounds):
print(f'{length = :,}')
largest = 1000000
lst = list(map(str, random.choices(range(largest + 1), k=length)))
prev = Nonefor func in funcs:
times = []
for _ inrange(5):
time = 0for _ inrange(rounds):
copy = lst.copy()
start = timer()
result = func(copy)
time += timer() - start
times.append(time / rounds)
print('%-19s' % func.__name__,
*('%5.2f ms ' % (t * 1e3) for t insorted(times)[:3]),
# *('%5.3f s ' % t for t in sorted(times)[:3]),'(%s previous result)' % ('no'if prev isNoneelse'same as'if result == prev else'differs from'))
prev = result
test(10_000, 10)
# test(100_000, 10)# test(1_000_000, 1)
Post a Comment for "Efficiency Of Sorting By Multiple Keys In Python"