Python Group By

October 27, 2023 Post a Comment

Assume that I have a set of data pair where index 0 is the value and index 1 is the type: input = [ ('11013331', 'KAT'), ('9085267', 'NOT'), ('5238

Solution 1:

Do it in 2 steps. First, create a dictionary.

>>>input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')]>>>from collections import defaultdict>>>res = defaultdict(list)>>>for v, k ininput: res[k].append(v)...

Then, convert that dictionary into the expected format.

>>> [{'type':k, 'items':v} for k,v in res.items()]
[{'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}]

It is also possible with itertools.groupby but it requires the input to be sorted first.

>>> sorted_input = sorted(input, key=itemgetter(1))
>>> groups = groupby(sorted_input, key=itemgetter(1))
>>> [{'type':k, 'items':[x[0] for x in v]} for k, v in groups]
[{'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}]

Note both of these do not respect the original order of the keys. You need an OrderedDict if you need to keep the order.

>>> from collections import OrderedDict
>>> res = OrderedDict()
>>> for v, k ininput:
... if k in res: res[k].append(v)
... else: res[k] = [v]
... >>> [{'type':k, 'items':v} for k,v in res.items()]
[{'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}]

Solution 2:

Python's built-in itertools module actually has a groupby function , but for that the elements to be grouped must first be sorted such that the elements to be grouped are contiguous in the list:

from operator import itemgetter
sortkeyfn = itemgetter(1)
input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), 
 ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), 
 ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')] 
input.sort(key=sortkeyfn)

Now input looks like:

[('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'),
 ('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH'), ('11013331', 'KAT'),
 ('9843236', 'KAT'), ('9085267', 'NOT'), ('11788544', 'NOT')]

groupby returns a sequence of 2-tuples, of the form (key, values_iterator). What we want is to turn this into a list of dicts where the 'type' is the key, and 'items' is a list of the 0'th elements of the tuples returned by the values_iterator. Like this:

from itertools import groupby
result = []
for key,valuesiter in groupby(input, key=sortkeyfn):
    result.append(dict(type=key, items=list(v[0] for v in valuesiter)))

Now result contains your desired dict, as stated in your question.

You might consider, though, just making a single dict out of this, keyed by type, and each value containing the list of values. In your current form, to find the values for a particular type, you'll have to iterate over the list to find the dict containing the matching 'type' key, and then get the 'items' element from it. If you use a single dict instead of a list of 1-item dicts, you can find the items for a particular type with a single keyed lookup into the master dict. Using groupby, this would look like:

result = {}
forkey,valuesiter ingroupby(input, key=sortkeyfn):
    result[key] = list(v[0] forvin valuesiter)

result now contains this dict (this is similar to the intermediate res defaultdict in @KennyTM's answer):

{'NOT': ['9085267', '11788544'], 
 'ETH': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 
 'KAT': ['11013331', '9843236']}

(If you want to reduce this to a one-liner, you can:

result = dict((key,list(v[0] for v in valuesiter)
              forkey,valuesiter in groupby(input, key=sortkeyfn))

or using the newfangled dict-comprehension form:

result = {key:list(v[0] forvin valuesiter)
              forkey,valuesiter ingroupby(input, key=sortkeyfn)}

Solution 3:

This answer is similar to @PaulMcG's answer but doesn't require sorting the input.

For those into functional programming, groupBy can be written in one line (not including imports!), and unlike itertools.groupby it doesn't require the input to be sorted:

from functools import reduce # import needed for python3; builtin in python2from collections import defaultdict

defgroupBy(key, seq):
 return reduce(lambda grp, val: grp[key(val)].append(val) or grp, seq, defaultdict(list))

(The reason for ... or grp in the lambda is that for this reduce() to work, the lambda needs to return its first argument; because list.append() always returns None the or will always return grp. I.e. it's a hack to get around python's restriction that a lambda can only evaluate a single expression.)

This returns a dict whose keys are found by evaluating the given function and whose values are a list of the original items in the original order. For the OP's example, calling this as groupBy(lambda pair: pair[1], input) will return this dict:

{'KAT': [('11013331', 'KAT'), ('9843236', 'KAT')],
 'NOT': [('9085267', 'NOT'), ('11788544', 'NOT')],
 'ETH': [('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH')]}

And as per @PaulMcG's answer the OP's requested format can be found by wrapping that in a list comprehension. So this will do it:

result = {key: [pair[0] for pair in values],
          for key, values in groupBy(lambda pair: pair[1], input).items()}

Solution 4:

I also liked pandas simple grouping. it's powerful, simple and most adequate for large data set

result = pandas.DataFrame(input).groupby(1).groups

Solution 5:

The following function will quickly (no sorting required) group tuples of any length by a key having any index:

# given a sequence of tuples like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)],# returns a dict grouping tuples by idx-th element - with idx=1 we have:# if merge is True {'c':(3,6,88,4),     'a':(7,2,45,0)}# if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))}
def group_by(seqs,idx=0,merge=True):
    d = dict()
    for seq in seqs:
        k = seq[idx]
        v = d.get(k,tuple()) + (seq[:idx]+seq[idx+1:]if merge else (seq[:idx]+seq[idx+1:],))
        d.update({k:v})
    return d

In the case of your question, the index of key you want to group by is 1, therefore:

group_by(input,1)

gives

{'ETH': ('5238761','5349618','962142','7795297','7341464','5594916','1550003'),
 'KAT': ('11013331', '9843236'),
 'NOT': ('9085267', '11788544')}

which is not exactly the output you asked for, but might as well suit your needs.

alezinhacris