Skip to content Skip to sidebar Skip to footer

Pandas: Counting Index Pairwise Occurrences Of Same Value In A Dataframe

I cluster dataframe indices based on a similarity matrix multiple times (trials) and store the cluster assignments in a dataframe as follows: trial 0 trial 1 trial 2 tri

Solution 1:

Here is a solution that requires looping over columns only.

res = sum(df[c].transform(lambda x: x == df[c]) for c in df.columns)

However, if your data is fairly sparse, using loops or graphs may end up being faster.

Solution 2:

I figured out how to do it with a bit of linear algebra.

First, the trial matrix is decomposed into a sum corresponding to each number (the cluster numbers should start at 1 in order to formulate the method mathematically though this isn't necessary in the implementation).

That is:

        trial 0  trial 1  trial 2  trial 3
index 0    0        1        0        0index 1    0        1        1        0index 2    2        0        2        0index 3    1        2        2        1

becomes

        trial 0  trial 1  trial 2  trial 3
index 0    1        2        1        1index 1    1        2        2        1index 2    3        1        3        1index 3    2        3        3        2

(incremented by one), which is decomposed as follows:

T = 1  0  1  1  +  2 * 0  1  0  0  + 3 * 0  0  0  0
    1  0  0  1         0  1  1  0        0  0  0  0
    0  1  0  1         0  0  0  0        1  0  1  0
    0  0  0  0         1  0  0  1        0  1  1  0

Then each (normalized) component matrix is multiplied by its transpose and summed together:

C1*C1.T/1 + C2*C2.T/2 + C3*C3.T/3

Where Ci is the matrix component of T corresponding to the cluster number i.

This sum is then the resultant co-occurrence matrix. Below is the implementation and the result for the above example:

test = pd.DataFrame(np.array([[0, 1, 0, 0], 
                              [0, 1, 1, 0], 
                              [2, 0, 2, 0], 
                              [1, 2, 2, 1]]), 
                    columns = ['trial 1', 'trial 2', 'trial 3', 'trial 4'])
test_val = test.values

# Base matrix that will be added to.
curr_mat = np.zeros((test_val.shape[0], test_val.shape[0]))

# Max index of matrix components (i.e. max_val + 1 is number of clusters/matrix components)
max_val = np.max(test_val)

for n_clus inrange(max_val + 1):

    # Extract component matrix corresponding to current iteration.
    clus_mem = (test_val == n_clus).astype(int)
    curr_mat += np.dot(clus_mem, clus_mem.T)

res = pd.DataFrame(curr_mat, index=test.index, columns=test.index)

With the result:

index0index1index2index3index04310index13410index21141index30014

Unfortunately I had to use a for loop but the number of iterations is now only the number of clusters and I take advantage of numpy's efficient array operations.

Post a Comment for "Pandas: Counting Index Pairwise Occurrences Of Same Value In A Dataframe"