Pandas: Counting Index Pairwise Occurrences Of Same Value In A Dataframe
Solution 1:
Here is a solution that requires looping over columns only.
res = sum(df[c].transform(lambda x: x == df[c]) for c in df.columns)
However, if your data is fairly sparse, using loops or graphs may end up being faster.
Solution 2:
I figured out how to do it with a bit of linear algebra.
First, the trial matrix is decomposed into a sum corresponding to each number (the cluster numbers should start at 1 in order to formulate the method mathematically though this isn't necessary in the implementation).
That is:
trial 0 trial 1 trial 2 trial 3
index 0 0 1 0 0index 1 0 1 1 0index 2 2 0 2 0index 3 1 2 2 1
becomes
trial 0 trial 1 trial 2 trial 3
index 0 1 2 1 1index 1 1 2 2 1index 2 3 1 3 1index 3 2 3 3 2
(incremented by one), which is decomposed as follows:
T = 1 0 1 1 + 2 * 0 1 0 0 + 3 * 0 0 0 0
1 0 0 1 0 1 1 0 0 0 0 0
0 1 0 1 0 0 0 0 1 0 1 0
0 0 0 0 1 0 0 1 0 1 1 0
Then each (normalized) component matrix is multiplied by its transpose and summed together:
C1*C1.T/1 + C2*C2.T/2 + C3*C3.T/3
Where Ci
is the matrix component of T
corresponding to the cluster number i
.
This sum is then the resultant co-occurrence matrix. Below is the implementation and the result for the above example:
test = pd.DataFrame(np.array([[0, 1, 0, 0],
[0, 1, 1, 0],
[2, 0, 2, 0],
[1, 2, 2, 1]]),
columns = ['trial 1', 'trial 2', 'trial 3', 'trial 4'])
test_val = test.values
# Base matrix that will be added to.
curr_mat = np.zeros((test_val.shape[0], test_val.shape[0]))
# Max index of matrix components (i.e. max_val + 1 is number of clusters/matrix components)
max_val = np.max(test_val)
for n_clus inrange(max_val + 1):
# Extract component matrix corresponding to current iteration.
clus_mem = (test_val == n_clus).astype(int)
curr_mat += np.dot(clus_mem, clus_mem.T)
res = pd.DataFrame(curr_mat, index=test.index, columns=test.index)
With the result:
index0index1index2index3index04310index13410index21141index30014
Unfortunately I had to use a for
loop but the number of iterations is now only the number of clusters and I take advantage of numpy's efficient array operations.
Post a Comment for "Pandas: Counting Index Pairwise Occurrences Of Same Value In A Dataframe"