Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I would like to compare two parts of two different columns from an Excel file that have a different number of elements. The comparison should be made between a part of Column 3 and a part of Column 2. Column 3 part has a length of j elements and Column 2 has a length of k elements(k>j). Column 2 part starts from row "j+1" and column 3 part starts from row 1. If an element from column 3 part is matching an element from column 2 part, then should check if the element from column1, before the j row, which has the same index as matched item from column 3 part is matching with the element from Column 1 part between j+1 and k, which has the same index as matched item from column 2 part. If yes, then should be written the element from Column 4 with the same index as matched element from column 2 part in a new Excel sheet.

Example: Column3[1]==Column2[2](which represents element 'A') => Column1[1]==Column1[j+2](which represents element 'P') => Column4[j+2] should be written in a new sheet.

Column 1 Column 2 Column 3 Column 4
  P         F        A          S
  B         G        X          T
  C         H        K          V
  D         I        M          W
  P         B        R          B
  P         A        R          D
  C         D        H          E
  D         E        J          k
  E         M        K          W
  F         F        L          Q
  Q         F        K          Q

For reading the Excel sheet cells from original sheet, I have used the df27.ix[:j-1,1].

One part of the code which reads the values of the mention part from column 3 and column 2 might be:

for j in range(1,j):
        c3=sheet['B'+str(j)].value
        for k in range(j,j+k):
                c2=sheet['B'+str(k)].value

Any hint how I can accomplish this?

UPDATED

I have tried a new code which takes in consideration that we have '-', like joaquin mentioned in his example.

Joaquin's example:

   C1  C2  C3  C4
0   P   -   A   -
1   B   -   X   -
2   C   -   K   -
3   D   -   M   -
4   P   B   -   B
5   P   A   -   D
6   C   D   -   E
7   D   E   -   k
8   E   M   -   W
9   F   F   -   Q
10  Q   F   -   Q

New code:

from pandas import DataFrame as df
import pandas as pd
import openpyxl

wb=openpyxl.load_workbook('/media/sf_vboxshared/x.xlsx')
sheet=wb.get_sheet_by_name('Sheet1')
C13=[]
C12=[]
C1=[]
C2=[]
C3=[]
for s in range(2, sheet.max_row+1):
        C1second=sheet['A'+str(s)].value
        C2second=sheet['B'+str(s)].value
        C3second=sheet['C'+str(s)].value
        C1.append(C1second)
        C2.append(C2second)
        C3.append(C3second)
        C1=[x.encode('UTF8') for x in C1]
for y in C2:
        if y is not None:
                C2=[x.encode('UTF8') if x is not None else None for x in C2]
for z in C3:
        if z is not None:
                C3=[x.encode('UTF8') if x is not None else None for x in C3]
for x in C1:
        C13.append(x)
for x in C3:
        C13.append(x)
for x in C1:
        C12.append(x)
for x in C2:
        C12.append(x)
tosave = pd.DataFrame()
df[C13]=pd.DataFrame(C13)
df[C12]=pd.DataFrame(C12)
for item in df[C13]:
    if '-' in item: continue
    new = df[df[C12] == item]
    tosave = tosave.append(new)

But I still get the following error: df[C13]=pd.DataFrame(C13) TypeError: 'type' object does not support item assignment. Any idea what is wrong?

Many thanks in advance, Dan

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
129 views
Welcome To Ask or Share your Answers For Others

1 Answer

Given your df is

    C1  C2  C3  C4
0   P   -   A   -
1   B   -   X   -
2   C   -   K   -
3   D   -   M   -
4   P   B   -   B
5   P   A   -   D
6   C   D   -   E
7   D   E   -   k
8   E   M   -   W
9   F   F   -   Q
10  Q   F   -   Q

then, I combine C1 and C3 and C1 and C2

df['C13'] = df.apply(lambda x: x['C1'] + x['C3'], axis=1)
df['C12'] = df.apply(lambda x: x['C1'] + x['C2'], axis=1)

and compare which rows have the same pair of characters in columns C13 and C12, and save them in tosave

tosave = p.DataFrame()

for item in df['C13']:
    if '-' in item: continue
    new = df[df['C12'] == item]
    tosave = tosave.append(new)

this gives you a tosave dataframe with the rows matching:

   C1   C2  C3  C4  C13 C12
5   P   A   -   D   P-  PA 

That can be directly saved as it is or you can save just column C4

UPDATE: If you have data on each row, then you can not use the '-' detection (or any other kind of detection based on the differences between empty and filled columns). On the other hand, if j,k are not defined (for any j and k), your problem is actually reduced to find, for each row, identical pairs below that row. In consecuence, this:

tosave = p.DataFrame()

for idx, item in enumerate(df['C13']):
    new = df[df['C12'] == item]
    tosave = tosave.append(new.loc[idx+1:])

solves the problem given your labels and data is like:

    C1  C2  C3  C4
0   P   F   A   S
1   B   G   X   T
2   C   H   K   V
3   D   I   M   W
4   P   B   R   B
5   P   A   R   D
6   C   D   H   E
7   D   E   J   k
8   E   M   K   W
9   F   F   L   Q
10  Q   F   K   Q

This code also produces the same output as before:

   C1   C2  C3  C4  C13 C12
5   P   A   R   D   PR  PA

Note this probably needs some refinenment (p.e when a row produces 2 matches, the second row with produce 1 match, and you will need to remove replicates from the final output).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...