Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm experienced with Pandas but stumbled upon a problem that I can't seem to figure out.

I have a large dataset ((40,000, 16)) and I am trying to group it by a specific column ("group_name" for this matter) and then for each group apply the following rules so it'd merge into one row per group:

  • x1, x2, x3 are the "important" columns, if one row has less nulls than the others, take it. (see example with row D)
  • If there are conflicts in any column, it's arbitrary and we can pick whatever.
  • Combine the nulls on the important fields (x1, x2, x3), see example with row A.

Here is an example with 6 rows that should turn into 4 groups (aka 4 rows).

Input, result and some explanations.

So far I have groups = df.groupby['group_name']

I tried many other solutions such as summing each group, applying a transformation, aggregating by each 'important' column, merging on each 'important' column and more. Each solution brought it's own problems so I'm offering this question here without limiting people to a certain way.

Also, I spent nearly two days combining different solutions from other questions but none has seem to work. Perhaps I've missed something.

  • Please note that since this is a large dataset, I'd very much like to avoid using for loop on each group since efficiency is something to consider here.

I hope I explained everything properly, please let me know if something is unclear.

Code to re-create the dataframe (thanks to @Henry Ecker from the first answer):

df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', None, None, 'abc', 'xx', None],
    'x2': [None, 'b', None, 'def', 'yy', None],
    'x3': [None, None, None, None, 'zz', 'ff']
})
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
144 views
Welcome To Ask or Share your Answers For Others

1 Answer

Try with groupby aggregate 'first' to get the first (valid) value from every column for each group_name:

new_df = df.groupby('group_name', as_index=False).agg('first')

new_df:

  group_name       z1        z2        z3      zN    x1    x2    x3
0          A   value1    value2    value3  valueN     a     b  None
1          B   value1    value2    value3  valueN  None  None  None
2          C   value1    value2    value3  valueN   abc   def  None
3          D  value99  value100  value101  valueN    xx    yy    zz

*Note if those are string 'null' mask them out first + fillna to put them back:

new_df = (
    df.mask(df.eq('null'))
        .groupby('group_name', as_index=False).agg('first')
        .fillna('null')
)

new_df:

  group_name       z1        z2        z3      zN    x1    x2    x3
0          A   value1    value2    value3  valueN     a     b  null
1          B   value1    value2    value3  valueN  null  null  null
2          C   value1    value2    value3  valueN   abc   def  null
3          D  value99  value100  value101  valueN    xx    yy    zz

DataFrame used:

  group_name               z1         z2        z3         zN    x1    x2    x3
0          A           value1     value2    value3     valueN     a  None  None
1          A  different_value     value2    value3     valueN  None     b  None
2          B           value1     value2    value3     valueN  None  None  None
3          C           value1     value2    value3     valueN   abc   def  None
4          D          value99   value100  value101     valueN    xx    yy    zz
5          D         value999  value1000  value101  valueN200  None  None    ff
df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', None, None, 'abc', 'xx', None],
    'x2': [None, 'b', None, 'def', 'yy', None],
    'x3': [None, None, None, None, 'zz', 'ff']
})

DataFrame with string 'null' used:

  group_name               z1         z2        z3         zN    x1    x2    x3
0          A           value1     value2    value3     valueN     a  null  null
1          A  different_value     value2    value3     valueN  null     b  null
2          B           value1     value2    value3     valueN  null  null  null
3          C           value1     value2    value3     valueN   abc   def  null
4          D          value99   value100  value101     valueN    xx    yy    zz
5          D         value999  value1000  value101  valueN200  null  null    ff
df = pd.DataFrame({
    'group_name': ['A', 'A', 'B', 'C', 'D', 'D'],
    'z1': ['value1', 'different_value', 'value1',
           'value1', 'value99', 'value999'],
    'z2': ['value2'] * 4 + ['value100', 'value1000'],
    'z3': ['value3'] * 4 + ['value101', 'value101'],
    'zN': ['valueN'] * 5 + ['valueN200'],
    'x1': ['a', 'null', 'null', 'abc', 'xx', 'null'],
    'x2': ['null', 'b', 'null', 'def', 'yy', 'null'],
    'x3': ['null', 'null', 'null', 'null', 'zz', 'ff']
})

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...