Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I want to go from this:

name pet
1 Rashida dog
2 Rashida cat
3 Jim dog
4 JIm dog
question from:https://stackoverflow.com/questions/65837805/aggregate-multiple-columns-of-qualitative-data-using-pandas

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
952 views
Welcome To Ask or Share your Answers For Others

1 Answer

There are lots of different ways to do this.

If you are filtering the value of a single column, then you can use the .agg with a custom lambda function.

(df.groupby(["name"])
  .agg(
      num_dogs=("pet", lambda x: np.sum(x == "dog")), 
      num_cats=("pet", lambda x: np.sum(x == "cat")))
)

Or

(df
  .groupby(["name", "pet"])
  .size()
  .unstack("pet", fill_value=0)
  .add_prefix("num_").add_suffix("s")
)

You can also use a pivot table.

df.reset_index().pivot_table(index="name", columns="pet", values="index", aggfunc="count", fill_value=0)

But if you need to filter based on two columns, then that approach will not work. For example if you need to know how many old dogs.

df = pd.DataFrame({'name': ["Rashida", "Rashida", "Joe", "Joe"],
                   'pet': ['dog', 'cat', 'dog', 'dog'],
                   'age': ["old", "old", "old", "young"]})

You can use the pivot table.

df.reset_index().pivot_table(index="name", columns=["pet", "age"], values="index", aggfunc="count", fill_value=0)

Or a crosstabs.

pd.crosstab(df["name"], [df["pet"], df["age"]], dropna=False).unstack().reset_index()

Or you can use the port of Dplyr called siuba to mimic the original R syntax but I haven't used this enough to know how to use it well.

from siuba import group_by, summarize, _

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...