Data Aggregation in Python Pandas

时间：2020-11-13 14:38:27 阅读：28 评论：0 收藏：0 [点我收藏+]

1. Introduction

In this article we will use classic dataset "tips.csv" as example.

import pandas as pd
import numpy as np

tips = pd.read_csv("tips.csv")
tips.head()

技术分享图片　　

2. Tradition Method

Tradionally, we will use groupby() and "[[" to subset variables, and then we can do a summary with aggregation function.

This process is easy to understand so many people will learn it at the first place, however, it has a shortcoming:

The aggregation function is applied to each variable seperatly. If we want to do summary with calculation of two or more varibales, we have to do it in one or more addtional steps.

For example, In below process, aggregation function "sum" is applied to "total_bill", "tip", "size", seperatly. If we want to do summary with calculation of sum(tip) / sum(size), we will have to do it in addional step.

In other words, the process is verbose:

1. We have to name intermediate variables, which is sometimes hard to think a reasonable name and it will not be used at all in other place else.

2. We are over typing additional gramma to make sense. Like [, ", =...

3. We are using imperative programming, which may harm the modifiability of our code in future.

summary_sex = tips.groupby("sex")[["total_bill", "tip", "size"]].sum()
summary_sex["average tip"] = summary_sex["tip"] / summary_sex["size"]
summary_sex

技术分享图片　　

3. agg()

agg() was first introducted at 0.20.0 version of pandas. It reduces some part of verbose with idea of pipline.

to_summary = {"total_bill": np.sum, "tip": np.sum, "size": np.sum}
tips.groupby("sex").agg(to_summary)

技术分享图片

It may seem not so much different than tradional method we mentioned above. But thanks to the idea of pipline, we can continue to add manipulation after it.

1. We don‘t have to name intermediate variables.

2. We are less typing but doing more jobs. And the readability is even better.

(By the way, if you don‘t like np.sum, we can use a string "sum" instead. Other aggregation functions are the same)

to_summary = {"total_bill": np.sum, "tip": np.sum, "size": np.sum}
(tips.groupby("sex")
    .agg(to_summary)
    .assign(average_tip=lambda df: df["tip"]/df["size"])
    .round(2)
)

技术分享图片　　

The process is better than trational method, but still we are doing aggregation to each variable seperately. How can we do summary with calculation of two or more variables in one step?

4. apply()

The differnce between agg() and apply() is that apply() can access to whole dataframe. Because of this, it can do summary with calculation of two or more variables in only one step.

(But also because of that, if the dataframe is huge, apply() may run slow)

I am surprised I have spent so much time to find a solution of this process. And I certainly will use it a lot in future daily analysis.

def func_average_tip(df):
    result = {
        "average_tip": df["tip"].sum() / df["size"].sum()
    }
    return pd.Series(result)

tips.groupby("sex").apply(func_average_tip).round(2)

技术分享图片　　

Data Aggregation in Python Pandas

原文：https://www.cnblogs.com/drvongoosewing/p/13968739.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)