Pandas GroupBy

Pandas GroupBy的操作实例

任何groupby操作都会对原始对象进行以下操作:

拆分对象应用函数合并结果

在许多情况下,我们将数据分成几组,然后在每个子集上应用一些功能。在Apply功能中,我们可以执行以下操作-

聚合 − 计算汇总统计转换 − 分组操作过滤 − 在某些条件下过滤数据

现在我们创建一个DataFrame对象并对其执行所有操作。

#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df)

运行结果如下:

PointsRankTeamYear08761Riders201417892Riders201528632Devils201436733Devils201547413Kings201458124kings201567561Kings201677881Kings201786942Riders201697014Royals2014108041Royals2015116902Riders2017

将数据分成组

象可以拆分为任何对象。有多种分割对象的方法,例如:

obj.groupby('key')obj.groupby(['key1','key2'])obj.groupby(key,axis=1)

现在我们看看如何将分组对象应用于DataFrame对象

实例

#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df.groupby('Team'))

运行结果如下:

查看组

#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df.groupby('Team').groups)

运行结果如下:

{'Kings':Int64Index([4,6,7],dtype='int64'),'Devils':Int64Index([2,3],dtype='int64'),'Riders':Int64Index([0,1,8,11],dtype='int64'),'Royals':Int64Index([9,10],dtype='int64'),'kings':Int64Index([5],dtype='int64')}

实例

用多列分组

#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df.groupby(['Team','Year']).groups)

运行结果如下:

{('Kings',2014):Int64Index([4],dtype='int64'),('Royals',2014):Int64Index([9],dtype='int64'),('Riders',2014):Int64Index([0],dtype='int64'),('Riders',2015):Int64Index([1],dtype='int64'),('Kings',2016):Int64Index([6],dtype='int64'),('Riders',2016):Int64Index([8],dtype='int64'),('Riders',2017):Int64Index([11],dtype='int64'),('Devils',2014):Int64Index([2],dtype='int64'),('Devils',2015):Int64Index([3],dtype='int64'),('kings',2015):Int64Index([5],dtype='int64'),('Royals',2015):Int64Index([10],dtype='int64'),('Kings',2017):Int64Index([7],dtype='int64')}

遍历组

有了groupby对象,我们可以类似于itertools.obj遍历该对象。

#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Year')forname,groupingrouped:print(name)print(group)

运行结果如下:

2014PointsRankTeamYear08761Riders201428632Devils201447413Kings201497014Royals20142015PointsRankTeamYear17892Riders201536733Devils201558124kings2015108041Royals20152016PointsRankTeamYear67561Kings201686942Riders20162017PointsRankTeamYear77881Kings2017116902Riders2017

默认情况下,groupby对象的标签名称与组名称相同。

选择组p

使用get_group()方法,我们可以选择一个组。

#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Year')print(grouped.get_group(2014))

运行结果如下:

PointsRankTeamYear08761Riders201428632Devils201447413Kings201497014Royals2014

集合体

聚合函数为每个组返回一个聚合值。一旦通过组对象被创建,几个聚合操作可以在分组的数据来执行。

一个明显的方法是通过合计或等效的agg方法进行合计。

#importthepandaslibraryimportpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Year')print(grouped['Points'].agg(np.mean))

运行结果如下:

Year2014795.252015769.502016725.002017739.00Name:Points,dtype:float64

查看每个组的大小的另一种方法是通过应用size()函数。

importpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)AttributeAccessinPythonPandasgrouped=df.groupby('Team')print(grouped.agg(np.size))

运行结果如下:

PointsRankYearTeamDevils222Kings333Riders444Royals222kings111

一次应用多个聚合功能

借助分组的Series,您还可以传递函数的列表或字典来进行聚合,并生成DataFrame作为输出-

#importthepandaslibraryimportpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Team')print(grouped['Points'].agg([np.sum,np.mean,np.std]))

运行结果如下:

TeamsummeanstdDevils1536768.000000134.350288Kings2285761.66666724.006943Riders3049762.25000088.567771Royals1505752.50000072.831998kings812812.000000NaN

转换

在组或列上进行转换将返回一个索引,该索引的大小与正在分组的对象的大小相同。因此,转换应返回与组块大小相同的结果。

#importthepandaslibraryimportpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Team')score=lambdax:(x-x.mean())/x.std()*10print(grouped.transform(score))

运行结果如下:

PointsRankYear012.843272-15.000000-11.61895013.0202865.000000-3.87298327.071068-7.071068-7.0710683-7.0710687.0710687.0710684-8.60862111.547005-10.9108955NaNNaNNaN6-2.360428-5.7735032.182179710.969049-5.7735038.7287168-7.7059635.0000003.8729839-7.0710687.071068-7.071068107.071068-7.0710687.07106811-8.1575955.00000011.618950

过滤

过滤根据定义的条件过滤数据并返回数据的子集。所述过滤器()函数是用来筛选数据。

importpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df.groupby('Team').filter(lambdax:len(x)>=3))

运行结果如下:

PointsRankTeamYear08761Riders201417892Riders201547413Kings201467561Kings201677881Kings201786942Riders2016116902Riders2017
编辑于2024-05-20 13:41