Pandas GroupBy
Pandas GroupBy的操作实例
任何groupby操作都会对原始对象进行以下操作:
拆分对象应用函数合并结果
在许多情况下,我们将数据分成几组,然后在每个子集上应用一些功能。在Apply功能中,我们可以执行以下操作-
聚合 − 计算汇总统计转换 − 分组操作过滤 − 在某些条件下过滤数据
现在我们创建一个DataFrame对象并对其执行所有操作。
#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df)
运行结果如下:
PointsRankTeamYear08761Riders201417892Riders201528632Devils201436733Devils201547413Kings201458124kings201567561Kings201677881Kings201786942Riders201697014Royals2014108041Royals2015116902Riders2017
将数据分成组
象可以拆分为任何对象。有多种分割对象的方法,例如:
obj.groupby('key')obj.groupby(['key1','key2'])obj.groupby(key,axis=1)
现在我们看看如何将分组对象应用于DataFrame对象
实例
#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df.groupby('Team'))
运行结果如下:
查看组
#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df.groupby('Team').groups)
运行结果如下:
{'Kings':Int64Index([4,6,7],dtype='int64'),'Devils':Int64Index([2,3],dtype='int64'),'Riders':Int64Index([0,1,8,11],dtype='int64'),'Royals':Int64Index([9,10],dtype='int64'),'kings':Int64Index([5],dtype='int64')}
实例
用多列分组
#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df.groupby(['Team','Year']).groups)
运行结果如下:
{('Kings',2014):Int64Index([4],dtype='int64'),('Royals',2014):Int64Index([9],dtype='int64'),('Riders',2014):Int64Index([0],dtype='int64'),('Riders',2015):Int64Index([1],dtype='int64'),('Kings',2016):Int64Index([6],dtype='int64'),('Riders',2016):Int64Index([8],dtype='int64'),('Riders',2017):Int64Index([11],dtype='int64'),('Devils',2014):Int64Index([2],dtype='int64'),('Devils',2015):Int64Index([3],dtype='int64'),('kings',2015):Int64Index([5],dtype='int64'),('Royals',2015):Int64Index([10],dtype='int64'),('Kings',2017):Int64Index([7],dtype='int64')}
遍历组
有了groupby对象,我们可以类似于itertools.obj遍历该对象。
#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Year')forname,groupingrouped:print(name)print(group)
运行结果如下:
2014PointsRankTeamYear08761Riders201428632Devils201447413Kings201497014Royals20142015PointsRankTeamYear17892Riders201536733Devils201558124kings2015108041Royals20152016PointsRankTeamYear67561Kings201686942Riders20162017PointsRankTeamYear77881Kings2017116902Riders2017
默认情况下,groupby对象的标签名称与组名称相同。
选择组p
使用get_group()方法,我们可以选择一个组。
#importthepandaslibraryimportpandasaspdipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Year')print(grouped.get_group(2014))
运行结果如下:
PointsRankTeamYear08761Riders201428632Devils201447413Kings201497014Royals2014
集合体
聚合函数为每个组返回一个聚合值。一旦通过组对象被创建,几个聚合操作可以在分组的数据来执行。
一个明显的方法是通过合计或等效的agg方法进行合计。
#importthepandaslibraryimportpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Year')print(grouped['Points'].agg(np.mean))
运行结果如下:
Year2014795.252015769.502016725.002017739.00Name:Points,dtype:float64
查看每个组的大小的另一种方法是通过应用size()函数。
importpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)AttributeAccessinPythonPandasgrouped=df.groupby('Team')print(grouped.agg(np.size))
运行结果如下:
PointsRankYearTeamDevils222Kings333Riders444Royals222kings111
一次应用多个聚合功能
借助分组的Series,您还可以传递函数的列表或字典来进行聚合,并生成DataFrame作为输出-
#importthepandaslibraryimportpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Team')print(grouped['Points'].agg([np.sum,np.mean,np.std]))
运行结果如下:
TeamsummeanstdDevils1536768.000000134.350288Kings2285761.66666724.006943Riders3049762.25000088.567771Royals1505752.50000072.831998kings812812.000000NaN
转换
在组或列上进行转换将返回一个索引,该索引的大小与正在分组的对象的大小相同。因此,转换应返回与组块大小相同的结果。
#importthepandaslibraryimportpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)grouped=df.groupby('Team')score=lambdax:(x-x.mean())/x.std()*10print(grouped.transform(score))
运行结果如下:
PointsRankYear012.843272-15.000000-11.61895013.0202865.000000-3.87298327.071068-7.071068-7.0710683-7.0710687.0710687.0710684-8.60862111.547005-10.9108955NaNNaNNaN6-2.360428-5.7735032.182179710.969049-5.7735038.7287168-7.7059635.0000003.8729839-7.0710687.071068-7.071068107.071068-7.0710687.07106811-8.1575955.00000011.618950
过滤
过滤根据定义的条件过滤数据并返回数据的子集。所述过滤器()函数是用来筛选数据。
importpandasaspdimportnumpyasnpipl_data={'Team':['Riders','Riders','Devils','Devils','Kings','kings','Kings','Kings','Riders','Royals','Royals','Riders'],'Rank':[1,2,2,3,3,4,1,1,2,4,1,2],'Year':[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df=pd.DataFrame(ipl_data)print(df.groupby('Team').filter(lambdax:len(x)>=3))
运行结果如下:
PointsRankTeamYear08761Riders201417892Riders201547413Kings201467561Kings201677881Kings201786942Riders2016116902Riders2017
编辑于2024-05-20 13:41