python学习——pandas的拼接操作
pandas的拼接分为两种:
- 级联:pd.concat, pd.append
- 合并:pd.merge, pd.join
0. 回顾numpy的级联
============================================
练习12:
- 生成2个3*3的矩阵,对其分别进行两个维度上的级联
============================================
In [19]:
import numpy as np import pandas as pd from pandas import Series,DataFrame
In [20]:
nd = np.random.randint(0,10,size=(3,3)) nd
Out[20]:
array([[6, 3, 4], [3, 9, 8], [8, 7, 8]])
In [24]:
np.concatenate((nd,nd),axis=0)#0代表行间操作
Out[24]:
array([[6, 3, 4], [3, 9, 8], [8, 7, 8], [6, 3, 4], [3, 9, 8], [8, 7, 8]])
In [25]:
np.concatenate([nd,nd],axis=1)#1代表列间操作,()huo[]效果一样
Out[25]:
array([[6, 3, 4, 6, 3, 4], [3, 9, 8, 3, 9, 8], [8, 7, 8, 8, 7, 8]])
为方便讲解,我们首先定义一个生成DataFrame的函数:
In [26]:
def make_df(inds,cols): #字典的key作为列名进行展示 data = {key:[key+str(i) for i in inds]for key in cols} return DataFrame(data,index=inds,columns=cols)
In [28]:
make_df([1,2],list('AB'))
Out[28]:
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
1. 使用pd.concat()级联
pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
1) 简单级联
和np.concatenate一样,优先增加行数(默认axis=0)
In [29]:
df1 = make_df([0,1],list('AB')) df2 = make_df([2,3],list('AB'))
In [30]:
display(df1,df2)
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
A | B | |
---|---|---|
2 | A2 | B2 |
3 | A3 | B3 |
可以通过设置axis来改变级联方向
In [31]:
pd.concat([df1,df2])
Out[31]:
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B3 |
In [32]:
pd.concat((df1,df2),axis = 1)
Out[32]:
A | B | A | B | |
---|---|---|---|---|
0 | A0 | B0 | NaN | NaN |
1 | A1 | B1 | NaN | NaN |
2 | NaN | NaN | A2 | B2 |
3 | NaN | NaN | A3 | B3 |
注意index在级联时可以重复
也可以选择忽略ignore_index,重新索引
In [34]:
pd.concat((df1,df2),axis=1,ignore_index=True)
Out[34]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | A0 | B0 | NaN | NaN |
1 | A1 | B1 | NaN | NaN |
2 | NaN | NaN | A2 | B2 |
3 | NaN | NaN | A3 | B3 |
或者使用多层索引 keys
concat([x,y],keys=['x','y'])
In [13]:
pd.concat([df1,df2],keys=['x','y'])
Out[13]:
A | B | ||
---|---|---|---|
x | 0 | A1 | B1 |
1 | A2 | B2 | |
y | 0 | A3 | B3 |
1 | A4 | B4 |
In [ ]:
#pd 模块 import pandas as pd #df1,df2 具体的实例 #级联的方法,属于上一级,DataFrame来自pandas
============================================
练习13:
想一想级联的应用场景?
使用昨天的知识,建立一个期中考试张三、李四的成绩表ddd
假设新增考试学科"计算机",如何实现?
新增王老五同学的成绩,如何实现?
============================================
In [ ]:
2) 不匹配级联
不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
In [38]:
df1 = make_df([1,2],list('AB')) df2 = make_df([2,4],list('BC')) display(df1,df2)
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
B | C | |
---|---|---|
2 | B2 | C2 |
4 | B4 | C4 |
有3种连接方式:
- 外连接:补NaN(默认模式)
In [39]:
pd.concat([df1,df2])
C:\Users\BLX\AppData\Roaming\Python\Python37\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'. """Entry point for launching an IPython kernel.
Out[39]:
A | B | C | |
---|---|---|---|
1 | A1 | B1 | NaN |
2 | A2 | B2 | NaN |
2 | NaN | B2 | C2 |
4 | NaN | B4 | C4 |
- 内连接:只连接匹配的项
In [41]:
#合并显示共有数据 pd.concat((df1,df2),join = 'inner',axis = 1)
Out[41]:
A | B | B | C | |
---|---|---|---|---|
2 | A2 | B2 | B2 | C2 |
- 连接指定轴 join_axes
In [42]:
df2.columns
Out[42]:
Index(['B', 'C'], dtype='object')
In [43]:
#join_axex以某个DataFrame的列索引为新的列索引值 pd.concat([df1,df2],join_axes=[df2.columns])
Out[43]:
B | C | |
---|---|---|
1 | B1 | NaN |
2 | B2 | NaN |
2 | B2 | C2 |
4 | B4 | C4 |
============================================
练习14:
假设【期末】考试ddd2的成绩没有张三的,只有李四、王老五、赵小六的,使用多种方法级联
============================================
3) 使用append()函数添加
由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加
In [44]:
display(df1,df2)
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
B | C | |
---|---|---|
2 | B2 | C2 |
4 | B4 | C4 |
In [49]:
#append函数属于DataFrame,concat这函数属于pandas模块 #pd.concat((df1,df2)) df1.append(df2)
Out[49]:
A | B | C | |
---|---|---|---|
1 | A1 | B1 | NaN |
2 | A2 | B2 | NaN |
2 | NaN | B2 | C2 |
4 | NaN | B4 | C4 |
============================================
练习15:
新建一个只有张三李四王老五的期末考试成绩单ddd3,使用append()与期中考试成绩表ddd级联
============================================
2. 使用pd.merge()合并
merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并
使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
注意每一列元素的顺序不要求一致
1) 一对一合并
In [54]:
#merge根据相同的元素进行合并的 df1 = DataFrame({'employee':['Po','Sara','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Sara','Bush'], 'work_time':[2,3,1]}) display(df1,df2)
employee | group | |
---|---|---|
0 | Po | sail |
1 | Sara | couting |
2 | Danis | marketing |
employee | work_time | |
---|---|---|
0 | Po | 2 |
1 | Sara | 3 |
2 | Bush | 1 |
In [55]:
pd.merge(df1,df2)
Out[55]:
employee | group | work_time | |
---|---|---|---|
0 | Po | sail | 2 |
1 | Sara | couting | 3 |
In [56]:
df1.merge(df2)
Out[56]:
employee | group | work_time | |
---|---|---|---|
0 | Po | sail | 2 |
1 | Sara | couting | 3 |
2) 多对一合并
In [57]:
df1 = DataFrame({'employee':['Po','Sara','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Po','Bush'], 'work_time':[2,3,1]}) display(df1,df2)
employee | group | |
---|---|---|
0 | Po | sail |
1 | Sara | couting |
2 | Danis | marketing |
employee | work_time | |
---|---|---|
0 | Po | 2 |
1 | Po | 3 |
2 | Bush | 1 |
In [58]:
pd.merge(df1,df2)
Out[58]:
employee | group | work_time | |
---|---|---|---|
0 | Po | sail | 2 |
1 | Po | sail | 3 |
3) 多对多合并
In [61]:
df1 = DataFrame({'employee':['Po','Po','Danis'], 'group':['sail','couting','marketing']}) df2 = DataFrame({'employee':['Po','Po','Bush'], 'work_time':[2,3,1]}) display(df1,df2)
employee | group | |
---|---|---|
0 | Po | sail |
1 | Po | couting |
2 | Danis | marketing |
employee | work_time | |
---|---|---|
0 | Po | 2 |
1 | Po | 3 |
2 | Bush | 1 |
In [62]:
pd.merge(df1,df2)
Out[62]:
employee | group | work_time | |
---|---|---|---|
0 | Po | sail | 2 |
1 | Po | sail | 3 |
2 | Po | couting | 2 |
3 | Po | couting | 3 |
4) key的规范化
- 使用on=显式指定哪一列为key,当有多个key相同时使用
In [66]:
df3 = DataFrame({'employee':['Po','Summer','Flower'], 'group':['sail','marketing','serch'], 'salary':[12000,10000,8000]}) df4 = DataFrame({'employee':['Po','Winter','Flower'], 'group':['marketing','marketing','serch'], 'work_time':[2,1,5]}) display(df3,df4)
employee | group | salary | |
---|---|---|---|
0 | Po | sail | 12000 |
1 | Summer | marketing | 10000 |
2 | Flower | serch | 8000 |
employee | group | work_time | |
---|---|---|---|
0 | Po | marketing | 2 |
1 | Winter | marketing | 1 |
2 | Flower | serch | 5 |
In [67]:
pd.merge(df3,df4)
Out[67]:
employee | group | salary | work_time | |
---|---|---|---|---|
0 | Flower | serch | 8000 | 5 |
In [70]:
pd.merge(df3,df4,on='employee')
Out[70]:
employee | group_x | salary | group_y | work_time | |
---|---|---|---|---|---|
0 | Po | sail | 12000 | marketing | 2 |
1 | Flower | serch | 8000 | serch | 5 |
In [73]:
pd.merge(df3,df4,on='group',suffixes=['_A','_B'])
Out[73]:
employee_A | group | salary | employee_B | work_time | |
---|---|---|---|---|---|
0 | Summer | marketing | 10000 | Po | 2 |
1 | Summer | marketing | 10000 | Winter | 1 |
2 | Flower | serch | 8000 | Flower | 5 |
- 使用left_on和right_on指定左右两边的列作为key,当左右两边的key都不想等时使用
- 参数1为左,参数2为右
In [79]:
df3 = DataFrame({'employer':['Po','Summer','Flower'], 'Team':['sail','marketing','serch'], 'salary':[12000,10000,8000]}) df4 = DataFrame({'employee':['Po','Winter','Flower'], 'group':['marketing','marketing','serch'], 'work_time':[2,1,5]}) display(df3,df4)
employer | Team | salary | |
---|---|---|---|
0 | Po | sail | 12000 |
1 | Summer | marketing | 10000 |
2 | Flower | serch | 8000 |
employee | group | work_time | |
---|---|---|---|
0 | Po | marketing | 2 |
1 | Winter | marketing | 1 |
2 | Flower | serch | 5 |
In [81]:
pd.merge(df3,df4,left_on='employer',right_on='employee')
Out[81]:
employer | Team | salary | employee | group | work_time | |
---|---|---|---|---|---|---|
0 | Po | sail | 12000 | Po | marketing | 2 |
1 | Flower | serch | 8000 | Flower | serch | 5 |
In [82]:
pd.merge(df3,df4,left_on='Team',right_on='group')
Out[82]:
employer | Team | salary | employee | group | work_time | |
---|---|---|---|---|---|---|
0 | Summer | marketing | 10000 | Po | marketing | 2 |
1 | Summer | marketing | 10000 | Winter | marketing | 1 |
2 | Flower | serch | 8000 | Flower | serch | 5 |
============================================
练习16:
假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?
如果ddd4中张三的名字被打错了,成为了张十三,怎么办?
自行练习多对一,多对多的情况
自学left_index,right_index
============================================
5) 内合并与外合并
- 内合并:只保留两者都有的key(默认模式)
In [85]:
df1 = DataFrame({'age':[18,22,33],'height':[175,169,180]}) df2 = DataFrame({'age':[18,23,31],'weight':[65,70,80]})
In [86]:
pd.merge(df1,df2)
Out[86]:
age | height | weight | |
---|---|---|---|
0 | 18 | 175 | 65 |
In [87]:
df1.merge(df2,how='inner')
Out[87]:
age | height | weight | |
---|---|---|---|
0 | 18 | 175 | 65 |
- 外合并 how='outer':补NaN
In [88]:
df1.merge(df2,how = 'outer')
Out[88]:
age | height | weight | |
---|---|---|---|
0 | 18 | 175.0 | 65.0 |
1 | 22 | 169.0 | NaN |
2 | 33 | 180.0 | NaN |
3 | 23 | NaN | 70.0 |
4 | 31 | NaN | 80.0 |
- 左合并、右合并:how='left',how='right',
In [89]:
df1.merge(df2,how = 'left')#保留左侧
Out[89]:
age | height | weight | |
---|---|---|---|
0 | 18 | 175 | 65.0 |
1 | 22 | 169 | NaN |
2 | 33 | 180 | NaN |
In [90]:
pd.merge(df1,df2,how='right')#保留右侧
Out[90]:
age | height | weight | |
---|---|---|---|
0 | 18 | 175.0 | 65 |
1 | 23 | NaN | 70 |
2 | 31 | NaN | 80 |
============================================
练习17:
如果只有张三赵小六语数英三个科目的成绩,如何合并?
考虑应用情景,使用多种方式合并ddd与ddd4
============================================
6) 列冲突的解决
当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
可以使用suffixes=自己指定后缀
In [91]:
display(df3,df4)
employer | Team | salary | |
---|---|---|---|
0 | Po | sail | 12000 |
1 | Summer | marketing | 10000 |
2 | Flower | serch | 8000 |
employee | group | work_time | |
---|---|---|---|
0 | Po | marketing | 2 |
1 | Winter | marketing | 1 |
2 | Flower | serch | 5 |
In [93]:
df3.columns = ['employee','group','salary'] display(df3)
employee | group | salary | |
---|---|---|---|
0 | Po | sail | 12000 |
1 | Summer | marketing | 10000 |
2 | Flower | serch | 8000 |
In [94]:
pd.merge(df3,df4,on='employee',suffixes=['_李','_王'])
Out[94]:
employee | group_李 | salary | group_王 | work_time | |
---|---|---|---|---|---|
0 | Po | sail | 12000 | marketing | 2 |
1 | Flower | serch | 8000 | serch | 5 |
============================================
练习18:
假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?
============================================
作业
3. 案例分析:美国各州人口数据分析
首先导入文件,并查看数据样本
In [62]:
import numpy as np import pandas as pd from pandas import Series,DataFrame
In [63]:
#使用pandas读取数据 pop = pd.read_csv('../../data/state-population.csv') areas = pd.read_csv('../../data/state-areas.csv') abb = pd.read_csv('../../data/state-abbrevs.csv')
In [64]:
pop.shape
Out[64]:
(2544, 4)
In [65]:
pop.head()
Out[65]:
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
In [70]:
areas.shape
Out[70]:
(52, 2)
In [69]:
abb.shape
Out[69]:
(51, 2)
合并pop与abbrevs两个DataFrame,分别依据state/region列和abbreviation列来合并。
为了保留所有信息,使用外合并。
In [71]:
pop.head()
Out[71]:
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
In [72]:
abb.head()
Out[72]:
state | abbreviation | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | AK |
2 | Arizona | AZ |
3 | Arkansas | AR |
4 | California | CA |
In [73]:
display(pop.shape,abb.shape)
(2544, 4)
(51, 2)
In [78]:
#此时的场景 left == outer left数据大于abb #left效果比outer差一些 #abb 河北 pop_m = pop.merge(abb,left_on='state/region',right_on='abbreviation',how = 'outer') pop_m.shape
Out[78]:
(2544, 6)
去除abbreviation的那一列(axis=1)
In [79]:
pop_m.head()
Out[79]:
state/region | ages | year | population | state | abbreviation | |
---|---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama | AL |
1 | AL | total | 2012 | 4817528.0 | Alabama | AL |
2 | AL | under18 | 2010 | 1130966.0 | Alabama | AL |
3 | AL | total | 2010 | 4785570.0 | Alabama | AL |
4 | AL | under18 | 2011 | 1125763.0 | Alabama | AL |
In [83]:
pop_m.drop('abbreviation',axis = 1,inplace=True)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-83-15dcfc478d0b> in <module>() ----> 1 pop_m.drop('abbreviation',axis = 1,inplace=True) /usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors) 2159 new_axis = axis.drop(labels, level=level, errors=errors) 2160 else: -> 2161 new_axis = axis.drop(labels, errors=errors) 2162 dropped = self.reindex(**{axis_name: new_axis}) 2163 try: /usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors) 3622 if errors != 'ignore': 3623 raise ValueError('labels %s not contained in axis' % -> 3624 labels[mask]) 3625 indexer = indexer[~mask] 3626 return self.delete(indexer) ValueError: labels ['abbreviation'] not contained in axis
In [82]:
pop_m.head()
Out[82]:
state/region | ages | year | population | state | |
---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama |
1 | AL | total | 2012 | 4817528.0 | Alabama |
2 | AL | under18 | 2010 | 1130966.0 | Alabama |
3 | AL | total | 2010 | 4785570.0 | Alabama |
4 | AL | under18 | 2011 | 1125763.0 | Alabama |
查看存在缺失数据的列。
使用.isnull().any(),只有某一列存在一个缺失数据,就会显示True。
In [88]:
pop_m.isnull().any()
Out[88]:
state/region False ages False year False population True state True dtype: bool
In [ ]:
#population 和 state这两列有数据缺失的情况
查看缺失数据
In [92]:
#为空的行索引 pop_m.loc[pop_m.isnull().any(axis = 1)]
Out[92]:
state/region | ages | year | population | state | |
---|---|---|---|---|---|
2448 | PR | under18 | 1990 | NaN | NaN |
2449 | PR | total | 1990 | NaN | NaN |
2450 | PR | total | 1991 | NaN | NaN |
2451 | PR | under18 | 1991 | NaN | NaN |
2452 | PR | total | 1993 | NaN | NaN |
2453 | PR | under18 | 1993 | NaN | NaN |
2454 | PR | under18 | 1992 | NaN | NaN |
2455 | PR | total | 1992 | NaN | NaN |
2456 | PR | under18 | 1994 | NaN | NaN |
2457 | PR | total | 1994 | NaN | NaN |
2458 | PR | total | 1995 | NaN | NaN |
2459 | PR | under18 | 1995 | NaN | NaN |
2460 | PR | under18 | 1996 | NaN | NaN |
2461 | PR | total | 1996 | NaN | NaN |
2462 | PR | under18 | 1998 | NaN | NaN |
2463 | PR | total | 1998 | NaN | NaN |
2464 | PR | total | 1997 | NaN | NaN |
2465 | PR | under18 | 1997 | NaN | NaN |
2466 | PR | total | 1999 | NaN | NaN |
2467 | PR | under18 | 1999 | NaN | NaN |
2468 | PR | total | 2000 | 3810605.0 | NaN |
2469 | PR | under18 | 2000 | 1089063.0 | NaN |
2470 | PR | total | 2001 | 3818774.0 | NaN |
2471 | PR | under18 | 2001 | 1077566.0 | NaN |
2472 | PR | total | 2002 | 3823701.0 | NaN |
2473 | PR | under18 | 2002 | 1065051.0 | NaN |
2474 | PR | total | 2004 | 3826878.0 | NaN |
2475 | PR | under18 | 2004 | 1035919.0 | NaN |
2476 | PR | total | 2003 | 3826095.0 | NaN |
2477 | PR | under18 | 2003 | 1050615.0 | NaN |
... | ... | ... | ... | ... | ... |
2514 | USA | under18 | 1999 | 71946051.0 | NaN |
2515 | USA | total | 2000 | 282162411.0 | NaN |
2516 | USA | under18 | 2000 | 72376189.0 | NaN |
2517 | USA | total | 1999 | 279040181.0 | NaN |
2518 | USA | total | 2001 | 284968955.0 | NaN |
2519 | USA | under18 | 2001 | 72671175.0 | NaN |
2520 | USA | total | 2002 | 287625193.0 | NaN |
2521 | USA | under18 | 2002 | 72936457.0 | NaN |
2522 | USA | total | 2003 | 290107933.0 | NaN |
2523 | USA | under18 | 2003 | 73100758.0 | NaN |
2524 | USA | total | 2004 | 292805298.0 | NaN |
2525 | USA | under18 | 2004 | 73297735.0 | NaN |
2526 | USA | total | 2005 | 295516599.0 | NaN |
2527 | USA | under18 | 2005 | 73523669.0 | NaN |
2528 | USA | total | 2006 | 298379912.0 | NaN |
2529 | USA | under18 | 2006 | 73757714.0 | NaN |
2530 | USA | total | 2007 | 301231207.0 | NaN |
2531 | USA | under18 | 2007 | 74019405.0 | NaN |
2532 | USA | total | 2008 | 304093966.0 | NaN |
2533 | USA | under18 | 2008 | 74104602.0 | NaN |
2534 | USA | under18 | 2013 | 73585872.0 | NaN |
2535 | USA | total | 2013 | 316128839.0 | NaN |
2536 | USA | total | 2009 | 306771529.0 | NaN |
2537 | USA | under18 | 2009 | 74134167.0 | NaN |
2538 | USA | under18 | 2010 | 74119556.0 | NaN |
2539 | USA | total | 2010 | 309326295.0 | NaN |
2540 | USA | under18 | 2011 | 73902222.0 | NaN |
2541 | USA | total | 2011 | 311582564.0 | NaN |
2542 | USA | under18 | 2012 | 73708179.0 | NaN |
2543 | USA | total | 2012 | 313873685.0 | NaN |
96 rows × 5 columns
根据数据是否缺失情况显示数据,如果缺失为True,那么显示
找到有哪些state/region使得state的值为NaN,使用unique()查看非重复值
In [94]:
condition = pop_m['state'].isnull() pop_m['state/region'][condition].unique()
Out[94]:
array(['PR', 'USA'], dtype=object)
In [95]:
areas
Out[95]:
state | area (sq. mi) | |
---|---|---|
0 | Alabama | 52423 |
1 | Alaska | 656425 |
2 | Arizona | 114006 |
3 | Arkansas | 53182 |
4 | California | 163707 |
5 | Colorado | 104100 |
6 | Connecticut | 5544 |
7 | Delaware | 1954 |
8 | Florida | 65758 |
9 | Georgia | 59441 |
10 | Hawaii | 10932 |
11 | Idaho | 83574 |
12 | Illinois | 57918 |
13 | Indiana | 36420 |
14 | Iowa | 56276 |
15 | Kansas | 82282 |
16 | Kentucky | 40411 |
17 | Louisiana | 51843 |
18 | Maine | 35387 |
19 | Maryland | 12407 |
20 | Massachusetts | 10555 |
21 | Michigan | 96810 |
22 | Minnesota | 86943 |
23 | Mississippi | 48434 |
24 | Missouri | 69709 |
25 | Montana | 147046 |
26 | Nebraska | 77358 |
27 | Nevada | 110567 |
28 | New Hampshire | 9351 |
29 | New Jersey | 8722 |
30 | New Mexico | 121593 |
31 | New York | 54475 |
32 | North Carolina | 53821 |
33 | North Dakota | 70704 |
34 | Ohio | 44828 |
35 | Oklahoma | 69903 |
36 | Oregon | 98386 |
37 | Pennsylvania | 46058 |
38 | Rhode Island | 1545 |
39 | South Carolina | 32007 |
40 | South Dakota | 77121 |
41 | Tennessee | 42146 |
42 | Texas | 268601 |
43 | Utah | 84904 |
44 | Vermont | 9615 |
45 | Virginia | 42769 |
46 | Washington | 71303 |
47 | West Virginia | 24231 |
48 | Wisconsin | 65503 |
49 | Wyoming | 97818 |
50 | District of Columbia | 68 |
51 | Puerto Rico | 3515 |
In [ ]:
只有两个州,对应的州名为空
为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN!
记住这样清除缺失数据NaN的方法!
In [96]:
#Puerto Rico conditon = pop_m['state/region'] == 'PR' condition
Out[96]:
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 2514 True 2515 True 2516 True 2517 True 2518 True 2519 True 2520 True 2521 True 2522 True 2523 True 2524 True 2525 True 2526 True 2527 True 2528 True 2529 True 2530 True 2531 True 2532 True 2533 True 2534 True 2535 True 2536 True 2537 True 2538 True 2539 True 2540 True 2541 True 2542 True 2543 True Name: state, Length: 2544, dtype: bool
In [97]:
pop_m['state'][condition] = 'Puerto Rico'
/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
In [99]:
condition = pop_m['state/region'] == 'USA' pop_m['state'][condition] = 'United State'
/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [100]:
#刚才的填补操作,起作用了 pop_m.isnull().any()
Out[100]:
state/region False ages False year False population True state False dtype: bool
合并各州面积数据areas,使用左合并。
思考一下为什么使用外合并?
In [102]:
pop.head() #人口的DataFrame和abb合并,有了州名全程 #可以和areas DataFrame进行合并
Out[102]:
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
In [103]:
pop_areas_m = pop_m.merge(areas,how = 'outer')
继续寻找存在缺失数据的列
In [105]:
pop_areas_m.shape
Out[105]:
(2544, 6)
In [109]:
areas
Out[109]:
state | area (sq. mi) | |
---|---|---|
0 | Alabama | 52423 |
1 | Alaska | 656425 |
2 | Arizona | 114006 |
3 | Arkansas | 53182 |
4 | California | 163707 |
5 | Colorado | 104100 |
6 | Connecticut | 5544 |
7 | Delaware | 1954 |
8 | Florida | 65758 |
9 | Georgia | 59441 |
10 | Hawaii | 10932 |
11 | Idaho | 83574 |
12 | Illinois | 57918 |
13 | Indiana | 36420 |
14 | Iowa | 56276 |
15 | Kansas | 82282 |
16 | Kentucky | 40411 |
17 | Louisiana | 51843 |
18 | Maine | 35387 |
19 | Maryland | 12407 |
20 | Massachusetts | 10555 |
21 | Michigan | 96810 |
22 | Minnesota | 86943 |
23 | Mississippi | 48434 |
24 | Missouri | 69709 |
25 | Montana | 147046 |
26 | Nebraska | 77358 |
27 | Nevada | 110567 |
28 | New Hampshire | 9351 |
29 | New Jersey | 8722 |
30 | New Mexico | 121593 |
31 | New York | 54475 |
32 | North Carolina | 53821 |
33 | North Dakota | 70704 |
34 | Ohio | 44828 |
35 | Oklahoma | 69903 |
36 | Oregon | 98386 |
37 | Pennsylvania | 46058 |
38 | Rhode Island | 1545 |
39 | South Carolina | 32007 |
40 | South Dakota | 77121 |
41 | Tennessee | 42146 |
42 | Texas | 268601 |
43 | Utah | 84904 |
44 | Vermont | 9615 |
45 | Virginia | 42769 |
46 | Washington | 71303 |
47 | West Virginia | 24231 |
48 | Wisconsin | 65503 |
49 | Wyoming | 97818 |
50 | District of Columbia | 68 |
51 | Puerto Rico | 3515 |
In [106]:
pop_areas_m.isnull().any()
Out[106]:
state/region False ages False year False population True state False area (sq. mi) True dtype: bool
我们会发现area(sq.mi)这一列有缺失数据,为了找出是哪一行,我们需要找出是哪个state没有数据
In [110]:
cond = pop_areas_m['area (sq. mi)'].isnull() cond
Out[110]:
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 2514 True 2515 True 2516 True 2517 True 2518 True 2519 True 2520 True 2521 True 2522 True 2523 True 2524 True 2525 True 2526 True 2527 True 2528 True 2529 True 2530 True 2531 True 2532 True 2533 True 2534 True 2535 True 2536 True 2537 True 2538 True 2539 True 2540 True 2541 True 2542 True 2543 True Name: area (sq. mi), Length: 2544, dtype: bool
In [111]:
pop_areas_m['state/region'][cond]
Out[111]:
2496 USA 2497 USA 2498 USA 2499 USA 2500 USA 2501 USA 2502 USA 2503 USA 2504 USA 2505 USA 2506 USA 2507 USA 2508 USA 2509 USA 2510 USA 2511 USA 2512 USA 2513 USA 2514 USA 2515 USA 2516 USA 2517 USA 2518 USA 2519 USA 2520 USA 2521 USA 2522 USA 2523 USA 2524 USA 2525 USA 2526 USA 2527 USA 2528 USA 2529 USA 2530 USA 2531 USA 2532 USA 2533 USA 2534 USA 2535 USA 2536 USA 2537 USA 2538 USA 2539 USA 2540 USA 2541 USA 2542 USA 2543 USA Name: state/region, dtype: object
去除含有缺失数据的行
In [112]:
pop_areas_m.shape
Out[112]:
(2544, 6)
In [114]:
pop_areas_r = pop_areas_m.dropna()
In [115]:
pop_areas_r.shape
Out[115]:
(2476, 6)
查看数据是否缺失
In [116]:
pop_areas_r.isnull().any()
Out[116]:
state/region False ages False year False population False state False area (sq. mi) False dtype: bool
找出2010年的全民人口数据,df.query(查询语句)
In [117]:
pop_areas_r.head()
Out[117]:
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 | Alabama | 52423.0 |
1 | AL | total | 2012 | 4817528.0 | Alabama | 52423.0 |
2 | AL | under18 | 2010 | 1130966.0 | Alabama | 52423.0 |
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
4 | AL | under18 | 2011 | 1125763.0 | Alabama | 52423.0 |
In [120]:
t_2010 = pop_areas_r.query("ages == 'total' and year == 2010")
In [121]:
t_2010.shape
Out[121]:
(52, 6)
In [122]:
t_2010
Out[122]:
state/region | ages | year | population | state | area (sq. mi) | |
---|---|---|---|---|---|---|
3 | AL | total | 2010 | 4785570.0 | Alabama | 52423.0 |
91 | AK | total | 2010 | 713868.0 | Alaska | 656425.0 |
101 | AZ | total | 2010 | 6408790.0 | Arizona | 114006.0 |
189 | AR | total | 2010 | 2922280.0 | Arkansas | 53182.0 |
197 | CA | total | 2010 | 37333601.0 | California | 163707.0 |
283 | CO | total | 2010 | 5048196.0 | Colorado | 104100.0 |
293 | CT | total | 2010 | 3579210.0 | Connecticut | 5544.0 |
379 | DE | total | 2010 | 899711.0 | Delaware | 1954.0 |
389 | DC | total | 2010 | 605125.0 | District of Columbia | 68.0 |
475 | FL | total | 2010 | 18846054.0 | Florida | 65758.0 |
485 | GA | total | 2010 | 9713248.0 | Georgia | 59441.0 |
570 | HI | total | 2010 | 1363731.0 | Hawaii | 10932.0 |
581 | ID | total | 2010 | 1570718.0 | Idaho | 83574.0 |
666 | IL | total | 2010 | 12839695.0 | Illinois | 57918.0 |
677 | IN | total | 2010 | 6489965.0 | Indiana | 36420.0 |
762 | IA | total | 2010 | 3050314.0 | Iowa | 56276.0 |
773 | KS | total | 2010 | 2858910.0 | Kansas | 82282.0 |
858 | KY | total | 2010 | 4347698.0 | Kentucky | 40411.0 |
869 | LA | total | 2010 | 4545392.0 | Louisiana | 51843.0 |
954 | ME | total | 2010 | 1327366.0 | Maine | 35387.0 |
965 | MD | total | 2010 | 5787193.0 | Maryland | 12407.0 |
1050 | MA | total | 2010 | 6563263.0 | Massachusetts | 10555.0 |
1061 | MI | total | 2010 | 9876149.0 | Michigan | 96810.0 |
1146 | MN | total | 2010 | 5310337.0 | Minnesota | 86943.0 |
1157 | MS | total | 2010 | 2970047.0 | Mississippi | 48434.0 |
1242 | MO | total | 2010 | 5996063.0 | Missouri | 69709.0 |
1253 | MT | total | 2010 | 990527.0 | Montana | 147046.0 |
1338 | NE | total | 2010 | 1829838.0 | Nebraska | 77358.0 |
1349 | NV | total | 2010 | 2703230.0 | Nevada | 110567.0 |
1434 | NH | total | 2010 | 1316614.0 | New Hampshire | 9351.0 |
1445 | NJ | total | 2010 | 8802707.0 | New Jersey | 8722.0 |
1530 | NM | total | 2010 | 2064982.0 | New Mexico | 121593.0 |
1541 | NY | total | 2010 | 19398228.0 | New York | 54475.0 |
1626 | NC | total | 2010 | 9559533.0 | North Carolina | 53821.0 |
1637 | ND | total | 2010 | 674344.0 | North Dakota | 70704.0 |
1722 | OH | total | 2010 | 11545435.0 | Ohio | 44828.0 |
1733 | OK | total | 2010 | 3759263.0 | Oklahoma | 69903.0 |
1818 | OR | total | 2010 | 3837208.0 | Oregon | 98386.0 |
1829 | PA | total | 2010 | 12710472.0 | Pennsylvania | 46058.0 |
1914 | RI | total | 2010 | 1052669.0 | Rhode Island | 1545.0 |
1925 | SC | total | 2010 | 4636361.0 | South Carolina | 32007.0 |
2010 | SD | total | 2010 | 816211.0 | South Dakota | 77121.0 |
2021 | TN | total | 2010 | 6356683.0 | Tennessee | 42146.0 |
2106 | TX | total | 2010 | 25245178.0 | Texas | 268601.0 |
2117 | UT | total | 2010 | 2774424.0 | Utah | 84904.0 |
2202 | VT | total | 2010 | 625793.0 | Vermont | 9615.0 |
2213 | VA | total | 2010 | 8024417.0 | Virginia | 42769.0 |
2298 | WA | total | 2010 | 6742256.0 | Washington | 71303.0 |
2309 | WV | total | 2010 | 1854146.0 | West Virginia | 24231.0 |
2394 | WI | total | 2010 | 5689060.0 | Wisconsin | 65503.0 |
2405 | WY | total | 2010 | 564222.0 | Wyoming | 97818.0 |
2490 | PR | total | 2010 | 3721208.0 | Puerto Rico | 3515.0 |
对查询结果进行处理,以state列作为新的行索引:set_index
In [124]:
t_2010.set_index('state',inplace=True)
In [126]:
t_2010
Out[126]:
state/region | ages | year | population | area (sq. mi) | |
---|---|---|---|---|---|
state | |||||
Alabama | AL | total | 2010 | 4785570.0 | 52423.0 |
Alaska | AK | total | 2010 | 713868.0 | 656425.0 |
Arizona | AZ | total | 2010 | 6408790.0 | 114006.0 |
Arkansas | AR | total | 2010 | 2922280.0 | 53182.0 |
California | CA | total | 2010 | 37333601.0 | 163707.0 |
Colorado | CO | total | 2010 | 5048196.0 | 104100.0 |
Connecticut | CT | total | 2010 | 3579210.0 | 5544.0 |
Delaware | DE | total | 2010 | 899711.0 | 1954.0 |
District of Columbia | DC | total | 2010 | 605125.0 | 68.0 |
Florida | FL | total | 2010 | 18846054.0 | 65758.0 |
Georgia | GA | total | 2010 | 9713248.0 | 59441.0 |
Hawaii | HI | total | 2010 | 1363731.0 | 10932.0 |
Idaho | ID | total | 2010 | 1570718.0 | 83574.0 |
Illinois | IL | total | 2010 | 12839695.0 | 57918.0 |
Indiana | IN | total | 2010 | 6489965.0 | 36420.0 |
Iowa | IA | total | 2010 | 3050314.0 | 56276.0 |
Kansas | KS | total | 2010 | 2858910.0 | 82282.0 |
Kentucky | KY | total | 2010 | 4347698.0 | 40411.0 |
Louisiana | LA | total | 2010 | 4545392.0 | 51843.0 |
Maine | ME | total | 2010 | 1327366.0 | 35387.0 |
Maryland | MD | total | 2010 | 5787193.0 | 12407.0 |
Massachusetts | MA | total | 2010 | 6563263.0 | 10555.0 |
Michigan | MI | total | 2010 | 9876149.0 | 96810.0 |
Minnesota | MN | total | 2010 | 5310337.0 | 86943.0 |
Mississippi | MS | total | 2010 | 2970047.0 | 48434.0 |
Missouri | MO | total | 2010 | 5996063.0 | 69709.0 |
Montana | MT | total | 2010 | 990527.0 | 147046.0 |
Nebraska | NE | total | 2010 | 1829838.0 | 77358.0 |
Nevada | NV | total | 2010 | 2703230.0 | 110567.0 |
New Hampshire | NH | total | 2010 | 1316614.0 | 9351.0 |
New Jersey | NJ | total | 2010 | 8802707.0 | 8722.0 |
New Mexico | NM | total | 2010 | 2064982.0 | 121593.0 |
New York | NY | total | 2010 | 19398228.0 | 54475.0 |
North Carolina | NC | total | 2010 | 9559533.0 | 53821.0 |
North Dakota | ND | total | 2010 | 674344.0 | 70704.0 |
Ohio | OH | total | 2010 | 11545435.0 | 44828.0 |
Oklahoma | OK | total | 2010 | 3759263.0 | 69903.0 |
Oregon | OR | total | 2010 | 3837208.0 | 98386.0 |
Pennsylvania | PA | total | 2010 | 12710472.0 | 46058.0 |
Rhode Island | RI | total | 2010 | 1052669.0 | 1545.0 |
South Carolina | SC | total | 2010 | 4636361.0 | 32007.0 |
South Dakota | SD | total | 2010 | 816211.0 | 77121.0 |
Tennessee | TN | total | 2010 | 6356683.0 | 42146.0 |
Texas | TX | total | 2010 | 25245178.0 | 268601.0 |
Utah | UT | total | 2010 | 2774424.0 | 84904.0 |
Vermont | VT | total | 2010 | 625793.0 | 9615.0 |
Virginia | VA | total | 2010 | 8024417.0 | 42769.0 |
Washington | WA | total | 2010 | 6742256.0 | 71303.0 |
West Virginia | WV | total | 2010 | 1854146.0 | 24231.0 |
Wisconsin | WI | total | 2010 | 5689060.0 | 65503.0 |
Wyoming | WY | total | 2010 | 564222.0 | 97818.0 |
Puerto Rico | PR | total | 2010 | 3721208.0 | 3515.0 |
计算人口密度。注意是Series/Series,其结果还是一个Series。
In [127]:
pop_density = t_2010['population']/t_2010["area (sq. mi)"] pop_density
Out[127]:
state Alabama 91.287603 Alaska 1.087509 Arizona 56.214497 Arkansas 54.948667 California 228.051342 Colorado 48.493718 Connecticut 645.600649 Delaware 460.445752 District of Columbia 8898.897059 Florida 286.597129 Georgia 163.409902 Hawaii 124.746707 Idaho 18.794338 Illinois 221.687472 Indiana 178.197831 Iowa 54.202751 Kansas 34.745266 Kentucky 107.586994 Louisiana 87.676099 Maine 37.509990 Maryland 466.445797 Massachusetts 621.815538 Michigan 102.015794 Minnesota 61.078373 Mississippi 61.321530 Missouri 86.015622 Montana 6.736171 Nebraska 23.654153 Nevada 24.448796 New Hampshire 140.799273 New Jersey 1009.253268 New Mexico 16.982737 New York 356.094135 North Carolina 177.617157 North Dakota 9.537565 Ohio 257.549634 Oklahoma 53.778278 Oregon 39.001565 Pennsylvania 275.966651 Rhode Island 681.339159 South Carolina 144.854594 South Dakota 10.583512 Tennessee 150.825298 Texas 93.987655 Utah 32.677188 Vermont 65.085075 Virginia 187.622273 Washington 94.557817 West Virginia 76.519582 Wisconsin 86.851900 Wyoming 5.768079 Puerto Rico 1058.665149 dtype: float64
排序,并找出人口密度最高的五个州sort_values()
In [128]:
type(pop_density)
Out[128]:
pandas.core.series.Series
In [130]:
pop_density.sort_values(inplace=True)
找出人口密度最低的五个州
In [131]:
pop_density[:5]
Out[131]:
state Alaska 1.087509 Wyoming 5.768079 Montana 6.736171 North Dakota 9.537565 South Dakota 10.583512 dtype: float64
In [132]:
pop_density.tail()
Out[132]:
state Connecticut 645.600649 Rhode Island 681.339159 New Jersey 1009.253268 Puerto Rico 1058.665149 District of Columbia 8898.897059 dtype: float64
要点总结:
- 统一用loc()索引
- 善于使用.isnull().any()找到存在NaN的列
- 善于使用.unique()确定该列中哪些key是我们需要的
- 一般使用外合并、左合并,目的只有一个:宁愿该列是NaN也不要丢弃其他列的信息
回顾:Series/DataFrame运算与ndarray运算的区别
- Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
- ndarray有广播,通过重复已有值来计算