Pandas 文本处理

Pandas 文本处理操作实例

在本章中，我们将使用基本的Series / Index讨论字符串操作。在随后的章节中，我们将学习如何在DataFrame上应用这些字符串函数。

Pandas提供了一组字符串函数，可以轻松地对字符串数据进行操作。最重要的是，这些函数忽略（或排除）缺少的/ NaN值。

几乎所有这些方法都可用于Python字符串函数（请参阅： https://docs.python.org/3/library/stdtypes.html#string-methods)。因此，将Series对象转换为String对象，然后执行该操作。

我们看看每个操作如何执行。

方法	说明
lower()	将系列/索引中的字符串转换为小写。
upper()	将系列/索引中的字符串转换为大写。
len()	计算字符串length()。
strip()	帮助从两侧从系列/索引中的每个字符串中去除空格（包括换行符）。
split(' ')	用给定的模式分割每个字符串。
cat(sep=' ')/td>	用给定的分隔符连接系列/索引元素。
get_dummies()	返回具有一键编码值的DataFrame。
contains(pattern)	如果子字符串包含在元素中，则为每个元素返回一个布尔值True，否则返回False。
replace(a,b)	a值替换成b。
repeat(value)	以指定的次数重复每个元素。
count(pattern)	返回每个元素中模式出现的次数。
startswith(pattern)	如果系列/索引中的元素以模式开头，则返回true。
endswith(pattern)	如果系列/索引中的元素以模式结尾，则返回true。
find(pattern)	返回模式首次出现的第一个位置。
findall(pattern)	返回所有出现的模式的列表。
swapcase	大小写互换
islower()<	检查“系列/索引”中每个字符串中的所有字符是否都小写。返回布尔值
isupper()	检查“系列/索引”中每个字符串中的所有字符是否都大写。返回布尔值。
isnumeric()	检查“系列/索引”中每个字符串中的所有字符是否都是数字。返回布尔值。

我们来创建一个Series，看看以上所有功能如何工作。

示例

importpandasaspdimportnumpyasnps=pd.Series(['Tom','WilliamRick','John','Alber@t',np.nan,'1234','SteveSmith'])prints

运行结果：

0Tom1WilliamRick2John3Alber@t4NaN512346SteveSmithdtype:object

lower()

示例

importpandasaspdimportnumpyasnps=pd.Series(['Tom','WilliamRick','John','Alber@t',np.nan,'1234','SteveSmith'])prints.str.lower()

运行结果：

0tom1williamrick2john3alber@t4NaN512346stevesmithdtype:object

upper()

示例

importpandasaspdimportnumpyasnps=pd.Series(['Tom','WilliamRick','John','Alber@t',np.nan,'1234','SteveSmith'])prints.str.upper()

运行结果：

0TOM1WILLIAMRICK2JOHN3ALBER@T4NaN512346STEVESMITHdtype:object

len()

示例

importpandasaspdimportnumpyasnps=pd.Series(['Tom','WilliamRick','John','Alber@t',np.nan,'1234','SteveSmith'])prints.str.len()

运行结果：

03.0112.024.037.04NaN54.0610.0dtype:float64

strip()

示例

importpandasaspdimportnumpyasnps=pd.Series(['Tom','WilliamRick','John','Alber@t'])printsprint("AfterStripping:")prints.str.strip()

运行结果：

0Tom1WilliamRick2John3Alber@tdtype:objectAfterStripping:0Tom1WilliamRick2John3Alber@tdtype:object

split(pattern)

示例

importpandasaspdimportnumpyasnps=pd.Series(['Tom','WilliamRick','John','Alber@t'])printsprint("SplitPattern:")prints.str.split('')

运行结果：

0Tom1WilliamRick2John3Alber@tdtype:objectSplitPattern:0[Tom,,,,,,,,,,]1[,,,,,William,Rick]2[John]3[Alber@t]dtype:object

cat(sep=pattern)

示例

importpandasaspdimportnumpyasnps=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.cat(sep='_')

运行结果：

Tom_WilliamRick_John_Alber@t

get_dummies()

示例

importpandasaspdimportnumpyasnps=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.get_dummies()

运行结果：

WilliamRickAlber@tJohnTom00001110002001030100

contains ()

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.contains('')

运行结果：

0True1True2False3Falsedtype:bool

replace(a,b)

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])printsprint("Afterreplacing@with$:")prints.str.replace('@','))

运行结果：

0Tom1WilliamRick2John3Alber@tdtype:objectAfterreplacing@with$:0Tom1WilliamRick2John3Alber$tdtype:object

repeat(value)

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.repeat(2)

运行结果：

0TomTom1WilliamRickWilliamRick2JohnJohn3Alber@tAlber@tdtype:object

count(pattern)

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])print("每个字符串中的“m”数:")prints.str.count('m')

运行结果：

每个字符串中的“m”数:01112030

startswith(pattern)

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])print("Stringsthatstartwith'T':")prints.str.startswith('T')

运行结果：

0True1False2False3Falsedtype:bool

endswith(pattern)

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])print("Stringsthatendwith't':")prints.str.endswith('t')

运行结果：

Stringsthatendwith't':0False1False2False3Truedtype:bool

find(pattern)

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.find('e')

运行结果：

0-11-12-133dtype:int64

“ -1”表示元素中没有匹配到。

findall(pattern)

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.findall('e')

运行结果：

0[]1[]2[]3[e]dtype:object

空列表（[]）表示元素中没有匹配到

swapcase()

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.swapcase()

运行结果：

0tOM1wILLIAMrICK2jOHN3aLBER@Tdtype:object

islower()

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.islower()

运行结果：

0False1False2False3Falsedtype:bool

isupper()

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.isupper()

运行结果：

0False1False2False3Falsedtype:bool

isnumeric()

示例

importpandasaspds=pd.Series(['Tom','WilliamRick','John','Alber@t'])prints.str.isnumeric()

运行结果：

0False1False2False3Falsedtype:bool

编辑于2024-05-20 13:41