Python3编写网络爬虫02-基本请求库requests的使用

2023-11-08 21:55•python•阅读 2608

一、requests 库使用需要安装 pip install requests

import requests #导入requests库
 
request = requests.get("https://www.baidu.com")#发送get请求（url地址）

print(request) #打印响应状态

如果要添加额外的信息例如 name = germey age = 22

req = reuqests.get("http://httpbin.org/get?name=germey&age=22")

可以简单写

import requests

data = {
'name':'germey',
'age':22
}
req = requests.get("http://httpbin.org/get",params=data)

print(req.text)

实际上返回应该是json格式的str 所以直接解析返回结果可以使用json方法

import request

req = request.get("http://httpbin.org/get")
print(type(req.text))
print(req.json())
print(type(req.json()))

调用json（）方法将返回结果是json格式的字符串转化为字典

二、抓取网页知乎 -> 发现

import requests
import re

header={
"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
}

req = requests.get('https://www.zhihu.com/explore',headers=header)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>',re.S)
titles = re.findall(pattern,req.text)
print(titles)

抓取 github 站点图标

import requests

req = requests.get('https://github.com/favicon.ico')
print(req.text)
print(req.content)

前者乱码后者最前方有个b开头 bytes类型二进制数据

下载图标保存本地

import requests

req = requests.get('https://github.com/favicon.ico')

with open('favicon.ico','wb') as f:
f.write(req.content)

三、requests 的 POST请求

import requests

header={
"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
}
data = {
'name' : 'germey',
'age' : 22
}
req = requests.post('https://httpbin.org/post',data=data,headers=header)
print(req.text)

还有其他属性状态码（status_code）响应头（headers）Cookies URL 请求历史(history)

内置状态码查询对象 requests.codes

import requests

req = requests.get("http://www.baidu.com")
exit() if req.status_code == requests.codes.ok else print('请求超时')

四、高级用法

1.模拟文件上传

import requests

files = {'file' : open('favicon.ico','rb')}#favicon.ico 文件需要和当前脚本在同一目录下
req = requests.post('http://httpbin.org/post',files=files)
print(req.text)#post 文件上传会有一个files字段标识

2. Cookies

实例获取cookies过程：

import requests

req = requests.get('https://www.baidu.com')
print(req.cookies)#返回的是RequestCookieJar类型

for key,value in req.cookies.items():#使用items方法 遍历解析
print(key + '=' + value)

cookies 实现维持登录状态知乎为例：

import requests

headers={
'Cookie' : '_zap=f03025ef-667e-4288-ba1d-fc1a8311d9a2; d_c0="AVCitvzSlA6PTtCufP48PF-2VJaklo6Z_LE=|1543285589"; q_c1=7ae656bc62b1417eb01d68786f1c95be|1543285592000|1543285592000; l_cap_>,
'Host' : 'www.zhihu.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
}
req = requests.get('https://www.zhihu.com',headers=headers)
req.encoding='utf-8'#解决中文乱码
print(req.text)

3.会话维持

requests 直接利用 get() 或者 post() 方法的确可以做到模拟网页的请求，但实际上相当于不同会话

也就是说相当于你用两个浏览器打开了不同的页面

例如第一次用post登录某网站第二次想获取登录后的个人信息，又用了一次get请求相当于打开两个浏览器。

两个完全不相关的会话能成功获取个人信息吗？显然不能。

假如两次请求设置一样的cookies不就行了确实可以但是太繁琐

更好的解决办法 -> session对象

import requests

requests.get('http://httpbin.org/cookies/set/number/123456')#测试 请求网址 设置cookies
req = requests.get('http://httpbin.org/cookies')#获取cookies
print(req.text)# 返回结果是空的

4.利用session

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456')
req = s.get('http://httpbin.org/cookies')
print(req.text)#成功获取

5.SSL证书验证

verify = False

6.代理设置

proxies 参数

例如：

import requests
proxies = {
'http': 'http://127.0.0.1:12345',
'https': 'http://127.0.0.1:12345',
}
requests.get('https://www.taobao.com',proxies=proxies)

如果是 HTTP basic Auth（客户端之前没有认证过，需要输入用户名和密码认证后才可以访问）

示例：

import requests
proxies = {
'http': 'http://user:password@127.0.0.1:12345/',
}
requests.get('https://www.taobao.com',proxies=proxies)

SOCKS协议（网络传输协议，主要用于客户端与外网服务器之间通讯的中间传递）

需要先安装socks库（pip install -U requests[socks]）

import requests
proxies = {
'http': 'socks5://user:password@host:port',
'https': 'socks5://user:password@host:port',
}
requests.get('https://www.taobao.com',proxies=proxies)

7.超时设置 timeout参数也可以写成 timeout(5,11,30) 请求连接5秒，读取接收11秒，总时间30秒默认None

8.身份认证

import requests
from request.auth import HTTPBasicAuth

req = requests.get('http://localhost:',auth=HTTPBasicAuth('username','password'))
print(req.status_code)

简写：

import requests
req = requests.get('http://localhost:',auth=('username','password'))
print(req.status_code)

OAuth认证（需要安装oauth包（pip intall requests_oauthlib））了解

import requests
from requests_oauthlib import OAuth1

url = 'http://api.twitter.com/1.1/account/verfiy_credentials.json'
auth = OAuth1('YOUR_APP_KEY','YOUR_APP_SECRET','USER_OAUTH_TOKEN','USER_OAUTH_TOKEN_SECRET')
requests.get(url,auth=auth)

9.Prepared request

from requests import Request,Session

url = 'http://httpbin.org/post'

data = {
'name':'germey'
}
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
}

s = Session()
req = Request('POST',url,data=data,headers=headers)
prepped = s.prepare_request(req)
r = s.send(prepped)
print(r.text)

上一篇 »python3爬虫再探之EXCEL
下一篇 »python3爬虫初探，四之文件保存

Python3编写网络爬虫02-基本请求库requests的使用

相关推荐

python3爬虫初探，二之requests

python3 进行接口测试

python3爬虫爬取网页思路及常见问题，原创

python3爬虫初探，一之urllib.request

Python网络爬虫与信息提取[request库的应用]，单元一

python使用requests时使用RequestsCookieJar自动保存并传递cookie

Python爬虫之使用BeautifulSoup和Requests抓取网页数据

Python爬虫之使用BeautifulSoup和Requests抓取网页数据