爬虫3：html页面+webdriver模块+demo

2024-04-01 03:06•html•阅读 4812

　　保密性好的网站，不能使用request请求页面信息，这样可以使用webdriver模块先开启一个浏览器，然后爬去信息，甚至还可以click等操作对页面操作，再爬取。

　　demo 一般流程：

　　1）包含selenium 模块

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

　　2）设置采用火狐浏览器（chrome也可以）

driver = webdriver.Firefox()

　　3）get方式打开（为了保密，url省略）

driver.get("http://www.---------------")

　　4）css方式筛选

elements = driver.find_elements_by_css_selector("span.c9.ng-binding")

　　5）由于webdriver模块的筛选功能不是很好用，这里推荐转成html形式，然后使用beautifulsoap筛选

html = driver.page_source

　　6）BeautifulSoup筛选信息-find_all 和 css 选择器方式更好用

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html)
# soup.find_all('div',text=re.compile(u"信息"))[0]
for i in soup.select('a[href*="human"]'):
    print i

爬虫3：html页面+webdriver模块+demo

相关推荐

爬虫1：html页面+beautifulsoap模块+get方式+demo

python3爬虫再探之EXCEL，续

爬虫2：html页面+beautifulsoap模块+post方式+demo

python3爬虫初探，四之文件保存

[HTML]HTML框架IFrame下利用JS在主页面和子页面间传值

Python Pymysql实现数据存储的示例

python3爬虫初探，二之requests

使用BeautifulSoup模块解析HTML