爬虫1：html页面+beautifulsoap模块+get方式+demo

2024-01-31 12:29•html•阅读 3045

　　前言：最近公司要求编写一个爬虫，需要完善后续金融项目的数据，由于工作隐私，就不付被爬的网址url了，下面总结下spider的工作原理。

　　语言：python；工具：jupyter；

　 概要：说到爬虫spider，就不得不提html页面的解析，说到html页面的解析就不得不提beautifulsoap模块的使用，其对html页面的解析很到位，可以很方便的定位需要爬取的元素。

　　BeautifulSoap的API： https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

　　demo流程：

　　（1）使用requests模块，获取url页面。

import requests
url = "http://www.~~~~~~~~~~~~~~~~~~~~~~~~~~"
r = requests.get(url)

　　（2）解析html页面（若是pdf页面需要其他工具模块）需要使用BeautifulSoup模块，把request下来的页面信息保存为soup格式。

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text)

　　（3）利用soup找到超链接href 并把href保存到文件中，为了后续的使用；

with open(r"E:\aa.txt", "wb") as code:
    for link in soup.find_all('a'):
        code.write(str(link.get('href')) + '\r\n')
print "Download Complete!"

　　(4)在上一步的文件中，读取保存的href连接，并保存到list数据结构中；

fd = open(r"E:\juchao.txt","r")
mylist = []for line in fd:
    mylist.append(line)

　　（5）编写header，为了post方式伪装成浏览器（必要的话，设置参数data）；并拼接成访问的url格式（利用浏览器调试，查看网络中的信息）；

headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Cookie': 'JSESSION,
'Host': 'www.cninfo.com.cn',
'Referer': 'http://www.~~~~~~~~~~~~~~~',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0',
'Content-Length': '262',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested-With': 'XMLHttpRequest',
}
urlpath = 'http://www.cninfo.com.cn/information/brief/szmb'
myUrls = []
for submylist in mylist:
    urlId = ''
    url = ''
    urlId = submylist[-7:-1]
    url = urlpath + urlId + '.html'
    myUrls.append(url)

　　（6）新拼接的url是我们需要的最终页面，requests获取url页面（注意编码问题），利用soup解析html页面，生成json字符串，保存到文件。

import json
with open(r"E:\juchao_json.txt", "wb") as code:
    
    for k in xrange(len(myUrls)):
        r1 = requests.get(myUrls[k])
        r1.encoding = r1.apparent_encoding
        # print r1.encoding

        soup = BeautifulSoup(r1.text)
        jsonMap = {}
        jsonMapKey = []
        jsonMapValue = []
        for i in soup.select(".zx_data"):
            jsonMapKey.append(i.text)

        for i in soup.select(".zx_data2"):
            jsonMapValue.append(i.text[:-3])

        for j in xrange(len(jsonMapKey)):
            jsonMap[jsonMapKey[j]] = jsonMapValue[j]    

        strJson = json.dumps(jsonMap, ensure_ascii=False)
#         print strJson
        code.write(strJson.encode('utf-8') + '\r\n')

print 'Done!'

　　BeautifulSoap常用API：推荐查看官方文档，见上

　　　　1）安装： pip install BeautifulSoap

　　　　2 )对象:Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象.

　　　　　　所有对象可以归纳为4种: Tag , NavigableString ,BeautifulSoup , Comment

　　　　3)遍历文档：.tag　　.contents　　.children　　.descendants　　.parent 　　.parents　　.next_slibling　　.previous_slibling　　.next_element

　　　　4)搜索文档：find()　　find_all()　　find_parents　　find_next_siblings　　select

descendants　　

上一篇 »从零开始写JavaScript框架（一）
下一篇 »第一个ext.js demo

爬虫1：html页面+beautifulsoap模块+get方式+demo

descendants

相关推荐

创建 demo项目表

爬虫2：html页面+beautifulsoap模块+post方式+demo

爬虫3：html页面+webdriver模块+demo

JavaScript获取HTML页面源代码

Java Web GET和POST区别

nodejs 返回html页面--使用 ejs 模板 nodejs 返回html页面--使用 ejs 模板

[HTML]HTML框架IFrame下利用JS在主页面和子页面间传值

HTML中的post和get

descendants