python_爬虫_爬取京东商品信息

爬取京东商品信息

代码:

import requests

# url = "https://item.jd.com/2967929.html"

url = "https://item.jd.com/100011585270.html"

try:

r = requests.get(url)

r.raise_for_status()

r.encoding = r.apparent_encoding

print(r.text[:1000])

except:

print("爬取失败")

运行结果1:

<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F100011585270.html'</script>

运行结果2:

能爬取到信息,但是信息不够全面。结果2只出现过一次,没有及时保存。

曾经怀疑结果1的出现是因为没有登陆,可是登陆后仍然会出现结果1。故排除该可能。

由于偶然出现结果2,所以怀疑可能是网络原因,或者爬虫被禁止。

想尝试更改header,模拟浏览器进行访问,但是由于现在要做scratch的分型雪花,所以暂时搁置。

以上止步于python网络爬虫与信息获取(嵩天老师_MOOC)第一周第三单元第一个视频

把url换成了淘宝的一个链接:https://item.taobao.com/item.htm?spm=a1z0d.6639537.1997196601.24.77b47484qxHVRi&id=620107543829

爬取结果:

<!doctype html>

<html><!-- cph -->

<head>

<meta http-equiv="X-UA-Compatible" content="IE=edge"/>

<meta charset="gbk"/>

<meta name="format-detection" content="telephone=no, address=no">

<link rel="dns-prefetch" href="//g.alicdn.com">

<link rel="dns-prefetch" href="//gtms01.alicdn.com">

<link rel="dns-prefetch" href="//gtms02.alicdn.com">

<link rel="dns-prefetch" href="//gtms03.alicdn.com">

<link rel="dns-prefetch" href="//gtms04.alicdn.com">

<link rel="dns-prefetch" href="//gd1.alicdn.com">

<link rel="dns-prefetch" href="//gd2.alicdn.com">

<link rel="dns-prefetch" href="//gd3.alicdn.com">

<link rel="dns-prefetch" href="//gd4.alicdn.com">

<link rel="canonical" href="https://item.taobao.com/item.htm? />

<link rel="amphtml" href href="https://www.taobao.com/list/item-amp/620107543829.htm"/>

<link rel="alternate" href href="https://world.taobao.com/item/620107543829.htm" />

<meta name="renderer" content="webkit"/>

<meta name="refer

shell模式下

>>> import requests

>>> r = requests.get("https://item.taobao.com/item.htm?spm=a1z0d.6639537.1997196601.24.77b47484qxHVRi&)

>>> r.encoding

'gb18030'

>>> r.apparent_encoding

'GB2312'

>>> r.encoding = r.apparent_encoding

>>> r.text[:1000]

'\r\n\r\n\r\n<!doctype html>\n<html><!-- cph -->\n <head>\n <meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<meta charset="gbk"/>\n<meta name="format-detection" content="telephone=no, address=no">\n<link rel="dns-prefetch" href="//g.alicdn.com">\n<link rel="dns-prefetch" href="//gtms01.alicdn.com">\n<link rel="dns-prefetch" href="//gtms02.alicdn.com">\n<link rel="dns-prefetch" href="//gtms03.alicdn.com">\n<link rel="dns-prefetch" href="//gtms04.alicdn.com">\n<link rel="dns-prefetch" href="//gd1.alicdn.com">\n<link rel="dns-prefetch" href="//gd2.alicdn.com">\n<link rel="dns-prefetch" href="//gd3.alicdn.com">\n<link rel="dns-prefetch" href="//gd4.alicdn.com">\n\n<link rel="canonical" href="https://item.taobao.com/item.htm? />\n<link rel="amphtml" href href="https://www.taobao.com/list/item-amp/620107543829.htm"/>\n<link rel="alternate" href href="https://world.taobao.com/item/620107543829.htm" />\n\n<meta name="renderer" content="webkit"/>\n<meta name="refer'

>>> r.encoding

'GB2312'

我的想法是正确的,京东应该是有来源审查,将headers换成mozilla就可以了。正确代码放到评论里