R 语言爬虫之 cnblog博文爬取

2024-04-09 21:29•移动端•阅读 513

Cnbolg Crawl

a). 加载用到的R包

##library packages needed in this case
library(proto)
library(gsubfn)

## Warning in doTryCatch(return(expr), name, parentenv, handler): 无法载入共享目标对象‘/Library/Frameworks/R.framework/Resources/modules//R_X11.so’：:
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
##   Reason: image not found

## Could not load tcltk.  Will use slower R code instead.

library(bitops)
library(rvest)
library(stringr)
library(DBI)
library(RSQLite)
library(sqldf)
library(RCurl)
library(ggplot2)
library(sp)
library(raster)
##由于我们的电脑一般是中文环境，但是我想要Monday，Tuesday，所以，这时需要增加设置参数
##来告知系统采用英文（北美）环境用法。
Sys.setlocale("LC_TIME", "C")

## [1] "C"

b). 自定义一个函数，后续用于爬取信息。

## Create a function,the parameter \'i\' means page number.
getdata <- function(i){
    url <- paste0("www.cnblogs.com/p",i)##generate url
    combined_info <- url%>%html_session()%>%html_nodes("div.post_item div.post_item_foot")%>%html_text()%>%strsplit(split="\r\n")
    post_date <- sapply(combined_info, function(v) return(v[3]))%>%str_sub(9,24)%>%as.POSIXlt()##get the date
    post_year <- post_date$year+1900
    post_month <- post_date$mon+1
    post_day <- post_date$mday
    post_hour <- post_date$hour
    post_weekday <- weekdays(post_date)
    title <- url%>%html_session()%>%html_nodes("div.post_item h3")%>%html_text()%>%as.character()%>%trim()
    link <- url%>%html_session()%>%html_nodes("div.post_item a.titlelnk")%>%html_attr("href")%>%as.character()
    author <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_text()%>%as.character()%>%trim()
    author_hp <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_attr("href")%>%as.character()
    recommendation <- url%>%html_session()%>%html_nodes("div.post_item span.diggnum")%>%html_text()%>%trim()%>%as.numeric()
    article_view <- url%>%html_session()%>%html_nodes("div.post_item span.article_view")%>%html_text()%>%str_sub(4,20)
    article_view <- gsub(")","",article_view)%>%trim

上一篇 »python3爬虫爬取网页思路及常见问题，原创
下一篇 »爬虫实战，三用Python爬取拉勾网