用TypeScript开发爬虫程序

全局安装typescript:

npm install -g typescript

目前版本2.0.3,这个版本不再需要使用typings命令了。但是vscode捆绑的版本是1.8的,需要一些配置工作,看本文的处理办法。

测试tsc命令:

tsc

创建要写的程序项目文件夹:

mkdir test-typescript-spider

进入该文件夹:

cd test-typescript-spider

初始化项目:

npm init

安装superagent和cheerio模块:

npm i --save superagent cheerio

安装对应的类型声明模块:

npm i -s @types/superagent --save

npm i -s @types/cheerio --save

安装项目内的typescript(必须走这一步):

npm i --save typescript

用vscode打开项目文件夹。在该文件夹下创建tsconfig.json文件,并复制以下配置代码进去:

{

"compilerOptions": {

"target": "ES6",

"module": "commonjs",

"noEmitOnError": true,

"noImplicitAny": true,

"experimentalDecorators": true,

"sourceMap": false,

// "sourceRoot": "./",

"outDir": "./out"

},

"exclude": [

"node_modules"

]

}

在vscode打开“文件”-“首选项”-“工作区设置”

在settings.json中加入(如果不做这个配置,vscode会在打开项目的时候提示选择哪个版本的typescript):

{

"typescript.tsdk": "node_modules/typescript/lib"

}

创建api.ts文件,复制以下代码进去:

import superagent = require('superagent');

import cheerio = require('cheerio');

export const remote_get = function(url: string) {

const promise = new Promise<superagent.Response>(function (resolve, reject) {

superagent.get(url)

.end(function (err, res) {

if (!err) {

resolve(res);

} else {

console.log(err)

reject(err);

}

});

});

return promise;

}

创建app.ts文件,书写测试代码:

import api = require('./api');

const go = async () => {

let res = await api.remote_get('http://www.baidu.com/');

console.log(res.text);

}

go();

执行命令:

tsc

然后:

node out/app

观察输出是否正确。

现在尝试抓取http://cnodejs.org/的第一页文章链接。

修改app.ts文件,代码如下:

import api = require('./api');

import cheerio = require('cheerio');

const go = async () => {

const res = await api.remote_get('http://cnodejs.org/');

const $ = cheerio.load(res.text);

let urls: string[] = [];

let titles: string[] = [];

$('.topic_title_wrapper').each((index, element) => {

titles.push($(element).find('.topic_title').first().text().trim());

urls.push('http://cnodejs.org/' + $(element).find('.topic_title').first().attr('href'));

})

console.log(titles, urls);

}

go();

观察输出,文章的标题和链接都已获取到了。

现在尝试深入抓取文章内容

import api = require('./api');

import cheerio = require('cheerio');

const go = async () => {

const res = await api.remote_get('http://cnodejs.org/');

const $ = cheerio.load(res.text);

$('.topic_title_wrapper').each(async (index, element) => {

let url = ('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));

const res_content = await api.remote_get(url);

const $_content = cheerio.load(res_content.text);

console.log($_content('.topic_content').first().text());

})

}

go();

可以发现因为访问服务器太迅猛,导致出现很多次503错误。

解决:

添加helper.ts文件:

export const wait_seconds = function (senconds: number) {

return new Promise(resolve => setTimeout(resolve, senconds * 1000));

}

修改api.ts文件为:

import superagent = require('superagent');

import cheerio = require('cheerio');

export const get_index_urls = function () {

const res = await remote_get('http://cnodejs.org/');

const $ = cheerio.load(res.text);

let urls: string[] = [];

$('.topic_title_wrapper').each(async (index, element) => {

urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));

});

return urls;

}

export const get_content = async function (url: string) {

const res = await remote_get(url);

const $ = cheerio.load(res.text);

return $('.topic_content').first().text();

}

export const remote_get = function (url: string) {

const promise = new Promise<superagent.Response>(function (resolve, reject) {

superagent.get(url)

.end(function (err, res) {

if (!err) {

resolve(res);

} else {

console.log(err)

reject(err);

}

});

});

return promise;

}

修改app.ts文件为:

import api = require('./api');

import helper = require('./helper');

import cheerio = require('cheerio');

const go = async () => {

const res = await api.remote_get('http://cnodejs.org/');

const $ = cheerio.load(res.text);

let urls = await api.get_index_urls();

for (let i = 0; i < urls.length; i++) {

await helper.wait_seconds(1);

let text = await api.get_content(urls[i]);

console.log(text);

}

}

go();

观察输出可以看到,程序实现了隔一秒再请求下一个内容页。

现在尝试把抓取到的东西存到数据库中。

安装mongoose模块:

npm i mongoose --save

npm i -s @types/mongoose --save

然后建立Scheme。先创建models文件夹:

mkdir models

在models文件夹下创建index.ts:

import * as mongoose from 'mongoose';

mongoose.connect('mongodb://127.0.0.1/cnodejs_data', {

server: { poolSize: 20 }

}, function (err) {

if (err) {

process.exit(1);

}

});

// models

export const Article = require('./article');

在models文件夹下创建IArticle.ts:

interface IArticle {

title: String;

url: String;

text: String;

}

export = IArticle;

在models文件夹下创建Article.ts:

import mongoose = require('mongoose');

import IArticle = require('./IArticle');

interface IArticleModel extends IArticle, mongoose.Document { }

const ArticleSchema = new mongoose.Schema({

title: { type: String },

url: { type: String },

text: { type: String },

});

const Article = mongoose.model<IArticleModel>("Article", ArticleSchema);

export = Article;

修改api.ts为:

import superagent = require('superagent');

import cheerio = require('cheerio');

import models = require('./models');

const Article = models.Article;

export const get_index_urls = async function () {

const res = await remote_get('http://cnodejs.org/');

const $ = cheerio.load(res.text);

let urls: string[] = [];

$('.topic_title_wrapper').each((index, element) => {

urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));

});

return urls;

}

export const fetch_content = async function (url: string) {

const res = await remote_get(url);

const $ = cheerio.load(res.text);

let article = new Article();

article.text = $('.topic_content').first().text();

article.title = $('.topic_full_title').first().text().replace('置顶', '').replace('精华', '').trim();

article.url = url;

console.log('获取成功:' + article.title);

article.save();

}

export const remote_get = function (url: string) {

return new Promise<superagent.Response>((resolve, reject) => {

superagent.get(url)

.end(function (err, res) {

if (!err) {

resolve(res);

} else {

reject(err);

}

});

});

}

修改app.ts为:

import api = require('./api');

import helper = require('./helper');

import cheerio = require('cheerio');

(async () => {

try {

let urls = await api.get_index_urls();

for (let i = 0; i < urls.length; i++) {

await helper.wait_seconds(1);

await api.fetch_content(urls[i]);

}

} catch (err) {

console.log(err);

}

console.log('完毕!');

})();

执行tsc

node out/app

观察输出,并去数据库检查一下

可以发现入库成功了!