从零开始爬草榴小黄文 (1) - 爬取

最近想学习下Python。考虑到一点一点看教程学习过于枯燥，没有时间也没有学习动力。爬虫和Python最常用的功能就是爬虫和数据分析。不如就先从这两样着手做个简单的小项目，爬取一些网站数据，并做一些简单的分析。选定了爬取草榴成人文学的标题列表（不好内容）来作为学术目的。

工具准备

安装Scrapy

Scrapy是一个现行常用的爬虫框架，扩展性强，拥有强大的爬取和分析能力，安装起来也蛮简单。

** pip install scrapy **

当然安装使用起来也有很多的坑会踩，比如说没有安装Twisted、lxml等，这在Python下都不是问题，一个_pip install_搞定。

另外CentOS服务器会碰到运行时找不到__sqlite3模块的错误。靠如下方法搞定：

_sudo yum install sqlite-deve_l
重新编译python3.6：./configure –enable-loadable-sqlite-extensions –with-ssl;make;sudo make install

P.S. 简单的功能，其实爬虫完全可以利用现行的HTTP Request类来写也不复杂。个人只是图省事儿用了Scrapy

安装Pymongo

原本以为数据量很大，就打算直接使用数据库来存储爬取到的数据，而Python常用的数据库为Mongo，使用需要安装Pymongo（后来发现其实只有2000多条数据，直接存文本就可以的）。

pip install Pymongo

这里说一下，数据库我直接使用的是Pymongo官方提供的实验数据库，有500M空间，可以远程使用.

Python使用Pymongo的文档，可以参考pymongo 文档

使用Scrapy爬取

先执行s_crapy startproject 爬虫项目名_来创建项目，得到如下目录结构

t66y
├── proxy.py
├── scrapy.cfg
└── t66y
├── init.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── init.py

代码

我们在spiders下建立t66y.py，直接上代码了（为避免教坏小朋友，删掉网址）：

import scrapy
from scrapy.http import Request
from t66y.items import T66YItem

class t66yLitSpider(scrapy.Spider):
    name = "t66yLit" 
    allowed_domains = ['草榴域名']
    #start_urls为开始抓取的网址
    start_urls = ['http://草榴域名/thread0806.php?fid=20&search=&page={}'.format(i) for i in range(1,22)]
    #对抓取到的网页进行Xpath解析，得到数据。注意每一个T66YItem对象对应一条pymongo中的数据
    def parse(self, response):
        item = T66YItem()
        lits = response.xpath('//tr[@class="tr3 t_one tac"]')
        
        for lit in lits:
            item['lit_type'] = lit.xpath('.//*[@class="tal"]/text()[1]').extract()[0].strip()
            item['lit_title'] = lit.xpath('.//*[@class="tal"]//h3/a/text()').extract()
            if len(item['lit_title']) == 1:
                item['lit_title'] = item['lit_title'][0].strip()
            item['lit_url'] = lit.xpath('.//*[@class="tal"]//h3/a/@href').extract()[0].strip()
            item['lit_writer'] = lit.xpath('.//td[1]//following-sibling::*[2]/a/text()').extract()[0].strip()
            item['lit_submit'] = lit.xpath('.//td[1]//following-sibling::*[2]/a//following-sibling::*[1]/text()').extract()
            if len(item['lit_submit']) == 1:
                lit_submit = lit.xpath('.//td[1]//following-sibling::*[2]/a//following-sibling::*[1]')
                item['lit_submit'] = lit_submit.xpath('string(.)').extract()[0].strip()
            item['lit_comments'] = lit.xpath('.//td[1]//following-sibling::*[3]/text()').extract()
            if len(item['lit_comments']) == 1:
                item['lit_comments'] = item['lit_comments'][0]
            item['lit_last_comments'] = lit.xpath('.//td[1]//following-sibling::*[4]/a/text()').extract()
            if len(item['lit_last_comments']) == 1:
                item['lit_last_comments'] = item['lit_last_comments'][0]
            self.log("[%s],[%s],[%s],[%s],[%s],[%s],[%s]" % (item['lit_type'],item['lit_title'],item['lit_url'],item['lit_writer'],item['lit_submit'],item['lit_comments'],item['lit_last_comments']))
            yield item

T66YItem定义在items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class T66YItem(scrapy.Item):
    # define the fields for your item here like:
    lit_type = scrapy.Field()
    lit_title = scrapy.Field()
    lit_url = scrapy.Field()
    lit_writer = scrapy.Field()
    lit_submit = scrapy.Field()
    lit_comments = scrapy.Field()
    lit_last_comments = scrapy.Field()
    pass

pipeline.py负责抓取后的mongodb存储

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings

class T66YPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        user = settings['MONGODB_USER']
        passwd = settings['MONGODB_PASSWD']
        dbname = settings['MONGODB_DBNAME']
        sheetname = settings['MONGODB_SHEET']
        client = pymongo.MongoClient("mongodb+srv://{0}:{1}@{2}/test?retryWrites=true&w=majority".format(user,passwd,host))
        
        mydb = client[dbname] 
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)#这行就是存储mongo数据了
        return item

若要抓取时同步保存，需要对setting.py进行配置：

BOT_NAME = 't66y'

SPIDER_MODULES = ['t66y.spiders']
NEWSPIDER_MODULE = 't66y.spiders'
#看到这些配置和t66y.py里的关联了吗。对应配置名就是setting['xxxx']
MONGODB_USER = '***'
MONGODB_PASSWD = '***'
MONGODB_DBNAME = '***'
MONGODB_SHEET = '***'
MONGODB_HOST = '***.mongodb.net'

ROBOTSTXT_OBEY = False
#需要定义一个延迟时间，否则会被服务器当成DDOS攻击
DOWNLOAD_DELAY = 5
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
#pipeline的名字要和pipeline.py里定义的类名相同
ITEM_PIPELINES = {
    't66y.pipelines.T66YPipeline': 300,
}

可以注意到创建工程的时候还有一个middleware.py文件生成，从名字看它应该是一种中间件对吧。其实一般大家用它来更换代理IP（有些网站会查询爬虫的ip，过于频繁的读取会被当成不合规行为，并进行拦截），更换代理IP的代码一般放在此处。当然当碰到网站返回错误要做一些处理的时候，也通常会在这里实现相应代码。好在CL并没有麻烦到需要更换代理IP。

效果

直接看在mongoDB存储的内容。可以看到爬取过程中插入操作增多了，右边是爬取到的数据，文章名称过于暴露就不再显示。

下一章我们介绍：

Jupyter+Pandas+Pyplot进行简单的数据分析。