The result of tag: (1 results)

Python爬虫框架之Scrapy的安装和使用

by LauCyun Nov 13,2016 14:22:12 12,904 views

Scrapy是一套基于基于Twisted的异步处理框架,纯python实现的爬虫框架,只需要定制开发几个模块就可以轻松的实现一个爬虫。Scrapy官网官方安装文档

安装

直接使用pip安装:

pip install Scrapy

注意:

Scrapy依赖这些python包:

  • lxml
  • parsel
  • w3lib
  • twisted
  • cryptography and pyOpenSSL

对于Scrapy最小的包版本:

  • Twisted 14.0
  • lxml 3.4
  • pyOpenSSL 0.14

创建一个Scrapy新项目

使用命令创建一个Scrapy新项目:

$ scrapy startproject 项目名称

例如:创建一个名叫helloscrapy的项目

$ scrapy startproject helloscrapy
New Scrapy project 'helloscrapy', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
    /home/helloscrapy

You can start your first spider with:
    cd helloscrapy
    scrapy genspider example example.com

项目结构

.
`-- helloscrapy
    |-- helloscrapy           # Python项目module
    |   |-- __init__.py
    |   |-- items.py          # 项目的Item定义位置
    |   |-- middlewares.py
    |   |-- pipelines.py      # 项目的Pipeline文件
    |   |-- settings.py       # 项目的设置文件
    |   `-- spiders           # 蜘蛛目录
    |       `-- __init__.py
    `-- scrapy.cfg            # Scrapy项目配置文件

编写一个蜘蛛

我们以爬取LauCyun's Blog为例,先创建一个爬虫:

$ scrapy genspider laucyun liuker.org
Created spider 'laucyun' using template 'basic' in module:
  helloscrapy.spiders.laucyun

在spider目录中将看到laucyun.py文件,内容如下:

# -*- coding: utf-8 -*-
import scrapy


class LaucyunSpider(scrapy.Spider):
    name = 'laucyun'
    allowed_domains = ['liuker.org']
    start_urls = ['http://liuker.org/']

    def parse(self, response):
        pass
~                                    

这里需要解释一下,每一个蜘蛛都要继承Spider,这是Scrapy提供的基础蜘蛛,Spider中有3个变量必须定义:

  • name - 蜘蛛的名字,等会儿通过命令行启动蜘蛛的时候用到
  • allowed_domains - 限定蜘蛛爬取的域,以免去爬一些我们不关心的网站内容,上面只爬liuker.org。如果多个的话则为['liuker.org', 'laucyun.com']
  • start_urls - 从哪儿开始爬。
  • def parse(self, response): - 这里就是蜘蛛通过下载器下载好的内容回调,通过这个方法可以取到网页内容。

这样一个简单的蜘蛛就完成了,可以通过命令:

$ scrapy crawl laucyun
2016-11-13 07:36:31 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: helloscrapy)
2016-11-13 07:36:31 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'helloscrapy.spiders', 'SPIDER_MODULES': ['helloscrapy.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'helloscrapy'}
2016-11-13 07:36:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-11-13 07:36:31 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-11-13 07:36:31 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-11-13 07:36:31 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2016-11-13 07:36:31 [scrapy.core.engine] INFO: Spider opened
2016-11-13 07:36:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-13 07:36:31 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-13 07:36:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://liuker.org/robots.txt> (referer: None)
2016-11-13 07:36:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://liuker.org/> (referer: None)
2016-11-13 07:36:32 [scrapy.core.engine] INFO: Closing spider (finished)
2016-11-13 07:36:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 426,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 18916,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 11, 13, 7, 36, 32, 399465),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'memusage/max': 46956544,
 'memusage/startup': 46956544,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 11, 13, 7, 36, 31, 700915)}
2016-11-13 07:36:32 [scrapy.core.engine] INFO: Spider closed (finished)

当然上面的代码还什么都抓不到,因为parse还没有实现。

总结

简单的实现了一个蜘蛛,这个蜘蛛现在只能抓取静态网页,如果网站包含了动态内容,或者很多ajax请求,那么这样是抓不到完整数据的,下面会通过selenium来抓取动态网页。

(全文完)

...

Tags Read More