手機(jī)站首頁(yè)散文詩(shī)歌雜文隨筆日記小小說(shuō)

散文網(wǎng) » 生活 »日常 » 23. Scrapy 框架-CrawlSpider

23. Scrapy 框架-CrawlSpider

2020-07-03 10:23 作者:自學(xué)Python的小姐姐呀 0人讀過(guò) | 我要投稿

1. CrawlSpiders

原理圖

通過(guò)下面的命令可以快速創(chuàng)建 CrawlSpider模板的代碼

scrapy genspider -t crawl 文件名 (allowed_url)

首先在說(shuō)下Spider，它是所有爬蟲(chóng)的基類，而CrawSpiders就是Spider的派生類。對(duì)于設(shè)計(jì)原則是只爬取start_url列表中的網(wǎng)頁(yè)，而從爬取的網(wǎng)頁(yè)中獲取link并繼續(xù)爬取的工作CrawlSpider類更適合

2. Rule對(duì)象

Rule類與CrawlSpider類都位于scrapy.contrib.spiders模塊中

class scrapy.contrib.spiders.Rule ( ?
link_extractor, callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None )

參數(shù)含義：

link_extractor為L(zhǎng)inkExtractor，用于定義需要提取的鏈接
callback參數(shù)：當(dāng)link_extractor獲取到鏈接時(shí)參數(shù)所指定的值作為回調(diào)函數(shù)

callback參數(shù)使用注意：當(dāng)編寫(xiě)爬蟲(chóng)規(guī)則時(shí)，請(qǐng)避免使用parse作為回調(diào)函數(shù)。于CrawlSpider使用parse方法來(lái)實(shí)現(xiàn)其邏輯，如果您覆蓋了parse方法，crawlspider將會(huì)運(yùn)行失敗

follow：指定了根據(jù)該規(guī)則從response提取的鏈接是否需要跟進(jìn)。當(dāng)callback為None,默認(rèn)值為True
process_links：主要用來(lái)過(guò)濾由link_extractor獲取到的鏈接
process_request：主要用來(lái)過(guò)濾在rule中提取到的request

3.LinkExtractors

3.1 概念

顧名思義，鏈接提取器

3.2 作用

response對(duì)象中獲取鏈接，并且該鏈接會(huì)被接下來(lái)爬取每個(gè)LinkExtractor有唯一的公共方法是 extract_links()，它接收一個(gè) Response 對(duì)象，并返回一個(gè) scrapy.link.Link 對(duì)象

3.3 使用

class scrapy.linkextractors.LinkExtractor(
? ?allow = (),
? ?deny = (),
? ?allow_domains = (),
? ?deny_domains = (),
? ?deny_extensions = None,
? ?restrict_xpaths = (),
? ?tags = ('a','area'),
? ?attrs = ('href'),
? ?canonicalize = True,
? ?unique = True,
? ?process_value = None
)

主要參數(shù)：

allow：滿足括號(hào)中“正則表達(dá)式”的值會(huì)被提取，如果為空，則全部匹配。
deny：與這個(gè)正則表達(dá)式(或正則表達(dá)式列表)不匹配的URL一定不提取。
allow_domains：會(huì)被提取的鏈接的domains。
deny_domains：一定不會(huì)被提取鏈接的domains。
restrict_xpaths：使用xpath表達(dá)式，和allow共同作用過(guò)濾鏈接(只選到節(jié)點(diǎn)，不選到屬性)

3.3.1 查看效果（shell中驗(yàn)證)

首先運(yùn)行

scrapy shell http://www.fhxiaoshuo.com/read/33/33539/17829387.shtml

繼續(xù)import相關(guān)模塊：

from scrapy.linkextractors import LinkExtractor

提取當(dāng)前網(wǎng)頁(yè)中獲得的鏈接

link = LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')

調(diào)用LinkExtractor實(shí)例的extract_links()方法查詢匹配結(jié)果

link.extract_links(response)

3.3.2 查看效果 CrawlSpider版本

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from xiaoshuo.items import XiaoshuoItem

class XiaoshuoSpiderSpider(CrawlSpider):
? ?name = 'xiaoshuo_spider'
? ?allowed_domains = ['fhxiaoshuo.com']
? ?start_urls = ['http://www.fhxiaoshuo.com/read/33/33539/17829387.shtml']

? ?rules = [
? ? ? ?Rule(LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')), callback='parse_item'),]

? ?def parse_item(self, response):
? ? ? ?info = response.xpath("//div[@id='TXT']/text()").extract()
? ? ? ?it = XiaoshuoItem()
? ? ? ?it['info'] = info
? ? ? ?yield it

注意：

rules = [
? ? ? ?Rule(LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')), callback='parse_item'),]

callback后面函數(shù)名用引號(hào)引起
函數(shù)名不能是parse
格式問(wèn)題

標(biāo)簽：