Skip to content Skip to sidebar Skip to footer

Crawling A Site Recursively Using Scrapy

I am trying to scrap a site using scrapy. This is the code I have written so far based on http://thuongnh.com/building-a-web-crawler-with-scrapy/ (original code does not work at al

Solution 1:

Following can be a good idea to start with:

There can be two use cases for 'Crawling a site recursively using scrapy'.

A). We just want to move across the website using, say, the pagination buttons of the table and fetch data. This is relatively straight forward.

classTrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    defparse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page isnotNone:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)`

Observe the last 4 lines. Here

  1. We are getting the next page link form next page xpath from the 'Next' pagination button.
  2. The if condition checks, if its not the end of the pagination.
  3. Join this link (that we got in step 1) with the main url using urljoin
  4. A recursive call to the 'parse' call back method.

B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.

classStationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
    )
    defparse_trains(self, response):
        '''do your parsing here'''

Over here, observe that:

  1. We are using the 'CrawlSpider' subclass of the 'scrapy.Spider' parent class

  2. We have set to 'Rules'

    a) The first rule just checks if there is a 'next_page' available and follows it.

    b) The second rule requests for all the links on a page that are in the format, say '/trains/12343' and then calls the 'parse_trains' to perform and parsing operation.

  3. Important: Note that we don't want to use the regular 'parse' method over here as we are using 'CrawlSpider' subclass. This class also has a 'parse' method so we don't want to override that. Just remember to name your call back method something other than 'parse'.

Solution 2:

The problem is what Spider class you are using as a base. The scrapy.Spider is a simple spider that does not support rules and link extractors.

Instead, use CrawlSpider:

class MySpider(CrawlSpider):

Post a Comment for "Crawling A Site Recursively Using Scrapy"