Start a new Scrapy project

$ scrapy startproject adventuretime

This will create a scrapy project with the following structure:

Create a spider

Inside the folder spiders create a new python file called

import scrapy

class CharactersSpider(scrapy.Spider):
	name = 'characters'

	start_urls = ['']


When Scrapy executes the spider code it will send HTTP code of the url’s in start_urls as an argument to the parse method, where we need to extract the links for the next pages and the character pages.

def parse(self, response):
	for href in response.xpath('//a[@class="category-page__member-link"]/@href'):
		yield response.follow(href, self.parse_character)

	for href in response.xpath('//a[@class="category-page__pagination-next wds-button wds-is-secondary"]/@href'):
		yield response.follow(href, self.parse)

The pagination links are parse by the parse method while the character links are parse by a method called parse_character.

def parse_character(self, response):
	i = ItemLoader(item=CharacterItem(), response=response)
	i.add_xpath('name', '//*[contains(@class, "pi-group")]/div[contains(string(), "Name")]/div//text()')
	i.add_xpath('sex', '//*[contains(@class, "pi-group")]/div[contains(string(), "Sex")]/div//text()')
	i.add_xpath('species', '//*[contains(@class, "pi-group")]/div[contains(string(), "Species")]/div/*/text()')

	return i.load_item()

To make the extraction of the character data we created a new item called CharacterItem at

class CharacterItem(scrapy.Item):
    url = scrapy.Field()
    name = scrapy.Field(output_processor=TakeFirst())
    sex = scrapy.Field(output_processor=TakeFirst())
    species = scrapy.Field()

Extracting the data

scrapy crawl spider_name -o result.json -t json