Scrapy
Scrapy is a screen scraping and web crawling framework written in Python, used to crawl websites and extract data in structured manner. Results can be exported to a variety of different formats including json, csv, xml or the storage backend FTP. You can also use an item pipeline to store items in a database. It is used for a number of purposes including data mining, automated testing and monitoring.
How to scrape a website example
I’m assuming you have installed python and scrapy already , if not go to http://www.python.org/getit/ to download python and then go to http://doc.scrapy.org/en/latest/intro/install.html#intro-install to install scrapy.
In this tutorial I will show you how I set up a simple scrapy spider to extract the name, price and url of new releases from the xtra-vision movie rental site, and store them in a CSV file.
Steps to create a spider:
- Setup a new scrapy project by going to a directory of your choice in the cmd line/Terminal Window and typing: scrapy startproject xtravision this will create the xtravision directory with the basic file structure setup.
- Define our item by modeling the item with a field for each attribute we wish to capture. So for example if we wish to capture the name, URL and price of the movie, we define fields for each of these three attributes in the item.py file found in the xtravision directory.
from scrapy.item import Item, Field
class XtravisionItem(Item):
# define the fields for your item here like:
# name = Field()
movie = Field()
link = Field()
price = Field()
-
Create the spider by subclassing scrapy.spider.Basespider and define the three mandatory attributes:
name: the spider name which must be unique
start_urls: the url’s that the spider will begin from
parse(): a method which receives the response object from each start url. this method processes the response objects and returns scraped data as item objects and the other url’s to follow as request objects.
In the parse method we will pass the response object to a HtmlXPathSelector and then use XPath syntax to define parts of the html to extract by passing it to the select method(part of the XPathSelector class) . If you are new to XPath then there is a good tutorial on the W3Schools.com website here. -
A good way to find out what xpath to pass to the select method is to use the built in scrapy shell (make sure IPython is installed on your system)
Start the shell by going to your projects top level directory and typing the url of the page you want to extract data from:
scrapy shell http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html
This is similar to what you should get back:
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html><head>\n<meta http-equiv="Content-T'>
[s] item {}
[s] request <GET http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html>
[s] response <200 http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x1e10450>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Once loaded you will have your response stored in the local response variable and two selector variables hxs for Html and xxs for Xml. -
Open up the webpage of interest in Firefox and use the Firebug plugin to look at the HTML source . Looking at the xtravision.com website, we will scrape the new releases page for movie name, url and price. So do a quick search in the firebug source window (ctrl+F) for a name of one of the movies. The html is
<li class="item first">
<a href="http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd/pitch-perf... title="Image for Pitch Perfect" class="product-image">
<span class="exclusive-star">
</span>
<img src="http://www.xtra-vision.ie/media/catalog/product/cache/3/small_image/124x... alt="Image for Pitch Perfect" />
<h2 class="product-name">Pitch Perfect</h2>
</a>
</li>
-
So from looking at the html we see that each new release has a address attribute of class = product-image. We can select each <a> element with an attribute of class set to product-image with the XPath //a[@class="product-image"] like this:
hxs.select ('//a[@class="product-image"]')
-
This will give a list of selectors that we can drill down further with more select() calls.
links = hxs.select('//a[@class="product-image"]')
items = []
#loop through each selector result and extract fields defined in items.py
for link in links:
item = XtravisionItem()
item['movie'] = link.select('h2[@class="product-name"]/text()').extract()
item['link'] = link.select('@href').extract()
item['price'] = link.select('div[@class="price-box"]//span[@class="price"]/text()').extract()
items.append(item)
return items
-
The full code for the spider is saved as xtravision_spider.py in the spider folder.
from scrapy.spider import BaseSpider #needed to subclass BaseSpider
from scrapy.selector import HtmlXPathSelector
from xtravision.items import XtravisionItem
class XtravisionSpider(BaseSpider):
#need to define three mandatory attributes: name, allowed_domains, start_urls
name = "xtravision"
allowed_domains = ["xtra-vision.ie"]
start_urls = [
"http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html/",
"http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html?p=2"
]
#parse is called with the downloaded response object of each start url
def parse(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a[@class="product-image"]')
items = []
for link in links: #loop through each selector result and extract fields
item = XtravisionItem() #defined in items.py
item['movie'] = link.select('h2[@class="product-name"]/text()').extract()
item['link'] = link.select('@href').extract()
item['price'] = link.select('div[@class="price-box"]//span[@class="price"]/text()').extract()
items.append(item)
return items
-
Start the crawl: Go to the projects top level directory in the Terminal Window/command line and type
scrapy crawl xtravision
-
You can store the scraped data in a number of formats such as CSV, JSON, XML etc. We will store our scraped data in a CSV file named new_releases.csv by running the spider with the –o and –t options
scrapy crawl xtravision –o new_releases.csv –t csv