Scrapy

Scrapy is a screen scraping and web crawling framework written in Python, used to crawl websites and extract data in structured manner. Results can be exported to a variety of different formats including json, csv, xml or the storage backend FTP. You can also use an item pipeline to store items in a database. It is used for a number of purposes including data mining, automated testing and monitoring.

How to scrape a website example

I’m assuming you have installed python and scrapy already , if not go to http://www.python.org/getit/ to download python and then go to http://doc.scrapy.org/en/latest/intro/install.html#intro-install to install scrapy.

In this tutorial I will show you how I set up a simple scrapy spider to extract the name, price and url of new releases from the xtra-vision movie rental site, and store them in a CSV file.

Steps to create a spider:

  1. Setup a new scrapy project  by going to a directory of your choice in the cmd line/Terminal Window and typing:  scrapy startproject xtravision this will create the xtravision directory with the basic file structure setup.
  2. Define our item by modeling the item with a field for each attribute we wish to capture. So for example if we wish to capture the name, URL and price of the movie, we define fields for each of these three attributes in the item.py file found in the xtravision directory.

    from scrapy.item import Item, Field

     

    class XtravisionItem(Item):

    # define the fields for your item here like:

    # name = Field()

        movie = Field()

        link = Field()

        price = Field()

  3. Create the spider by subclassing  scrapy.spider.Basespider  and define the three mandatory attributes:

    name: the spider name which must be unique

    start_urls:  the url’s that the spider will begin from

    parse(): a method which receives the response object from each start url. this method processes the response objects and returns scraped data as item objects and the other url’s to follow as request objects.
    In the parse method we will pass the response object to a HtmlXPathSelector and then use XPath syntax to define parts of the html to extract by passing it to the select method(part of the XPathSelector class) . If you are new to XPath then there is a good tutorial on the W3Schools.com website here.

  4. A good way to find out what xpath to pass to the select method is to use the built in scrapy shell (make sure IPython is installed on your system)

    Start the shell by going to your projects top level directory and typing the url of the page you want to extract data from:

    scrapy shell  http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html

    This is similar to what you should get back:

    [s] Available Scrapy objects:

    [s]   hxs        <HtmlXPathSelector xpath=None data=u'<html><head>\n<meta http-equiv="Content-T'>

    [s]   item       {}

    [s]   request    <GET http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html>

    [s]   response   <200 http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html>

    [s]   settings   <CrawlerSettings module=None>

    [s]   spider     <BaseSpider 'default' at 0x1e10450>

    [s] Useful shortcuts:

    [s]   shelp()           Shell help (print this help)

    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects

    [s]   view(response)    View response in a browser

     
    Once loaded you will have your response stored in the local response variable and two selector variables hxs for Html and xxs for Xml.

  5. Open up the webpage of interest in Firefox and use the Firebug plugin to look at the HTML source . Looking at the xtravision.com website, we will scrape the new releases page for movie name, url and price. So do a quick search in the firebug source window (ctrl+F) for a name of one of the movies. The html is

    <li class="item first">

     

      <a href="http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd/pitch-perf... title="Image for Pitch Perfect" class="product-image">

       

        <span class="exclusive-star">

        </span>

       

        <img src="http://www.xtra-vision.ie/media/catalog/product/cache/3/small_image/124x... alt="Image for Pitch Perfect" />

       

        <h2 class="product-name">Pitch Perfect</h2>

      </a>

    </li>

  6. So from looking at the html we see that each new release has a address attribute of class = product-image. We can select each <a> element with an attribute of class set to product-image with the XPath //a[@class="product-image"] like this:

    hxs.select ('//a[@class="product-image"]')

  7. This will give a list of selectors that we can drill down further with more select() calls.

    links = hxs.select('//a[@class="product-image"]')

           items = []

           #loop through each selector result and extract fields defined in items.py

           for link in links:

               item = XtravisionItem()

               item['movie'] = link.select('h2[@class="product-name"]/text()').extract()

               item['link'] = link.select('@href').extract()

               item['price'] = link.select('div[@class="price-box"]//span[@class="price"]/text()').extract()

               items.append(item)

           return items

  8. The full code for the spider is saved as xtravision_spider.py in the spider folder.

                   from scrapy.spider import BaseSpider #needed to subclass BaseSpider

                   from scrapy.selector import HtmlXPathSelector

                   from xtravision.items import XtravisionItem

     

                   class XtravisionSpider(BaseSpider):

                   #need to define three mandatory attributes: name, allowed_domains,            start_urls

                   name = "xtravision"

                   allowed_domains = ["xtra-vision.ie"]

                   start_urls = [

                   "http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html/",

                   "http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd.html?p=2"

                   ]

     

                   #parse is called with the downloaded response object of each start     url

                   def parse(self, response):

                                  hxs = HtmlXPathSelector(response)

                                  links = hxs.select('//a[@class="product-image"]')

                                   items = []

                                  for link in links: #loop through each selector result and extract fields                                            

                                  item = XtravisionItem()  #defined in items.py

                                   item['movie'] = link.select('h2[@class="product-name"]/text()').extract()

                                  item['link'] = link.select('@href').extract()

                                  item['price'] = link.select('div[@class="price-box"]//span[@class="price"]/text()').extract()

                                  items.append(item)

                                  return items    

  9. Start the crawl: Go to the projects top level directory in the Terminal Window/command line and type

    scrapy crawl xtravision

  10. You can store the scraped data in a number of formats such as CSV, JSON, XML etc. We will store our scraped data in a CSV file named new_releases.csv by running the spider with the –o and –t options

    scrapy crawl xtravision –o new_releases.csv –t csv