python,web-scraping,scrapy,scrapy-spider,craigslist
Multiple issues here. The main problem is in invalid expressions inside the select() calls. Aside from that: use response.xpath() or response.css() no need for HtmlXPathSelector anymore no need to instantiate an Item instance in the parse() callback and pass in meta. Get the url from response.url in parse_listing_page() callback Improved...
python,web-scraping,scrapy,sitemap,scrapy-spider
I think the nicest and cleanest solution would be to add a downloader middleware which changes the malicious URLs without the spider noticing. import re import urlparse from scrapy.http import XmlResponse from scrapy.utils.gz import gunzip, is_gzipped from scrapy.contrib.spiders import SitemapSpider # downloader middleware class SitemapWithoutSchemeMiddleware(object): def process_response(self, request, response, spider):...
python,web-scraping,scrapy,scrapy-spider
The problems you have in the code: yield item should be inside the loop since you are instantiating items there the xpath you have is pretty messy and not quite reliable since it heavily relies on the elements location inside parent tags and starts from almost the top parent of...
python,web-scraping,scrapy,screen-scraping,scrapy-spider
From what I understand, you want something similar to restrict_xpaths, but provide a CSS selector instead of an XPath expression. This is actually a built-in feature in Scrapy 1.0 (currently in a release candidate state), the argument is called restrict_css: restrict_css a CSS selector (or list of selectors) which defines...
python,ajax,web-scraping,scrapy,scrapy-spider
The key problem is in missing quotes around the filters in the body: url = 'https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters' req = scrapy.Request(url, method='POST', body='{"filters": []}', headers={'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/json; charset=UTF-8'}, callback=self.parser2) yield req Or, you can define it as a dictionary and then call json.dumps() to dump it to a string: params =...
python,web-scraping,scrapy,scrapy-spider
You can avoid multiple inheritance here. Combine both spiders in a single one. If start_urls would be passed from the command-line - it would behave like a CrawlSpider, otherwise like a regular spider: from scrapy import Item from scrapy.contrib.spiders import CrawlSpider, Rule from foo.items import AtlanticFirearmsItem from scrapy.contrib.loader import ItemLoader...
python,regex,scrapy,multiple-inheritance,scrapy-spider
You are on the right track, the only thing left is at the end of your parse_product function, you have to yield all the urls extracted by the crawler like so def parse_product(self, response): loader = FlipkartItemLoader(response=response) loader.add_value('pid', 'value of pid') loader.add_xpath('name', 'xpath to name') yield loader.load_item() # CrawlSpider defines...
python,mongodb,web-scraping,scrapy,scrapy-spider
process_item() method is not indented properly, should be: class MongoDBPipeline(object): def __init__(self): connection = pymongo.Connection(settings['MONGODB_HOST'], settings['MONGODB_PORT']) db = connection[settings['MONGODB_DATABASE']] self.collection = db[settings['MONGODB_COLLECTION']] def process_item(self, item, spider): self.collection.insert(dict(item)) log.msg("Item wrote to MongoDB database {}, collection {}, at host {}, port {}".format( settings['MONGODB_DATABASE'], settings['MONGODB_COLLECTION'], settings['MONGODB_HOST'],...
web-scraping,scrapy,scrapy-spider
You can rely on the spider_closed signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer and adopted for your use case: from scrapy.crawler import Crawler from scrapy import log, signals from scrapy.settings import Settings from twisted.internet import reactor #...
python,xpath,scrapy,scrapy-spider
You can get all the child tags except the div with class="needremove": response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract() Demo from the shell: $ scrapy shell index.html In [1]: response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract() Out[1]: [u'<p>text</p>', u'<p>text</p>', u'<p>text</p>',...
python,xml,web-scraping,scrapy,scrapy-spider
You need to yield Request instances to the other URLs: def check_login_response(self, response): # check login succeed before going on if "incorrect" in response.body: self.log("Login failed", level=scrapy.log.ERROR) return for url in list_or_urls: yield Request(url, callback=self.parse_other_url) def parse_other_url(self, response): # ... ...
I think what you're looking for is the yield statement: def parse(self, response): # Get the list of URLs, for example: list = ["http://a.com", "http://b.com", "http://c.com"] for link in list: request = scrapy.Request(link) yield request ...
python,xml,xpath,scrapy,scrapy-spider
The signature of parse_node() is incorrect. There should be selector argument given which you should call xpath() method on, example: def parse_node(self, response, selector): to = selector.xpath('//to/text()').extract() who = selector.xpath('//from/text()').extract() print to, who Prints: [u'Tove'] [u'Jani'] ...
You are using old Scrapy (0.14.4) with the most latest documentation. Solution: upgrade to the latest version of Scrapy or read old docs, that suit currently installed version...
xpath,web-scraping,scrapy,scrapy-spider
Browsers add tbody to table elements which is why your xpath works in dev tools but fails with scrapy, this is common gotcha. Usually you need to find xpath yourself, don't trust automatically generated xpaths they are usually needleesly long. For example to get data about ships you could just...
java,bash,shell,scrapy,scrapy-spider
Your are missing forward slash on your 'PATH' declaration, and path should be to the directory, not the program You have export PATH=usr/local/bin/scrapy:$PATH Should be export PATH=/usr/local/bin:$PATH ...
python,web-scraping,scrapy,scrapy-spider
Let me suggest a more reliable and readable XPath to locate, for the sake of an example, "Humidity" value where the base is that "Humidity" column label: "".join(i.xpath('.//td[dfn="Humidity"]/following-sibling::td//text()').extract()).strip() Outputs 45% now. FYI, your XPath had at least one problem - the tbody tag - remove it from the XPath expression....
python,web-scraping,scrapy,scrapy-spider
This is a perfect use case for using process_value argument: from scrapy.contrib.linkextractors import LinkExtractor addition = "?pag_sortorder=0&pag_perPage=999" LinkExtractor(process_value=lambda x: x + addition) ...
python,callback,web-scraping,scrapy,scrapy-spider
Theoretically, it is doable since callback is just a callable that has a response as it's argument. Though, Items are just containers of the fields, they are for storing data, you should not put logic there. Better create a method in the spider and pass the item instance inside meta:...
web-crawler,scrapy,scrapy-spider
import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin class CompItem(scrapy.Item): name = scrapy.Field() price = scrapy.Field() location = scrapy.Field() class criticspider(CrawlSpider): name = "craig" allowed_domains = ["newyork.craigslist.org"] start_urls = ["http://newyork.craigslist.org/search/cta"] def parse(self, response): sites = response.xpath('//div[@class="content"]') items = []...
python,web-scraping,scrapy,scrapy-spider
You need to pass the item instantiated in parse() inside the meta to the secondPage callback: def parse(self, response): for sel in response.xpath('//*[@id="conteudoInternas"]/ul/li'): item = CinemasItem() item['title'] = 'title' item['room'] = 'room' item['mclass'] = 'mclass' item['minAge'] = 'minAge' item['cover'] = 'cover' item['sessions'] = 'sessions' secondUrl = sel.xpath('p[1]/a/@href').extract()[0] # see: we...
python,list,key,scrapy,scrapy-spider
Scrapy Item class provides a dictionary-like interface for storing the extracted data. There are no default values set for item fields. To check whether the field was set or not, simply check for the field key in the item instance: if 'link' not in item: print "link has not been...
python,html,web-scraping,scrapy,scrapy-spider
I would use starts-with() XPath function to get the div element's text that starts with "Timings": sel.xpath('.//div[starts-with(., "Timings")]/text()').extract() Note that the HTML structure of the page doesn't make it easy to distinguish locations between each other - there is no location-specific containers that you can iterate over. In this case,...
python,authentication,scrapy,scrapy-spider
I was able to figure this out. I had to override the Request object in order to set a new authorization token into the header when the token expires. I made the the token a global variable. # override Request object in order to set new authorization token into the...
python,xml,scrapy,scrapy-spider
Change word from to something different, in python it is a keyword.
python,scrapy,web-crawler,scrape,scrapy-spider
You can also use the link extractor to pull all the links once you are parsing each page. The link extractor will filter the links for you. In this example the link extractor will deny links in the allowed domain so it only gets outside links. from scrapy.contrib.spiders import CrawlSpider,...
python,scrapy,python-requests,scrapy-spider
I resolved this by removing the cookies and not setting the "Content-Length" header manually on my yielded request. It seems like those 2 things were the extra 21 bytes on the 2nd TCP segment and caused the 413 response. Maybe the server was interpreting the "Content-Length" as the combined value...
python,web-scraping,scrapy,scrapy-spider
Allowed domains should be defined without the http://. For example: allowed_domains= ["cbfcindia.gov.in/"] If any issues persist, then please show the full log that includes details of the pages crawled and any redirects that may have occurred....
delete,web-crawler,scrapy,scrapy-spider,scrapinghub
You just need to remove the spider from your project, and deploy the project again, via shub deploy, or scrapyd-deploy.
I don't know if CrawlSpider is supposed to be deterministic, but a site with millions of pages has probably not always the same links on a given page and may deliver responses not in the same order. No, I would not expect a CrawlSpider to take the same path. But...
python,xml,scrapy,scrapy-spider
The extract() method will always return a list of values, even if there is only a single value as a result, for example: [4], [3,4,5] or None. To avoid this, if you know there is only one value, you can select it like: item['to'] = selector.xpath('//to/text()').extract()[0] Note: Be aware that...
python,xpath,web-scraping,scrapy,scrapy-spider
First of all, your XPath expressions are very fragile in general. The main problem with your approach is that site does not contain a review section, but it should. In other words, you are not iterating over review blocks on a page. Also, the model name should be extracted outside...
python,web-scraping,scrapy,scrapy-spider
Which is the best way to make the spider follow the pagination of an url ? This is very site-specific and depends on how the pagination is implemented. If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination...
You need to indent the parse block. import scrapy class StockSpider(scrapy.Spider): name = "ugaz" allowed_domains = ["nasdaq.com"] start_urls = ["http://www.nasdaq.com/symbol/ugaz/"] # Indent this block def parse(self, response): stock = StockItem() stock['quote'] = response.xpath('//*[@id="qwidget_lastsale"]/text()').extract() stock['time'] = response.xpath('//*[@id="qwidget_markettime"]/text()').extract() return stock ...
python,web-crawler,scrapy,scrapy-spider
You need to add scheme for the URL: ftp://ftp.site.co.uk The FTP URL syntax is defined as: ftp://[<user>[:<password>]@]<host>[:<port>]/<url-path> Basically, you do this: yield Request('ftp://ftp.site.co.uk/feed.xml', ...) Read more about schemas at Wikipedia: http://en.wikipedia.org/wiki/URI_scheme...
It was possible to run multiple spiders within one reactor by keeping the reactor open until all the spiders have stopped running. This was achieved by keeping a list of all the running spiders and not executing reactor.stop() until this list is empty: import sys import os from scrapy.utils.project import...
Take a closer look to the traceback, there is the line: File "/home/me/Desktop/scrapy/dirbot-master/dirbot-master/dirbot/pipelines.py", line 13, in process_item if word in unicode(item['description']).lower(): This means that your pipeline is throwing the error while trying to process an item. Then, see what fields do you fill in the spider: for site in sites:...
python,web-scraping,arguments,scrapy,scrapy-spider
You need to pass your parameters on the crawl method of the CrawlerProcess, so you need to run it like this: crawler = CrawlerProcess(Settings()) crawler.crawl(BBSpider, start_url=url) crawler.start() ...
There is a Warning box in the CrawlSpider documentation. It says: When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work. Your code does...
python,python-2.7,web-scraping,scrapy,scrapy-spider
You are looking for the CLOSESPIDER_PAGECOUNT setting of the CloseSpider extension: An integer which specifies the maximum number of responses to crawl. If the spider crawls more than that, the spider will be closed with the reason closespider_pagecount. If zero (or non set), spiders won’t be closed by number of...
python,web-scraping,scrapy,web-crawler,scrapy-spider
You need to call extract() and then get the first item: item['main_title'] = site.xpath('.//li[@id="nav_cat_0"]/text()').extract()[0] # HERE ^ If you want to have a separate category for each item, iterate over them: for title in site.xpath('.//li[starts-with(@id, "nav_cat_")]/text()').extract(): item = AmazonItem() item['main_title'] = title items.append(item) ...
In order for get_project_settings() to find the desired settings.py, set the SCRAPY_SETTINGS_MODULEenvironment variable: import os import sys # ... sys.path.append(os.path.join(os.path.curdir, "crawlers/myproject")) os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.settings' settings = get_project_settings() Note that, due to the location of your runner script, you need to add myproject to the sys.path. Or, move scrapyScript.py under myproject...
python,html,web-crawler,scrapy,scrapy-spider
Nevermind, I found the answer. type() only gives information on the immediate type. It tells nothing of inheritance. I was looking for isinstance(). This code works: if isinstance(response, TextResponse): links = response.xpath("//a/@href").extract() ... http://stackoverflow.com/a/2225066/1455074, near the bottom...
I'm not sure if there is a way to push items directly to the engine, but what you could do is push dummy requests with the items in the meta variable and just yield them in the callback. def _file_thread_loop(self): while True: #... read files... for file in files: response...
python,web-crawler,scrapy,scrapy-spider,sgml
The problem is in the restrict_xpaths - it should point to a block where a link extractor should look for links. Don't specify allow at all: rules = [ Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), callback="parse_me", follow=True), ] And you need to fix your allowed_domains: allowed_domains = ["www.allgigs.co.uk"] Also note that the print items in...
python,python-2.7,scrapy,screen-scraping,scrapy-spider
Please, show position of items.py in project structure. You should have smth like this: craig (folder) craig.py items (folder) __ init __.py items.py ...
The problem is that the HTML has an incorrect HTML base element, which is supposed to specify the base url for all the relative links in the page: <base href="http://www.lecture-en-ligne.com/"/> Scrapy is respecting that, that's why the links are being formed that way....
python,web-scraping,scrapy,scrapy-spider
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td...
python,xml,rss,scrapy,scrapy-spider
You need to handle namespaces: class PLoSSpider(XMLFeedSpider): name = "plos" namespaces = [('atom', 'http://www.w3.org/2005/Atom')] itertag = 'atom:entry' iterator = 'xml' # this is also important See also: how do I use empty namespaces in an lxml xpath query? Working example: from scrapy.contrib.spiders import XMLFeedSpider class PLoSSpider(XMLFeedSpider): name = "plos" namespaces...
python,web-scraping,scrapy,scrapy-spider
That's because your locators match multiple elements and are not context-specific (should start with a dot), fix it: def parse(self, response): for obj in response.css("ul.search-results li"): item = YelpItem() item['name'] = obj.xpath(".//div[@class='media-story']//h3//a/text()").extract()[0] item['address'] = ''.join(obj.xpath(".//div[@class='secondary-attributes']//address/text()").extract()).strip() yield item ...
python,web-scraping,scrapy,scrapy-spider
By default, Scrapy only handles responses with status codes 200-300. Let Scrapy handle 500 and 502: class Spider(...): handle_httpstatus_list = [500, 502] Then, in the parse() callback, check response.status: def parse(response): if response.status == 500: # logic here elif response.status == 502: # logic here ...
python,json,web-scraping,scrapy,scrapy-spider
The output file would contain only [ if there was an error while crawling or, there were no items returned. In your case, it is because of the indentation, parse_item() should be indented: class CaptchaSpider(CrawlSpider): name = "CaptchaSpider" allowed_domains = ["*****.ac.in"] start_urls = [ "https://*****.ac.in/*****.asp" ] def parse_item(self, response): item...
python,web-scraping,scrapy,scrapy-spider
Collect the phone details in the list and join() them after the loop: def parse(self, response): phone = gsmArenaDataItem() details = [] for tableRows in response.css("div#specs-list table"): phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0] for ttl in tableRows.xpath(".//td[@class='ttl']"): ttl_value = " ".join(ttl.xpath(".//text()").extract()) nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract()) details.append('{title}: {info}'.format(title=ttl_value.encode("utf-8"),...
python,flask,scrapy,scrapy-spider
Override spider's __init__() method: class MySpider(Spider): name = 'my_spider' def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) endpoints = kwargs.get('start_urls').split(',') self.start_urls = ["http://www.google.com/patents/" + x for x in endpoints] And pass the list of endpoints through the -a command line argument: scrapy crawl patents -a start_urls="US6249832,US20120095946" -o static/s.json See also: How...
You can subclass scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and override _retry() to do whatever you want with the request than is given up on. from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware from scrapy import log class CustomRetryMiddleware(RetryMiddleware): def _retry(self, request, reason, spider): retries = request.meta.get('retry_times', 0) + 1 if retries <= self.max_retry_times: log.msg(format="Retrying %(request)s (failed %(retries)d times):...
python,xml,xpath,scrapy,scrapy-spider
You need to instantiate a CrawlerItem instance in your parse_node() method: def parse_node(self, response, selector): item = CrawlerItem() item['to'] = selector.xpath('//to/text()').extract() item['who'] = selector.xpath('//from/text()').extract() item['heading'] = selector.xpath('//heading/text()').extract() item['body'] = selector.xpath('//body/text()').extract() return item ...
web-scraping,scrapy,scrape,scrapy-spider
The HTTP status 302 means Moved Temporarily. When I do a HTTP GET request to the url http://fuyuanxincun.fang.com/xiangqing/ it show's me a HTTP 200 status. It's common that the server won't send anything after sending the 302 statuscode (altough technically sending data after a 302 is possible). The reason why...
python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.exceptions import CloseSpider from scrapy.http import Request from test.items import CraigslistSampleItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor URL = "http://example.com/subpage/%d" class MySpider(BaseSpider): name = "craig" allowed_domains = ["xyz.com"] def start_requests(self): for i in range(10): yield Request(URL % i,...
python,xpath,web-scraping,scrapy,scrapy-spider
Let's make it simpler. Each division is represented with div with a mod-teams-list-medium class. Each division div consist of 2 parts: div with class="mod-header" containing the division name div with class="mod-content" containing the list of teams Inside your spider it would be reflected this way: for division in response.xpath('//div[@id="content"]//div[contains(@class, "mod-teams-list-medium")]'):...
python,web-scraping,scrapy,scrapy-spider
Before reading the technical part: make sure you are not violating the iTunes terms of use. All of the problems you have are inside the parse() callback: the main xpath is not correct (there is no ul element directly under the section) instead of response.selector you can directly use response...
For rules to work, you need to use CrawlSpider not the general scrapy Spider. Also, you need to rename your first parsing function to a name other than parse. Otherwise, you will be overwriting an important method of the CrawlSpider and it will not work. See the warning in the...
python-2.7,web-scraping,scrapy,screen-scraping,scrapy-spider
The problem is that you are putting your spider into the items.py. Instead, create a package spiders, inside it create a dmoz.py and put your spider into it. See more at Our first Spider paragraph of the tutorial....
python,selenium,web-scraping,scrapy,scrapy-spider
Just yield Request instances from the method and provide a callback: class AB_Spider(CrawlSpider): ... def parse_abcs(self, response): ... all_URLS = hxs.xpath('a/@href').extract() for url in all_URLS: yield Request(url, callback=self.parse_page) def parse_page(self, response): # Do the parsing here ...
python,web-scraping,scrapy,scrapy-spider
Set the default parse callback to spin off all the links. By default Scrapy does not visit the same page twice. def parse(self, response): links = LinkExtractor().extract_links(response) return (Request(url=link.url, callback=self.parse_page) for link in links) def parse_page(self, response): # name = manipulate response.url to be a unique file name with open(name,...
python,web-scraping,web-crawler,scrapy,scrapy-spider
The final answer did indeed lie in the indentation of one single yield line. This is the code that ended up doing what I needed it to do. from scrapy.spider import Spider from scrapy.selector import Selector from scrapy.http import Request import re from yelp2.items import YelpReviewItem RESTAURANTS = ['sixteen-chicago'] def...
python,selenium,selenium-webdriver,scrapy,scrapy-spider
I managed it differently.. See my code for further reference. Working fine for complete site.. class FlipkartSpider(BaseSpider): name = "flip1" allowed_domains = ["flipkart.com"] start_urls = [ "http://www.flipkart.com/tablets/pr?sid=tyy%2Chry&q=mobile&ref=b8b64676-065a-445c-a6a1-bc964d5ff938" ] '''def is_element_present(self, finder, selector, wait_time=None): wait_time = wait_time or self.wait_time end_time = time.time() + wait_time while time.time() < end_time: if finder(selector):...
Seems like your parameters don't match(should be login instead of username) and you are missing some of them in your formdata. This is what firebug shows me is delivered when trying to log in: Seems like layoutType and returnUrl can just be hardcoded in but profillingSessionId needs to be retrieved...
python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider
The main problem is in how you are passing the booking type, country, pickup and dropoff. You need to pass the corresponding "id"s instead of literal strings. The following would work in your case: return FormRequest.from_response( response, formxpath="//form[@id='transfer_search']", formdata={ 'bookingtypeid': '1', 'airportgroupid': '14', 'pickup': '121', 'dropoff': '1076', 'arrivaldate': '12-07-2015', 'arrivalhour':...
python,web-scraping,scrapy,scrapy-spider
I don't think that you need two rules, you can declare one and do it to follow links and parse each page. In the rule I restrict the xpath to the last link of the list, because otherwise you could be parsing some links multiple times. I use parse_start_url as...
python,xpath,web-scraping,scrapy,scrapy-spider
This is because you are setting a default input and output processors which are applied for all item fields including time which is a float. You have multiple options: instead of default processors, use field-specific processors: l.name_in = MapCompose(lambda v: v.split(), replace_escape_chars) l.name_out = Join() convert/format the time into string:...
python,postgresql,scrapy,scrapy-spider
You need to this in scrapy pipelines. You cancreate 2 item classes, one for each table. In pipeline, you check for the item type, based on that you store it in the table you want. From your description, the code can look like this: class Item1(Item): location = Field() salary...
python,web-scraping,scrapy,scrapy-spider
The XPath configured inside restict_xpaths should point to an element, not an attribute. Replace: //*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a/@href with: //*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a ...
python,python-2.7,web-scraping,scrapy,scrapy-spider
You need to modify your __init__() constructor to accept the date argument. Also, I would use datetime.strptime() to parse the date string: from datetime import datetime class MySpider(CrawlSpider): name = 'tw' allowed_domains = ['test.com'] def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) date = kwargs.get('date') if not date: raise ValueError('No date...
python,web-scraping,scrapy,scrapy-spider
The problem is that the menu is constructed dynamically with the help of the browser executing javascript. Scrapy is not a browser and doesn't have a javascript engine built-in. Hopefully, there is a script tag containing a javascript array of menu objects. We can locate the desired script tag, extract...
python,web-scraping,web-crawler,scrapy,scrapy-spider
The key problem with your code is that you have not set the rules for the CrawlSpider. Other improvements I would suggest: there is no need to instantiate HtmlXPathSelector, you can use response directly select() is deprecated now, use xpath() get the text() of the title element in order to...