Menu
  • HOME
  • TAGS

Scrapy JSON export issues

python,web-scraping,scrapy,scrapy-spider,craigslist

Multiple issues here. The main problem is in invalid expressions inside the select() calls. Aside from that: use response.xpath() or response.css() no need for HtmlXPathSelector anymore no need to instantiate an Item instance in the parse() callback and pass in meta. Get the url from response.url in parse_listing_page() callback Improved...

Parsing the urls in sitemap with different url format using sitemap spider in scrapy, python

python,web-scraping,scrapy,sitemap,scrapy-spider

I think the nicest and cleanest solution would be to add a downloader middleware which changes the malicious URLs without the spider noticing. import re import urlparse from scrapy.http import XmlResponse from scrapy.utils.gz import gunzip, is_gzipped from scrapy.contrib.spiders import SitemapSpider # downloader middleware class SitemapWithoutSchemeMiddleware(object): def process_response(self, request, response, spider):...

Python scrapy to extract specific Xpath fields

python,web-scraping,scrapy,scrapy-spider

The problems you have in the code: yield item should be inside the loop since you are instantiating items there the xpath you have is pretty messy and not quite reliable since it heavily relies on the elements location inside parent tags and starts from almost the top parent of...

Extracting links with scrapy that have a specific css class

python,web-scraping,scrapy,screen-scraping,scrapy-spider

From what I understand, you want something similar to restrict_xpaths, but provide a CSS selector instead of an XPath expression. This is actually a built-in feature in Scrapy 1.0 (currently in a release candidate state), the argument is called restrict_css: restrict_css a CSS selector (or list of selectors) which defines...

Scrapy simulate XHR request - returning 400

python,ajax,web-scraping,scrapy,scrapy-spider

The key problem is in missing quotes around the filters in the body: url = 'https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters' req = scrapy.Request(url, method='POST', body='{"filters": []}', headers={'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/json; charset=UTF-8'}, callback=self.parser2) yield req Or, you can define it as a dictionary and then call json.dumps() to dump it to a string: params =...

How can I reuse the parse method of my scrapy Spider-based spider in an inheriting CrawlSpider?

python,web-scraping,scrapy,scrapy-spider

You can avoid multiple inheritance here. Combine both spiders in a single one. If start_urls would be passed from the command-line - it would behave like a CrawlSpider, otherwise like a regular spider: from scrapy import Item from scrapy.contrib.spiders import CrawlSpider, Rule from foo.items import AtlanticFirearmsItem from scrapy.contrib.loader import ItemLoader...

Multiple inheritance in scrapy spiders

python,regex,scrapy,multiple-inheritance,scrapy-spider

You are on the right track, the only thing left is at the end of your parse_product function, you have to yield all the urls extracted by the crawler like so def parse_product(self, response): loader = FlipkartItemLoader(response=response) loader.add_value('pid', 'value of pid') loader.add_xpath('name', 'xpath to name') yield loader.load_item() # CrawlSpider defines...

Why scrapy not storing data into mongodb?

python,mongodb,web-scraping,scrapy,scrapy-spider

process_item() method is not indented properly, should be: class MongoDBPipeline(object): def __init__(self): connection = pymongo.Connection(settings['MONGODB_HOST'], settings['MONGODB_PORT']) db = connection[settings['MONGODB_DATABASE']] self.collection = db[settings['MONGODB_COLLECTION']] def process_item(self, item, spider): self.collection.insert(dict(item)) log.msg("Item wrote to MongoDB database {}, collection {}, at host {}, port {}".format( settings['MONGODB_DATABASE'], settings['MONGODB_COLLECTION'], settings['MONGODB_HOST'],...

Running multiple spiders in the same process, one spider at a time

web-scraping,scrapy,scrapy-spider

You can rely on the spider_closed signal to start crawling for the next postal code/category. Here is the sample code (not tested) based on this answer and adopted for your use case: from scrapy.crawler import Crawler from scrapy import log, signals from scrapy.settings import Settings from twisted.internet import reactor #...

Remove first tag html using python & scrapy

python,xpath,scrapy,scrapy-spider

You can get all the child tags except the div with class="needremove": response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract() Demo from the shell: $ scrapy shell index.html In [1]: response.xpath('//div[contains(@class, "abc")]/div[contains(@class, "xyz")]/*[local-name() != "div" and not(contains(@class, "needremove"))]').extract() Out[1]: [u'<p>text</p>', u'<p>text</p>', u'<p>text</p>',...

How to read xml directly from URLs with scrapy/python

python,xml,web-scraping,scrapy,scrapy-spider

You need to yield Request instances to the other URLs: def check_login_response(self, response): # check login succeed before going on if "incorrect" in response.body: self.log("Login failed", level=scrapy.log.ERROR) return for url in list_or_urls: yield Request(url, callback=self.parse_other_url) def parse_other_url(self, response): # ... ...

Scrapy - Scrape multiple URLs using results from the first URL

python,scrapy,scrapy-spider

I think what you're looking for is the yield statement: def parse(self, response): # Get the list of URLs, for example: list = ["http://a.com", "http://b.com", "http://c.com"] for link in list: request = scrapy.Request(link) yield request ...

Scrapy - Issue with xpath on an xml crawl

python,xml,xpath,scrapy,scrapy-spider

The signature of parse_node() is incorrect. There should be selector argument given which you should call xpath() method on, example: def parse_node(self, response, selector): to = selector.xpath('//to/text()').extract() who = selector.xpath('//from/text()').extract() print to, who Prints: [u'Tove'] [u'Jani'] ...

AttributeError: 'module' object has no attribute 'Spider'

python,scrapy,scrapy-spider

You are using old Scrapy (0.14.4) with the most latest documentation. Solution: upgrade to the latest version of Scrapy or read old docs, that suit currently installed version...

Scrapy doesn't download data though the xpath is correct

xpath,web-scraping,scrapy,scrapy-spider

Browsers add tbody to table elements which is why your xpath works in dev tools but fails with scrapy, this is common gotcha. Usually you need to find xpath yourself, don't trust automatically generated xpaths they are usually needleesly long. For example to get data about ships you could just...

Scrapy command in shell script not executing when called from java

java,bash,shell,scrapy,scrapy-spider

Your are missing forward slash on your 'PATH' declaration, and path should be to the directory, not the program You have export PATH=usr/local/bin/scrapy:$PATH Should be export PATH=/usr/local/bin:$PATH ...

Why does my Scrapy code return an empty array?

python,web-scraping,scrapy,scrapy-spider

Let me suggest a more reliable and readable XPath to locate, for the sake of an example, "Humidity" value where the base is that "Humidity" column label: "".join(i.xpath('.//td[dfn="Humidity"]/following-sibling::td//text()').extract()).strip() Outputs 45% now. FYI, your XPath had at least one problem - the tbody tag - remove it from the XPath expression....

scrapy append to linkextractor links

python,web-scraping,scrapy,scrapy-spider

This is a perfect use case for using process_value argument: from scrapy.contrib.linkextractors import LinkExtractor addition = "?pag_sortorder=0&pag_perPage=999" LinkExtractor(process_value=lambda x: x + addition) ...

Can I specify any method as the callback when constructing a Scrapy Request object?

python,callback,web-scraping,scrapy,scrapy-spider

Theoretically, it is doable since callback is just a callable that has a response as it's argument. Though, Items are just containers of the fields, they are for storing data, you should not put logic there. Better create a method in the spider and pass the item instance inside meta:...

How to crawl classified websites [closed]

web-crawler,scrapy,scrapy-spider

import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin class CompItem(scrapy.Item): name = scrapy.Field() price = scrapy.Field() location = scrapy.Field() class criticspider(CrawlSpider): name = "craig" allowed_domains = ["newyork.craigslist.org"] start_urls = ["http://newyork.craigslist.org/search/cta"] def parse(self, response): sites = response.xpath('//div[@class="content"]') items = []...

Recursive crawling over a page

python,web-scraping,scrapy,scrapy-spider

You need to pass the item instantiated in parse() inside the meta to the secondPage callback: def parse(self, response): for sel in response.xpath('//*[@id="conteudoInternas"]/ul/li'): item = CinemasItem() item['title'] = 'title' item['room'] = 'room' item['mclass'] = 'mclass' item['minAge'] = 'minAge' item['cover'] = 'cover' item['sessions'] = 'sessions' secondUrl = sel.xpath('p[1]/a/@href').extract()[0] # see: we...

Scrapy: If key exists, why do I get a KeyError?

python,list,key,scrapy,scrapy-spider

Scrapy Item class provides a dictionary-like interface for storing the extracted data. There are no default values set for item fields. To check whether the field was set or not, simply check for the field key in the item instance: if 'link' not in item: print "link has not been...

How to exclude a particular html tag(without any id) from several tags while using scrapy?

python,html,web-scraping,scrapy,scrapy-spider

I would use starts-with() XPath function to get the div element's text that starts with "Timings": sel.xpath('.//div[starts-with(., "Timings")]/text()').extract() Note that the HTML structure of the page doesn't make it easy to distinguish locations between each other - there is no location-specific containers that you can iterate over. In this case,...

Scrapy - scraped website authentication token expires while scraping

python,authentication,scrapy,scrapy-spider

I was able to figure this out. I had to override the Request object in order to set a new authorization token into the header when the token expires. I made the the token a global variable. # override Request object in order to set new authorization token into the...

Simple scrapy XML spider syntax error [closed]

python,xml,scrapy,scrapy-spider

Change word from to something different, in python it is a keyword.

Scrapy, only follow internal URLS but extract all links found

python,scrapy,web-crawler,scrape,scrapy-spider

You can also use the link extractor to pull all the links once you are parsing each page. The link extractor will filter the links for you. In this example the link extractor will deny links in the allowed domain so it only gets outside links. from scrapy.contrib.spiders import CrawlSpider,...

413 on XHR using scrapy, works fine in requests library

python,scrapy,python-requests,scrapy-spider

I resolved this by removing the cookies and not setting the "Content-Length" header manually on my yielded request. It seems like those 2 things were the extra 21 bytes on the 2nd TCP segment and caused the 413 response. Maybe the server was interpreting the "Content-Length" as the combined value...

Scraping data from multiple URL

python,web-scraping,scrapy,scrapy-spider

Allowed domains should be defined without the http://. For example: allowed_domains= ["cbfcindia.gov.in/"] If any issues persist, then please show the full log that includes details of the pages crawled and any redirects that may have occurred....

delete spiders from scrapinghub

delete,web-crawler,scrapy,scrapy-spider,scrapinghub

You just need to remove the spider from your project, and deploy the project again, via shub deploy, or scrapyd-deploy.

Scrap a huge site with scrapy never completed

scrapy,scrapy-spider

I don't know if CrawlSpider is supposed to be deterministic, but a site with millions of pages has probably not always the same links on a given page and may deliver responses not in the same order. No, I would not expect a CrawlSpider to take the same path. But...

Scrapy creating XML feed wraps content in “value” tags

python,xml,scrapy,scrapy-spider

The extract() method will always return a list of values, even if there is only a single value as a result, for example: [4], [3,4,5] or None. To avoid this, if you know there is only one value, you can select it like: item['to'] = selector.xpath('//to/text()').extract()[0] Note: Be aware that...

Scrapy not giving individual results of all the reviews of a phone?

python,xpath,web-scraping,scrapy,scrapy-spider

First of all, your XPath expressions are very fragile in general. The main problem with your approach is that site does not contain a review section, but it should. In other words, you are not iterating over review blocks on a page. Also, the model name should be extracted outside...

Scrapy: Spider optimization

python,web-scraping,scrapy,scrapy-spider

Which is the best way to make the spider follow the pagination of an url ? This is very site-specific and depends on how the pagination is implemented. If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination...

Scrapy returning zero results

python,scrapy,scrapy-spider

You need to indent the parse block. import scrapy class StockSpider(scrapy.Spider): name = "ugaz" allowed_domains = ["nasdaq.com"] start_urls = ["http://www.nasdaq.com/symbol/ugaz/"] # Indent this block def parse(self, response): stock = StockItem() stock['quote'] = response.xpath('//*[@id="qwidget_lastsale"]/text()').extract() stock['time'] = response.xpath('//*[@id="qwidget_markettime"]/text()').extract() return stock ...

Scrapy python error - Missing scheme in request URL

python,web-crawler,scrapy,scrapy-spider

You need to add scheme for the URL: ftp://ftp.site.co.uk The FTP URL syntax is defined as: ftp://[<user>[:<password>]@]<host>[:<port>]/<url-path> Basically, you do this: yield Request('ftp://ftp.site.co.uk/feed.xml', ...) Read more about schemas at Wikipedia: http://en.wikipedia.org/wiki/URI_scheme...

Scrapy `ReactorNotRestartable`: one class to run two (or more) spiders

scrapy,twisted,scrapy-spider

It was possible to run multiple spiders within one reactor by keeping the reactor open until all the spiders have stopped running. This was achieved by keeping a list of all the running spiders and not executing reactor.stop() until this list is empty: import sys import os from scrapy.utils.project import...

Scrapy: ERROR: Error processing

python,scrapy,scrapy-spider

Take a closer look to the traceback, there is the line: File "/home/me/Desktop/scrapy/dirbot-master/dirbot-master/dirbot/pipelines.py", line 13, in process_item if word in unicode(item['description']).lower(): This means that your pipeline is throwing the error while trying to process an item. Then, see what fields do you fill in the spider: for site in sites:...

Passing Argument to Scrapy Spider from Python Script

python,web-scraping,arguments,scrapy,scrapy-spider

You need to pass your parameters on the crawl method of the CrawlerProcess, so you need to run it like this: crawler = CrawlerProcess(Settings()) crawler.crawl(BBSpider, start_url=url) crawler.start() ...

scrapy crawling at depth not working

python,scrapy,scrapy-spider

There is a Warning box in the CrawlSpider documentation. It says: When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work. Your code does...

Scrapy Limit Requests For Testing

python,python-2.7,web-scraping,scrapy,scrapy-spider

You are looking for the CLOSESPIDER_PAGECOUNT setting of the CloseSpider extension: An integer which specifies the maximum number of responses to crawl. If the spider crawls more than that, the spider will be closed with the reason closespider_pagecount. If zero (or non set), spiders won’t be closed by number of...

While scraping getting error instance method has no attribute '__getitem__'

python,web-scraping,scrapy,web-crawler,scrapy-spider

You need to call extract() and then get the first item: item['main_title'] = site.xpath('.//li[@id="nav_cat_0"]/text()').extract()[0] # HERE ^ If you want to have a separate category for each item, iterate over them: for title in site.xpath('.//li[starts-with(@id, "nav_cat_")]/text()').extract(): item = AmazonItem() item['main_title'] = title items.append(item) ...

Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script

python,scrapy,scrapy-spider

In order for get_project_settings() to find the desired settings.py, set the SCRAPY_SETTINGS_MODULEenvironment variable: import os import sys # ... sys.path.append(os.path.join(os.path.curdir, "crawlers/myproject")) os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.settings' settings = get_project_settings() Note that, due to the location of your runner script, you need to add myproject to the sys.path. Or, move scrapyScript.py under myproject...

Distinguishing between HTML and non-HTML pages in Scrapy

python,html,web-crawler,scrapy,scrapy-spider

Nevermind, I found the answer. type() only gives information on the immediate type. It tells nothing of inheritance. I was looking for isinstance(). This code works: if isinstance(response, TextResponse): links = response.xpath("//a/@href").extract() ... http://stackoverflow.com/a/2225066/1455074, near the bottom...

Scrapy - generating items outside of parse callback

python,scrapy,scrapy-spider

I'm not sure if there is a way to push items directly to the engine, but what you could do is push dummy requests with the items in the meta variable and just yield them in the callback. def _file_thread_loop(self): while True: #... read files... for file in files: response...

SgmlLinkExtractor not displaying results or following link

python,web-crawler,scrapy,scrapy-spider,sgml

The problem is in the restrict_xpaths - it should point to a block where a link extractor should look for links. Don't specify allow at all: rules = [ Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), callback="parse_me", follow=True), ] And you need to fix your allowed_domains: allowed_domains = ["www.allgigs.co.uk"] Also note that the print items in...

Error in scrapy screen scraper - Cant find what's wrong for the life of me

python,python-2.7,scrapy,screen-scraping,scrapy-spider

Please, show position of items.py in project structure. You should have smth like this: craig (folder) craig.py items (folder) __ init __.py items.py ...

scrapy LxmlLinkExtractor and relative urls

python,scrapy,scrapy-spider

The problem is that the HTML has an incorrect HTML base element, which is supposed to specify the base url for all the relative links in the page: <base href="http://www.lecture-en-ligne.com/"/> Scrapy is respecting that, that's why the links are being formed that way....

Scrapy: Extract links and text

python,web-scraping,scrapy,scrapy-spider

There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td...

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

python,xml,rss,scrapy,scrapy-spider

You need to handle namespaces: class PLoSSpider(XMLFeedSpider): name = "plos" namespaces = [('atom', 'http://www.w3.org/2005/Atom')] itertag = 'atom:entry' iterator = 'xml' # this is also important See also: how do I use empty namespaces in an lxml xpath query? Working example: from scrapy.contrib.spiders import XMLFeedSpider class PLoSSpider(XMLFeedSpider): name = "plos" namespaces...

How to get scrapy results orderly?

python,web-scraping,scrapy,scrapy-spider

That's because your locators match multiple elements and are not context-specific (should start with a dot), fix it: def parse(self, response): for obj in response.css("ul.search-results li"): item = YelpItem() item['name'] = obj.xpath(".//div[@class='media-story']//h3//a/text()").extract()[0] item['address'] = ''.join(obj.xpath(".//div[@class='secondary-attributes']//address/text()").extract()).strip() yield item ...

Scrapy: catch responses with specific HTTP server codes

python,web-scraping,scrapy,scrapy-spider

By default, Scrapy only handles responses with status codes 200-300. Let Scrapy handle 500 and 502: class Spider(...): handle_httpstatus_list = [500, 502] Then, in the parse() callback, check response.status: def parse(response): if response.status == 500: # logic here elif response.status == 502: # logic here ...

Spider returns only “[” in the items.json file

python,json,web-scraping,scrapy,scrapy-spider

The output file would contain only [ if there was an error while crawling or, there were no items returned. In your case, it is because of the indentation, parse_item() should be indented: class CaptchaSpider(CrawlSpider): name = "CaptchaSpider" allowed_domains = ["*****.ac.in"] start_urls = [ "https://*****.ac.in/*****.asp" ] def parse_item(self, response): item...

Web scraping error: exceptions.MemoryError

python,web-scraping,scrapy,scrapy-spider

Collect the phone details in the list and join() them after the loop: def parse(self, response): phone = gsmArenaDataItem() details = [] for tableRows in response.css("div#specs-list table"): phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0] for ttl in tableRows.xpath(".//td[@class='ttl']"): ttl_value = " ".join(ttl.xpath(".//text()").extract()) nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract()) details.append('{title}: {info}'.format(title=ttl_value.encode("utf-8"),...

Passing list as arguments in Scrapy

python,flask,scrapy,scrapy-spider

Override spider's __init__() method: class MySpider(Spider): name = 'my_spider' def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) endpoints = kwargs.get('start_urls').split(',') self.start_urls = ["http://www.google.com/patents/" + x for x in endpoints] And pass the list of endpoints through the -a command line argument: scrapy crawl patents -a start_urls="US6249832,US20120095946" -o static/s.json See also: How...

Scrapy: collect retry messages

python,scrapy,scrapy-spider

You can subclass scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and override _retry() to do whatever you want with the request than is given up on. from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware from scrapy import log class CustomRetryMiddleware(RetryMiddleware): def _retry(self, request, reason, spider): retries = request.meta.get('retry_times', 0) + 1 if retries <= self.max_retry_times: log.msg(format="Retrying %(request)s (failed %(retries)d times):...

Scrapy prints fields but doesn't populate XML file

python,xml,xpath,scrapy,scrapy-spider

You need to instantiate a CrawlerItem instance in your parse_node() method: def parse_node(self, response, selector): item = CrawlerItem() item['to'] = selector.xpath('//to/text()').extract() item['who'] = selector.xpath('//from/text()').extract() item['heading'] = selector.xpath('//heading/text()').extract() item['body'] = selector.xpath('//body/text()').extract() return item ...

Scrapy: how can I get the content of pages whose response.status=302?

web-scraping,scrapy,scrape,scrapy-spider

The HTTP status 302 means Moved Temporarily. When I do a HTTP GET request to the url http://fuyuanxincun.fang.com/xiangqing/ it show's me a HTTP 200 status. It's common that the server won't send anything after sending the 302 statuscode (altough technically sending data after a 302 is possible). The reason why...

how to output multiple webpages crawled data into csv file using python with scrapy

python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.exceptions import CloseSpider from scrapy.http import Request from test.items import CraigslistSampleItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor URL = "http://example.com/subpage/%d" class MySpider(BaseSpider): name = "craig" allowed_domains = ["xyz.com"] def start_requests(self): for i in range(10): yield Request(URL % i,...

xpath preceding child issue

python,xpath,web-scraping,scrapy,scrapy-spider

Let's make it simpler. Each division is represented with div with a mod-teams-list-medium class. Each division div consist of 2 parts: div with class="mod-header" containing the division name div with class="mod-content" containing the list of teams Inside your spider it would be reflected this way: for division in response.xpath('//div[@id="content"]//div[contains(@class, "mod-teams-list-medium")]'):...

Scraping iTunes Charts using Scrapy

python,web-scraping,scrapy,scrapy-spider

Before reading the technical part: make sure you are not violating the iTunes terms of use. All of the problems you have are inside the parse() callback: the main xpath is not correct (there is no ul element directly under the section) instead of response.selector you can directly use response...

Scrapy extracting from Link

python,scrapy,scrapy-spider

For rules to work, you need to use CrawlSpider not the general scrapy Spider. Also, you need to rename your first parsing function to a name other than parse. Otherwise, you will be overwriting an important method of the CrawlSpider and it will not work. See the warning in the...

Scrapy-Scraper Does Not Run

python-2.7,web-scraping,scrapy,screen-scraping,scrapy-spider

The problem is that you are putting your spider into the items.py. Instead, create a package spiders, inside it create a dmoz.py and put your spider into it. See more at Our first Spider paragraph of the tutorial....

Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

python,selenium,web-scraping,scrapy,scrapy-spider

Just yield Request instances from the method and provide a callback: class AB_Spider(CrawlSpider): ... def parse_abcs(self, response): ... all_URLS = hxs.xpath('a/@href').extract() for url in all_URLS: yield Request(url, callback=self.parse_page) def parse_page(self, response): # Do the parsing here ...

Scrapy keep all unique pages based on a list of start urls

python,web-scraping,scrapy,scrapy-spider

Set the default parse callback to spin off all the links. By default Scrapy does not visit the same page twice. def parse(self, response): links = LinkExtractor().extract_links(response) return (Request(url=link.url, callback=self.parse_page) for link in links) def parse_page(self, response): # name = manipulate response.url to be a unique file name with open(name,...

Scrapy spider not including all requested pages

python,web-scraping,web-crawler,scrapy,scrapy-spider

The final answer did indeed lie in the indentation of one single yield line. This is the code that ended up doing what I needed it to do. from scrapy.spider import Spider from scrapy.selector import Selector from scrapy.http import Request import re from yelp2.items import YelpReviewItem RESTAURANTS = ['sixteen-chicago'] def...

How to extract data from dynamic websites like Flipkart using selenium and Scrapy?

python,selenium,selenium-webdriver,scrapy,scrapy-spider

I managed it differently.. See my code for further reference. Working fine for complete site.. class FlipkartSpider(BaseSpider): name = "flip1" allowed_domains = ["flipkart.com"] start_urls = [ "http://www.flipkart.com/tablets/pr?sid=tyy%2Chry&q=mobile&ref=b8b64676-065a-445c-a6a1-bc964d5ff938" ] '''def is_element_present(self, finder, selector, wait_time=None): wait_time = wait_time or self.wait_time end_time = time.time() + wait_time while time.time() < end_time: if finder(selector):...

Why doesn't this FormRequest log me in?

python,scrapy,scrapy-spider

Seems like your parameters don't match(should be login instead of username) and you are missing some of them in your formdata. This is what firebug shows me is delivered when trying to log in: Seems like layoutType and returnUrl can just be hardcoded in but profillingSessionId needs to be retrieved...

Using scrapy's FormRequest no form is submitted

python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider

The main problem is in how you are passing the booking type, country, pickup and dropoff. You need to pass the corresponding "id"s instead of literal strings. The following would work in your case: return FormRequest.from_response( response, formxpath="//form[@id='transfer_search']", formdata={ 'bookingtypeid': '1', 'airportgroupid': '14', 'pickup': '121', 'dropoff': '1076', 'arrivaldate': '12-07-2015', 'arrivalhour':...

Crawl spider not crawling ~ Rule Issue

python,web-scraping,scrapy,scrapy-spider

I don't think that you need two rules, you can declare one and do it to follow links and parse each page. In the rule I restrict the xpath to the last link of the list, because otherwise you could be parsing some links multiple times. I use parse_start_url as...

Using ItemLoader but adding XPath, values etc. in Scrapy

python,xpath,web-scraping,scrapy,scrapy-spider

This is because you are setting a default input and output processors which are applied for all item fields including time which is a float. You have multiple options: instead of default processors, use field-specific processors: l.name_in = MapCompose(lambda v: v.split(), replace_escape_chars) l.name_out = Join() convert/format the time into string:...

Is it possible to scrape data to two different database tables?

python,postgresql,scrapy,scrapy-spider

You need to this in scrapy pipelines. You cancreate 2 item classes, one for each table. In pipeline, you check for the item type, based on that you store it in the table you want. From your description, the code can look like this: class Item1(Item): location = Field() salary...

Error using scrapy

python,web-scraping,scrapy,scrapy-spider

The XPath configured inside restict_xpaths should point to an element, not an attribute. Replace: //*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a/@href with: //*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a ...

Pass argument to scrapy spider within a python script

python,python-2.7,web-scraping,scrapy,scrapy-spider

You need to modify your __init__() constructor to accept the date argument. Also, I would use datetime.strptime() to parse the date string: from datetime import datetime class MySpider(CrawlSpider): name = 'tw' allowed_domains = ['test.com'] def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) date = kwargs.get('date') if not date: raise ValueError('No date...

Why is xpath selecting only the last
  • inside the
      ?
  • python,web-scraping,scrapy,scrapy-spider

    The problem is that the menu is constructed dynamically with the help of the browser executing javascript. Scrapy is not a browser and doesn't have a javascript engine built-in. Hopefully, there is a script tag containing a javascript array of menu objects. We can locate the desired script tag, extract...

    Scrapy CrawlSpider not following links

    python,web-scraping,web-crawler,scrapy,scrapy-spider

    The key problem with your code is that you have not set the rules for the CrawlSpider. Other improvements I would suggest: there is no need to instantiate HtmlXPathSelector, you can use response directly select() is deprecated now, use xpath() get the text() of the title element in order to...

    How to extract all the source code under and export as html?

    python,html,scrapy,scrapy-spider

    If I understand you correctly and you need to extract only original tables html from page, then solution is very simple: def parse(self, response): # XPath query to get all tables from response tables_selectors = response.xpath('//table') tables_html = tables_selectors.extract() ... tables_html is array of strings from original tables html. Process...

    scrapy itemloaders return list of items

    scrapy,scrapy-spider

    You need to instantiate a new ItemLoader in the loop providing an item argument: l = MytemsLoader() l.add_value('main1', some xpath) l.add_value('main2', some xpath) l.add_value('main3', some xpath) item = l.loaditem() rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]") for row in rows: l = MytemsLoader(item=item) l.add_value('table1', some xpath based on rows) l.add_value('table2', some xpath based...

    Using arguments in scrapy pipeline on __init__

    python,web-scraping,arguments,scrapy,scrapy-spider

    Set the arguments inside the spider's constructor: class MySpider(CrawlSpider): def __init__(self, user_id='', *args, **kwargs): self.user_id = user_id super(MySpider, self).__init__(*args, **kwargs) And read them in the open_spider() method of your pipeline: def open_spider(self, spider): print spider.user_id ...

    How to identify a request's crucial information that needs to be sent?

    web-scraping,scrapy,session-cookies,scrapy-spider

    You do not even need the cookies to send with your request. The problem is with body=urllib.urlencode(payload), This encodes the body to URL-Format however if you look at the body of the request of your browser you will see that a JSON is the body. So the solution is to...

    Scrapy crawl and follow links within href

    python,web-scraping,scrapy,scrapy-spider

    From what I see, I can say that: URLs to product categories always end with .kat URLs to products contain id_ followed by a set of digits Let's use this information to define our spider rules: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class CodeCheckspider(CrawlSpider): name = "code_check"...

    How can I start to write Unit test in web Scrapy using python?

    python,unit-testing,web-scraping,scrapy,scrapy-spider

    If we are talking specifically about how to test the spiders (not pipelines, or loaders), then what we did is provided a "fake response" from a local HTML file. Sample code: def fake_response(file_name=None, url=None): """Create a Scrapy fake HTTP response from a HTML file""" if not url: url = 'http://www.example.com'...

    Is there a way using scrapy to export each item that is scrapped into a separate json file?

    web-scraping,scrapy,scrapy-spider

    You can use scrapy-pipeline and from there you can insert each item into seperate files. I have set a counter in my spider so that it increments on each item yield and added that value to item. Using that counter value I'm creating file names. Test_spider.py class TestSpider(Spider): # spider...

    Regular expression for Scrapy rules

    python,regex,scrapy-spider

    For the 1-100 range, you can use r"com/vessels\?page=(?:[1-9][0-9]?|100)\b" See demo In case you need any number, just use \d+: r"com/vessels\?page=\d+" See demo 2...

    scrapy from script output in json

    python,json,web-scraping,scrapy,scrapy-spider

    You need to set FEED_FORMAT and FEED_URI settings manually: settings.overrides['FEED_FORMAT'] = 'json' settings.overrides['FEED_URI'] = 'result.json' If you want to get the results into a variable you can define a Pipeline class that would collect items into the list. Use the spider_closed signal handler to see the results: import json from...

    How to get the scrapy form submission working

    python,forms,web-scraping,scrapy,scrapy-spider

    Your problem is that FormRequest.from_response() uses a different form - a "search form". But, you wanted it to use a "log in form" instead. Provide a formnumber argument: yield FormRequest.from_response(response, formnumber=1, formdata=formdata, clickdata={'name': 'commit'}, callback=self.parse1) Here is what I see opened in the browser after applying the change (used "fake"...

    CrawlSpider derived object das no attribute 'state'

    python,web-scraping,scrapy,scrapy-spider

    According to the source code, state attribute is set on the spider by scrapy.contrib.spiderstate.SpiderState extension in the spider_opened() signal handler: class SpiderState(object): """Store and load spider state during a scraping job""" ... def spider_closed(self, spider): if self.jobdir: with open(self.statefn, 'wb') as f: pickle.dump(spider.state, f, protocol=2) def spider_opened(self, spider): if self.jobdir...

    Scrapy collect data from first element and post's title

    python,web-scraping,web-crawler,scrapy,scrapy-spider

    For posting title, get all the text nodes from the span tag and join them: $ scrapy shell http://denver.craigslist.org/bik/5042090428.html In [1]: "".join(response.xpath("//span[@class='postingtitletext']//text()").extract()) Out[1]: u'Tonka double shock boys bike - $10 (Denver)' Note that the "Scrapy-way" to do this would be to use an ItemLoader and the Join() processor. Second is...

    Rename output file after scrapy spider complete

    python,scrapy,scrapy-spider,scrapyd

    I got a working solution after trying different approaches. Since in my particular case I dump the output into files, specifically bz2 files. I customized a FileFeedStorage to do the job before opening and after closing the file. See code below: from scrapy.contrib.feedexport import FileFeedStorage import os import bz2 MB...

    How to use regex in Portia visual scrapy?

    python-2.7,web-crawler,scrapy-spider,portia

    You need to use capture groups to extract the data so in this case: Location: (.*) This tells portia to extract all data following the Location: string. If for example you only wanted to extract all of the data between Location: and the , you could use the following: Location:...

    Cannot download image with relative URL Python Scrapy

    python,web-crawler,scrapy,scrapy-spider

    Wrap your image url in a list like so: item['image_urls'] = [self.page_name + imageStr[3:-2]] ...

    Scrapy only crawling given page

    python,web-scraping,scrapy,scrapy-spider

    You need to make a couple changes to make it work: inherit from CrawlSpider instead of Spider provide a callback different from parse() Here's the code of the spider that follows every link: class LoomSpider(CrawlSpider): name = "loom" allowed_domains = ["2loom.com"] start_urls = [ "http://2loom.com", ] rules = [Rule(SgmlLinkExtractor(), callback='parse_page',...