Menu
  • HOME
  • TAGS

Scrapy: how can I get the content of pages whose response.status=302?

Tag: web-scraping,scrapy,scrape,scrapy-spider

I get the following log when crawling:

DEBUG: Crawled (302) <GET http://fuyuanxincun.fang.com/xiangqing/> (referer: http://esf.hz.fang.com/housing/151__1_0_0_0_2_0_0/)
DEBUG: Scraped from <302 http://fuyuanxincun.fang.com/xiangqing/>

But it actually returns nothing. How can I deal with these response with status=302?

Any help would be much appreciated !

Best How To :

The HTTP status 302 means Moved Temporarily. When I do a HTTP GET request to the url http://fuyuanxincun.fang.com/xiangqing/ it show's me a HTTP 200 status. It's common that the server won't send anything after sending the 302 statuscode (altough technically sending data after a 302 is possible).

The reason why you get a HTTP 302 status can be one of the following:

  1. The website does not serve it's content when a specific referer (like: http://esf.hz.fang.com/housing/151__1_0_0_0_2_0_0/) is present.
  2. You didn't send the HTTP headers the server wants to see. For example like a certain User-Agent. The website can decide to reject requests without a specific header by sending a HTTP 302 status instead of a HTTP 200 status.
  3. The specific IP-address you try to send the request from is excluded by the website you try to gather.

I would recommend to:

  1. Make the request look like a "real" browser request (communicate similair headers).
  2. Try to send the request from another IP-address.
  3. Try to send the request with a (randomized) User-Agent.

I did the request at UTC time 07:30:29 Wednesday, 13 May 2015, the behavior of the website could be changed in the time between your and my request.

Also it can be helpfull to post the full RAW HTTP request and response.

Python Web Scraping title in a special div & Page 1 + 15

python,css,xpath,web-scraping,request

Here are the things I would fix/improve: the code is not properly indented, you need to move the HTML-parsing code into the loop body a url whisky.de/shop/Aktuell/1 for the page number 1 would not work, instead don't specify the page number: whisky.de/shop/Aktuell/ to get the prices and titles I would...

iMacros TAG to Find TXT and Click Nearby (previous) Link

javascript,dom,web-scraping,scrape,imacros

Relative positioning doesn't meet well this issue, so: SET !EXTRACT_TEST_POPUP NO TAG POS=1 TYPE=SPAN ATTR=TXT:*Banana* EXTRACT=HTM SET FP EVAL("parseInt('{{!EXTRACT}}'.match(/check-num-(\\d)/)[1]) + 1;") TAG POS={{FP}} TYPE=A ATTR=TXT:* ...

How scrapy write in log while running spider?

python,scrapy,scrapyd,portia

Let me try to explain based on the Scrapy Sample Code shown on the Scrapy Website. I saved this in a file scrapy_example.py. from scrapy import Spider, Item, Field class Post(Item): title = Field() class BlogSpider(Spider): name, start_urls = 'blogspider', ['http://blog.scrapinghub.com'] def parse(self, response): return [Post(title=e.extract()) for e in response.css("h2...

Loop through downloading files using selenium in Python

python,selenium,selenium-webdriver,web-scraping,python-3.4

You need to pass the filename into the XPath expression: filename = driver.find_element_by_xpath('//a[contains(text(), "{filename}")]'.format(filename=f)) Though, an easier location technique here would be "by partial link text": for f in fname: filename = driver.find_element_by_partial_link_text(f) filename.click() ...

Scrapy Memory Error (too many requests) Python 2.7

python,django,python-2.7,memory,scrapy

You can process your urls by batch by only queueing up a few at time every time the spider idles. This avoids having a lot of requests queued up in memory. The example below only reads the next batch of urls from your database/file and queues them as requests only...

get div attribute val and div text body

python,web-scraping,beautifulsoup

Not all divs have smturl attribute, so you need to add the attribute to find call. Also not all elements in productDivs contain divs you are looking for, hence I've added test if find returns None. In [27]: for div in productDivs: ....: if div.find('div', {"class":"vthumb", 'smturl': True}) is not...

xpath: how to select items between item A and item B

xpath,scrapy

Assuming that <big><b>Staff in:</b></big> is a unique element that we can use as 'anchor', you can try this way : //big[b='Staff in:']/following-sibling::a[preceding-sibling::big[1][b='Staff in:']] Basically, the xpath finds all <a> that is following sibling of the 'anchor' <big> element mentioned above, and restrict the result to those having nearest preceding sibling...

getting specific images from page

python,html,web-scraping,beautifulsoup,html-parsing

Bing is using some techniques to block automated scrapers. I tried to print div.find('img') and found that they are sending source in attribute names src2, so following should work - div.find('img')['src2'] This is working for me. Hope it helps....

How to get javascript output in python BeautifulSoup or any other module

javascript,python,html,web-scraping,beautifulsoup

Beautifulsoup just can't execute javascript code. I suggest you to integrate something like PhantomJS into your scrapper. If you can drop python, you can write the whole scrapper in PhantomJS

Scraping dynamic data with imacro to excell

web-scraping,imacros

Here is the more detailed answer with a code. TAG POS=1 TYPE=TD ATTR=CLASS:"score" EXTRACT=TXT SET !EXTRACT EVAL("'{{!EXTRACT}}'.match(/\\d+/)[0];") TAG POS=1 TYPE=TD ATTR=CLASS:"score part" EXTRACT=TXT SAVEAS TYPE=EXTRACT FOLDER=* FILE=scores.csv TAG POS=2 TYPE=TD ATTR=CLASS:"score" EXTRACT=TXT SET !EXTRACT EVAL("'{{!EXTRACT}}'.match(/\\d+/)[0];") TAG POS=2 TYPE=TD ATTR=CLASS:"score part" EXTRACT=TXT SAVEAS TYPE=EXTRACT FOLDER=* FILE=scores.csv WAIT SECONDS=4 Play this macro...

Stuck scraping a specific table with scrapy

python,xpath,scrapy

When you say name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0] the XPath expression starts with td and so is relative to the context node that you have in the variable sel (i.e. the tr element in the set of tr elements that the for loop iterates over). However when you say name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]...

Scrapy running from python script processes only start url

python,python-2.7,scrapy

When you override the default parse_start_url for a crawl spider, the method has to yield Requests for the spider to follow, otherwise it can't go anywhere. You are not required to implement this method when subclassing CrawlSpider, and from the rest of your code, it looks like you really don't...

How to reset standard dupefilter in scrapy

scrapy

You can access the current dupefilter object used by the spider via self.crawler.engine.slot.scheduler.df. from scrapy import signals, Spider from scrapy.xlib.pydispatch import dispatcher class ExampleSpider(Spider): name = "example" start_urls = ['http://www.example.com/'] def __init__(self, *args, **kwargs): super(ExampleSpider, self).__init__(*args, **kwargs) dispatcher.connect(self.reset_dupefilter, signals.spider_idle) def reset_dupefilter(self, spider): # clear stored fingerprints by the dupefilter when...

Return html code of dynamic page using selenium

python,python-2.7,selenium,selenium-webdriver,web-scraping

You need to explicitly wait for the search results to appear before getting the page source: from selenium import webdriver from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC wd = webdriver.Firefox() wd.get("https://www.leforem.be/particuliers/offres-emploi-recherche-par-criteres.html?exParfullText=&exPar_search_=true& exParGeographyEdi=true") wd.switch_to.frame("cible") wait = WebDriverWait(wd, 10)...

Scrapy extracting from Link

python,scrapy,scrapy-spider

For rules to work, you need to use CrawlSpider not the general scrapy Spider. Also, you need to rename your first parsing function to a name other than parse. Otherwise, you will be overwriting an important method of the CrawlSpider and it will not work. See the warning in the...

Scrapy Limit Requests For Testing

python,python-2.7,web-scraping,scrapy,scrapy-spider

You are looking for the CLOSESPIDER_PAGECOUNT setting of the CloseSpider extension: An integer which specifies the maximum number of responses to crawl. If the spider crawls more than that, the spider will be closed with the reason closespider_pagecount. If zero (or non set), spiders won’t be closed by number of...

Can't get value from xpath python

python,html,xpath,web-scraping,html-parsing

The values in the table are generated with the help of javascript being executed in the browser. One option to approach it is to automate a browser via selenium, e.g. a headless PhantomJS: >>> from selenium import webdriver >>> >>> driver = webdriver.PhantomJS() >>> driver.get("http://www.tabele-kalorii.pl/kalorie,Actimel-cytryna-miod-Danone.html") >>> >>> table = driver.find_element_by_xpath(u"//table[tbody/tr/td/h3...

Web scraping error: exceptions.MemoryError

python,web-scraping,scrapy,scrapy-spider

Collect the phone details in the list and join() them after the loop: def parse(self, response): phone = gsmArenaDataItem() details = [] for tableRows in response.css("div#specs-list table"): phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0] for ttl in tableRows.xpath(".//td[@class='ttl']"): ttl_value = " ".join(ttl.xpath(".//text()").extract()) nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract()) details.append('{title}: {info}'.format(title=ttl_value.encode("utf-8"),...

How can I use beautiful soup to get the current price of a stock on Google Finance?

python,web-scraping

find() will allow you to find a tag within the HTML DOM. For example, if you want the title of the website you can do something like, bs.find("title") and it will return the first instance of title. (Like: <title>Some title here</title>) You can also filter tags with certain attributes. A...

Parse an HTML table with Nokogiri in Ruby

html,ruby,web-scraping,nokogiri

Here is one approach I tried. But yes, you can take it further from here to meet the need you have : require 'nokogiri' require 'pp' doc = Nokogiri::HTML.parse(File.read("#{__dir__}/out1.html")) data = doc.css('.TTdata, .TTdata_lgrey').map do |tr| %i(position year name).zip(tr.css("td:nth-child(-n+3)").map(&:text)).to_h end pp data output [{:position=>"1.", :year=>"2015", :name=>"Yasmani Grandal"}, {:position=>"3.", :year=>"2015", :name=>"Francisco Cervelli"},...

How to identify an element via XPath when IDs keep changing

html,xml,xpath,selenium-webdriver,web-scraping

You can select a table cell whose contents equals a string such as "Item Pricing" via the following XPath: //td[. = 'Item Pricing'] ...

How to access response Body after simulating a POST request in Node.js?

node.js,web-scraping,http-post,reddit

In short, when you click YES button, the form sends over18=yes parameter to url http://www.reddit.com/over18?dest=http%3A%2F%2Fwww.reddit.com%2Fr%2Fnsfw using POST method. Then, server responds with an 302 Redirection header, cookie with value over18=1 and finally redirects to url http://www.reddit.com/r/nsfw using GET request. THen, server just checks if youa have a cookie with needed...

Extracting links with scrapy that have a specific css class

python,web-scraping,scrapy,screen-scraping,scrapy-spider

From what I understand, you want something similar to restrict_xpaths, but provide a CSS selector instead of an XPath expression. This is actually a built-in feature in Scrapy 1.0 (currently in a release candidate state), the argument is called restrict_css: restrict_css a CSS selector (or list of selectors) which defines...

BeautifulSoup is not getting all data, only some

python,html,web-scraping,beautifulsoup,html-parsing

Instead of getting the .string, get the text of the posting body (worked for me): item_name.get_text(strip=True) As a side note, your script has a blocking "nature", you may speed things up dramatically by switching to Scrapy web-scraping framework. ...

Scrapy not giving individual results of all the reviews of a phone?

python,xpath,web-scraping,scrapy,scrapy-spider

First of all, your XPath expressions are very fragile in general. The main problem with your approach is that site does not contain a review section, but it should. In other words, you are not iterating over review blocks on a page. Also, the model name should be extracted outside...

Scraping with BeautifulSoup: want to scrape entire column including header and title rows

python,web-scraping,beautifulsoup

How's this? I added th.getText() and created a list on the desired columns which pulled the column name, and then added row_name = row.findNext('th').getText() to get the row. from bs4 import BeautifulSoup from urllib import request page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read() soup = BeautifulSoup(page) desired_table = soup.findAll('table')[2] # Find the columns you...

Scrapy redirects to homepage for some urls

scrapy,scrapy-shell

When you have trouble replicating browser behavior using scrapy, you generally want to look at what are those things which are being communicated differently when your browser is talking to the website compared with when your spider is talking to the website. Remember that a website is (almost always) not...

Error fetching data from website

objective-c,osx,web-scraping,request

The request takes time and at your log statement the data has not arrived yet. Put a log of responseData in the else clause, that is when the data is available and you will see it. You do not need (or want) the the __block declaration. [[NSData alloc]initWithData:data] is unnecessary,...

Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script

python,scrapy,scrapy-spider

In order for get_project_settings() to find the desired settings.py, set the SCRAPY_SETTINGS_MODULEenvironment variable: import os import sys # ... sys.path.append(os.path.join(os.path.curdir, "crawlers/myproject")) os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.settings' settings = get_project_settings() Note that, due to the location of your runner script, you need to add myproject to the sys.path. Or, move scrapyScript.py under myproject...

Python Beautiful Soup Web Scraping Specific Numbers

python,html,web-scraping,beautifulsoup,html-parsing

There are just two elements with class="finalScore", the first is the score of the home team, the second is the score of the away team: >>> from urllib import urlopen >>> from bs4 import BeautifulSoup >>> >>> favPrevGMInfoUrl = 'http://www.cbssports.com/nfl/gametracker/boxscore/[email protected]' >>> >>> favPrevGMInfoSoup = BeautifulSoup(urlopen(favPrevGMInfoUrl)) >>> score = [item.get_text() for...

Crawl spider not crawling ~ Rule Issue

python,web-scraping,scrapy,scrapy-spider

I don't think that you need two rules, you can declare one and do it to follow links and parse each page. In the rule I restrict the xpath to the last link of the list, because otherwise you could be parsing some links multiple times. I use parse_start_url as...

Xpath text() wrong output

python,xpath,web-scraping,scrapy

Wow, I feel stupid. The answer just came to me. I forgot to call .extract() on the temp variable in the parse function of my spider class.

Save image from url to special folder

python,web-scraping,beautifulsoup

def get_img(html): soup = BeautifulSoup(html) img_box = [] imgs = soup.find_all('div', class_= 'pthumb') for img in imgs: img_box.append(get_domain(BASE_URL) + img.img['src']) my_path = '/home/<username>/Desktop' # use whatever path you like for img in img_box: urllib.request.urlretrieve(img, os.path.join(my_path, os.path.basename(img))) ...

How to parse Selenium driver elements?

python,parsing,selenium,selenium-webdriver,web-scraping

find_elements_by_css_selector() would return you a list of WebElement instances. Each web element has a number of methods and attributes available. For example, to get a inner text of the element, use .text: for element in driver.find_elements_by_css_selector("div.flightbox"): print(element.text) You can also make a context-specific search to find other elements inside the...

VBA skipping code directly after submitting form in IE

vba,internet-explorer,excel-vba,web-scraping

I solved this by using a completely different method. I used a query table with strings to go where I wanted. Sub ExtractTableData() Dim This_input As String Const prefix As String = "Beginning of url" Const postfix As String = "end of url" Dim qt As QueryTable Dim ws As...

Python Beautiful Soup Table Data Scraping Specific TD Tags

python,table,web-scraping,beautifulsoup,html-table

Following should work as well - import pickle import math import urllib2 from lxml import etree from bs4 import BeautifulSoup from urllib import urlopen year = '2014' lastWeek = '2' favQB1 = "Tom Brady" favQBurl2 = 'http://www.nfl.com/player/tombrady/2504211/gamelogs' favQBhtml2 = urlopen(favQBurl2).read() favQBsoup2 = BeautifulSoup(favQBhtml2) favQBpass2 = favQBsoup2.find_all("table", { "summary" : "Game...

Scrapy xpath construction for tables of data - yielding empty brackets

html,xpath,scrapy

The thing is that what you see in your browser is after Javascript has formatted stuff, presumably Angular. If you run the HTML source in a HTML source beautifier, and search for <span class="item_name"> you'll see a pattern like this, repeating blocks of <div class="menu_item" data-category-id="1" data-category-name="Indica" data-json="{}" id="menu_item_5390083" style="position:...

Grabbing text data from Baseball-reference Python

python,web-scraping,html-parsing

If you were to solve it with BeautifulSoup, you would find the b tag by text Throws: and get the following sibling: >>> from urllib2 import urlopen >>> from bs4 import BeautifulSoup >>> >>> url = "http://www.baseball-reference.com/players/split.cgi?id=aardsda01&year=2015&t=p" >>> soup = BeautifulSoup(urlopen(url)) >>> soup.find("b", text='Throws:').next_sibling.strip() u'Right' ...

Distinguishing between HTML and non-HTML pages in Scrapy

python,html,web-crawler,scrapy,scrapy-spider

Nevermind, I found the answer. type() only gives information on the immediate type. It tells nothing of inheritance. I was looking for isinstance(). This code works: if isinstance(response, TextResponse): links = response.xpath("//a/@href").extract() ... http://stackoverflow.com/a/2225066/1455074, near the bottom...

Scrapy not entering parse method

python,selenium,web-scraping,web-crawler,scrapy

Your parse(self, response): method is not part of the jobSpider class. If you look at the Scrapy documentation you'll see that the parse method needs to be a method of your spider class. from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium import...

Iterate over all links/sub-links with Scrapy run from script

python,windows,python-2.7,web-scraping,scrapy

Your spider is being blocked from visiting pages after the start page by your allowed_domains specification. The value should include just the domain, not the protocol. Try allowed_domains = ["www.WebStore.com"] Also the line desc_out = Join() in your WebStoreItemLoader definition may give an error as you have no desc field....

scrapy xpath not returning desired results. Any idea?

html,xpath,scrapy

seems like your xpath has some problem, checkout the demo from scrapy shell, In [1]: response.xpath('//tr[td[@class="mainheaderq" and contains(font/text(), "ANSWER")]]/following-sibling::tr/td[@class="griditemq"]//text()').extract() Out[1]: [u'\r\n\r\n', u'MINISTER OF STATE(I/C) FOR COAL, POWER AND NEW & RENEWABLE ENERGY (SHRI PIYUSH GOYAL)\r\n\r\n ', u'(a) & (b): So far 29 coal mines have been auctioned under the provisions...