Menu
  • HOME
  • TAGS

Python: Scrapy start_urls list able to handle .format()?

python,function,while-loop,web-crawler,scrapy

I would use a for loop, like this: class MySpider(BaseSpider): stock = ["SCMP", "APPL", "GOOG"] name = "dozen" allowed_domains = ["yahoo.com"] def stock_list(stock): start_urls = [] for i in stock: start_urls.append("http://finance.yahoo.com/q/is?s={}".format(i)) return start_urls start_urls = stock_list(stock) Then assign the function call as I have at the bottom. UPDATE Using Scrapy...

How to check whether the html has changed?

javascript,html,web-scraping,firefox-addon,web-crawler

Basically you might set up Google Spreadsheet to scrape pages' parts thru IMPORTXML function (here with an example) using xpath. Then you set up notifications in a spreadsheet: Tools -> Notification Rules Now each time the scraping function (IMPORTXML) gets content that is different to previous one, spreadsheet should trigger...

Scraping Multi level data using Scrapy, optimum way

python,selenium,data-structures,web-crawler,scrapy

The problem with the code above was: Mutable object ( list, dict) : and all the callbacks were changing that same object in each loop hence ...first and second level of data was being overwritten in last third loop ( mp3_son_url) ...(this was my failed attempt) the solution was to...

Web Crawler - TooManyRedirects: Exceeded 30 redirects. (python)

python,web-crawler

The url to that forum has changed Two modifications for your code Changed the forum 1.url(https://www.thenewboston.com/forum/recent_activity.php?page=" + str(page)) allow_redirects=False (to disable redirects if any). import requests from bs4 import BeautifulSoup def trade_spider(max_pages): page = 1 while page <= max_pages: url = "https://www.thenewboston.com/forum/recent_activity.php?page=" + str(page) print url source_code = requests.get(url, allow_redirects=False)...

Stop Scrapy crawling the same URLs

python,web-scraping,web-crawler,scrapy,duplication

The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url. I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on...

Scrapy, detect when new start_url is being

python,scrapy,web-crawler

It looks like you might be able to accomplish this through the use of signals. Specifically, the item_scraped signal which allows you to register an event after an item is scraped. For each received response, check if the response.url is in the start_url list. scrapy.signals.item_scraped(item, response, spider) More info on...

How to use MessageQueue in Crawler?

architecture,web-crawler,message-queue

what it does next? How it separates found links into processed, discovered and the new ones? You would set up separate queues for these, which would stream back to your database. The idea is that you could have multiple workers going, and a feedback loop to send the newly...

Why getting unexpected with loop of URL results when solving equation

java,web-crawler,equation

My best guess is you are using integer division from all the .size() methods which will result in 0 if the answer is less than 1. Cast your results of .size() to float or double

Why the difference of set-cookie after curl call in php

php,curl,cookies,header,web-crawler

CURLOPT_COOKIESESSION is not the option to set your request's Cookie: header. CURLOPT_COOKIE is. That said, path=/; HttpOnly should never be a part of it. Since that second URI expects session cookie to be present in the request and it isn't because you fail to set it, it redirects you to...

PHP Curl for encrypted pages

php,curl,encryption,web-crawler

Looks like their web server is rejecting requests based on HTTP headers. Or it might be on the application level as well. Try this <?php // Get cURL resource $curl = curl_init(); // Set some options - we are passing in a useragent too here curl_setopt_array($curl, array( CURLOPT_RETURNTRANSFER => 1,...

Heritrix not finding CSS files in conditional comment blocks

java,web-crawler,heritrix

ExtractorHTML parses the page using the following regex: static final String RELEVANT_TAG_EXTRACTOR = "(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2 "|((style[^>]*+)>.*?</style)" + // 3, 4 "|(((meta)|(?:\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\s+[^>]*+)" + // 5, 6, 7 "|(!--(?!\\[if).*?--))>"; // 8 Basically, cases 1 .. 7 match any interesting tags for link extractions, and case 8 matches HTML comments...

WxPython using Listbox and other UserInput with a Button

python,listbox,wxpython,web-crawler

You can just grab it from the listbox, you don't need it from the event. See below: import wx # for gui class MyFrame(wx.Frame): def __init__(self, parent, id): wx.Frame.__init__(self, parent, id, 'Title', size=(300,200)) tournLevel = ['$10,000', '$15,000', '$20,000', '$50,000','$75,000','$100,000'] self.levelBox = wx.ListBox(panel, -1, (40, 50), (90, 90), tournLevel) self.levelBox.SetSelection(1) #...

Howto use scrapy to crawl a website which hides the url as href=“javascript:;” in the next button

javascript,python,pagination,web-crawler,scrapy

Visiting the site with a Web-Browser and activated Web-Developer-Tools (the following screenshots are made with Firefox and add-on Firebug) you should be able to analyze the Network requests and responses. It will show you that the sites pagination buttons send requests like the following: So the URL seems to be:...

How to retrieve redirect url given in window.location

python,beautifulsoup,web-crawler,python-requests,url-redirection

How about regex You just check response.text on redirect occurance (python): regex= /window\.location\s*=\s*\"([^"]+)\"/ var occurance = regex.exec(responce.text) if (occurance[1]) print occurance[1]; See the demo....

Get Facebook name from id with API/crawler

facebook,facebook-graph-api,web-crawler

Since you didn't mentioned which language you are using I'll use PHP since its easier and quicker to learn (my opinion). If you have the ID then this link will take you the user profile: www.facebook.com/profile.php?id=<user id digits only> If you have the username then you can use: www.facebook.com/<user name>...

how to output multiple webpages crawled data into csv file using python with scrapy

python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.exceptions import CloseSpider from scrapy.http import Request from test.items import CraigslistSampleItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor URL = "http://example.com/subpage/%d" class MySpider(BaseSpider): name = "craig" allowed_domains = ["xyz.com"] def start_requests(self): for i in range(10): yield Request(URL % i,...

Selenium interpret javascript on mac?

selenium,web-crawler,mechanize

Selenium is a browser automation tool. You can basically automate everything you can do in your browser. Start with going through the Getting Started section of the documentation. Example: from selenium import webdriver driver = webdriver.Firefox() driver.get("http://www.python.org") print driver.title driver.close() Besides automating common browsers, like Chrome, Firefox, Safari or Internet...

PhantomJS console charset

node.js,character-encoding,web-crawler,phantomjs

I got a cause. This problem was simply the MS console encoding issues. I just typed 'chcp' command and console show me a clear screen. That's all.

PHP web crawler, check URL for path

php,url,path,web-crawler,bots

It looks like regular expressions would be helpful here. You could say, for instance: /* if $input_url contains a 4 digit year, slash, number(s), slash, number(s) */ if (preg_match("/\/20\d\d\/\d+\/\d+\/",$input_url)) { echo $input_url . "<br>"; } ...

My Java program reaches 80% cpu usage after 20-30 min

java,database,web-crawler,cpu

This sound like resources problem. Did you close all your resources in finally statements? Did you start Threads, which are not finishing and keep going on and on?

Unable to click in CasperJS

javascript,web-crawler,phantomjs,casperjs

try to change your selector to this.mouse.click('li[class="item1"] > a') because the li[class="item1"] is not clickable, but the a element inside of it is.

python3 - can't pass through autorization

authentication,python-3.x,web-crawler,authorization

In my case problem was in 'Referer' parameter in headers, which is required but wasn't specified

Authorization issue with cron crawler inserting data into Google spreadsheet using Google API in Ruby

ruby,cron,google-api,web-crawler,google-api-client

Solved thanks to Ruby google_drive gem oAuth2 saving I needed to get a refresh token and make my code use it like below. CLIENT_ID = '!!!' CLIENT_SECRET = '!!!' OAUTH_SCOPE = 'https://www.googleapis.com/auth/drive' REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob' REFRESH_TOKEN = '!!!' client = Google::APIClient.new client.authorization.client_id = CLIENT_ID client.authorization.client_secret = CLIENT_SECRET client.authorization.scope = OAUTH_SCOPE...

Scrapy python error - Missing scheme in request URL

python,web-crawler,scrapy,scrapy-spider

You need to add scheme for the URL: ftp://ftp.site.co.uk The FTP URL syntax is defined as: ftp://[<user>[:<password>]@]<host>[:<port>]/<url-path> Basically, you do this: yield Request('ftp://ftp.site.co.uk/feed.xml', ...) Read more about schemas at Wikipedia: http://en.wikipedia.org/wiki/URI_scheme...

Scrapy CrawlSpider not following links

python,web-scraping,web-crawler,scrapy,scrapy-spider

The key problem with your code is that you have not set the rules for the CrawlSpider. Other improvements I would suggest: there is no need to instantiate HtmlXPathSelector, you can use response directly select() is deprecated now, use xpath() get the text() of the title element in order to...

Cannot download image with relative URL Python Scrapy

python,web-crawler,scrapy,scrapy-spider

Wrap your image url in a list like so: item['image_urls'] = [self.page_name + imageStr[3:-2]] ...

How to crawl links on all pages of a web site with Scrapy

website,web-crawler,scrapy,extract

So, a couple things first: 1) the rules attribute only works if you're extending the CrawlSpider class, they won't work if you extend the simpler scrapy.Spider. 2) if you go the rules and CrawlSpider route, you should not override the default parse callback, because the default implementation is what actually...

Net/HTTPS not getting all the content

ruby,web-crawler,nokogiri,net-http,mechanize-ruby

Thanks to the help of @theTinMan, @MarkThomas and a colleague, I've managed to log into jenkins and collect the page's XML, through Mechanize and Nokogiri: 1 require 'rubygems' 2 require 'nokogiri' 3 require 'net/https' 4 require 'openssl' 5 require 'mechanize' 6 7 # JenkinsXML logs into Jenkins and gets an...

Web Scraper for dynamic forms in python

python,web-scraping,web-crawler,mechanize

If you look at the request being sent to that site in developer tools, you'll see that a POST is sent as soon as you select a state. The response that is sent back has the form with the values in the city dropdown populated. So, to replicate this in...

Crawling & parsing results of querying google-like search engine

java,parsing,web-crawler,jsoup

I have to translate \x signs and add that site to my "toVisit" sites...I don't have any other idea, how to parse something like this... The \xAA is hexadecimal encoded ascii. For instance \x3d is =, and \x26 is &. These values can be converted using Integer.parseInt with radix...

Scrapy middleware setup

python,web-scraping,web-crawler,scrapy

With this setup, you need to move middlewares.py one level up into craiglist package.

How can I get the value of a Monad without System.IO.Unsafe? [duplicate]

haskell,web-crawler,monads

This is a lie: getResultCounter :: String -> Integer The type signature above is promising that the resulting integer only depends on the input string, when this is not the case: Google can add/remove results from one call to the other, affecting the output. Making the type more honest, we...

Check if element exists in fetched URL [closed]

javascript,jquery,python,web-crawler,window.open

I can suggest you use iframe for loading pages. For example: $.each($your-links, function(index, link) { var href = $(link).attr("href"); // your link preprocess logic ... var $iframe = $("<iframe />").appendTo($("body")); $iframe.attr("src", href).on("load", function() { var $bodyContent = $iframe.contents().find("body"); // check iframe content and remove iframe $iframe.remove(); } } But, I...

Python: urllib2 get nothing which does exist

python,web-scraping,web-crawler,urllib2

Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history r = requests.get('website', allow_redirects=True) print r.text ...

focused crawler by modifying nutch

web-crawler,nutch

If the extracted urls could be differentiated by Regular expression you can do that with current Nutch by adding the specific regex filter. But if you are going to classify URL according to some metadata features related to page you have to implement a customized HTMLParseFilter to filter Outlink[] during...

SgmlLinkExtractor not displaying results or following link

python,web-crawler,scrapy,scrapy-spider,sgml

The problem is in the restrict_xpaths - it should point to a block where a link extractor should look for links. Don't specify allow at all: rules = [ Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), callback="parse_me", follow=True), ] And you need to fix your allowed_domains: allowed_domains = ["www.allgigs.co.uk"] Also note that the print items in...

Cannot Write Web Crawler in Python

python,web-crawler,beautifulsoup,urllib2

I modified your function so it doesn't write to file, it just prints the urls, and this is what I got: http://www.nytimes.com/ http://cn.nytimes.com http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/ http://international.nytimes.com http://cn.nytimes.com http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/ http://international.nytimes.com http://cn.nytimes.com http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/ http://international.nytimes.com http://cn.nytimes.com...

Ruby - WebCrawler how to visit the links of the found links?

ruby,url,hyperlink,web-crawler,net-http

First of all you need a function that accepts a link and returns the body output. Then parse all the links out of the body and keep a list of links. Check that list if you didn't visit the link yet. Remove those visited links from the new links list...

Scrapy delay request

python,web-crawler,scrapy

You need to set DOWNLOAD_DELAY in settings.py of your project. Note that you may also need to limit concurrency. By default concurrency is 8 so you are hitting website with 8 simultaneous requests. # settings.py DOWNLOAD_DELAY = 1 CONCURRENT_REQUESTS_PER_DOMAIN = 2 Starting with Scrapy 1.0 you can also place custom...

scrapy crawling multiple pages [3 levels] but scraped data not linking properly

python,arrays,web-crawler,scrapy

You should instantiate a new item to yield each time. Assuming TV() is an item class class TV(Item): .... Than you should have a separate item = TV() for each episode If you want to pass data from top levels - pass the data itself and create an Item only...

Scrapy returning a null output when extracting an element from a table using xpath

python,xpath,web-scraping,web-crawler,scrapy

Seems like its an xpath problem, in this site during the development they might have omitted tbody but a browser automatically inserted when its viewed through the browser. You can get more info about this from here. So you need county's value (WELD #123) in the given page then the...

ValueError:(“Invalid XPath: %s” % query) XPath Checker generating erroneous code

python,html,xpath,web-crawler,scrapy

'id("div_a1")/div[3]' works for me. See this sample scrapy shell session: $ scrapy shell http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml ... 2015-02-10 12:56:13+0100 [default] DEBUG: Crawled (200) <GET http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x7fbd4f5a7a50> [s] item {} [s] request <GET http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml> [s] response <200...

How to use file_get_contents to find 'a' and click 'a' to get inner contents

php,ajax,web-crawler

Use PHP cURL and PHP DOMDocument for this: libxml_use_internal_errors(true); for ($y = 1; $y <= 5; $y++) { $ch = curl_init(); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_URL, 'http://www.pakwheels.com/used-cars/search/-/?page=' . $y); $searchResults = curl_exec($ch); // save $searchResults here to a file or use DOMDocument to filter what you need $doc = new...

How could I get a part of a match string by RegEx in Python?

python,regex,web-crawler

Use capturing group or lookarounds. >>> pattern=re.compile(r'\bdata-id="(\d+)"') >>> s = 'data-id="48859672"' >>> pattern.search(s).group(1) '48859672' OR >>> pattern=re.compile(r'(?<=\bdata-id=")\d+(?=")') >>> s = 'data-id="48859672"' >>> pattern.search(s).group() '48859672' ...

Making AngularJS and Parse Web App Crawlable with Prerender

angularjs,parse.com,web-crawler,google-crawlers,prerender

After some help from Prerender.io team, here are the outlined steps that resulted in successful crawling by the Facebook and Google crawler tests. Remember this is for an AngularJS app running on a Parse.com backend add $locationProvider.hashPrefix("!") to your .config in your main module (I am not using HTML5Mode because...

T_STRING error in my php code [duplicate]

php,web-crawler

I think that you get this code of C# or C++ or other similar language, this code not work in PHP, If you get an external java application (jar) use the exec functions instead. $url_effective = "http://www.endclothing.co.uk/checkout/cart/add/uenc/aHR0cDovL3d3dy5lbmRjbG90aGluZy5jby51ay9ldHEtaGlnaC10b3AtMS1zbmVha2VyLWVuZC1leGNsdXNpdmUtZXRxLTQtZW5kYmsuaHRtbA,,/product/$i/form_key/DcwmUqDsjy4Wj4Az/"; $crwal = exec("end-cookie.jar -w".$url_effective." -L -s -S -o"); Or some for this style....

How to Pass variables inside functions using new method

java,variables,web-crawler

Yes there is some ways. You need to understand about Constructors which is a way to create your classes in Java. Right now your code only allows to do: LinkNode frl = new LinkNode(frontierUrl) webCrawler.enque( frl ); //you are passing your LinkNode Instance Another ways is changing your LinkNode Constructor...

How do i cache pages that are created on the fly by java servlet so reusable and indexable

java,tomcat,amazon-web-services,web-crawler

Set up some cache headers on the pages so the pages are stored for a longer period of time (e.g. a few days), move tomcat over to some other hostname, then setup amazon cloudfront to have tomcat as the origin server. Then finally setup a CNAME DNS record to point...

limit web scraping extractions to once per xpath item, returning too many copies

python,xpath,web-crawler,scrapy

A very simple solution is to correct your parse function as this one. No need of the outside loop since there is just one div_a1 element in the html code. class Spider(BaseSpider): name = "hzIII" allowed_domains = ["tool.httpcn.com"] start_urls = ["http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml"] def parse(self, response): print response.xpath('//*[@id="div_a1"]/div[2]').extract() print response.xpath('//*[@id="div_a1"]/div[3]').extract() Note: About...

Python: Transform a unicode variable into a string variable

python,unicode,casting,web-crawler,unicode-string

Use a RegEx to extract the price from your Unicode string: import re def reducePrice(price): match = re.search(r'\d+', u' $500 ') price = match.group() # returns u"500" price = str(price) # convert "500" in unicode to single-byte characters. return price Even though this function converts Unicode to a "regular" string...

Extracting data from webpage using lxml XPath in Python

python,xpath,web-crawler,lxml,python-requests

I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty. I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html You might...

How to get file access information on linux (debian)

linux,logging,web-crawler,monitoring,server

Your http server probably produces logs in /var/log/something, where something depends on which server you use. Apache?

want to keep running my single ruby crawler that dont need html and nothing

ruby-on-rails,ruby,web-crawler

If your computer is going to be on everyday at 9am, you could schedule it using cron. If your computer is not going to be on everyday at 9am, you can simply buy a cheap VPS server from digitalocean, linode or one of the many other VPS hosting providers and...

jsoup crawler error when called inside a servlet

java,google-app-engine,servlets,web-crawler,jsoup

Works when i use the userAgent mozilla. doc3 = Jsoup.connect(link).userAgent("Mozilla").timeout(250000).get(); ...

Why python print is delayed?

python,python-3.x,web-crawler,python-requests

You are accessing request.content here: size = len(respond_file.content)//1000000 Accessing that property forces the whole response to be downloaded, and for large responses this takes some time. Use int(respond_file.headers['content-length']) instead: size = int(respond_file.headers['content-length']) // 1000000 The Content-Length header is provided by the server and since it is part of the headers...

How to get Google to re-index a page after removing noindex metatag?

web-crawler,sitemap,meta-tags,google-webmaster-tools,noindex

Google usually fairly quickly crawls your pages. Inclusion into index is a bit slower, and getting reasonable search rank takes time. Look at your web server log to confirm that google bot did crawl your pages, you can search for exact page in google and it usually comes up, but...

How to make a parser for a web crawler maintainable

ruby,web-crawler,nokogiri

You should try to use the data and metadata of the web page to find the element you care about as much as possible instead of using element index numbers like you are doing. The "class" and "id" attributes are a good way to do it. Nokogiri has XPath features...

Redirecting Crawler to internal service

facebook,nginx,service,web-crawler

Add a break after the proxy_pass. location / { if ($http_user_agent ~ Facebot) { proxy_pass http://127.0.0.1:9998; break; } root /etc/www/website; try_files $uri /index.html; ... other stuff... } ...

SgmlLinkExtractor in scrapy

web-crawler,scrapy,rules,extractor

It's not possible to answer "should i" questions if you don't provide complete example strings and what you want to match (and what you don't want to match) with a regular expression. I guess, that your regex won't work because you use \ instead of /. I recommend you go...

Selenium pdf automatic download not working

python,selenium,selenium-webdriver,web-scraping,web-crawler

Disable the built-in pdfjs plugin and navigate to the URL - the PDF file would be downloaded automatically, the code: from selenium import webdriver fp = webdriver.FirefoxProfile() fp.set_preference("browser.download.folderList", 2) fp.set_preference("browser.download.manager.showWhenStarting",False) fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar") fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf") fp.set_preference("pdfjs.disabled", "true") # < KEY PART HERE...

“TypeError: 'Rule' object is not iterable” webscraping an .aspx page in python

python-2.7,selenium,web-crawler

The problem is in the following line. rules = (Rule(LinkExtractor (allow=(" .*http://profiles.ehs.state.ma.us/Profiles/Pages/PhysicianProfile.aspx?PhysicianID=.*," )))) You miss positioned a comma. The correct code is: rules = (Rule(LinkExtractor(allow=(" .*http://profiles.ehs.state.ma.us/Profiles/Pages/PhysicianProfile.aspx?PhysicianID=.*" ))),) By this correction you make the rule iterable. Good definition of iterators here: (Build a Basic Python Iterator) Iterator objects in python conform...

Single session multiple post/get in python requests

python,web-crawler,python-requests

Requests uses an internal version of urllib3. I have the impression that somehow there is a version mismatch between the internal urllib3 and requests itself. httplib_response = conn.getresponse(buffering=True) TypeError: getresponse() got an unexpected keyword argument 'buffering' Seems to indicate that requests is calling urllib3 (the internal version, not Python's), but...

How can I scrape pages with dynamic content using node.js?

node.js,request,web-crawler,phantomjs,cheerio

Here you go; var phantom = require('phantom'); phantom.create(function (ph) { ph.createPage(function (page) { var url = "http://www.bdtong.co.kr/index.php?c_category=C02"; page.open(url, function() { page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() { page.evaluate(function() { $('.listMain > li').each(function () { console.log($(this).find('a').attr('href')); }); }, function(){ ph.exit() }); }); }); }); }); ...

Heritrix single-site scrape, including required off-site assets

java,web-crawler,heritrix

You could use 3 decide rules: The first one accepts all non-html pages, using a ContentTypeNotMatchesRegexDecideRule; The second one accepts all urls in the current domain. The third one rejects all pages not in the domain and not directly reached from the domain (the alsoCheckVia option) So something like that:...

The scrapy LinkExtractor(allow=(url)) get the wrong crawled page, the regulex doesn't work

python,web-crawler,scrapy

The regular expression \d{2} matches every number that starts with two digits. If you want to limit the regular expression to two digits you can use \d{2}$ so that it only matches if there are tow digits at the end of the line. Even more general would be to use...

How to retrieve all the images, js, css urls

python,http,web,web-crawler,scrapy

You can use a link extractor (more specifically, I've found the LxmlParserLinkExtractor works better for this kind of thing), customizing the elements and attributes like this: from scrapy.contrib.linkextractors.lxmlhtml import LxmlParserLinkExtractor tags = ['img', 'embed', 'link', 'script'] attrs = ['src', 'href'] extractor = LxmlParserLinkExtractor(lambda x: x in tags, lambda x: x...

How to crawl classified websites [closed]

web-crawler,scrapy,scrapy-spider

import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin class CompItem(scrapy.Item): name = scrapy.Field() price = scrapy.Field() location = scrapy.Field() class criticspider(CrawlSpider): name = "craig" allowed_domains = ["newyork.craigslist.org"] start_urls = ["http://newyork.craigslist.org/search/cta"] def parse(self, response): sites = response.xpath('//div[@class="content"]') items = []...

Is it possible to list all the functions called after clicking the page with the use of Chrome Developer Tools

javascript,google-chrome,debugging,web-crawler

If you open devtools and go to your network tab. Search the request that you want to investigate. One of the columns in the network table is the "Initiator column". There you'll find the script and line number of the javascript that fired the request. If you then hover over...

Get a substring from a list item in python after a word

python,regex,beautifulsoup,web-crawler

Edited: Just print the text of txt (thanks for @angurar clarifying OP's requirements): for txt in soup.findAll('td',{'class':"field title"}): print txt.string Or if you're after the title attribute of <a>: for txt in soup.findAll('td',{'class':"field title"}): print [a.get('title') for a in txt.findAll('a')] It will return a list of all <a> title's attribute....

how to check whether a program using requests module is dead or not

python,web-crawler,downloading

You probably don't want to detect this from outside, when you can just use timeouts to have requests fail instead of stopping is the server stops sending bytes. Since you didn't show us your code, it's hard to show you how to change it… but I'll show you how to...

Python: Can I use Chrome's “Inspect Element” XPath create tool as a Scrapy spider XPath?

python,google-chrome,xpath,web-crawler,scrapy

In most cases you need to tweak a bit the Xpath returned by the browsers, for these basic reasons: The HTML can be altered after the page loads by JavaScript. The HTML can be altered by the browser itself. They rely heavily on node position and include many unnecessary elements,...

brute force web crawler, how to use Link Extractor towards increased automation. Scrapy

python,xpath,hyperlink,web-crawler,scrapy

I think this is what you want: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from brute_force.items import BruteForceItem from urlparse import urljoin class DmozSpider(BaseSpider): name = "brutus" allowed_domains = ["tool.httpcn.com"] start_urls = ['http://tool.httpcn.com/Zi/BuShou.html'] def parse(self, response): for url in response.css('td a::attr(href)').extract(): cb = self.parse if...

New to Python, what am I doing wrong and not seeing tag (links) returned with BS4

python,beautifulsoup,web-crawler,bs4

You are not actually running the function if you have the function call inside the actual function, after you correct that you are going to get an error as that is not a valid url to pass to requests, last your soup.findAll('a', {'h3 class': "two-lines-name"}) is not going to find...

How to resume a previous incomplete job in apache nutch crawler

apache,web-crawler,nutch,resume

One of its simple solutoion is to save current batchid and stage at which crawler is running. Edit your crawl script so that you can jump to the stage that is in stage file and change batchid with saved batchid also( This is like case statement scenerio). It will do...

delete spiders from scrapinghub

delete,web-crawler,scrapy,scrapy-spider,scrapinghub

You just need to remove the spider from your project, and deploy the project again, via shub deploy, or scrapyd-deploy.

Workload balancing between akka actors

multithreading,scala,web-crawler,akka,actor

I'm working on a similar program where the workers have a non-uniform resource cost (in my case the task is performing database queries and dumping the results in another database, but just as crawling different websites will have different costs so too will different queries have different costs). Two ways...

Scrape result export prooblem

python,web-crawler,scrapy

This is because you need to return Item instances: import scrapy from tutorial.items import TutorialItem class ChillumSpider(scrapy.Spider): name = "chillum" allowed_domains = ["flipkart.com"] start_urls = ["http://www.flipkart.com/search?q=brown+jacket&as=offas-show=off&otracker=start"] def parse(self, response): titles = response.xpath('//a[@class="fk-display-block"]/text()').extract() for title in titles: item = TutorialItem() item['title'] = title yield item ...

Get Web Bot To Properly Crawl All Pages Of A Site

python,web-scraping,web-crawler,beautifulsoup

The problem with your current code is that the URLs you are putting into the queue (urls) are pointing to the same page, but to a different anchor, for example: https://weedmaps.com/dispensaries/shakeandbake#videos https://weedmaps.com/dispensaries/shakeandbake#weedmenu In other words, tag['href'] not in visited condition does not filter different URLs pointing to the same page...

Create accounts only for real people

session,cookies,web-crawler

You can check your user accepts cookies via AJAX. On landing page set a cookie and then send a request back to server immediately after page load with the cookie. And only if the cookie is present, then create your user. This will be quick and confirms that the users...

efficient XPath syntax exclusively extract single component

html,xpath,web-scraping,web-crawler,scrapy

The warning and error you get are specific to the site you are using to test your XPath expression. It appears you have used a syntax that is used to declare namespaces on http://www.xpathtester.com/xpath. Given that you know how to submit an XPath expression, the following works fine: //td[@class =...

Get all links from page on Wikipedia

python,python-2.7,web-crawler

Wikipedia has a built in tool that does just what you are describing WhatLinksHere/Backlink. You can see this tool on every Wikipedia page. You can simply scrape all of the links off the page of goal's the back-links page. 'http://en.wikipedia.org/w/index.php?title=Special%3AWhatLinksHere&limit='500'&target='+goal+'&namespace=0' ^^^^ Article you are trying to reach here Wiki-help page...

Scrapy follow link and collect email

python,web-scraping,web-crawler,scrapy

In order to get see an email on a craiglist item page, the one would click "Reply" button, which initiates a new request to "reply/chi/vgm/" url. This is something you need to simulate in Scrapy by issuing a new Request and parsing the results in a callback: # -*- coding:...

Why scrapy not giving all the results and the rules part is also not working?

python,xpath,web-scraping,web-crawler,scrapy

The problem is in how you define sites. Currently, it is just //table[@width="100%"] which would result into the complete table to be matched. Instead, find all div elements having id attribute directly inside a td tag: sites = response.xpath("//td/div[@id]") As for the rules part - here is the approach I...

Redis - list of visited sites from crawler

python,url,redis,queue,web-crawler

A few suggestions: Look into using Redis' (2.8.9+) HyperLogLog data structure - you can use PFADD and PFCOUNT to get a reasonable answer whether a URL was counted before. Don't keep each URL in its own url_ key - consolidate into a single or bucket Hashs as explained in "Memory...

Apache Nutch REST api

api,rest,web-crawler,nutch

At the time of this posting, the REST API is not yet complete. A much more detailed document exists, though it's still not comprehensive. It is linked to in the following email from the user mailing list (which you might want to consider joining): http://www.mail-archive.com/user%40nutch.apache.org/msg13652.html But to answer your question...

Python 3.3 TypeError: can't use a string pattern on a bytes-like object in re.findall()

python-3.x,web-crawler

You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8'). See Convert bytes to a Python String...

fullPage.js: Make all slides and sections visible in search engine results

jquery,seo,web-crawler,single-page-application,fullpage.js

You probably won't be able to force Google to index your anchor links as different pages. You will be able to index them as a single page. Google will read your page as what it really is. A single page. There are some recommendations which suggest to use the id...

how to download image in Goutte

php,web-crawler,guzzle,goutte

I got the answer: $client->getClient()->get($img_url, ['save_to' => $img_url_save_name, 'headers'=>['Referer'=>$src] ]); Actually I can set header Referer in Goutte\Client and but there's no option to give a path to save image. So I finally use Guzzle Client instead....

Scrapy collect data from first element and post's title

python,web-scraping,web-crawler,scrapy,scrapy-spider

For posting title, get all the text nodes from the span tag and join them: $ scrapy shell http://denver.craigslist.org/bik/5042090428.html In [1]: "".join(response.xpath("//span[@class='postingtitletext']//text()").extract()) Out[1]: u'Tonka double shock boys bike - $10 (Denver)' Note that the "Scrapy-way" to do this would be to use an ItemLoader and the Join() processor. Second is...

How to access the web page contents

java,html,web,web-crawler,jsoup

Your question isn't clear enough, but from your codes I understand that you are looking for, saving the Link's Text and using the .select() syntax you must use doc.select("a[href"]); then you can use your current for loop.