You'll probably kick yourself about this, but you need to include the scheme in the URL, i.e. http. Try changing codecademy = 'www.codecademy.com' to codecademy = 'http://www.codecademy.com' ...
There are lots of possible sources for a speed change between requests. A few that immediately spring to mind: DNS lookup cached on your client. The first call must convert "xyz.com" to "123.45.67.89", involving a DNS lookup which may be slow. HTTP keep-alive. There is an initial conversation between client...
python,screen-scraping,mechanize
This is how I selected the first form in my code. br.select_form(nr=0) #Form fields to populate br.form['username'] = username br.form['password'] = password #Submit the login form br.submit() Modify it to suit your needs. The "nr=0" is probably what you're looking for. But the problem is the DOCTYPE. I tested the...
javascript,ruby,select,mechanize
Let's say the option looks like this: option = page.at('option[text()=foo]') You would do the action (change location to the option's value) with: page = agent.get option[:value] ...
Where does the link point to? Ie, what's the href? I ask because "scheme" usually refers to something like http or https or ftp, so maybe the link has a weird scheme that mechanize doesn't know how to handle, hence Mechanize::UnsupportedSchemeError...
python,web-scraping,mechanize,scrape
Its zero-indexed. try the code below: br.select_form(nr=0) ...
python,caching,web-scraping,mechanize
There are plenty of easy ways to implement caching. Write data to a file. This is especially easy if you are just dealing with small amounts of data. JSON blobs can also be easily written to a file. with open("my_cache_file", "a+") as file_: file_.write(my_json_blob) Use a key value store to...
python,html,web-scraping,mechanize,hidden
Pretty sure there is javascript involved which mechanize cannot handle. An alternative solution here would be to automate a real browser through selenium: from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() # could...
ruby,hash,web-scraping,nokogiri,mechanize
With help from here, here and here I have the fully working code: require 'mechanize' @hashes = [] # Initialize Mechanize object a = Mechanize.new # Begin scraping a.get('http://www.marktplaats.nl/') do |page| groups = page.search('//*[(@id = "navigation-categories")]//a') groups.each_with_index do |group, index_1| a.get(group[:href]) do |page_2| categories = page_2.search('//*[(@id = "category-browser")]//a') categories.each_with_index do...
/<— comment tag
python,facebook,web-scraping,beautifulsoup,mechanize
Or instead of scraping Facebook, you can do it the proper way - through their graph API ;) import requests url = "http://graph.facebook.com/{}".format("zuck") params = { "fields": "picture" } response = requests.get(url, params=params).json() picture_url = response['picture']['data']['url'] print(picture_url) # output: #...
Looks like there is a meta refresh in there (per your description). Try adding this to your Mechanize object: a.follow_meta_refresh = true Also, you may want your user_agent to an accepted value instead of your custom one: require 'mechanize' Mechanize::AGENT_ALIASES.each { |k,v| puts k } => Mechanize => Linux Firefox...
python,beautifulsoup,mechanize
You need to supply a user agent: url = "http://www.marinetraffic.com/en/ais/index/positions/all/shipid:415660/mmsi:354975000/shipname:ADESSA%20OCEAN%20KING/_:6012a2741fdfd2213679de8a23ab60d3" import requests headers = {'User-agent': 'Mozilla/5.0'} html = requests.get(url,headers=headers).content soup = BeautifulSoup(html) table = soup.find("table") # only one table So just unpack the list with something like: for row in table.findAll('tr')[1:]: items = row.text.replace(u"kn","") # remove kn so items line...
I worked with Ian (the OP) on this problem and just felt we should close this thread with some answers based on what we found when we spent some more time on the problem. 1) You can't scrape a Google Group with Mechanize. We managed to get logged in abut...
open takes an argument that's a filename, not an URL. If you want to access the URL, you would normally have to do a lot more than simply open a file. Luckily, Ruby provides a nice wrapper for Net::HTTP, called open-uri. Just drop the following line at the top of...
python,loops,beautifulsoup,mechanize,bs4
You don't actually need requests module to iterate through paged search result, mechanize is more than enough. This is one possible approach using mechanize. First, get all paging links from current page : links = br.links(url_regex=r"fuseaction=home.search&pageNumber=") Then iterate through paging links, open each link and gather useful information from each...
Here's an example of how you could parse the bold text and href attribute from the anchor elements you describe: require 'nokogiri' require 'open-uri' url = 'http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green%20tea' doc = Nokogiri::HTML(open(url)) doc.xpath('//dd/*/a').each do |a| text = a.xpath('.//b').map {|b| b.text.gsub(/\s+/, ' ').strip} href = a['href'] puts "OK: text=#{text.inspect}, href=#{href.inspect}" end # OK:...
ruby-on-rails-3,web-scraping,mechanize
When you click one of the page links javascript on the page triggers a post request to the same path with the new page number. This can be found in their js file at http://eserver.goutsi.com:8080/js/LoadBoard.js function gotoPage(pageNumber) { document.getElementById("PageNbr").value= pageNumber; // Set new page number document.getElementById("PageDir").value="R"; // Refresh document.getElementById("theForm").submit(); }...
html,ruby,xpath,nokogiri,mechanize
It work for me, you also can try ;) array = page.search('//*[@class="visible-phone"]/a').each { |i| puts i['href'] } UPD: first_link_page = page.link_with(:href => array.first['href']).click UPD2: array = page.search('//*[@class="visible-phone"]/a') first_link_page = page.link_with(:href => array.first['href']).click ...
ruby,csv,automation,web-scraping,mechanize
Your error is telling you that something on line 19 in your code is causing the issue for line 442 in mechanize. I tried your sample out in IRB and it seems to work fine: 2.2.2 :001 > require 'mechanize' => true 2.2.2 :002 > agent = Mechanize.new => #<Mechanize:......
python,beautifulsoup,mechanize,urllib
assuming you know how to grab the page with requests you would do the following: ... from bs4 import BeautifulSoup ... gradetd = BeautifulSoup(html).find_all('td',{'class':'fixed-column important'}) for row in gradetd: grades = row.find('span',{'class':'expandable-row'}).text.strip() if grades: avg, grade = grades.split(' ') print("{}/{}".format(avg,grade)) 89.83/A- 99.14/A+ 98.20/A+ 91.14/A- 94.32/A 95.76/A 91.28/A- 85.42/B 97.86/A+ 95.63/A...
ruby,json,hash,web-scraping,mechanize
The same object, the result of @categories_hash['category'], is being updated each loop. Thus the array is filled with the same object 1747 times, and the object reflects the mutations done on the last loop when it is viewed later. While a fix might be to use @categories_hash[category_name] or similar (i.e....
python,web-scraping,beautifulsoup,linkedin,mechanize
Try to set up User-Agent header. Add this line after op.set_handle_robots(False) op.addheaders = [('User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36")] Edit: If you want to scrape web-sites, first check if it has API or library, that deals with API....
It's been a while since I've written for python, but I think I have a workaround for your problem. Try this method: import requests except Mechanize.HTTPError: while true: ## DANGER ## ## You will need to format and/or decode the POST for your form response = requests.post('http://yourwebsite.com/formlink', data=None, json=None) ##...
Mechanize does not support JavaScript and since the model select box is populated via JavaScript, using the select boxes is not going to work. However, http://www.parkers.co.uk/ works fine (though differently) if you disable JavaScript. Disable JavaScript in your browser while using the site and you'll notice that you get a...
Got the same error, try using a user-agent or requests: import requests response=requests.get('http://proxygaz.com/country/india-proxy/') print(response.status_code) 200 using agent-works fine: import urllib2 resp = urllib2.Request('http://proxygaz.com/country/india-proxy/') resp.add_header('User-Agent', 'FIREFOX') opener = urllib2.build_opener() print opener.open(resp).read() ...
You can get data from PubChem with their service PUG REST A simple example: import urllib2 import json def get(url): req = urllib2.Request(url) response=urllib2.urlopen(req) return response.read() pugrest = 'http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/' cmpd = 'methane' prop ='/property/MolecularFormula,MolecularWeight,CanonicalSMILES,InChI,IUPACNam/JSON' data = get(pugrest+cmpd+prop) print data This will give you this json: { "PropertyTable": { "Properties": [...
You need to specify the value as a list: if br.form["list"] == ["---"]: br.form["list"].value = ["1"] According to the mechanize - Forms documentation: # Controls that represent lists (checkbox, select and radio lists) are # ListControl instances. Their values are sequences of list item names. # They come in two...
I found the problem in mechanize that it doesn't support javascript. Whenever mechanize reach the page after submit, then javascript not working due to it pagination click was not triggered. I have achieved what I want using selenium. and Beautiful Soup using following selenium selector: elem1 = driver.find_element_by_link_text("Next Page >>")...
ruby-on-rails,ruby,background,web-crawler,mechanize
You can create new thread and call method there http://www.ruby-doc.org/core-2.1.5/Thread.html
ruby-on-rails,ruby,web-scraping,nokogiri,mechanize
I used capybara-webkit to solve my problem
I have managed to identify the culprit here. As my development machine is Windows-based, this seems to have been an issue with mechanize (or one of its dependencies) and Windows. By specifying the b (binary) part in the second argument of File.new, the problem went away on its own. tl;dr:...
ruby-on-rails,ruby,mechanize,open-uri
I think you'll want to use a Mechanize#start block: 10.times do Mechanize.start do |minion| minion.open_timeout = 15 minion.read_timeout = 15 minion.set_proxy '212.82.126.32', 80 page = minion.get("http://www.whatsmyip.org/") proxy_ip_adress = page.parser.css('#ip').text puts proxy_ip_adress end # minion definitely doesn't exist anymore end ...
You can try out something like this: from mechanize import Browser from bs4 import BeautifulSoup br = Browser() br.set_handle_robots( False ) br.addheaders = [('User-agent', 'Firefox')] br.open("http://www.mcxindia.com/sitepages/BhavCopyCommodityWise.aspx") br.select_form("form1") #now enter the dates according to your choice br.form["mTbFromDate"] = "date-From" br.form["mTbFromDate"] = "date-To" response = br.submit() #now read the response with BeautifulSoup...
You can probably use "watir" insteed. It can mimic the actions (such as clicking, rolling the page) in a browser.
ruby-on-rails,ruby,login,web-scraping,mechanize
Instead of: docwatan = Nokogiri::HTML(open('http://www.elwatan.com/')) You want to do: docwatan = agent.get('http://www.elwatan.com/') otherwise the session cookie isn't getting sent in the request....
Pulled straight from the docs and changed to your example. from selenium import webdriver # Create a new instance of the Firefox driver driver = webdriver.Firefox() # go to the page driver.get("http://www.link.cs.cmu.edu/link/submit-sentence-4.html") # the page is ajaxy so the title is originally this: print driver.title # find the element that's...
python,flask,screen-scraping,mechanize,cookiejar
Once your Flask app is started it only imports each package once. That means that when it runs into import webscrap for the second time it says “well, I already imported that earlier, so no need to take further action…” and moves on to the next line, rendering the template...
python,firefox,popup,mechanize,webpage
Mechanize cannot handle javascript and popup windows: How do I use Mechanize to process JavaScript? Mechanize and Javascript To accomplish the goal, you need to utilize a real browser, headless or not. This is where selenium would help. It has a built-in support for popup dialogs: Selenium WebDriver has built-in...
python,beautifulsoup,mechanize
When combining the two it works without a problem so it might be related to the html.parser you are using. import mechanize from bs4 import BeautifulSoup URL = ('http://www.airchina.com.cn/www/jsp/airlines_operating_data/' 'exlshow_en.jsp') control_year = ['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014'] control_month = ['01', '02', '03', '04', '05', '06', '07',...
selenium,web-crawler,mechanize
Selenium is a browser automation tool. You can basically automate everything you can do in your browser. Start with going through the Getting Started section of the documentation. Example: from selenium import webdriver driver = webdriver.Firefox() driver.get("http://www.python.org") print driver.title driver.close() Besides automating common browsers, like Chrome, Firefox, Safari or Internet...
html,ruby-on-rails,ruby,parsing,mechanize
More succint version relying more on the black magic of XPath :) require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp')) last_td = doc./("//tr[td[strong[text()='#{ARGV[0]}']]]/td[5]") puts last_td.text.gsub(/.*?;/, '').strip ...
python,web-scraping,web-crawler,mechanize
If you look at the request being sent to that site in developer tools, you'll see that a POST is sent as soon as you select a state. The response that is sent back has the form with the values in the city dropdown populated. So, to replicate this in...
ruby-on-rails,ruby,web-scraping,nokogiri,mechanize
You can use page.parser to gain access to the underlying Nokogiri object. http://mechanize.rubyforge.org/Mechanize/Page.html#method-i-parser require 'mechanize' agent = Mechanize.new agent.get("http://stackoverflow.com/questions/23064821/using-the-mechanize-gem-with-the-nokogirl-gem/") agent.page.parser.class # => Nokogiri::HTML::Document agent.page.parser.css("#answer-23065003 .user-details a").text # => "akatakritos" ...
$mech -> field($name, $value) field() only lets you set one name at a time. But $mech -> set_fields($name => $value, $name2 => $value2,... $nameN => $valueN) ...set_fields() allows you to set multiple names at the same time. That's not really such a big deal because you could always use the...
python,html,selenium,web-scraping,mechanize
You can select the form by the order at which it appears on the page, firstly import & open import mechanize br = mechanize.Browser() br.open('http://www.staffordshire-pcc.gov.uk/space/') Loop through all the forms in the page forms = [f.name for f in br.forms()] Lets check whether form[0] is the correct index for the...
python,beautifulsoup,mechanize
There's a library to access instagram from python. To login, you need the following code: from instagram.client import InstagramAPI access_token = "YOUR_ACCESS_TOKEN" # get this from instagram api = InstagramAPI(access_token=access_token) recent_media, next_ = api.user_recent_media(user_id="userid", count=10) for media in recent_media: print media.caption.text In other words, don't reinvent the wheel....
In the end, without some extremely complex programming, this was only possible using Selenium and unfortunately it wasn't reliable enough to rely on.
ruby,table,web-scraping,nokogiri,mechanize
Using css selector, to print text and href attribute values: require 'nokogiri' doc = Nokogiri::HTML(page.body) doc.css('table#myTable tbody td[3] a').each {|a| puts a.text, a[:href] } ...
Author here. It seems like the form was a javascript. I used Selenium instead to output keys to the form.
ruby,web-scraping,nokogiri,mechanize
I would try this, replace lien.click with page = lien.click.
Can you try to change this to: except mechanize._mechanize.FormNotFoundError: instead of this: except FormNotFoundError: ...
I've not see it used that way. Normally you need to create an agent, then issue the get. try this require 'rubygems' require 'mechanize' uri = URI 'http://website.com/page.html' agent = Mechanize.new file = agent.get uri filename = file.save # saves to page.html puts filename # page.html ...
ruby,css-selectors,nokogiri,mechanize
The page you are searching doesn’t contain any tbody tags. When your browser parses the page it adds the missing tbody elements into the DOM that it creates. This means that when you examine the page through the browser’s inspector and console it acts like the tbody tags exist. Nokogiri...
Mechanize has regular expressions: page.link_with(text: /foo/).click page.link_with(href: /foo/).click Here are the Mechanize criteria that generally work for links and forms: name: name_matcher id: id_matcher class: class_matcher search: search_expression xpath: xpath_expression css: css_expression action: action_matcher ... If you're curious, here's the Mechanize ElementMatcher code...
python,mechanize,mechanize-python
Your code works for me, but I would remove the line ('Accept-Encoding', 'gzip, deflate, sdch'), to not having to reverse that encoding afterwards. To clarify: you are getting the content, but you expect it to be in "clear text". You get clear text by not requesting gzipped content....
The problem here is that you need to actually interact with the javascript on the page. requests, while being an excellent library has no support for javascript interaction, it is just an http library. If you want to interact with javascript-rich web pages in a meaningful way I would suggest...
Easiest way: details = doc.css('table > tr > th') details2 = doc.css('table > tr > td > p') details.map!.with_index { |d, i| {name: d.text, value: details2[i].text } } details will look like [{name: 'asd', value: '123'}, {name: 'qwe', value: '234'}]...
Trying to access that page what you are actually doing is being directed to an error page. Paste that url in a browser and you get a page with: Not comply with the conditions of the inquiry data and no forms at all You need to access the page in...
Your problem are some broken HTML comment tags, leading to an invalid website which mechanize's parser can't read. But you can use the included BeautifulSoup parser instead, which works in my case (Python 2.7.9, mechanize 0.2.5): #!/usr/bin/env python #-*- coding: utf-8 -*- import mechanize br = mechanize.Browser(factory=mechanize.RobustFactory()) br.open('http://mspc.bii.a-star.edu.sg/tankp/run_depth.html') br.select_form(nr=0) br['pdb_id']...
python,forms,beautifulsoup,mechanize,python-requests
There could be different solutions applied here, though it is pretty clear - this is not an easy case for mechanize. Better make that submission (POST request) using requests: import requests url = 'http://bioportal.bioontology.org/annotator' params = { 'text': 'Sample text', # this is the contents of the text area 'longest_only':...
Adding the following line to change the user agent worked in the end: agent.user_agent_alias = 'Mac Safari' ...
python,web-scraping,beautifulsoup,mechanize
You can always make the request using requests: import requests url = 'http://genome-euro.ucsc.edu/cgi-bin/hgTables?hgsid=201790284_dkwVYFu7V6ISmTzFGlXzo23aUhXk' session = requests.Session() params = { 'hgsid': '201790284_dkwVYFu7V6ISmTzFGlXzo23aUhXk', 'jsh_pageVertPos': '0', 'clade': 'mammal', 'org': 'Human', 'db': 'hg19', 'hgta_group': 'genes', 'hgta_track': 'refGene', 'hgta_table': 'refFlat', 'hgta_regionType': 'range', 'position': 'chr9:21802635-21865969', 'hgta_outputType': 'gff', 'boolshad.sendToGalaxy': '0',...
html,ruby,web-scraping,nokogiri,mechanize
Here's the url for page 2 http://store.steampowered.com/search/results?sort_by=_ASC&page=2&auction=1&snr=1_7_7_230_7 That should be all you need, parse it the same as page 1....
python,forms,mechanize,mechanize-python
In order to submit the calendarForm, you should access a different URL with mechanize: br.open('https://eappointment.ica.gov.sg/ibook/gethome.do') br.select_form(name="calendarForm") ...
python,web,browser,login,mechanize
That form submits data somewhere. You need to find out where and what method it uses. After you find out, you can use the requests library to do a one-liner, like: response = requests.post("https://controller.mobile.lan/101/portal/", data={'login': "username", 'password': "password") print response.read() # Dumps the whole webpage after. Note that if that...
form['user[email]'] = '[email protected]' is working...
after you br.submit() you go straight into br['Verication'] = str(input_var) this is incorrect since using br.submit() will make your browser not have a form selected anymore. after submitting i would try: for form in br.forms(): print form to see if there is another form to be selected read up on...
html,ruby-on-rails,encoding,utf-8,mechanize
ok, google are stupid and not use header encoding, we need to precise it in request string with param: "ie=utf-8&oe=utf-8"... my problem is fixed with this....
Mechanize won't cut it because it doesn't evaluate Javascript code and you really need it because the input button triggers a Javascript function. Even if you try to use Mechanize to simply submit the form, you'll get the following message: The web site is unable to acknowledge your acceptance of...
python,html,google-app-engine,login,mechanize
First of all, you should know that if there isn't a publicly available API to do all this without scraping then it's very likely that what you are doing is not welcomed by the website owners, against their terms of service and could even be illegal and punishable by law...
python,osx,python-2.7,mechanize,python-3.4
OS X comes with Python pre-installed. Exactly which version(s) depends on your OS X version (e.g., 10.10 comes with 2.6 and 2.7, 10.8 comes with 2.5-2.7 plus a partial installation of 2.3, and 10.3 comes with 2.3). These are installed in /System/Library/Frameworks/Python.framework/Versions/2.*, and their site-packages are inside /Library/Python/2.*. (The reason...
java.util.List has not method foreach. But you can convert it to scala list using implicit conversion. Just add import scala.collection.convert.wrapAsScala._ to your source file.
ruby-on-rails,ruby-on-rails-4,mechanize
Try this code. It mixes .map and .each_with_object to obtain the hash and array mix you describe: namespace :data do task scrap: :environment do agent = Mechanize.new results = (3000..25000).step(1000).each_with_object({}) do |amount, h| year_costs = (1..5).step(1).map do |year| total_cost = agent.get("https://www.cashbuddy.se/api/loancost?amount=#{amount}&numberOfYears=#{year}").body {year => total_cost} end # => an array of...
python,forms,web-scraping,mechanize
I don't know if you need to select the form, but if you did so, following should just do the trick: br.find_control(type="checkbox").items[0].selected=True if you want to select all checkboxes: for i in range(0, len(br.find_control(type="checkbox").items)): br.find_control(type="checkbox").items[i].selected =True then submit br.submit() ...
According to the FAQ Mechanize cant handle invalid HTML like :" br/ " you have such code on your website You could use the BeautifullSoup Parser import mechanize browser = mechanize.Browser(factory=mechanize.RobustFactory()) browser.open("http://example.com/") print browser.forms Alternatively, you can process the HTML (and headers) arbitrarily: browser = mechanize.Browser() browser.open("http://example.com/") html = browser.response().get_data().replace("<br/>",...