Menu
  • HOME
  • TAGS

BeautifulSoup (bs4) parsing wrong

python,html,python-2.7,bs4

The doc at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser tells how to get BS4 to use different parsers. Apparently the default is html.parse, which the BS4 doc says is broken before Python 2.7.3, but apparently still has the problem described above in 2.7.6. Switching to "lxml" was unsuccessful for me, but switching to "html5lib" produces...

Extract the URL of stored html file

python,urllib2,bs4

It is only possible if the file itself contains that URL, which is not very common. So it depends on the files you have downloaded. Look for a <link rel="canonical" ...> as this is the way search engines recommend to publish the canonical address to a web page. If they...

Writing loop over multiple pages with BeautifulSoup

python,loops,beautifulsoup,mechanize,bs4

You don't actually need requests module to iterate through paged search result, mechanize is more than enough. This is one possible approach using mechanize. First, get all paging links from current page : links = br.links(url_regex=r"fuseaction=home.search&pageNumber=") Then iterate through paging links, open each link and gather useful information from each...

Removing newlines (\n) with BeautifulSoup

python,regex,bs4

I suspect the newline actually appears in the source Html file. I tried to reproduce your error using your paragraphs and I didn't get any \n until I actually inserted a new line in the source file. This would also explain why it doesn't happen for other longer paragraphs: they...

How to use Beautiful Soup's find() instead of find_all() for better runtime

python,find,beautifulsoup,findall,bs4

According to the official documentation there is a way to search by the custom data-* attributes. You should try this: line = soup.find('img', attrs={'data-a-dynamic-image': True}) ...

Extract News article content from stored .html pages

python,urllib2,bs4

There are libraries for this in Python too :) Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe If you want to use purely python libraries, there are 2 options: https://github.com/buriy/python-readability and https://github.com/grangier/python-goose Of the two, I...

Python // BS4 // Tags

python,tags,bs4

I think you can use: [i.contents[0].strip() for i in soup.select('td.first')] Regarding the 2nd part of your question - you want to have the fields in individual variables? You can do it, but it's probably not a great idea. Is there a reason for it? Either, you know how many of...

New to Python, what am I doing wrong and not seeing tag (links) returned with BS4

python,beautifulsoup,web-crawler,bs4

You are not actually running the function if you have the function call inside the actual function, after you correct that you are going to get an error as that is not a valid url to pass to requests, last your soup.findAll('a', {'h3 class': "two-lines-name"}) is not going to find...

BS4 and onclick(): how to make action?

python,django,bs4

You need to use selenium to perform clicks. Mechanize might be useful too.