Menu
  • HOME
  • TAGS

lxml — how to isolate element from children

python,html,xml,html-parsing,lxml

You could use the remove method to remove the children: import lxml.html as LH code = '''<a foo="bar">some text<b></b> here <c><d>Hi</d></c> and here</a>''' root = LH.fromstring(code) print(root.text_content()) # some text here Hi and here for elt in root: root.remove(elt) print(LH.tostring(root)) yields <a foo="bar">some text</a> Note, however, that not all text...

LXML - parse td content within tr tag

python,html,python-3.x,lxml

Use range and format for i in range(1,5): contentA = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[{i}]".format(i=i))[0].text_content().strip() print(contentA) Output Total Revenue 31,821,000 30,871,000 29,904,000 for i in range(1,5): contentB = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[3]/td[{i}]".format(i=i))[0].text_content().strip() print(contentB) Output Cost of Revenue 16,447,000 16,106,000 15,685,000 EDIT In [22]: d = {} In [23]: d.setdefault('Revenue', []) Out[23]: [] In [24]: for i in...

python lxml loop through all tags

python,xml,dictionary,lxml

You could try with iter() function. It will traverse through all the children elements. The comparison of the length is to print only those that has no children: A complete script like this one: from lxml import etree tree = etree.parse('xmlfile') for tag in tree.iter(): if not len(tag): print (tag.tag,...

Combine multiple tags with lxml

python,html,xpath,lxml

I have managed to solve my own problem. for p in self.tree.xpath('//body/p'): if p.tail is None: # some conditions specifically for my doc children = p.getchildren() if len(children)>1: for child in children: #if other stuffs present, break if child.tag!='strong' or child.tail is not None: break else: # If not break,...

XML scanning for value

python,xml,lxml

The proper way to access nodes in namespace is by passing prefix-namespace URL mapping as additional argument to xpath() method, for example : ns = {'tes' : 'http://www.blah.com/client/servlet'} source_type = sourceobject.xpath('//tes:actions/tes:type/text()', namespaces=ns) Or, another way which is less recommended, by literally ignoring namespaces using xpath function local-name() : source_type =...

How do you write an constraint correctly and/or get lxml to throw an exception when such a constraint is violated? [duplicate]

python,xml,xsd,lxml

Looks like I just had the <xs:unique> constraint in the wrong place. It should've been: <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="parent"> <xs:complexType> <xs:sequence maxOccurs="unbounded"> <xs:element name="child"> <xs:complexType> <xs:attribute name="foo" type="xs:string" /> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> <xs:unique...

Pandas read_html equivalent for a lxml table

python,pandas,lxml

I am able to now answer my own question and maybe this can be of assistance to others. I tried modifying the read_html source code in pandas without much success because of some recognition issues. Nonetheless the answer is much simpler than you might think. >>>import pandas as pd >>>import...

Print only not null values

python,python-2.7,xpath,lxml

You are getting empty lists, not None objects. You are testing for the wrong thing here; you see [], while if a Python null was being returned you'd see None instead. The Element.xpath() method will always return a list object, and it can be empty. Use a boolean test: percentage...

How can I strip namespaces out of an lxml tree?

python,xml,lxml,xml-namespaces,prefix

One possible way to remove namespace prefix from each element : def strip_ns_prefix(tree): #iterate through only element nodes (skip comment node, text node, etc) : for element in tree.xpath('descendant-or-self::*'): #if element has prefix... if element.prefix: #replace element name with it's local name element.tag = etree.QName(element).localname return tree Another version which...

how to get unresolved entities from html attributes using python and lxml

python,html,python-2.7,lxml

Unescaped character is invalid in HTML, and HTML abstraction model (lxml.etree in this case) only works with valid HTML. So there is no notion of unescaped character after the source HTML loaded to the object model. Given unescaped characters in HTML source, parser will either fails completely, or tries to...

XSLT template does not apply to all elements using lxml

xml,xslt,lxml

That really seems odd - unless you realize what happens. This has nothing to do with how lxml performs XSLT tranformations, as far as I can see. It's just that lxml.etree.tostring() expects an object containing well-formed HTML or XML as input. You don't hand it well-formed markup: <?xml version="1.0"?> <h1>book</h1>...

Japanese characters screwing up lxml parsing

python,lxml

I imagine you want: runtime_text = node.xpath(u"//dl/dt[text()='Runtime:' or text()='Laufzeit:' or text()='再生時間:']/following-sibling::dd")[0].text.strip() lxml probably doesn't understand python's unicode literals...

Set lxml as default BeautifulSoup parser

python,html,beautifulsoup,html-parsing,lxml

According to the Specifying the parser to use documentation page: The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed. If you don’t specify anything, you’ll get the best HTML parser that’s...

lxml - how to remove element but not it's content?

python,lxml

Consider using copy.deepcopy as suggested in the docs: For example: div_parents = root_element.xpath('//div[div]') for outer_div in div_parents: if len(outer_div.getchildren()) == 1: inner_div = outer_div[0] # Copy the children of innder_div to outer_div for e in inner_div: outer_div.append( copy.deepcopy(e) ) # Remove inner_div from outer_div outer_div.remove(inner_div) Full code used to test:...

With Xpath, how do you select these elements but not those?

xpath,lxml,scrape

You could use the XPath //div[starts-with(@class,"cl1 ")]; note the space after cl1. For example, In [20]: import lxml.html as LH In [21]: doc = LH.parse('data.html') In [24]: doc.xpath('//div[starts-with(@class,"cl1 ")]') Out[24]: [<Element div at 0x7f0568c68100>, <Element div at 0x7f0568c68158>] In [25]: [LH.tostring(elt) for elt in doc.xpath('//div[starts-with(@class,"cl1 ")]')] Out[25]: ['<div class="cl1 a"></div>\n',...

lxml: I can't remove a span tag and the text inside

html,lxml

You can try with strip_elements() function, like: from lxml import etree as et parser = et.XMLParser(remove_blank_text=True, recover=True) documentXml=et.parse(html_FileName, parser) for class1Node in documentXml.xpath('//div[@class="class1-text"]'): chineseNode=class1Node.xpath('.//span[@class="class3"]') et.strip_elements(chineseNode[0].getparent(), 'span', with_tail=False) print(et.tostring(documentXml)) It yields: b'<div num="1" class="class1"><div class="class1-text"><span class="class2">\n some english...

python lxml.html.parse not reading url

python,lxml,python-requests

Adding some clarification to @michael_stackof's answer. This particular URL would return 403 Forbidden status code if User-Agent header is not supplied. According to the lxml's source code, it uses urllib2.urlopen() without supplying a User-Agent header which results into 403, which results into failed to load HTTP resource error. On the...

type error saving lxml elements in shelve

python,lxml,pickle

Only picklable data types can be stored in a shelf -- in particular, types added by C extensions need explicit support to be picklable; lxml has not as of this date had that support written. Unless you're willing to provide a patch to upstream lxml and shepherd it through merge...

Scraping data python lxml

python,lxml

Use xpath to select the text explicitly //legend[@align='center']/a/text() This plugin for chrome helps a lot when writing lxml queries Xpath Helper...

Python 3.4 : LXML web scraping

python,lxml

Browsers add in missing HTML elements that the HTML specification states are part of the model. lxml does not add those in. The most common such element is the <tbody> element. Your document has no such element, but Chrome does and they put it in your XPath. Another such an...

Gracefully recover from parse error in lxml

python,xml,parsing,exception,lxml

You can use finally try: #something except etree.XMLSyntaxError: #onotherthing finally: root = tree.getroot() ...

LXML to write in unicode?

python,unicode,lxml

Yes you can pass an encoding to etree.tostring method using the encoding parameter: etree.tostring(node, pretty_print=True, encoding='unicode') From the etree.tostring docs: You can also serialise to a Unicode string without declaration by passing the unicode function as encoding (or str in Py3), or the name 'unicode'. This changes the return value...

Join consecutive HTML tags of the same kind, same CSS class

html,xslt,lxml

The following stylesheet: XSLT 1.0 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="p"> <xsl:copy> <xsl:copy-of select="@*"/> <span class="{span[1]/@class}"> <xsl:apply-templates/> </span> </xsl:copy> </xsl:template> <xsl:template match="br"> <xsl:text> </xsl:text> </xsl:template>...

How do I get all content between two html tags in Python?

python,xml,xpath,beautifulsoup,lxml

You can try this way : import lxml.html.soupparser import lxml.html my_html_page = '''...some html markup here...''' root = lxml.html.soupparser.fromstring(my_html_page) for elem in root.xpath("//div[@class='post_body']"): result = elem.text + ''.join(lxml.html.tostring(e, pretty_print=True) for e in elem) print result result variable constructed by combining text nodes within parent <div> with markup of all of...

Python XML parsing, lxml, urllib.request

python,xml,lxml,urllib

Per the doc string (see help(ET.parse)), ET.parse expects the first argument to be a file name/path import lxml.etree as ET tree = ET.parse(filename) a file object with open('data.xml') as f: tree = ET.parse(f) a file-like object import io tree = ET.parse(io.BytesIO(data)) a URL using the HTTP or FTP protocol import...

Parsing XML using LXML and Python

python,xml,lxml

You will end up with different lists as you are looking for different things, but that should be alright. The goal here is not to immediately get the result out of an lxml interface, but get something later that you can then put into a mysql database, yes? If so,...

Python: specifying the namespace in an lxml.etree path

python,xml,svg,lxml,xml-namespaces

Mapping default namespace to None prefix didn't work for me either. You can, however, map it to a normal string prefix and use that prefix in the xpath, the rest of your codes are working without any change : from lxml import etree XHTML_NAMESPACE = "http://www.w3.org/2000/svg" XHTML = "{%s}" %...

Merging xpath generated lists, Python, lxml

python,xml,list,xpath,lxml

Using zip, you will get a list (or an iterator if you use Python 3.x) of tuples: records = zip( tree.xpath("/rss/channel/item/title/text()"), tree.xpath("/rss/channel/item/link/text()"), tree.xpath("/rss/channel/item/pubDate/text()") ) # records => [ # ('Klistermärkesuppsättning i Trollhättan', # 'https://www.nordfront.se/klistermarkesuppsattning-trollhattan.smr', # 'Mon, 02 Feb 2015 14:00:37 +0000'), # ('Dror Feiler vill inte betala sina böter –...

how to write the opening of an xml doc in lxml?

python,xml,lxml,cxml

You can pass them as arguments to the tostring() method. An example: from lxml import etree root = etree.Element('root') etree.SubElement(root, 'child1') etree.SubElement(root, 'child2') print etree.tostring(root, encoding='UTF-8', xml_declaration=True, doctype='''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">''') That yields: <?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"...

Python - ccsselect with spaces

python,html,css,lxml

To match div with photo and cover class, use div.photo.cover. img = page.cssselect('div.photo.cover div.outer a') Instead of thinkg class="photo cover" as class attribute with photo cover as a value, think it as a class attribute with photo and cover as values....

How to indent XML with lxml?

python,lxml

doc.write('albumtest.xml', pretty_print=True) ...

How can I set up lxml and pypy on Yosemite?

python,osx,lxml,pypy

PyPy does not work with lxml (at least not very well, even if it accidentally does), due to lxml being built on top of Cython which uses CPython C API bindings. Consider using lxml-cffi instead https://github.com/amauryfa/lxml/tree/cffi

How can I modify the attributes of an SVG file from Python?

python,svg,lxml

You do not set the attribute again, but use its value instead of the elmenet in this line: yes_votes.set(yes_votes, str(int(yes_votes) + 1)) yes_votes contains the content of the attribute and not a reference to the attribute itself. Change it to: element.set( "data-yes", str(int(yes_votes) + 1)) ...

Case insensitive xpath [duplicate]

python,xpath,lxml

You can use the following XPATH: //locales/locale[translate(@name,'nl','NL')='NL-NL'] Or, if there are just two values, you can use even this: //locales/locale[@name='NL-NL' or @name = 'nl-NL'] ...

Scrape data using lxml python

python,json,dictionary,lxml

findtext() would be handy in this case: league = tree.findtext("//legend[@align='center']/a)") ...

How to extract efficientely content from an xml with python?

python,xml,python-2.7,pandas,lxml

There are several things wrong here. (Asking questions on selecting a library is against the rules here, so I'm ignoring that part of the question). You need to pass in a file handle, not a file name. That is: y = BeautifulSoup(open(x)) You need to tell BeautifulSoup that it's dealing...

How to get full text inside lxml element

python,lxml

Use .text property of the enclosing element.

Append parent to xml

python,lxml,elementtree

The lxml has the function parentElem.insert(position, new_element) that allows you to insert a new child element under its parent. You can find an example here and here (Section Elements are lists)

How to merge two different paths in a XML file?

python,xml,parsing,xml-parsing,lxml

So there wasn't too much detail to go on here, but this at least give the correct output: from lxml import etree root = etree.fromstring(xml) replace_set = {} for node in root.iter("Node"): if 'NodeRef' in [c.tag for c in node]: # This is a <Node> type with child element <NodeRef>....

Extracting data from webpage using lxml XPath in Python

python,xpath,web-crawler,lxml,python-requests

I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty. I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html You might...

Creating xsd document from file download

python,amazon-s3,xsd,lxml

Use .content which is type bytes >>> from lxml import etree >>> xsd_url = 'https://s3-us-west-1.amazonaws.com/premiere-avails/movie.xsd.xml' >>> node = etree.fromstring(requests.get(xsd_url).content)) The problem is that your xml file specifies an encoding, and therefore it is the xml parser's job to decode this encoding. But your code uses .text, which asks requests to...

SOLVED: Installing lxml, libxml2, libxslt on Windows 8.1

python,windows,module,installation,lxml

I was able to fix the installation with the following steps. I hope others find this helpful. My installation of "pip" was working fine before the problem. I went to the Windows command line and made sure that "wheel" was installed. C:\Python34>python -m pip install wheel Requirement already satisfied (use...

Convert XML to string to find length

python,lxml

To serialize to a unicode string (thus, getting length in characters) in Python 2: root_str = etree.tostring(root_el, encoding=unicode) root_len_chars = len(root_str) To encode that unicode string in UTF-8, and get the length in bytes (for that encoding): root_len_bytes = len(root_str.encode('utf-8')) ...

Python lxml - find tag block ammend

python,xml,xml-parsing,tags,lxml

You can make an xpath expression to get all products where territory is IE: //product[territory = "IE"] But, you need to handle an empty namespace here: from lxml import etree data = """<?xml version="1.0" encoding="UTF-8"?> <package xmlns="http://apple.com/itunes/importer" version="film5.0"> <video> <products> <product> <territory>GB</territory> <cleared_for_sale>true</cleared_for_sale> <wholesale_price_tier>1</wholesale_price_tier> </product> <product>...

Python 3.4 : LXML : Parsing Tables

python,python-3.x,datatables,lxml

Rather than use a huge long XPath as generated by Chrome, you can just search for a table with the yfnc_tabledata1 class; there is just the one: >>> tree.xpath("//table[@class='yfnc_tabledata1']") [<Element table at 0x10445e788>] Get to your <td> from there: >>> tree.xpath("//table[@class='yfnc_tabledata1']//td[1]")[0].text_content() 'Period EndingDec 31, 2014Dec 31, 2013Dec 31, 2012\n \n...

Parse HTML from local file

python,html,google-app-engine,lxml

Your working directory is the base of your app directory. So if your app is organized like: app.yaml nl/ home.html You can then read your file at nl/html.html (assuming you haven't changed your working directory)....

Python 2.7 Etree/lxml minimizing [duplicate]

python,xsd,formatting,lxml

Use method parameter of write method. Value on parameter if html or xml E.g. tree.write("output.xsd", method="html") Also have pretty print parameter which have value True or False e.g. tree.write("output.xsd", method="html", pretty_print=True) Have may parameters: write(self, file, encoding=None, method="xml", pretty_print=False, xml_declaration=None, with_tail=True, standalone=None, compression=0, exclusive=False, with_comments=True, inclusive_ns_prefixes=None) ...

how to make a selection?

python,xpath,lxml

If you are looking for an XPath-only solution, you could pick off all the pieces of text in the div tag using //text(): import lxml.html as LH doc = LH.parse('data', parser=LH.HTMLParser(encoding='utf-8')) for elt in doc.xpath( '//table[@id="single-stone"]/following-sibling::div/div//text()'): print(elt) yields Гранит Мансуровский продается в готовых изделиях, а также слэбах. Сейчас Мансуровский есть...

Get inner text from lxml

python,lxml

That's often referred to as "inner xml" rather than "inner text". This is one possible way to get inner xml of an element : import lxml.etree as etree import lxml.html html = "<p>this is <b>the</b> good stuff<p>" tree = lxml.html.fromstring(html) node = tree.xpath("//p")[0] result = node.text + ''.join(etree.tostring(e) for e...

Problems with parsing xml

python,xml,csv,lxml

What you need is: root = ET.fromstring(data) and omit the line of: tree = ET.parse('bikeStations.xml') As the response from connection.read() returns String, you can directly read the XML string by using fromstring method, you can read more from HERE....

Issue with parsing html with lxml by xpath

python,parsing,xpath,lxml,lxml.html

You can make a shorter and more reliable xpath expression and you have to use namespaces: tree.xpath('//text[@class="bar-text-label"]/text()', namespaces={'n': 'http://www.w3.org/2000/svg'}) Alternative solution could be to use selenium browser automation package: from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox()...

Using lxml in Python, I need to replace “RNA” with RNA in input xml file. Code below

python-2.7,lxml

You are setting an XML mark in element .text, so when writing to XML it is interpreted as text, not markup, and characters are escaped with &...;. What you want to do is: divide .text into three parts: before new tag, in new tag, after new tag add new tag...

Get value using lxml

python,html,html-parsing,lxml,lxml.html

This is called the .tail of an element: from lxml.html import fromstring data = """ <div class="txt-block"> <h4 class="inline">Aspect Ratio:</h4> 2.35 : 1 </div> """ root = fromstring(data) print root.xpath('//h4[@class="inline"]')[0].tail Prints 2.35 : 1. As an alternative, you can get the following text sibling of the h4 element: root.xpath('//h4[@class="inline"]/following-sibling::text()')[0] Also,...

Obtaining position info when parsing HTML in Python

python,html,parsing,lxml,html5lib

After some additional research and more carefully reviewing of the source code of html5lib, I discovered that html5lib.tokenizer.HTMLTokenizer does retain partial position information. By "partial," I mean that it knows the line and column of the last character of a given token. Unfortunately, it does not retain the position of...

URL with Ukrainian characters giving UnicodeEncodeError

python,character-encoding,lxml

Your issue is that you have non-ASCII characters within your URL path which must be properly encoded using urllib.parse.quote(string) in Python 3 or urllib.quote(string) in Python 2. # Python 3 import urllib.parse url = 'http://www.lingvo.ua' + urllib.parse.quote('/uk/Interpret/uk-ru/вікно') # Python 2 import urllib url = 'http://www.lingvo.ua' + urllib.quote(u'/uk/Interpret/uk-ru/вікно'.encode('UTF-8')) NOTE: According to...

Problems grabbing data with xpath and lxml in Python

python,xpath,lxml

As far as I can see, the correct URL is http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData?$filter=month%28NEW_DATE%29%20eq%201%20and%20year%28NEW_DATE%29%20eq%202015 whereas in your code there is (no ? after Data) http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData$filter=month%28NEW_DATE%29%20eq%201%20and%20year%28NEW_DATE%29%20eq%202015 That's why your code currently produces the following XML: <?xml version="1.0" encoding="utf-8" standalone="yes"?> <error...

Namespace argument in lxml parsing

python,lxml

You need to use the namespace prefix while querying. like node.xpath('//html:body', namespaces={'html': 'http://...'}) or you can use the .nsmap node.xpath('//html:body', namespaces=node.nsmap) This assumes all the namespaces are defined on tag pointed by node. This is usually true for most xml documents....

python - find xpath of element containing string

python,html,xml,xpath,lxml

You can use getpath() to get xpath from element, for example : import requests from lxml import html page = requests.get("http://www.w3schools.com/xpath/") root = html.fromstring(page.text) tree = root.getroottree() result = root.xpath('//*[. = "XML"]') for r in result: print(tree.getpath(r)) Output : /html/body/div[3]/div/ul/li[10] /html/body/div[3]/div/ul/li[10]/a /html/body/div[4]/div/div[2]/div[2]/div[1]/div/ul/li[2] /html/body/div[5]/div/div[6]/h3 /html/body/div[6]/div/div[4]/h3 /html/body/div[7]/div/div[4]/h3 ...

Issue Parsing a site with lxml and xpath in python

python,xpath,lxml

The primary problem here is that the actual document served from the server contains an empty table: <table id="cardtable" class="cardlist"/> The data is filled in after the page loads by the embedded javascript that follows the empty table element: <script> $('#cardtable').dataTable({ "aLengthMenu": [[25, 100, -1], [25, 100, "All"]], "bDeferRender": true,...

lxml startswith for xpath

python,lxml

You can use this syntax: /locale[starts-with(@name, "en")] ...

Difference between a/img/..//text() and a//text()

xpath,scrapy,lxml,lxml.html

The difference between a/img/..//text() and a//text() is that the first will return you text nodes ONLY from a elements with img elements as children, whereas the second will return text nodes from a elements irrespective of whether they have img elements as children. Put another way, a/img/..//text() could equally be...

Scraping IMDb Review Page with lxml and requests package

python,lxml,lxml.html

You can use the following line to extract the number of reviews: number_of_reviews = int(reviewPage.xpath("//div[@id = 'tn15content']/table[2]/tr/td[2]")[0].text_content().split()[0]) You can even use your own code if you modify it a little bit. The problem lies in your XPath. Get rid of the tbody part and it works. number_of_reviews = reviewPage.xpath("/html/body/div[1]/div/layer/div[4]/div[3]/div[3]/div[3]/table[2]/tr/td[2]")[0] You...

Find all HTML and non-HTML encoded URLs in string

python,html,regex,beautifulsoup,lxml

I upvoted because it triggered my curiosity. There seems to be a library called twitter-text-python, that parses Twitter posts to detect both urls and hrefs. Otherwise, I would go with the combination regex + lxml

how to use lxml to get the url

python,html,lxml

res.content is a string, an HTML string. You need to use lxml.html.fromstring(): import lxml.html import requests res = requests.get('http://www.ipeen.com.tw/comment/778246') doc = lxml.html.fromstring(res.content) name = doc.xpath(".//meta[@itemprop='name']/@content") print name ...

Installing lxml on RHEL for Python 2.7

python,lxml,easy-install

The short answer is, not easily. I'm assuming you're using RHEL5 since you mentioned python 2.4 was in the repo. The problem is that lxml is not a pure python library, it is a rather thin wrapper around two C/C++ libraries, libxml2 and libxslt. This means that any RPM installer...

Parsing XML in Python using LXML

python,xml,lxml

I think this is what you need to get the id from current node referenced by leagues variable : leagueid = leagues.xpath('./id/text()') Above Xpath looks for child node <id> from current <league> node....

Python - Requests: Correctly Using Params?

python,html,request,lxml,lxml.html

The <input> elements with the name attributes equal to: a, b, c, d, e, f, g(the radio button Daily/Weekly/Monthly) are inside a <form> tag, which has this hidden form field: <input type="hidden" name="s" value="^GSPC" data-rapid_p="11"> That will send a name/value pair to the server just like a regular <input> element....

Parsing xpath with python

python,xpath,lxml,lxml.html

How about using .text_content()? .text_content(): Returns the text content of the element, including the text content of its children, with no markup. table = tree.xpath('//table/tr') for item in table: print ' '.join(item.text_content().split()) join()+split() here help to replace multiple spaces with a single one. It prints: February 20, 2015 9:00 PM...

python install lxml on mac os 10.10.1

python,osx,python-2.7,scrapy,lxml

I had that problem, and what I did is: installed all xcode (2.8GB) from apple store. to be sure that the installation is successfully finished: open terminal and typed xcode-select -p you should get something like this: /Applications/Xcode.app/Contents/Developer now you need to install command line tools. try to type gcc...

How to set up XPath query for HTML parsing?

python,xml,parsing,xpath,lxml

It is important to inspect the string returned by page.text and not just rely on the page source as returned by your Chrome browser. Web sites can return different content depending on the User-Agent, and moreover, GUI browsers such as your Chrome browser may change the content by executing JavaScript...

LXML namespace in sub-elements formatted like XML spreadsheet

python,xml,excel,lxml,spreadsheet

If you need to use a namespace for an attribute of an Element or SubElement you can't pass it using the **kwargs (**_extra) syntax, as that does only allow to specify attributes with a name that is a valid python identifier. So in this case you need to use the...

how to install lxml with pypy in virtualenv

python,lxml,pypy

I had a similar problem. I got it working by running sudo apt-get install python-dev libxml2 libxml2-dev libxslt-dev and then pip install --upgrade lxml

Why does am I unable to receive data from only this website?

python,lxml,python-requests

Provide a User-Agent header and it would work for you: webpage = 'http://www.whosampled.com/search/?q=de+la+soul' page = requests.get(webpage, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}) Proof: >>> from lxml import html >>> import requests >>> >>> webpage = 'http://www.whosampled.com/search/?q=de+la+soul' >>> page = requests.get(webpage, headers={'User-Agent': 'Mozilla/5.0...

Why won't lxml strip section tags?

python,html,lxml

Why is this the case? It isn't. The documention says the following: Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants. So: >>> tree = etree.fromstring('<section> outer <section> inner </section> </section>') >>> etree.strip_tags(tree,...

Saving spider results to database

python,python-3.x,sqlalchemy,web-scraping,lxml

As far as I understand, this is a Scrapy-specific question. If so, you just need to activate your pipeline in settings.py: ITEM_PIPELINES = { 'myproj.pipeline.NordfrontPipeline': 100 } This would let the engine know to send the crawled items to the pipeline (see control flow): If we are not talking about...

Python and lxml.html get_element_by_id output questions

python,html,html-parsing,lxml,lxml.html

I'm afraid lxml.html cannot handle parsing this particular HTML source. It parses the h3 tag with id="productDetails" as an empty element (and this is in a default "recover" mode): <h3 class="productDescription2" id="productDetails" itemprop="description"></h3> Switch to BeautifulSoup with html5lib parser (it is extremely lenient): from urllib2 import urlopen from bs4 import...

How to add spaces between nodes when using string() on a tree in XPath

html,xslt,xpath,lxml

You can try with itertext() method, that iterates over all text content: from lxml import etree root = etree.XML('<root><div>abc</div><div>def</div></root>') print(' '.join(e for e in root.itertext())) It yields: abc def ...

TypeError: unhashable type: 'list' while posting data

python,list,lxml

The lxml tree.xpath() method returned a list object. You cannot use a list object as a dictionary key. If you meant to retrieve the first (perhaps only) result of the XPath query, then do so explicitly: post_data={'A': 'B', token[0]: '1'} If you needed to use all results of the query...

Print Soap Body data using lxml

python,lxml

I dumped your xml into variable root, here is how you can get that piece of XML: import lxml.etree as ET createCase=root.find('.//cre:createCase',namespaces=root.nsmap) print ET.tostring(createCase, pretty_print=True) prints this: <cre:createCase xmlns:cre="http://www.code.com/abc/V1/createCase" xmlns:wsu="http://docs.oasis-open.org/30.xsd" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> <cre:Request> <cre:ServiceAttributesGrp> <cre:MinorVer>?</cre:MinorVer>...

lxml not parsing unicode properly for HTML

python,unicode,lxml

It's just an encoding issue. It looks like you're using requests, which is good, because it does this work for you. First, requests guesses at the encoding, which you can access with r.encoding. For that page, requests guessed at utf-8. You could do: data = r.content.decode('UTF-8') # or data =...

Insert xml node in specific location

python,xml,lxml

Using etree.SubElement always appends the subelement to the end of the parent item. So instead, to insert a new element at a particular location, use item.insert(pos, subelement): import lxml.etree as etree xml_node = etree.Element("node") item = etree.SubElement(xml_node, 'Item') etree.SubElement(item, 'Name').text = 'Hello' etree.SubElement(item, 'Hero').text = '1' etree.SubElement(item, 'Date').text = '2014-01-01'...

How do I do thread-safe python XML validation?

python,xml,multithreading,validation,lxml

On linux, a forked process gets a copy-on-write view of the parent's memory. You can leverage this to process a large object with little overhead and without changing the object in the parent's memory space. After creating the object, create pipes for communication and fork a child to do the...

lxml - how to get minimal xpath of element?

python,xpath,xml-parsing,html-parsing,lxml

It is not possible to make lxml generate any xpath expression except the "absolute" one most importantly because there could be enormous amount of xpath expressions pointing to an element. There are also several important points raised here: XPath generator As far as I understand from the firebug documentation and...

Select by xpath knowing only ending of element's attribute

python,xml,xpath,web-scraping,lxml

Try using the contains selector to find the element with an attribute href which contains the word parent //*[contains(@href, 'parent')] or if you are sure about the position of text "parent" you can use the ends-with //*[ends-with(@href, 'parent')] ...

crawling imdb database using python and lxml

python,web-crawler,lxml

The following function lxml.html.document_fromstring(string) Parses a document from the given string. This always creates a correct HTML document, which means the parent node is , and there is a body and possibly a head. hxs = lxml.html.document_fromstring(requests.get("http://www.imdb.com/title/" + id).content) You can see the html using this code. print lxml.html.tostring(hxs) Considering...

Fix the location of attributes in etree.Element

python-2.7,lxml,xml.etree

As for your first question, this seems to be a good answer. As for your second question, the "tostring()" method follows this standard (as per the documentation)....

How to find text in specific nested tag wih lxml and pyhthon?

python-3.x,lxml

Here's my attempt. It gives your desired output for your sample input. I made it tolerant of certain tag changes like div vs. span. import xml.etree.cElementTree as etree # or: from lxml import etree body = etree.parse('test.html').find('body') for aname in body.iterfind('*[@class="aname"]'): cname = aname.find('*[@class="bname"]//a[@class="cname"]') print cname.text # title print cname.get('href')...

Xpath extract current node content including all child node

python,xpath,lxml

To get XML node's content markup (sometimes referred to as "innerXML") , you can start by selecting the node (instead of selecting the child or the text content) : from lxml import html import lxml.etree as le input = "<pre>abcdefg<b>b1b2b3</b></pre>" tree = html.fromstring(input) node = tree.xpath("//pre")[0] then combine the text...

Extracting between specific tag

python,lxml

You could use regex on the script to find the quoteDataObj variable and load its contents with JSON. Example: import re import json #...your code... content = wgetUrl(url) matches = re.findall(r'quoteDataObj\s\=\s\[(\{.+\})\]', content) if len(matches) > 0: python_dict = json.loads(matches[0]) Output: {u'altSymbol': u'CL/M5', u'assetType': u'DERIVATIVE', u'change': u'-1.39', u'code': 0, u'curmktstatus': u'REG_MKT',...

Close open tags without wrapping in

python,django,lxml

Wrap everything in a throwaway tag, then get the outermost elements inner contents, like so: def clean_custom_title(self): title = self.cleaned_data.get('custom_title') if title: title = "<foo>%s</foo>"%title title = lxml.html.fromstring(title) # title = lxml.some_function(title) # strip <foo> in a 'proper' way title = lxml.etree.tostring(title) title = title[5:-6] # This is a hack...

how to use lxml find all the src tags and replace them

python,html,html-parsing,lxml,lxml.html

At least, you need to assign the result of replace to the body: for ss in src: body = body.replace(str(ss), '') print body Though, I personally don't like the approach. Better find all tags that have src attribute and set the attribute value to an empty string: for element in...

Python: Scrape Data from Web after Inputing Info

python,web-scraping,html-parsing,lxml,lxml.html

The main issue was that you needed to make a GET request, not a POST. Plus, @Paul Lo is right about the date ranges. For the sake of example, I'm querying from 2010 to 2015. Also, you have to pass query parameters as strings. 00 evaluated to 0, requests converted...

Parsing HTML tree in lxml : how can I retrieve the text inside the element?

python,html-parsing,lxml

You can try using xpath function string() which return concatenated value of all text nodes within current element : import lxml.html html = """<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:&quot;Open Sans&quot;, &quot;Helvetica Neue&quot;, Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important">&#13;\n<i...

Parsing forum posts using lxml/python

python,parsing,web-scraping,lxml,lxml.html

Find the element with xpath (not text) and use text_content() method: details = tree.xpath('.//div[contains(@id, "post_message_33583236")]')[0] print(details.text_content()) Prints: With all the talk about raising the minimum wage, I think the real issue is that people are not getting a liveable wage anymore. This applies to many skilled people too in which...