python,html,xml,html-parsing,lxml
You could use the remove method to remove the children: import lxml.html as LH code = '''<a foo="bar">some text<b></b> here <c><d>Hi</d></c> and here</a>''' root = LH.fromstring(code) print(root.text_content()) # some text here Hi and here for elt in root: root.remove(elt) print(LH.tostring(root)) yields <a foo="bar">some text</a> Note, however, that not all text...
Use range and format for i in range(1,5): contentA = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[2]/td[{i}]".format(i=i))[0].text_content().strip() print(contentA) Output Total Revenue 31,821,000 30,871,000 29,904,000 for i in range(1,5): contentB = tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr[3]/td[{i}]".format(i=i))[0].text_content().strip() print(contentB) Output Cost of Revenue 16,447,000 16,106,000 15,685,000 EDIT In [22]: d = {} In [23]: d.setdefault('Revenue', []) Out[23]: [] In [24]: for i in...
You could try with iter() function. It will traverse through all the children elements. The comparison of the length is to print only those that has no children: A complete script like this one: from lxml import etree tree = etree.parse('xmlfile') for tag in tree.iter(): if not len(tag): print (tag.tag,...
I have managed to solve my own problem. for p in self.tree.xpath('//body/p'): if p.tail is None: # some conditions specifically for my doc children = p.getchildren() if len(children)>1: for child in children: #if other stuffs present, break if child.tag!='strong' or child.tail is not None: break else: # If not break,...
The proper way to access nodes in namespace is by passing prefix-namespace URL mapping as additional argument to xpath() method, for example : ns = {'tes' : 'http://www.blah.com/client/servlet'} source_type = sourceobject.xpath('//tes:actions/tes:type/text()', namespaces=ns) Or, another way which is less recommended, by literally ignoring namespaces using xpath function local-name() : source_type =...
Looks like I just had the <xs:unique> constraint in the wrong place. It should've been: <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="parent"> <xs:complexType> <xs:sequence maxOccurs="unbounded"> <xs:element name="child"> <xs:complexType> <xs:attribute name="foo" type="xs:string" /> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> <xs:unique...
I am able to now answer my own question and maybe this can be of assistance to others. I tried modifying the read_html source code in pandas without much success because of some recognition issues. Nonetheless the answer is much simpler than you might think. >>>import pandas as pd >>>import...
You are getting empty lists, not None objects. You are testing for the wrong thing here; you see [], while if a Python null was being returned you'd see None instead. The Element.xpath() method will always return a list object, and it can be empty. Use a boolean test: percentage...
python,xml,lxml,xml-namespaces,prefix
One possible way to remove namespace prefix from each element : def strip_ns_prefix(tree): #iterate through only element nodes (skip comment node, text node, etc) : for element in tree.xpath('descendant-or-self::*'): #if element has prefix... if element.prefix: #replace element name with it's local name element.tag = etree.QName(element).localname return tree Another version which...
Unescaped character is invalid in HTML, and HTML abstraction model (lxml.etree in this case) only works with valid HTML. So there is no notion of unescaped character after the source HTML loaded to the object model. Given unescaped characters in HTML source, parser will either fails completely, or tries to...
That really seems odd - unless you realize what happens. This has nothing to do with how lxml performs XSLT tranformations, as far as I can see. It's just that lxml.etree.tostring() expects an object containing well-formed HTML or XML as input. You don't hand it well-formed markup: <?xml version="1.0"?> <h1>book</h1>...
I imagine you want: runtime_text = node.xpath(u"//dl/dt[text()='Runtime:' or text()='Laufzeit:' or text()='再生時間:']/following-sibling::dd")[0].text.strip() lxml probably doesn't understand python's unicode literals...
python,html,beautifulsoup,html-parsing,lxml
According to the Specifying the parser to use documentation page: The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed. If you don’t specify anything, you’ll get the best HTML parser that’s...
Consider using copy.deepcopy as suggested in the docs: For example: div_parents = root_element.xpath('//div[div]') for outer_div in div_parents: if len(outer_div.getchildren()) == 1: inner_div = outer_div[0] # Copy the children of innder_div to outer_div for e in inner_div: outer_div.append( copy.deepcopy(e) ) # Remove inner_div from outer_div outer_div.remove(inner_div) Full code used to test:...
You could use the XPath //div[starts-with(@class,"cl1 ")]; note the space after cl1. For example, In [20]: import lxml.html as LH In [21]: doc = LH.parse('data.html') In [24]: doc.xpath('//div[starts-with(@class,"cl1 ")]') Out[24]: [<Element div at 0x7f0568c68100>, <Element div at 0x7f0568c68158>] In [25]: [LH.tostring(elt) for elt in doc.xpath('//div[starts-with(@class,"cl1 ")]')] Out[25]: ['<div class="cl1 a"></div>\n',...
You can try with strip_elements() function, like: from lxml import etree as et parser = et.XMLParser(remove_blank_text=True, recover=True) documentXml=et.parse(html_FileName, parser) for class1Node in documentXml.xpath('//div[@class="class1-text"]'): chineseNode=class1Node.xpath('.//span[@class="class3"]') et.strip_elements(chineseNode[0].getparent(), 'span', with_tail=False) print(et.tostring(documentXml)) It yields: b'<div num="1" class="class1"><div class="class1-text"><span class="class2">\n some english...
Adding some clarification to @michael_stackof's answer. This particular URL would return 403 Forbidden status code if User-Agent header is not supplied. According to the lxml's source code, it uses urllib2.urlopen() without supplying a User-Agent header which results into 403, which results into failed to load HTTP resource error. On the...
Only picklable data types can be stored in a shelf -- in particular, types added by C extensions need explicit support to be picklable; lxml has not as of this date had that support written. Unless you're willing to provide a patch to upstream lxml and shepherd it through merge...
Use xpath to select the text explicitly //legend[@align='center']/a/text() This plugin for chrome helps a lot when writing lxml queries Xpath Helper...
Browsers add in missing HTML elements that the HTML specification states are part of the model. lxml does not add those in. The most common such element is the <tbody> element. Your document has no such element, but Chrome does and they put it in your XPath. Another such an...
Yes you can pass an encoding to etree.tostring method using the encoding parameter: etree.tostring(node, pretty_print=True, encoding='unicode') From the etree.tostring docs: You can also serialise to a Unicode string without declaration by passing the unicode function as encoding (or str in Py3), or the name 'unicode'. This changes the return value...
The following stylesheet: XSLT 1.0 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="p"> <xsl:copy> <xsl:copy-of select="@*"/> <span class="{span[1]/@class}"> <xsl:apply-templates/> </span> </xsl:copy> </xsl:template> <xsl:template match="br"> <xsl:text> </xsl:text> </xsl:template>...
python,xml,xpath,beautifulsoup,lxml
You can try this way : import lxml.html.soupparser import lxml.html my_html_page = '''...some html markup here...''' root = lxml.html.soupparser.fromstring(my_html_page) for elem in root.xpath("//div[@class='post_body']"): result = elem.text + ''.join(lxml.html.tostring(e, pretty_print=True) for e in elem) print result result variable constructed by combining text nodes within parent <div> with markup of all of...
Per the doc string (see help(ET.parse)), ET.parse expects the first argument to be a file name/path import lxml.etree as ET tree = ET.parse(filename) a file object with open('data.xml') as f: tree = ET.parse(f) a file-like object import io tree = ET.parse(io.BytesIO(data)) a URL using the HTTP or FTP protocol import...
You will end up with different lists as you are looking for different things, but that should be alright. The goal here is not to immediately get the result out of an lxml interface, but get something later that you can then put into a mysql database, yes? If so,...
python,xml,svg,lxml,xml-namespaces
Mapping default namespace to None prefix didn't work for me either. You can, however, map it to a normal string prefix and use that prefix in the xpath, the rest of your codes are working without any change : from lxml import etree XHTML_NAMESPACE = "http://www.w3.org/2000/svg" XHTML = "{%s}" %...
Using zip, you will get a list (or an iterator if you use Python 3.x) of tuples: records = zip( tree.xpath("/rss/channel/item/title/text()"), tree.xpath("/rss/channel/item/link/text()"), tree.xpath("/rss/channel/item/pubDate/text()") ) # records => [ # ('Klistermärkesuppsättning i Trollhättan', # 'https://www.nordfront.se/klistermarkesuppsattning-trollhattan.smr', # 'Mon, 02 Feb 2015 14:00:37 +0000'), # ('Dror Feiler vill inte betala sina böter –...
You can pass them as arguments to the tostring() method. An example: from lxml import etree root = etree.Element('root') etree.SubElement(root, 'child1') etree.SubElement(root, 'child2') print etree.tostring(root, encoding='UTF-8', xml_declaration=True, doctype='''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">''') That yields: <?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"...
To match div with photo and cover class, use div.photo.cover. img = page.cssselect('div.photo.cover div.outer a') Instead of thinkg class="photo cover" as class attribute with photo cover as a value, think it as a class attribute with photo and cover as values....
PyPy does not work with lxml (at least not very well, even if it accidentally does), due to lxml being built on top of Cython which uses CPython C API bindings. Consider using lxml-cffi instead https://github.com/amauryfa/lxml/tree/cffi
You do not set the attribute again, but use its value instead of the elmenet in this line: yes_votes.set(yes_votes, str(int(yes_votes) + 1)) yes_votes contains the content of the attribute and not a reference to the attribute itself. Change it to: element.set( "data-yes", str(int(yes_votes) + 1)) ...
You can use the following XPATH: //locales/locale[translate(@name,'nl','NL')='NL-NL'] Or, if there are just two values, you can use even this: //locales/locale[@name='NL-NL' or @name = 'nl-NL'] ...
findtext() would be handy in this case: league = tree.findtext("//legend[@align='center']/a)") ...
python,xml,python-2.7,pandas,lxml
There are several things wrong here. (Asking questions on selecting a library is against the rules here, so I'm ignoring that part of the question). You need to pass in a file handle, not a file name. That is: y = BeautifulSoup(open(x)) You need to tell BeautifulSoup that it's dealing...
The lxml has the function parentElem.insert(position, new_element) that allows you to insert a new child element under its parent. You can find an example here and here (Section Elements are lists)
python,xml,parsing,xml-parsing,lxml
So there wasn't too much detail to go on here, but this at least give the correct output: from lxml import etree root = etree.fromstring(xml) replace_set = {} for node in root.iter("Node"): if 'NodeRef' in [c.tag for c in node]: # This is a <Node> type with child element <NodeRef>....
python,xpath,web-crawler,lxml,python-requests
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty. I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html You might...
Use .content which is type bytes >>> from lxml import etree >>> xsd_url = 'https://s3-us-west-1.amazonaws.com/premiere-avails/movie.xsd.xml' >>> node = etree.fromstring(requests.get(xsd_url).content)) The problem is that your xml file specifies an encoding, and therefore it is the xml parser's job to decode this encoding. But your code uses .text, which asks requests to...
python,windows,module,installation,lxml
I was able to fix the installation with the following steps. I hope others find this helpful. My installation of "pip" was working fine before the problem. I went to the Windows command line and made sure that "wheel" was installed. C:\Python34>python -m pip install wheel Requirement already satisfied (use...
To serialize to a unicode string (thus, getting length in characters) in Python 2: root_str = etree.tostring(root_el, encoding=unicode) root_len_chars = len(root_str) To encode that unicode string in UTF-8, and get the length in bytes (for that encoding): root_len_bytes = len(root_str.encode('utf-8')) ...
python,xml,xml-parsing,tags,lxml
You can make an xpath expression to get all products where territory is IE: //product[territory = "IE"] But, you need to handle an empty namespace here: from lxml import etree data = """<?xml version="1.0" encoding="UTF-8"?> <package xmlns="http://apple.com/itunes/importer" version="film5.0"> <video> <products> <product> <territory>GB</territory> <cleared_for_sale>true</cleared_for_sale> <wholesale_price_tier>1</wholesale_price_tier> </product> <product>...
python,python-3.x,datatables,lxml
Rather than use a huge long XPath as generated by Chrome, you can just search for a table with the yfnc_tabledata1 class; there is just the one: >>> tree.xpath("//table[@class='yfnc_tabledata1']") [<Element table at 0x10445e788>] Get to your <td> from there: >>> tree.xpath("//table[@class='yfnc_tabledata1']//td[1]")[0].text_content() 'Period EndingDec 31, 2014Dec 31, 2013Dec 31, 2012\n \n...
python,html,google-app-engine,lxml
Your working directory is the base of your app directory. So if your app is organized like: app.yaml nl/ home.html You can then read your file at nl/html.html (assuming you haven't changed your working directory)....
Use method parameter of write method. Value on parameter if html or xml E.g. tree.write("output.xsd", method="html") Also have pretty print parameter which have value True or False e.g. tree.write("output.xsd", method="html", pretty_print=True) Have may parameters: write(self, file, encoding=None, method="xml", pretty_print=False, xml_declaration=None, with_tail=True, standalone=None, compression=0, exclusive=False, with_comments=True, inclusive_ns_prefixes=None) ...
If you are looking for an XPath-only solution, you could pick off all the pieces of text in the div tag using //text(): import lxml.html as LH doc = LH.parse('data', parser=LH.HTMLParser(encoding='utf-8')) for elt in doc.xpath( '//table[@id="single-stone"]/following-sibling::div/div//text()'): print(elt) yields Гранит Мансуровский продается в готовых изделиях, а также слэбах. Сейчас Мансуровский есть...
That's often referred to as "inner xml" rather than "inner text". This is one possible way to get inner xml of an element : import lxml.etree as etree import lxml.html html = "<p>this is <b>the</b> good stuff<p>" tree = lxml.html.fromstring(html) node = tree.xpath("//p")[0] result = node.text + ''.join(etree.tostring(e) for e...
What you need is: root = ET.fromstring(data) and omit the line of: tree = ET.parse('bikeStations.xml') As the response from connection.read() returns String, you can directly read the XML string by using fromstring method, you can read more from HERE....
python,parsing,xpath,lxml,lxml.html
You can make a shorter and more reliable xpath expression and you have to use namespaces: tree.xpath('//text[@class="bar-text-label"]/text()', namespaces={'n': 'http://www.w3.org/2000/svg'}) Alternative solution could be to use selenium browser automation package: from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox()...
You are setting an XML mark in element .text, so when writing to XML it is interpreted as text, not markup, and characters are escaped with &...;. What you want to do is: divide .text into three parts: before new tag, in new tag, after new tag add new tag...
python,html,html-parsing,lxml,lxml.html
This is called the .tail of an element: from lxml.html import fromstring data = """ <div class="txt-block"> <h4 class="inline">Aspect Ratio:</h4> 2.35 : 1 </div> """ root = fromstring(data) print root.xpath('//h4[@class="inline"]')[0].tail Prints 2.35 : 1. As an alternative, you can get the following text sibling of the h4 element: root.xpath('//h4[@class="inline"]/following-sibling::text()')[0] Also,...
python,html,parsing,lxml,html5lib
After some additional research and more carefully reviewing of the source code of html5lib, I discovered that html5lib.tokenizer.HTMLTokenizer does retain partial position information. By "partial," I mean that it knows the line and column of the last character of a given token. Unfortunately, it does not retain the position of...
python,character-encoding,lxml
Your issue is that you have non-ASCII characters within your URL path which must be properly encoded using urllib.parse.quote(string) in Python 3 or urllib.quote(string) in Python 2. # Python 3 import urllib.parse url = 'http://www.lingvo.ua' + urllib.parse.quote('/uk/Interpret/uk-ru/вікно') # Python 2 import urllib url = 'http://www.lingvo.ua' + urllib.quote(u'/uk/Interpret/uk-ru/вікно'.encode('UTF-8')) NOTE: According to...
As far as I can see, the correct URL is http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData?$filter=month%28NEW_DATE%29%20eq%201%20and%20year%28NEW_DATE%29%20eq%202015 whereas in your code there is (no ? after Data) http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData$filter=month%28NEW_DATE%29%20eq%201%20and%20year%28NEW_DATE%29%20eq%202015 That's why your code currently produces the following XML: <?xml version="1.0" encoding="utf-8" standalone="yes"?> <error...
You need to use the namespace prefix while querying. like node.xpath('//html:body', namespaces={'html': 'http://...'}) or you can use the .nsmap node.xpath('//html:body', namespaces=node.nsmap) This assumes all the namespaces are defined on tag pointed by node. This is usually true for most xml documents....
You can use getpath() to get xpath from element, for example : import requests from lxml import html page = requests.get("http://www.w3schools.com/xpath/") root = html.fromstring(page.text) tree = root.getroottree() result = root.xpath('//*[. = "XML"]') for r in result: print(tree.getpath(r)) Output : /html/body/div[3]/div/ul/li[10] /html/body/div[3]/div/ul/li[10]/a /html/body/div[4]/div/div[2]/div[2]/div[1]/div/ul/li[2] /html/body/div[5]/div/div[6]/h3 /html/body/div[6]/div/div[4]/h3 /html/body/div[7]/div/div[4]/h3 ...
The primary problem here is that the actual document served from the server contains an empty table: <table id="cardtable" class="cardlist"/> The data is filled in after the page loads by the embedded javascript that follows the empty table element: <script> $('#cardtable').dataTable({ "aLengthMenu": [[25, 100, -1], [25, 100, "All"]], "bDeferRender": true,...
You can use this syntax: /locale[starts-with(@name, "en")] ...
The difference between a/img/..//text() and a//text() is that the first will return you text nodes ONLY from a elements with img elements as children, whereas the second will return text nodes from a elements irrespective of whether they have img elements as children. Put another way, a/img/..//text() could equally be...
You can use the following line to extract the number of reviews: number_of_reviews = int(reviewPage.xpath("//div[@id = 'tn15content']/table[2]/tr/td[2]")[0].text_content().split()[0]) You can even use your own code if you modify it a little bit. The problem lies in your XPath. Get rid of the tbody part and it works. number_of_reviews = reviewPage.xpath("/html/body/div[1]/div/layer/div[4]/div[3]/div[3]/div[3]/table[2]/tr/td[2]")[0] You...
python,html,regex,beautifulsoup,lxml
I upvoted because it triggered my curiosity. There seems to be a library called twitter-text-python, that parses Twitter posts to detect both urls and hrefs. Otherwise, I would go with the combination regex + lxml
res.content is a string, an HTML string. You need to use lxml.html.fromstring(): import lxml.html import requests res = requests.get('http://www.ipeen.com.tw/comment/778246') doc = lxml.html.fromstring(res.content) name = doc.xpath(".//meta[@itemprop='name']/@content") print name ...
The short answer is, not easily. I'm assuming you're using RHEL5 since you mentioned python 2.4 was in the repo. The problem is that lxml is not a pure python library, it is a rather thin wrapper around two C/C++ libraries, libxml2 and libxslt. This means that any RPM installer...
I think this is what you need to get the id from current node referenced by leagues variable : leagueid = leagues.xpath('./id/text()') Above Xpath looks for child node <id> from current <league> node....
python,html,request,lxml,lxml.html
The <input> elements with the name attributes equal to: a, b, c, d, e, f, g(the radio button Daily/Weekly/Monthly) are inside a <form> tag, which has this hidden form field: <input type="hidden" name="s" value="^GSPC" data-rapid_p="11"> That will send a name/value pair to the server just like a regular <input> element....
How about using .text_content()? .text_content(): Returns the text content of the element, including the text content of its children, with no markup. table = tree.xpath('//table/tr') for item in table: print ' '.join(item.text_content().split()) join()+split() here help to replace multiple spaces with a single one. It prints: February 20, 2015 9:00 PM...
python,osx,python-2.7,scrapy,lxml
I had that problem, and what I did is: installed all xcode (2.8GB) from apple store. to be sure that the installation is successfully finished: open terminal and typed xcode-select -p you should get something like this: /Applications/Xcode.app/Contents/Developer now you need to install command line tools. try to type gcc...
It is important to inspect the string returned by page.text and not just rely on the page source as returned by your Chrome browser. Web sites can return different content depending on the User-Agent, and moreover, GUI browsers such as your Chrome browser may change the content by executing JavaScript...
python,xml,excel,lxml,spreadsheet
If you need to use a namespace for an attribute of an Element or SubElement you can't pass it using the **kwargs (**_extra) syntax, as that does only allow to specify attributes with a name that is a valid python identifier. So in this case you need to use the...
I had a similar problem. I got it working by running sudo apt-get install python-dev libxml2 libxml2-dev libxslt-dev and then pip install --upgrade lxml
Provide a User-Agent header and it would work for you: webpage = 'http://www.whosampled.com/search/?q=de+la+soul' page = requests.get(webpage, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}) Proof: >>> from lxml import html >>> import requests >>> >>> webpage = 'http://www.whosampled.com/search/?q=de+la+soul' >>> page = requests.get(webpage, headers={'User-Agent': 'Mozilla/5.0...
Why is this the case? It isn't. The documention says the following: Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants. So: >>> tree = etree.fromstring('<section> outer <section> inner </section> </section>') >>> etree.strip_tags(tree,...
python,python-3.x,sqlalchemy,web-scraping,lxml
As far as I understand, this is a Scrapy-specific question. If so, you just need to activate your pipeline in settings.py: ITEM_PIPELINES = { 'myproj.pipeline.NordfrontPipeline': 100 } This would let the engine know to send the crawled items to the pipeline (see control flow): If we are not talking about...
python,html,html-parsing,lxml,lxml.html
I'm afraid lxml.html cannot handle parsing this particular HTML source. It parses the h3 tag with id="productDetails" as an empty element (and this is in a default "recover" mode): <h3 class="productDescription2" id="productDetails" itemprop="description"></h3> Switch to BeautifulSoup with html5lib parser (it is extremely lenient): from urllib2 import urlopen from bs4 import...
You can try with itertext() method, that iterates over all text content: from lxml import etree root = etree.XML('<root><div>abc</div><div>def</div></root>') print(' '.join(e for e in root.itertext())) It yields: abc def ...
The lxml tree.xpath() method returned a list object. You cannot use a list object as a dictionary key. If you meant to retrieve the first (perhaps only) result of the XPath query, then do so explicitly: post_data={'A': 'B', token[0]: '1'} If you needed to use all results of the query...
I dumped your xml into variable root, here is how you can get that piece of XML: import lxml.etree as ET createCase=root.find('.//cre:createCase',namespaces=root.nsmap) print ET.tostring(createCase, pretty_print=True) prints this: <cre:createCase xmlns:cre="http://www.code.com/abc/V1/createCase" xmlns:wsu="http://docs.oasis-open.org/30.xsd" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> <cre:Request> <cre:ServiceAttributesGrp> <cre:MinorVer>?</cre:MinorVer>...
It's just an encoding issue. It looks like you're using requests, which is good, because it does this work for you. First, requests guesses at the encoding, which you can access with r.encoding. For that page, requests guessed at utf-8. You could do: data = r.content.decode('UTF-8') # or data =...
Using etree.SubElement always appends the subelement to the end of the parent item. So instead, to insert a new element at a particular location, use item.insert(pos, subelement): import lxml.etree as etree xml_node = etree.Element("node") item = etree.SubElement(xml_node, 'Item') etree.SubElement(item, 'Name').text = 'Hello' etree.SubElement(item, 'Hero').text = '1' etree.SubElement(item, 'Date').text = '2014-01-01'...
python,xml,multithreading,validation,lxml
On linux, a forked process gets a copy-on-write view of the parent's memory. You can leverage this to process a large object with little overhead and without changing the object in the parent's memory space. After creating the object, create pipes for communication and fork a child to do the...
python,xpath,xml-parsing,html-parsing,lxml
It is not possible to make lxml generate any xpath expression except the "absolute" one most importantly because there could be enormous amount of xpath expressions pointing to an element. There are also several important points raised here: XPath generator As far as I understand from the firebug documentation and...
python,xml,xpath,web-scraping,lxml
Try using the contains selector to find the element with an attribute href which contains the word parent //*[contains(@href, 'parent')] or if you are sure about the position of text "parent" you can use the ends-with //*[ends-with(@href, 'parent')] ...
The following function lxml.html.document_fromstring(string) Parses a document from the given string. This always creates a correct HTML document, which means the parent node is , and there is a body and possibly a head. hxs = lxml.html.document_fromstring(requests.get("http://www.imdb.com/title/" + id).content) You can see the html using this code. print lxml.html.tostring(hxs) Considering...
As for your first question, this seems to be a good answer. As for your second question, the "tostring()" method follows this standard (as per the documentation)....
Here's my attempt. It gives your desired output for your sample input. I made it tolerant of certain tag changes like div vs. span. import xml.etree.cElementTree as etree # or: from lxml import etree body = etree.parse('test.html').find('body') for aname in body.iterfind('*[@class="aname"]'): cname = aname.find('*[@class="bname"]//a[@class="cname"]') print cname.text # title print cname.get('href')...
To get XML node's content markup (sometimes referred to as "innerXML") , you can start by selecting the node (instead of selecting the child or the text content) : from lxml import html import lxml.etree as le input = "<pre>abcdefg<b>b1b2b3</b></pre>" tree = html.fromstring(input) node = tree.xpath("//pre")[0] then combine the text...
You could use regex on the script to find the quoteDataObj variable and load its contents with JSON. Example: import re import json #...your code... content = wgetUrl(url) matches = re.findall(r'quoteDataObj\s\=\s\[(\{.+\})\]', content) if len(matches) > 0: python_dict = json.loads(matches[0]) Output: {u'altSymbol': u'CL/M5', u'assetType': u'DERIVATIVE', u'change': u'-1.39', u'code': 0, u'curmktstatus': u'REG_MKT',...
Wrap everything in a throwaway tag, then get the outermost elements inner contents, like so: def clean_custom_title(self): title = self.cleaned_data.get('custom_title') if title: title = "<foo>%s</foo>"%title title = lxml.html.fromstring(title) # title = lxml.some_function(title) # strip <foo> in a 'proper' way title = lxml.etree.tostring(title) title = title[5:-6] # This is a hack...
python,html,html-parsing,lxml,lxml.html
At least, you need to assign the result of replace to the body: for ss in src: body = body.replace(str(ss), '') print body Though, I personally don't like the approach. Better find all tags that have src attribute and set the attribute value to an empty string: for element in...
python,web-scraping,html-parsing,lxml,lxml.html
The main issue was that you needed to make a GET request, not a POST. Plus, @Paul Lo is right about the date ranges. For the sake of example, I'm querying from 2010 to 2015. Also, you have to pass query parameters as strings. 00 evaluated to 0, requests converted...
You can try using xpath function string() which return concatenated value of all text nodes within current element : import lxml.html html = """<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:"Open Sans", "Helvetica Neue", Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important"> \n<i...
python,parsing,web-scraping,lxml,lxml.html
Find the element with xpath (not text) and use text_content() method: details = tree.xpath('.//div[contains(@id, "post_message_33583236")]')[0] print(details.text_content()) Prints: With all the talk about raising the minimum wage, I think the real issue is that people are not getting a liveable wage anymore. This applies to many skilled people too in which...