==============
XPath selector
==============

Implemente a simple xpath selector usable for HTML pages:


  >>> from pprint import pprint
  >>> from p01.testbrowser.xpath import getXPathSelector

Let's just try our getXPathSelector method which returns a XPathSelector
instance:

  >>> content = u"""<div id="content">
  ...   <div class="address">
  ...     <div>FirstName</div>
  ...     <div>LastName</div>
  ...   </div>
  ... </div>
  ... """

  >>> xs = getXPathSelector(content)

We can extract the content from our xpath selector:

  >>> print(xs.extract())
  <html><body><div id="content">
    <div class="address">
      <div>FirstName</div>
      <div>LastName</div>
    </div>
  </div></body></html>

Now we can start selecting an xpath:

  >>> xPath = '//div[@id="content"]'
  >>> xsNew = xs.select(xPath)
  >>> xsNew
  [<XPathSelector data=...'<div id="content">\n  <div class="address'...>]

Now we can extract the content:

  >>> print(xsNew.extract())
  <div id="content">
    <div class="address">
      <div>FirstName</div>
      <div>LastName</div>
    </div>
  </div>

Of course we can also directly extract text from our xpath selector:

  >>> xs = getXPathSelector(content)
  >>> xPath = '//div[@id="content"]'
  >>> print(xs.extract(xPath))
  <div id="content">
    <div class="address">
      <div>FirstName</div>
      <div>LastName</div>
    </div>
  </div>

  >>> print(xs.extractContent(xPath))
  <div class="address">
      <div>FirstName</div>
      <div>LastName</div>
    </div>

Or we can use a more generic xpath and extract details later:

  >>> xPath = '//div[@id="content"]'
  >>> xsList = xs.select(xPath)
  >>> print(xsList.extract('//div[@class="address"]/div[1]/text()'))
  FirstName

  >>> print(xsList.extract('//div[@class="address"]/div[2]/text()'))
  LastName

Our content can also include tags. This is important since user will add
additional tags during editing:

  >>> xPath = '//div[@id="content"]'
  >>> xsList = xs.select(xPath)
  >>> print(xsList.extract('//div[@class="address"]/div'))
  <div>FirstName</div>
      <div>LastName</div>
  <BLANKLINE>

Now this is the real deal that is used in recruiter:

  >>> xs = getXPathSelector(content)
  >>> xPath = '//div[@id="content"]/node()'
  >>> print(xs.extract(xPath))
  <BLANKLINE>
    <div class="address">
      <div>FirstName</div>
      <div>LastName</div>
    </div>
  <BLANKLINE>
  <BLANKLINE>

  >>> content = u"""<div id="content">just some text</div>"""

  >>> xs = getXPathSelector(content)
  >>> xPath = '//div[@id="content"]/node()'
  >>> print(xs.extract(xPath))
  just some text

More complete example
---------------------

Let's take a real advertisement from recruiter.

  >>> content = """<div id="advertisement">
  ...   <div id="advertisementWrapper">
  ...     <div id="advertisementBody">
  ...       <div id="adDescription" class="smartEditable">Sachbearbeiter Einkauf</div>
  ...       <div id="adTitle" class="smartEditable">Sachbearbeiter (100&#37;)
  ...       </div>
  ...       <div id="dutyWrapper">
  ...         <div id="adDutyHeadline" class="headline">
  ...           Your tasks
  ...         </div>
  ...         <div id="adDuty" class="smartEditable">Sachbearbeiter Duty</div>
  ...         <div class="clearer">&nbsp;</div>
  ...       </div>
  ...       <div id="requirementWrapper">
  ...         <div id="adRequirementHeadline" class="headline">
  ...           Your profile
  ...         </div>
  ...         <div id="adRequirement" class="smartEditable">Sachbearbeiter Requirement</div>
  ...         <div class="clearer">&nbsp;</div>
  ...       </div>
  ...       <div id="applicationNoteWrapper">
  ...         <div id="adContactHeadline" class="headline">
  ...           Your application
  ...         </div>
  ...         <div id="adDeptContact" class="smartEditable">
  ...             Please send your application to
  ...             ,
  ...             ,
  ...             Tel. ,
  ...             E-Mail: .
  ...         </div>
  ...         <div class="clearer">&nbsp;</div>
  ...       </div>
  ...     </div>
  ...     <div id="applyLinkBox">
  ...       <div id="applyLink">
  ...         <a class="applyLink"
  ...            href="http://127.0.0.1/index.html"
  ...            target="_top">apply now</a>
  ...       </div>
  ...     </div>
  ...     <div id="adContact" class="smartEditable">
  ...       Jessy
  ...       Ineichen,
  ...       Human Ressource Management,
  ...       Langackerstrasse 8,
  ...       6330
  ...       Cham.
  ...     </div>
  ...     <div id="advertisementFooter">
  ...       <div class="link">
  ...         <a class="link" href="http://www.foobar.tld/"
  ...            target="_blank">
  ...           www.foobar.tld
  ...         </a>
  ...       </div>
  ...     </div>
  ...   </div>
  ... </div>
  ... """

  >>> xs = getXPathSelector(content)

And extract various parts:

  >>> print(xs.extract('//div[@id="adIntro"]/node()'))

  >>> print(xs.extractContent('//div[@id="adIntro"]'))

  >>> print(xs.extract('//div[@id="adTitle"]/node()'))
  Sachbearbeiter (100%)
  <BLANKLINE>

  >>> print(xs.extractContent('//div[@id="adTitle"]'))
  Sachbearbeiter (100%)
  <BLANKLINE>

  >>> print(xs.extract('//div[@id="adSubTitle"]/node()'))

  >>> print(xs.extractContent('//div[@id="adSubTitle"]'))

  >>> print(xs.extract('//div[@id="adDescription"]/node()'))
  Sachbearbeiter Einkauf

  >>> print(xs.extractContent('//div[@id="adDescription"]'))
  Sachbearbeiter Einkauf

  >>> print(xs.extract('//div[@id="adDuty"]/node()'))
  Sachbearbeiter Duty

  >>> print(xs.extractContent('//div[@id="adDuty"]'))
  Sachbearbeiter Duty

  >>> print(xs.extract('//div[@id="adContact"]/node()'))
  <BLANKLINE>
        Jessy
        Ineichen,
        Human Ressource Management,
        Langackerstrasse 8,
        6330
        Cham.
  <BLANKLINE>

  >>> print(xs.extractContent('//div[@id="adContact"]'))
  <BLANKLINE>
        Jessy
        Ineichen,
        Human Ressource Management,
        Langackerstrasse 8,
        6330
        Cham.
  <BLANKLINE>

We can also just select elements with an xpath expression:

  >>> xPath = '//div[@id="adTitle"]'
  >>> xs.xpath(xPath)
  [<Element div at ...>]

  >>> el = xs.xpath(xPath)[0]
  >>> el.attrib['id']
  'adTitle'

  >>> el.attrib['class']
  'smartEditable'


Error handling
--------------

Now let's show what happens with a bad xpath:

  >>> print(xsList.extract('////somethingelse'))


and what happens with a invalid xpath expression:

  >>> xsList.extract('//div[@class="addre...')
  Traceback (most recent call last):
  ...
  ValueError: Invalid XPath: //div[@class="addre...

As you can see, we can catch xpath errors as a ValueError.


issue with_tail=True/False
--------------------------

The xpath can't correctly handle <br /> tags out of the box because of it's tail
concept. This means we need to use with_tail=False in the etree.tostring method:

  >>> content = u"""<div id="content">
  ...   <div id="address">
  ...     <div>FirstName</div>
  ...     <br /> and more text
  ...     <br /> and even more text
  ...     <div>LastName</div>
  ...   </div>
  ... </div>
  ... """

before bugfix (without "with_tail=False" in tostring) we've got:

<div>FirstName</div>
<br> and more text
 and more text
<br> and even more text
 and even more text
<div>LastName</div>

As you can see each <br /> tag contains the text followed by the tag.
Also the XPathSelector list with tail data looks like:

[<XPathSelector data=u'\n    '...>,
 <XPathSelector data=u'<div>FirstName</div>\n    '...>,
 <XPathSelector data=u'\n    '...>,
 <XPathSelector data=u'<br> and more text\n    '...>,
 <XPathSelector data=u' and more text\n    '...>,
 <XPathSelector data=u'<br> and even more text\n    '...>,
 <XPathSelector data=u' and even more text\n    '...>,
 <XPathSelector data=u'<div>LastName</div>\n  '...>,
 <XPathSelector data=u'\n  '...>]

Now with the bugfix we will get:

  >>> xs = getXPathSelector(content)
  >>> xPath = '//div[@id="address"]/node()'
  >>> xsAddress = xs.select(xPath)

  >>> print(xsAddress.extract())
  <div>FirstName</div>
  <br> and more text
  <br> and even more text
  <div>LastName</div>

  >>> xPath = '//div[@id="address"]'
  >>> print(xs.extractContent(xPath))
  <BLANKLINE>
      <div>FirstName</div>
      <br> and more text
      <br> and even more text
      <div>LastName</div>
  <BLANKLINE>

As you can see, the basic XPathSelector ``data`` contains no tail data:

  >>> pprint(xsAddress)
  [<XPathSelector data=...'\n    '...>,
   <XPathSelector data=...'<div>FirstName</div>\n    '...>,
   <XPathSelector data=...'\n    '...>,
   <XPathSelector data=...'<br>'...>,
   <XPathSelector data=...' and more text\n    '...>,
   <XPathSelector data=...'<br>'...>,
   <XPathSelector data=...' and even more text\n    '...>,
   <XPathSelector data=...'<div>LastName</div>\n  '...>,
   <XPathSelector data=...'\n  '...>]


issue with duplicated content
-----------------------------

The select method returns more then one item in it's list if the content
was starting with a tag. The content after such a tag was appendend twice
within the extract method:

  >>> content = u"""<div id="adRequirement" class="smartEditable" contentEditable="true">
  ... <strong>attention</strong> repeated text</div>
  ... """

  >>> content = content.replace('\\n', ' ')
  >>> xs = getXPathSelector(content)

We can extract the content from our xpath selector:

Extract /node() returns the WRONG text

  >>> print(xs.extract('//div[@id="adRequirement"]/node()'))
  <BLANKLINE>
  <strong>attention</strong> repeated text repeated text

extractContent returns the right content:

  >>> print(xs.extractContent('//div[@id="adRequirement"]'))
  <strong>attention</strong> repeated text



  >>> content = u"""<div id="content">
  ...   <strong>attention</strong><div class="address">
  ...     <div>FirstName</div>
  ...     <div>LastName</div>
  ...   </div>
  ... </div>
  ... """

  >>> xs = getXPathSelector(content)

  >>> print(xs.extractContent('//div[@id="content"]'))
  <BLANKLINE>
    <strong>attention</strong><div class="address">
      <div>FirstName</div>
      <div>LastName</div>
    </div>
  <BLANKLINE>


  >>> content = u"""<div id="content">
  ...   blah
  ...   <strong>attention</strong>foo<div class="address">
  ...     <div>FirstName</div>
  ...     <div>LastName</div>
  ...   </div>
  ... </div>
  ... """

  >>> xs = getXPathSelector(content)

  >>> print(xs.extractContent('//div[@id="content"]'))
  <BLANKLINE>
    blah
    <strong>attention</strong>foo<div class="address">
      <div>FirstName</div>
      <div>LastName</div>
    </div>
  <BLANKLINE>


Edge case is a totally empty element:

  >>> content = u"""<div id="content"></div>"""

  >>> xs = getXPathSelector(content)

  >>> print(xs.extractContent('//div[@id="content"]'))
