Skip to content

html_token diff should highlight links when only href attr changed #4

Open
@Mr0grog

Description

@Mr0grog

The html_token diff (based on lxml.html.diff) used to pick up changes to a link’s href attribute by virtue of sticking the text “Link: [url]” after the link’s main text. This got really rough and noisy in practice, though, because many links are images (so the text broke the layout) or because URLs are extremely long and unbreakable (making layout nuts, especially when the link is a heading, which is very common), so we removed it by request of the analyst team. That left us with a single blank space getting highlighted in its stead, which was confusing as all get out (it also happened in other situations besides links), so we removed that, too, and now you can’t tell visually when a link’s URL has changed at all.

The change_href test case demonstrates this really well. Output looks like:

<html>
  <!-- clipped to brevity -->
  <body>
    <p>
      Here is a pragraph with
      <a href="/elsewhere">
        A link
      </a>
      to another site.
    </p>
    <script id="wm-diff-script"><!-- clipped for brevity --></script>
  </body>
</html>

Not sure what we want the ideal markup to look like in this case, but we should be signaling the change in the markup somehow. Once we’ve solved that, we should also add an actual pass/fail “validity” test.

See further notes in the code about this here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/f899fa2ef088ef83398a65a05f04321ab659eda2/web_monitoring/html_diff_render.py#L444-L460

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions