You are here

Are there any established web scraping procedures to compare/diff migrated web sites?

I recently came across this post and wanted to share it with you to show you how "the other side" of web scraping.

We are in the process of migrating a comparatively small web application (~2k pages) embedded in a large (enterprise scale) portal to a new platform. The rewrite is an enabling step for future expansion, albeit some minor new features will be added on the fly already.

Other than that, the site should basically look identical to end users after the migration; internally the DOM as well as CSS and JS composition are changing significantly though, thanks to our excellent front end designer/developer, who is driving accessibility 'on the side' (i.e. unobtrusive JavaScript only etc.).

I'm aware of and facilitate unit and functional testing of web sites via various tools like Selenium, Canoo Webtest, HTTrack etc.; nonetheless I'm lacking a procedure on how to achieve this task without a lot of manual test/XPath coding - here is what's desired:

  • definitely compare/diff the page data (i.e. disregarding CSS/JS) before/after
    • account for well known differences, i.e. map/exclude expected changes
  • ideally compare/diff the page layout (through end users view, semantically) before/after as well
    • This seems rather difficult, maybe a heuristic image based comparison would be a better approach here, like with Pixel Perfect or so? An image based comparison would likely lack the ability to map/exclude changes easily though.

Here's how I'd approach this, approximately:

  1. scrape live portal sub-site and test sub-site accordingly
  2. strip portal induced framing via XSLT (especially ad stuff etc.)
  3. This would yield two DOM fragments that should be semantically identically through users eyes, but will be quite different internally; consequently I'll need to:

    • either compare/diff the generated fragments via common procedures/tools, which in turn requires me to
      • transform one fragment into the other via XSLT

      • extract the data (e.g. into a CSV) and disregard the XHTML structure altogether for comparison - this approach pretty much boils down to a lot of manual test/XPath coding as mentioned above, which I'd like to avoid, obviously ;)
    • or map one fragment to the other with a dedicated tool/process for comparison

Is there an established procedure/tooling available for this kind of task or are my automation desires too sophisticated here?