as someone tracking legislation change, I now and then needlessly have to waste time by manually comparing the existing and proposed text. Some of this can be done by using office software and its merging capabilities, while PDF text can be extracted and diffed (tiny script here). However this fails for larger changes and renewals, where articles get moved around, renumbered or split and you get many extraneous diff chunks just due to all the page and title numbering changes.
So I was looking around if there are any context-aware diff tools or lexers that could alleviate this. I didn’t find anything — can someone prove me wrong?
Also, all the text mining things I looked into were just about linguistical details or tone.
In the worst case, I’m considering building something myself. Most legislation is already very consistent when it comes to formatting, so parsing shouldn’t be a nightmare. And once it is converted to a structured format, mixing and matching individual chunks shouldn’t be hard either, even for non-sequential versions. Plus, there’s plenty of various diff libs out there to reuse … Let me know if this would interest you, dear reader, so we can collaborate.
It then also opens up the option of (public) collaborative commenting on legislative change and manual classification of chunks into, let’s say, boring (spelling, renames …) and content-bearing.