Detailed legislation version tracking?




as someone tracking legislation change, I now and then needlessly have to waste time by manually comparing the existing and proposed text. Some of this can be done by using office software and its merging capabilities, while PDF text can be extracted and diffed (tiny script here). However this fails for larger changes and renewals, where articles get moved around, renumbered or split and you get many extraneous diff chunks just due to all the page and title numbering changes.

So I was looking around if there are any context-aware diff tools or lexers that could alleviate this. I didn’t find anything — can someone prove me wrong?
Also, all the text mining things I looked into were just about linguistical details or tone.

In the worst case, I’m considering building something myself. Most legislation is already very consistent when it comes to formatting, so parsing shouldn’t be a nightmare. And once it is converted to a structured format, mixing and matching individual chunks shouldn’t be hard either, even for non-sequential versions. Plus, there’s plenty of various diff libs out there to reuse … Let me know if this would interest you, dear reader, so we can collaborate.

It then also opens up the option of (public) collaborative commenting on legislative change and manual classification of chunks into, let’s say, boring (spelling, renames …) and content-bearing.


Not sure if that’s relevant to your question but you might want to look at theses projects in France :


What about using a version control system (VCS), such as Git?

Software files often undergo large changes or refactorings. Git, and other VCSs, have been evolved to keep track of small and large changes to text files (sourcecode).


@samgta: thanks, that looks interesting. I’m happy to be from a smaller country with a shorter process than what is shown at lafabrique. :slight_smile: It looks like plain version tracking though, nothing to combat noise. Also, we don’t have the luxury of always getting nicely machine readable texts (it’s not just laws and below, it can be strategies and similar that are free-form). So in the worst case of PDFs, you automatically lose any information in graphs and other images, unless you use a raster pdf differ and get more images as a result (1px differences besides all the text and imagery).

@brylie: the same problems apply. Git can track moves between files, but not within them. In contrast, the Wikipedia difflib supports also detection of such changes. So putting the results in a SCM would be a good storage option, but wouldn’t help that much with the diffing (hacking on gitlib seems silly) and not at all with the parsing.


Review Board seems like a good example of what can be built on top of VCS. It has some features that might be relevant to tracking and reviewing changes in legislation. For an example directly related to this discussion:

Ever move some functions or other code around in a file, and then try to review it? It’s a pain! It’s hard to tell what code has moved and to where, or whether there were other changes to pay attention to.

Not here. Review Board checks that for you, helpfully showing where code moved to, from where, and whether there were any other changes made during the move.

Since Review Board is open-source, it can serve as a basis, or reference model, for a legislation revision control system :grinning:



The idea of versionning bills and laws has been around for a while and led many thoughts and projects, here are a few links to more examples:



Hi, I liked the “git solution” as the @samgta’s france.code-civil example. But the big problem is not to convert to markdown, is how to obtain “compiled text” for each version,
there are some system for this task?


In the a Civil Law system of laws like Brazilian, there are no concept of “new version”, alterations and revocations are made by new law.
Each law is published in the offcial gazette… So, it is like a blog where some posts are erratum.

For final user, the “new version” is another document generated by other body (official or not), as compiled text.
The compiled text of the legal norm is a rectified text: with the registration of the official source that published the rectification… When automated process, to be official must be ratified by a human reviewer.


  • law L1 is published in date D1 to be in force in date D2;

  • law L2 is published in date D2 to be in force from date D3 to D4, and L2 contains, with other things, an alteration for L1’s second paragraph, so L2 modify L1.p2.
    The source of this modification is L2.p5.

  • The compiled text of L1 is the original text from D2 to D3. This is the version v1 of L1, so L1v1=L1.

  • The compiled text of L1 from D3 to D4 is L1v2=L1 - L1.p2 + L2.p5.

  • The compiled text of L1 after D4 is again the original text, so L1v3=L1.

Today in Brazil the compiled text is builded by hand, is a craft work, there are no relevant automation. And other countries of Civil Law? There are a system to help (some computer-aided compilation) or to automate the compilation task?

There are experiments demonstrationg tha XML transcription of law can be used for automation… Is the only way to automate the task that I know.

An “official body” that compile and publish the compiled text is not the authority of the original texts.

PS: in many cases the compiled text not exists (!), the govern have no obligation to do it. And in all cases the “official compiled text” is not a proof, so the economic cost of reliable law compilation is hight, but real value (value as proof) of compilation is low.


@brylie: yep, nice idea.

@RouxRC: again, basic version tracking doesn’t cut it. But thanks for some extra reading.

@ppkrauss: it’s similar here, but all substantial changes also include the new cleartext, not just the “errata”. You should push for that. But parsing the errata shouldn’t be horribly hard either, since the syntax is predictable.


HI @lynxlynxlynx I edited adding an example of the complexity of the “compilation task”. Can you show (a link?) an example of a system that parses natural language? (when, as you say, “the syntax is predictable”).


I was referring to the way L1.p2 and L2.p5 are specified, but no, I don’t know of any existing parsers for this — they’d be language and standard dependant anyway. The source of the impression was just thinking about all the changes I’ve reviewed and the way they were worded (formal nomotechnic standards).


This project might also be of interest in this topic