Copernic.space: versioned structured data, with change-request mechanic, at scale

I happy to share with you the result of many years of work.

Introduction

In 2011, I was intrigued by Good Old Fashioned AI (GOFAI). I was willing to build a some kind of Artificial Intelligence system. I did not know that I did not know.

I was not confortable with SQL and wanted to do all the code in my favorite language, willy nilly: Python. Also, at that time, let’s be honest, PostgreSQL was not as great as today. I got into MongoDB, then I was inspired by ThinkerPop (nowadays known as Janus Graph) and Neo4J, so called, knowledge graphs.

It was somewhat good, but not good enough. The developer experience was missing some ease of use. At that point, I started working with Python 2.7 module called bsddb. It is a joy. You can store WHATEVER you want, the way you want. Some time has passed, many bits were put together, and later fall into pieces of nothingness.

Fast forward. Today is different.

Kesako http://copernic.space?

  • copernic is meant to be a scalable wikidata.
  • copernic is powered by FoundationDB, Python and Django
  • copernic is versioned triple store
  • copernic is a work-in-progress
  • http://copernic.space is mostly empty (at the time of writing).

For a more technical description look at: https://lists.w3.org/Archives/Public/semantic-web/2020Feb/0046.html

License is AGPLv3+, code is at github: https://github.com/amirouche/copernic/

Feedback more than welcome!

What existing systems did you try out / contribute to before you starting building this?

Here are the systems I considered, tried or reviewed:

  • postgresql
  • mongodb
  • neo4j
  • virtuoso
  • tinkerpop / rexster / blueprints / janusgraph
  • wikidata / wikibase + blazegraph
  • db.nomics.world
  • datahub.io
  • data.world
  • R&Wbase / R43ples / rdfostrich / QuitStore
  • qri.io

Great! :+1:

Could you give a bit of detail of what you found about them? What was your experience of trying them out? What led you to build a new one?

Like I started to explain in the first post, this work is the result of several years of research to find a suitable database to work on symbolic artificial intelligence. And to some extent, copernic, is a collateral result of that work.

tl;dr: the big idea, that is NOT a new thing, behind this work, is the use of Ordered Key-Value Store (OKVS). Instead, of building a database from scratch, I re-use existing software, and make it work in the context of “cooperation around the making of knowledge bases”.

I set the following requirements:

  • All around ACID guarantees, that can be debated, and goes against the current practice called BASE or eventual consistency where eventually, the data is both in the primary source of truth and in secondary systems like elastic search or REDIS. The rationale for that requirement is that it is easier to reason about the system if all instructions are applied OR none are applied. Unlike a system that is eventually consistent, where synchronization between several source of truth must be handled by ad-hoc code.

  • Poly-structured data: relational, recursive, text, time and space.

  • Embedded in host language, that is I did not want to fiddle with string interpolation or yet-another-domain-specific-language. That is a case of monoculture to some extent. But it is also a case for ease-of-use, because since the API is available in the host language, it is easy to tape into existing IDE feature for auto-completion and type checking.

  • Scale horizontally, that is you keep the above requirements while throwing more commodity hardware at it.

This requirements lead me to OKVS databases, because I did not want to create a new system from scratch and also because OKVS are versatile enough to cover all the requirements I have set.

I reply in another post what I think of the other systems.

  • postgresql / mongodb

    • not embedded in host language. An external service makes the target solution, more complicated to setup. Specialized indices, must be written in C or C++ and require extra data ops like postgresql full-text search requires manual operation to extend the dictionary of synonyms
    • not horizontally scalable
  • neo4j

    • not embedded in host language + extra service
    • not horizontally scalable
    • no support for specialized indices as part of transactions
    • no support for poly-structured data, that is text and geo-spatial data requires yet-another-service.
  • virtuoso

    • not embedded in host language + extra service
    • not horizontally scalable
    • ACID support is unclear
    • poly-structured data support is unclear
  • tinkerpop / rexster / blueprints / janusgraph

    • blueprints was embedded but JVM-to-Python interop is slow. Not maintained?
    • not horizontally scalable
    • no support for indices as part of transactions
    • no support for poly-structured data, that is text and geo-spatial data requires yet-another-service.
  • wikidata / wikibase + blazegraph

    • microservices approach, make it very difficult to reproduce.
    • blazegraph is not maintained
    • Eventually consistent
    • Not horizontally scalable
  • db.nomics.world / datahub.io

    • both rely on git + elastic search / solr + custom code
    • difficult to reproduce
    • git can not easily track the history of a given table cell
    • git does not scale with the size of the data
  • data.world

    • That is proprietary.
  • R&Wbase / R43ples / rdfostrich / QuitStore

    • rdfostrich rely on kyotocabinet that does not offer transactions
    • QuitStore rely on git, hence subject to the same limitation as git ie. does not scale in terms of data size.
    • I did not manage to run the R&Wbase / R43ples
  • qri.io

    • As far as I know, that is not meant for cooperation in the large, it is more like a git-for-structured-data, cooperation in the small.

No mainstream database system has the requirements I have set.

What lead me to copernic and in general adopt OKVS, is that it more versatile, more tested, more straight forward approach to designing higher level databases. OKVS are powerful, simple and elegant.

I forgot to mention it here and nobody cared to ask. There is trick of some sort.

To summarize:

  • If you are serious about Software Engineering and want to build an Artificial Intelligence system, I urge you to consider Ordered Key-Value Store with ACID transaction (prefered: wiredtiger, foundationdb)

  • The first result of my work is dubbed, generic tuple store (nstore), it is a generalization of triple and quad store to tuples of n items. To achieve, what blazegraph calls “perfect indices”, the code rely on a math result: the covering of the Boolean lattice by minimal number of maximal chains. That leads to a storage factor when n=5 items in a tuple of 10.

  • Then, because nstore allows to query data in every “dimensions”, it is a good fit for storing the history of a triple store. This assumes, you do not know what query you want to execute over the history.

  • Given the nstore one can build a Directed-Acyclic-Graph git-like history: DAG approach may be useful in a cooperation-in-the-small approach, but requires when merging a branch in the main branch to “copy” and recompute the history significance and update the snapshot.

  • Given the nstore one can build a single branch history, that is more scalable and is meant for “cooperation-in-the-large” this is what copernic is about. This is a Proof-of-Concept. When foundationdb replication is 2 (aka. at least two machines must be up to make progress), with the nstore storage factor of 10, it result in a storage factor of 50. That is to store wikidata, one would need approximatly 2*50 TB = 100 TB of SSD, if you include RAID 1, it becomes 200 TB spread over multiple commodity hardware.

Note: If you look at cloud or dedicated server provider pricings, sure thing you will notice that 200 TB of SSD is a lot of money. Also in the case of cloud providers, the performances are poor. However, the equivalent SSD hardware is approximately $20,000. If you add to the price of the SSD the price a rack server (50 racks at $500), it ends up around $55,000 without location fees, maintenance or renew. In comparison, not an apple-to-apple comparison, the cost PER MONTH to have 200TB of SSD on GCP is $34,641.92.

Like I wrote somewhere else, I do not want to compete with wikidata or dbpedia, I am looking for a use-case that completes the current #opendata #openknowledge offering.