Code generation in Python (Or, generating APIs from Data Packages)

Hackers.

I’m looking for some pointers on code generation in Python. My use case is around generating APIs from Tabular Data Packages.

There is a Ruby lib that generates a read-only API from data packages, and there are various CSV → JSON API codebases around, including part of the codebase I did for the Open Data Index.

So, straight CSV or Data Package → Read APIs are a great thing, and I’d like to see more in this direction, but, I’m interested in leveraging the power of existing web frameworks, and that is why I’m looking at the next step which will require code generation.

Here is an example flow:

  1. I have a data package.
  2. I use the DataPackage lib with the SQL plugin to create and populate a database.
  3. I generate framework code to manage the data and generate an API (Example: serializers.py and models.py and other bootstrapping config in Django REST Framework)
  4. Instant API, but also can hack further around the generated code. The API is also not necessarily read only, so there is the possibility to write changes back to a DP. We also have access to user management, etc of the framework to control API use.

The code generation aspects are step 3 and parts of step 4 (if we went down the route of writable APIs that also update the datapackage!).

It looks to me that using Jinja templates to generate code is the most friendly and possibility also the most powerful way to go about this. Maybe others who have gone deep into code generation have other suggestions?

The great thing here is that, by using the SQL driver direct from the Data Package lib, we are not tied to a framework, we just generate the framework’s code. That means we can add frameworks to generate APIs for!

We are also not even tied to SQL. We currently also have a BigQuery driver, and plan to roll out others. So we potentially can roll out a combination of framework + datastore backends.

Lastly, of course, we are not tied to Python. While the lib would likely be in Python, we can generate code for any framework, and accept contributions for new frameworks as part of the codebase.

Some really good web frameworks to target would be:

  • Django REST Framework
  • Express.js
  • Flask + SQL Alchemy + Restless
  • Hapi.js

In researching this, I found GitHub - cookiecutter/cookiecutter: A cross-platform command-line utility that creates projects from cookiecutters (project templates), e.g. Python package projects, C projects. which may be of use here.

Any thoughts?

1 Like

I like the idea of going from DataPackage to API easily, but code generators are tricky. They’re useful if the source (i.e. the DataPackage) isn’t expected to change, which might or might not be the case for us. Consider the following scenario:

  1. Get Data Package;
  2. Generate API code with this tool;
  3. Customize the generated code for our specific case;
  4. …time passes…
  5. The original Data Package changes (say altering the schema).

What now? I can re-generate the API code and try to merge in the current codebase, which can be quite difficult depending on how much it (and the generation tool) was modified.

If we expect the Data Packages to change relatively often, it might be better to use them as the canonical source of truth, without generating intermediary code. However, this makes it harder to customize the API.

Overall, I think not generating code is the best approach. This makes any change to the Data Package reflect into the API without extra effort, and allows our users to easily get any improvements we make.

@vitorbaptista great points. It is an open question for me, how much we’d expect data packages to change (esp. the type that might be useful for auto-generated APIs). I’m still inclined to say it is worth building something out in this direction and seeing where it goes.

My 2¢: My experience with code generators is that usually it’s best to avoid it - the templates are hard to understand and manage and project structure and build system become hard to maintain.
(e.g you can’t lint a code template)

If possible, it is much better to have generic code that takes a configuration file and adapts to it, than generating large code files from these config files.

The only exceptions here are:

  • If performance is a big issue and you want to avoid opening and parsing a file, or to avoid these extra levels of indirection that come with more generic code,
  • If you can limit the code generation to a very thin layer of ‘code-configuration’ and keep the main logic out of it. For example, convert a JSON object to a Python object which can be imported without having to load and parse that JSON file.