Python dataclasses: A revolution

Python data classes are a new feature in Python 3.7, which is currently in Beta 4 and scheduled for final release in June 2018.  However, a simple pip install brings a backport to Python 3.6.  The name dataclass sounds like this feature is specifically for classes with data and no methods, but in reality, just as with the Kotlin dataclass, the Python dataclass is for all classes and more.

I suggest that the introduction of the dataclass will transform the Python language, and in fact signal a more significant change than the move from Python 2 to Python 3.

The Problems Addressed By Dataclasses

There are two negatives with Python 3.6 classes.

‘hidden’ Class Instance Variables

The style rules for Python suggest that all instance variables should be initialised (assigned to some value) in the class __init__() method.  This at least allows scanning __init__ to reverse engineer a list of class instance variables. Surely an explicit declaration of the instance variables is preferable?

A tool such as PyCharm scans the class __init__() method to find all assignments, and then any reference to an object variable that was not found in the __init__() method is flagged as an error.

However, having to discover the instance variables for a class by scanning for assignments in the __init__() method is a poor substitute for scanning a more explicit declaration.  The body of the __init__() method essentially becomes part of the class statement.

The dataclass provides for much cleaner class declarations.

Class Overhead for Simple Classes

Creating basic classes in Python is too tedious. One solution is for programmers to use dictionaries as classes – but this is bad programming practice. Other solutions include namedtuples, the Struct class from the ObjDict package or the attrs package. With this number of different solutions, it is clear that people are looking for a solution.

The dataclass provides a cleaner, and arguably more powerful syntax than any of those alternatives, and provides the stated Python goal of one single clear solution.

Dataclass Syntax Example

Below is an arbitrary class to implement an XY coordinate and provide addition and a __repr__ to allow simple printing.

class XYPoint:

    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __add__(self, other):
        new = XYPoint(self.x, self.y)
        new.x += other.x
        new.y += other.y
        return new

    def __repr__(self):
        printf(f"XYPoint(x={self.x},y={self.y})")

Now the same functionality using a data class:

@dataclass
class XYPoint:
        x:float
        y:float

    def __add__(self, other):
        new = XYPoint(self.x, self. y)
        new.x += other.x
        new.y += other.y

The dataclass is automatically provided with an __init__() method and a __repr__() method.

The class declaration now has the instance variables declared at the top of the class more explicitly.

The type of ‘x’ and ‘y’ are declared as float above, although to exactly match the previous example, they should be of type Any but float may be more precise, and more clearly illustrates that annotation is usually a type.

The dataclass merely requires the class variables to be annotated using variable annotation.  ‘Any‘ provides a type completely open to any type, as is traditional with Python.  In fact, the type is currently just for documentation and is not type checked, so you could state ‘str‘ and supply an ‘int‘, and no warning or error is raised.

As you can see from the example, dataclass does not mean the class implemented is a data only class, but rather the class contains some data which almost all classes do. There are code savings for simple classes, mostly around the __init__ and __repr__ and other simple operations with the data, but the cleaner declaration syntax could be considered the main benefit and is useful for any class.

When to Use Dataclasses

The Candidates

The most significant candidates are any Python class and any dictionary used in place of a class.

Other examples are namedtuples and Struct classes from the ObjDict package.

Code using the attrs package can migrate to the more straightforward dataclass which has improved syntax at the expense of losing attrs compatibility with older Python versions.

Performance Considerations

There is no significant change to performance by using a dataclass in place of a regular Python class. A namedtuple could be slightly faster for some uses of immutable, ‘method free’ objects, but for all practical purposes, a dataclass introduces no performance overhead, and although there may be a reduction in code, this is insignificant. This video shows the results of actual performance tests.

Compatibility Limitations?

The primary compatibility constraint is that dataclasses require Python 3.6 or higher. With Python 3.6 being released in 2016, most potential deployments are well supported, leaving the main restriction as, not being available in Python 2.

The only other compatibility limitation applies to classes with existing type annotated class variables.

Any class which can limit support to Python 3.6+, and does not have type annotated class variables, can add the dataclass decorator without compatibility problems.

Just adding the dataclass decorator does not break anything, but without then adding data fields also, it does not bring significant new functionality either. ?? But the compatibility means adding data fields can be incremental as desired, with no step to ensure compatibility. ??

New capabilities do not guarantee delivery without code to make use of those capabilities. Unless the class is part of a library merely getting a spec upgrade, conversion to dataclasses makes most sense either when refactoring for readability or when code makes use of one or more functions made available by converting to a dataclass.

The ‘free’ Functionality

In addition to the clean syntax, the features provided automatically to dataclasses are:

  • Class methods generated automatically if not already defined
    • __init__ method code to save parameters
    • __repr__ to allow quick display of class data, e.g. for informative debugging
    • __eq__ and other comparison methods
    • __hash__ allowing a class to function as a dictionary key
  • Helper functions (see PEP for details)
    • fields() returns a tuple of the fields of the dataclass
    • asdict() returns a dictionary of the class data fields
    • astuple() returns a tuple of the dataclass fields
    • make_dataclass() as a factory method
    • replace() to generate a modified clone of a dataclass
    • is_dataclass
  • New Standardized Metadata
    • more information in a standard form for new methods and classes

Dataclass Full Syntax & Implementation

How Dataclasses Work

dataclass is based on the dataclass decorator. This decorator inspects the class, generates relevant metadata, then adds the required methods to the class.

The first step is to scan the class __annotations__ data. The __annotations__ data has an entry for each class level variable provided with an annotation.  Since variable annotations only appeared in Python 3.6, and class level variables are not common, there is no significant amount of legacy code with annotated class level variables.

This list is scanned for actual values of these class level variables which are of the type field. Values of type field can contain additional data for building the metadata which is stored in __dataclass_fields__ and  __dataclass_params__. Once these two metadata dictionaries are built, the standard methods are then added if they are not already present in the class. Note while an __init__() method blocks the very desirable boilerplate removing automatic __init__ method, simply renaming __init__ to __post_init__ allows retaining any code desired in an __init__, and removing the distracting boilerplate.

This process means that any class level variables that are not decorated are ignored by the dataclass decorator and not impacted by the move to a data class.

Converting Class

Consider the previous example, which was very simple. Real classes have default __init__ parameters, instance variables that are not passed to __init__, and code that will not be replaced with the automatic __init__. Here is a slightly more complicated contrived example to cover those complications with a straightforward use case.

This example adds a class level variable, last_serial_no, just to have an example of a working, class level variable, which allows a counter of each instance of the class.

Also added is serial_no which holds a serial number for each instance of the class.  Although it makes more sense to always increment the serial number by 1, an optional __init__ parameter allows incrementing by another value, showing how to deal with __init__ parameters which cannot be processed by the default __init__ method.

class XYPoint:

    last_serial_no = 0

    def __init__(self, x, y=0, skip=1):
        self.x = x
        self.y = 0
        self.serial_no = self.__class__.last_serial_no + skip
        self.__class__.last_serial_no = self.serial_no

    def __add__(self, other):
        new = XYPoint(self.x, self. y)
        new.x += other.x
        new.y += other.y
        return new

    def __repr__(self):
        printf(f"XYPoint(x={self.x},y={self.y})")

Now the same functionality using a dataclass.

from dataclasses import dataclass, field, InitVar

@dataclass
class XYPoint:
    last_serial_no = 0
    x: float
    y: float = 0
    skip: InitVar[int] = 1
    serial_no: int = field(init=False)

    def __post_init__(self, skip):
        self.serial_no = self.last_serial_no + self.skip
        self.__class__.last_serial_no = self.serial_no

    def __add__(self, other):
        new = XYPoint(self.x, self. y)
        new.x += other.x
        new.y += other.y

The class level variable without annotation needs no change. The __init__ parameter that is not also an instance variable has the InitVar type wrapper. This ensures it is passed through to __post_init__  which provides all __init__ logic that is not automatic.

The serial number is an instance variable or field that is not in the init, and to change default settings for a field, just assign a value to the field (which can still include a default value as a parameter to the field).

I think this example covers every realistic use requirement to convert any existing class.

Types and Python

Dataclasses are based on usage of annotations. As noted in annotations, there is no requirement that annotations be types. The reason for providing annotations was primarily driven by the need to allow for third-party type hinting.

Dataclasses do give the first use of annotations (and by implication, potentially types) in the Python standard libraries.

Annotating with None or docstrings is possible. There are many in the Python community adamant that types will never be required, nor become the convention. I do see optional types slowly creeping in though.

Issues and Considerations

It is possible there are some issues with existing classes which use class level variables and instance variables, but none have been found so far, which this leaves this section as mostly ‘to be added’ (check back).

Conclusion

There is a strong case that all classes, as well as namedtuples, and even other data not currently implemented as a class and some other constructs better implemented as classes, should move to dataclasses. For small classes, and all classes start as small classes, there is the advantage of saving some boilerplate code. Reducing boilerplate code makes it easier to maintain, and ultimately more readable.

Ultimately, the main benefit is any class written using a dataclass is more readable and maintainable than without dataclasses. Converting existing classes is as simple as renaming.

Advertisements

Python Class vs Instance Variables

Python class variables offer a key tool for building libraries with DSL type syntax, as well as data shared between all instances of the class. This page explains how Python class variables work and gives some examples of usage.

  • Class & Instance Variables
    • Instances vs the class object
    • Instance Variables
    • Class Variables
  • Using Class and Instance Variables

Class & Instance Variables

Instances vs the class object

In object oriented programming there is the concept of a class, and the definition of that class is the ‘blueprint’ for objects of that class.  The class definition is used to create objects which are instances of the class. In Python, for each class, there is an additional object for the class itself.  This object for the class itself, is an object of the class ‘type’. So if there is a class ‘Foo’ with two instanced objects ‘a’ and ‘b’, created by a = Foo() and b = Foo(), this creates objects a and b of class Foo. In Python, the code declaring the class Foo does the equivalent of, for example, Foo = type(). This Foo object can be manipulated at run time, the object can be inspected to discover things about the class, and the object can even be changed to alter the class itself at run time.

Instance Variables

Consider the following  code  as run in Idle (Python 3.6):


class Foo:
    def __init__(self):
         self.var = 1
>>> a = Foo()
>>> type(a)
'class: __man__'.Foo
>>type(Foo)
'class: type'
>>'var' in a.__dict__
True
>>>var in Foo.__dict__
False
>>> a.var
1
>>> Foo.var
Traceback (most recent call last):
  File "", line 1, in
AttributeError: type object 'Foo' has no attribute 'var'

Note the results of type(a) compared to type(Foo). 'var' appears in the a.__dict__ , but not in the Foo.__dict__ within Foo. Further, a.var gives a value of 1 while Foo.var returns an error.

This is all quite straightforward, and is as would be expected.

Class Variables

Now consider this code as run in  Idle that has a class variable in Idle (Python 3.6):


class Foo:
    var = 1
>>> a = Foo()
>>'var' in a.__dict__
False
>>>'var' in Foo.__dict__
True

>>> Foo.var
1
>>> a.var
1
>>> a.var = 2
>>> a.var
2
>>> Foo.var
1
>>> a.__class__.var
1

All as would be expected, the __dict__ results are reversed, this time the class Foo initially has var, and the instance a does not.

But even though a does not have a var attribute,  a.var returns 1, because when there is no instance variable var, Python will return a class level variable if one is present with the same name.

However, assignment does not fall back to class level variables, so setting a.var = 2 actually creates an instance variable var and reviewing the __dict__ data now reveals  both that the class level object and instance each have var. Once an instance variable is present, the class level variable is hidden from access using a.var which will now access the instance variable. In this way, code can add an instance variable replacing the value of class variable which provides what can be effectively a default value until the instance value is set.

Simple Usage Example

Consider the following Python class:


class XYPoint:

    scale = 2.0
    serial_no = 0

    def __init__(self, x, y):
        self.serial_no = self.serial_no + 1
        self.__class__.serial_no = self.serial_no
        self.x = x
        self.y = y

    def scaled_vector(self):
        vector = (self.x**2 + self.y**2)**.5
        return vector * self.scale

   def set_scale(self, scale):
        self.__class__.scale = scale

    def __repr__(self):
        return f"XYPoint#{self.serial_no}(x={self.x},y={self.y})"

The class encloses an (x,y) coordinate, and has a method scaled_vector() which calculates a vector (using Pythagoras theorem) from that coordinate, and then scales the vector from the class variable scale.

If the class level variable scale is changed, automatically all XYPoint objects will return vectors using the new scale.

Being a class variable, scale can be read as self.scale but must be set as in the set_scale() method by self.__class__.scale = scale.

The other use case illustrated is the instance counter serial_no, which provides each instance of XYPoint a unique, incrementing serial_no.

Python Annotations

Python syntax allows for two flavours of annotations:

Introduction.

Both flavours of annotation work in the same manner.

They build a dictionary called __annotations__ which stores the list of annotations for a function, a module or a class.

It is common practice to annotate with types, such as int or str, but Python language implementation allows any valid expression. Using types can make Python look like other languages which have similar syntax and require types, and is one of the motivations for annotations in Python. Third-party tools may report expressions which are not types as errors, but Python itself currently allows any expression.

The Python dataclass now makes use of annotations.  Outside of this use, if you are not using an external type validator like mypy there seems little incentive to bother with type annotations, but  if you are going to document what type a variable should be, then annotation is the optimum solution.

The following code illustrates variable annotation at class level, module level, and local level:


>>> class Foo:

    class_var1: "Annotation"
    class_var2: "Another" + " annotation" = 3
    class_int: int

    def func(self):
        local1 : int
        local2 : undeclared_var
        self.variable : undeclared * 2 = 7
>>> module_var: Foo = Foo()

>>> module_var2: 3*2 = 7
>>> module_var3: "another" + " " + "one"
>>> Foo.__annotations__
{'class_var1': 'Annotation', 'class_var2': 'Another annotation',
'class_int':  }
>>> __annotations__
{ 'module_var': , 'module_var2': 6, 'module_var3': 'another one'}
>>> f = Foo()
>>> f.func()
>>> f.variable
7

Class Variables: The code annotates 3 identifiers at Foo scope (these identifiers, and the annotations all then appear in Foo.__annotations__. Note that only class_var2 is actually a variable and will appear in a dir() for Foo.  class_var1 and class_int appear in __annotations__ but are not actually created as variables.

Module Variables: Three module_var variables annotated at the module level, and all appear in __annotations__, and again module_var3 does not appear in globals as annotation itself does not actually create the variable, it solely creates the entry in __annotations__.  (module_var and module_var2 are assigned values, so are actual variables).

Local & Instance Variables: The func within the Foo class illustrates two local annotations, one of which uses an undeclared_var. This use of an undeclared identifier would generate an error with either class or module variables, in which case the expression is evaluated for the relevant __annotations__ dictionary. The expressions for local and instance variables annotations are not evaluated. At this stage, I have not found where, or how, the annotation data is stored.

The PEP for variable annotations is available here. Note the stated goal is to enable third party type checking utilities, even though the implementation does not restrict annotations to types. The non-goals are also very interesting.

While the practice of using annotations only with valid types might be best practice, it is worth understanding the compiler does not require this.

Function annotations: Introduced in Python 3.0 (2008)

Here is an example of function annotation:


>>> def func(p1: int, p2: "this is also an int" + " but ...") -> float:
	return p1 + p2

>>> func.__annotations__
{'p1': , 'p2': 'this is also an int but ...', 'return':  }

The expression following the ‘:‘ (colon) character (or the ‘->' symbol) is evaluated and the result stored in the __annotations__ dictionary.

The PEP is available here, but it is the fundamentals section that is the most highly recommended reading.

What is a DSL?

With Kotlin the term ‘Kotlin DSL’ usually refers to DSLs built in Kotlin using specific Kotlin features (as discussed in ‘Kotlin DSLs’), this page, however, is about DSLs in general.

  • Introduction:
    • DSL: The general definition
    • DSL vs ‘a DSL’
  • The Types of DSL:
    1. External DSL
    2. Internal Detached DSL
    3. Internal Augmentation DSL
  • Detached vs Internal DSL: A continuity?
  • Language Augmentation in general Vs An Augmentation DSL
  • Conclusion: DSL types can be quite different

Introduction:

DSL: The general definition

The acronym ‘DSL’ stands for ‘Domain Specific Language’. A ‘Domain’ being effectively a particular application or field of expertise. ‘Specific’ is self-explanatory, but what exactly is meant by ‘language’ does warrant further exploration later.

Contrasting with ‘general purpose languages’ which attempt to allow for solving any programming problem, a DSL can be purpose designed for a specific ‘domain’ or a specific type of problem.

The term DSL is a broad term, covering some different types of DSLs.  Sometimes people use the term DSL when they are referring to a specific type of DSL, resulting in the term appearing to mean different things in different contexts.

Martin Fowler (who has written books on DSLs that can be very worthwhile reading) described two different main types of DSL, External and Internal, which differ by how they are implemented.  Next, Martin Fowler explains that the second Implementation type, Internal, itself provides two types of DSL, the Internal Mini-language and the Internal Language Extension. This results in a total of three different types of DSL.

DSL vs a DSL

There is a sematic difference between ‘language’ and ‘a language’.  Consider the two phrases “he likes to use language which is considered antiquated’ and “he likes to use a language which is considered antiquated”.  The first suggests vocabulary within a language e.g. antiquated words within the English language, the second suggests use of a language such as ancient Greek or Latin.

Similarly. ‘domain specific language’ can be though of a terms within a language which are specific to a particular domain’ while ‘a domain specific language’ suggests an entirely new language developed for use in a specific domain.

The Types of DSL: External,  Detached & Augmentation

DSLs come in two main forms: external and internal. An external DSL is a language that is parsed independently of the host general purpose language: good examples include regular expressions and CSS: Martin fowler.

These are DSLs like SQL, or HTML. Languages only applicable within a specific domain (such a databases, or web pages) which are stand-alone languages, but with functionality focused on that specific field or domain, and too limited to be used as a general purpose language.  Implementing a DSL as an external DSLs enables the DSL to be unrelated to the programming language used to write the DSL.

Externals DSLs generally have the same goal as a Detached DSL, but built using a different implementation method.

The key advantage for external DSLs is that by being independent of any base language, they work unchanged with any general language.  So SQL is the same DSL when working with Java, Python, Kotlin or C#.

The first problem with independent DSLs is that the task written using the DSL often also need some general purpose language functionality. So the task will then be written in two languages.  A general purpose language for part of the solution, and a DSL for another part.  The project requires two different languages.

The second problem with independent DSLs is that the features of the general purpose language are not accessible from within the DSL. This means the DSL may need to duplicate features already available in any general purpose languages. Such duplicated features are generally inferior to those in general purpose languages.  E.g. numeric expressions in SQL are not as powerful as most general purpose languages, and there is often a syntax change from the general purpose language.

2. Internal Detached DSLs

When people talk about internal DSLs I see two styles: internal mini-languages and language enhancements.

An internal minilanguage is really using an internal DSL to do the same thing as you would with an external DSL.  Source: Martin Fowler.

Unlike an external DSL, you are limited by the syntax and programming model of your host language, but you do not need to bother with building a parser. You are also able to use the host language features in complicated cases should you need to.

Martin Fowler

Under Martin Fowlers definition, a detached DSL is the first of two types of Internal DSL.  These Internal Detached DSLs, like External DSLs,  are building their own ‘mini-language’ for a specific domain.  Detached DSLS are building ‘a domain specific language‘ as opposed to ‘domain specific language’ vocabulary for an existing language.  With a Detached DSLs, the new stand-alone language is created within an existing language. To achieve being a standalone language,  the DSLs needs to be separated or ‘detached’ from the host language.  Even if such a language is ‘fully-detached’ from the host language, it is will normally be the case that some host language syntax is available from within the DSL.  In all cases, the rules and syntax of the DSL will be shaped by what can be built within the framework of the host language.

This Detached DSL is the type of DSL usually referred to in the discussion of Kotlin DSLs, and of Gradle build files are an example of a Groovy Internal, Detached DSL.

As the goals are the same as External DSLs in creating what can be seen as a standalone language, these DSLs ideally require little understanding of the host language by those using the DSL.  So build.gradle files require, at least in theory, almost no understanding of the Groovy language, or perhaps more realistically, an understanding of only a tiny subset of the host language.  Kotlinx.html is a Kotlin example of this type of DSL built within Kotlin, and the actual Kolinx.html syntax can seem very different to regular Kotlin syntax, even though all code is actually Kotlin.

3: Internal Augmentation DSL.

The alternative way of using internal DSLs is quite different to anything you might do with an external DSL. This is where you are using DSL techniques to enhance the host language. A good example of this is many of the facilities of Ruby on Rails.  Martin Fowler.

Why build a complete language if you can just add features to an existing language?  This third type of DSL no longer has the goal of creating a standalone language. It is ‘domain specific language’ more as a parallel to a set of jargon words for a specific domain can be used in a conversation that is based in English.  The jargon provides new language, but the conversation overall is still in English.   To understand the conversation, you need to know English as well as the specific jargon.  Code using an augmentation DSL will still also make use of the host language.   The program is still seen as in the original language, but using some additional definitions specific to the augmentation DSL. The goal of the augmentation DLS is to add new vocabulary or capability to an existing language, and this makes Augmentation DSLs quite different to the previous DSL types. Instead of an entire stand alone new language, the result is an extension or augmentation to an existing ‘host’ language.  Effectively extending the power of the original host language to have new vocabulary and perhaps also new grammar. This enables the simple and concise expression of ideas and concepts from a specific domain while continuing to use the host language. The augmentation is to be used in combination with the power and flexibility of the host language, which allows for more general areas of a programming in combination with programming for the specialist domain.

Such augmentations still require users to know the host language, but  provide a more homogenous solution than the combination of a stand-alone language with a general purpose language.   For example, while a Python program can build SQL commands to send directly to an SQL database server, an augmentation to python such as SQLAlchemy allow the same power as the SQL language, all within the general syntax of Python.

Detached vs Augmentation DSLs: A continuity?

Both Detached DSLs and Augmentation DSLs are build inside an existing language, and the same set of language features can be used to build either type of DSL.   It is only the goal that is different.  Build a syntax that feels detached from the host language,  or build a syntax that integrates with the host language.

The reality is not every detached DSL is fully detached from the host language, and many do require knowing the host language.

There is a clear test for a fully Detached DSL:  If the DSL can be read, or written, by people with knowledge only of the DSL without needing knowledge of the host language, then it is a fully detached language. Gradle Build files are an example of a internal detached DSL that passes this test, as you can write build files without knowing  the host language (which can be either Groovy or Kotlin).

However,  just because the DSL syntax can be used fully detached from the host language, does not mean actual code in the DSL always will be fully detached from the host language.   For example, Gradle build files can make use of the host language syntax within the build file, and when that host syntax is used, the result is  a build file that does require a knowledge of the host language (which can actually be either Groovy or Kotlin). So for some code, even with a DSL capable of fully detached use,  working with that code will require knowledge of the host language.

Fully detached code can be designed to be  possible, but with the host language syntax available, it cannot be guaranteed all code will be fully detached.

Further, in practice many examples seek to be only partially detached from the host language.  In fact our own example all fit this pattern, as the semi-detached code actually exists interspersed with Kotlin code and there is no goal to enable code be read without knowing Kotlin.

Martin Fowler quotes the examples of the Rake DSL as being able to be categorised as either an independent language or an extension, which in my terminology would suggest it is more to the centre of the continuum.

When we use the term ‘Kotlin DSL’ or even ‘Python DSL’, we mean a DSL made by augmenting Kotlin or Python with both extra ‘vocabulary’ for domain specific features, are rarely.  The DSL is a set of new language constructs which extends an existing language.

Technically, this is always an extended language, but if the goal is to allow the use of these extensions by themselves you have independent language DSL, and if the goal is to allow programs in the host language access to new additional syntax, you have a Language Extensions DSL

An Augmentation DSL vs Language Augmentation

As discussed in languages, all but the simplest human communication makes use of language augmentation, and all but the simplest programs defines variables, functions and other elements that then become part of the language used elsewhere in the program.  An augmentation DSL is created when a specific block of language augmentation (definitions of variables, functions classes or even syntax) is separated from any specific application using that augmentation, and is provided for the use of any application which may require the same functionality.

Conclusion: DSL types can be quite different.

The Rake DSL(Detached/Augmentation hybrid DSL), or Gradle(Detached DSL) or HTML(External DSL):  these are all greatly different examples that all can be called DSL.

When the term DSL is used, it can refer to DSLs in general, but more often one of three entirely different types can be being discussed, and being discussed as if all DSLs are of that type, which can be confusing if you are often dealing with one of the other DSL types.  The term DSL is an extension of the language programming jargon, but perhaps it would be useful to have three additional terms, (making a four-word language extension) with an agreed adjective for each of the three types of DSL.

DSL Methodology: A key software concept.

With Kotlin the term ‘DSL’ has taken on a specific meaning, and that more specific meaning is explored in another page on Kotlin DSLs like kotlinx.html and how to write them.  DSL methodology is a key reason that the capabilities of a language to write domain specific extensions to the language becomes important.

This page concentrates on the concept of DSL Methodology.  DSL methodology is to consider software development as the task of creating the component tools that allow the expression of what the program does within a single concise function, and in a manner the not only executes correctly but is also easy for a person to read, understand and when necessary, modify.

  • Introduction to DSLs
    • DLS: the general DSL definition
    • DSL: independent language or and language extension?
    • When does a program create a DSL?
  • Human Language and DSLs
    • Why consider human languages?
    • The Dictionary is an insufficient reference.
    • Jargon as a form of DSL
    • Situation Specific Language extensions
    • Specific Human language: Independent Language vs Extensions.
    • Conclusion
  • DSL Methodology: The Basics
    • Simple language extension
    • DSL Methodology: Building Blocks
    • More Layers
    • The Core concept
    • When to apply DSL Methodology
  • Implementing DSL Methodology
    • The Core concept
    • software modules: group extensions together
    • Leveraging existing Language Extensions
  • Conclusion

Introduction

DSL: The general definition

A DSL is an acronym for Domain Specific Language.  A ‘Domain’ being effectively a specific application. Contrasting with ‘General purpose languages’ which attempt to allow for solving any programming problem, a DSL can be purpose designed for a specific ‘domain’ or a specific type of problem.

DSLs can sound like using Kotlin (or Python) to build an entirely new language, and this can seem true, but the reality is simpler. All kotlin DSLs are extensions of Kotlin, and although sometimes use cases can focus on the extension and make little use of the underlying language, that underlying language is always still there.

Two Types of DSL: Independent DSLs and Language (extensions) DSLs

Independent DSLs

There are DSLs like SQL, which are stand-alone languages to tackle a specific field or domain.  The first problem with independent DSLs is that a general purpose language is almost always also required.  This problem can be solved by combining a DSL with a general language, like using SQL from Python, although then you have two different languages, it does work.  The second problem with independent DSLs is that the features of the general purpose language are not accessible from within the DSL, so the DSL has to duplicate features of general purpose languages, and the duplicated features are generally inferior to those in general purpose languages.  E.g. numeric expressions in SQL are not as powerful as most general purpose languages, and there is often a syntax change from the general purpose language.

Language (extended) DSLs.

There is also a solution where a DSL is an extension to a general purpose language. For example, in place of developing in SQL with Python, SQLAlchemy brings the power of the SQL language to Python.

When we use the term ‘Kotlin DSL’ or even ‘Python DSL’, we mean a DSL made by extending Kotlin or Python with extra ‘vocabulary’ for domain specific features.  The DSL is a set of new language which extends an existing language.  This type of DSL is seen as the preferred solution. In programming, with an extension, there is less syntax to learn and greater consistency if, in place of a new independent language, extra ‘vocabulary’ for a language we already know can be used.

When does a program create a DSL?

When does adding new features to an existing language, create a new DSL? As with many things, the extremes are evident. The “hello world” program would be regarded not to create a DSL, and at the other extreme, some packages clearly implement a DSL. However, if you had half the features, would it still constitute a DSL? At what point does adding additional ‘vocabulary’ reach the definition of being a DSL?

The reality is that every program, including “hello world”, creates some new vocabulary somewhere, just in the case of “hello world” that vocabulary is not very useful. However, if the program is called “hello”, then the computer gains new syntax in ‘the shell’ such that typing ‘hello‘ now does something, and prints “Hello world”. The shell gains “hello” as one new word of vocabulary. While one word is not much of a language extension, the concept of extending a language is common throughout programming, so the principle of building a DSL always applies at some level.

Human language and DSLs

Why consider human languages?

A significant part of the human brain has evolved specifically to process language.  Since the first move from machine code to assembler, the goal has been for computer programs also to be processed by humans.  Just how do our brains handle domain specific language?  Don’t we take years to learn one single language and find learning another quite difficult?

Jargon as a DSL equivalent

One spoken language parallel to a computer extended DSL is ‘jargon’. Many ‘domains’ evolve their jargon or extensions to the language. Jargon more concisely and more specifically communicates the concepts needed in specific a domain than regular ‘non-jargon’ language does, but is generally used in combination with an underlying general-purpose language, such as English or French or Chinese. People who already speak a general purpose language can learn one or more jargon vocabularies in a much shorter time than learning the general-purpose language.

The dictionary is an insufficient model.

If you are reading this document, it can be assumed you can read English, and the reference for English is the dictionary.  But there may be words even on this page, where the dictionary is actually not that helpful, because the dictionary does cover what words part of the language, but not how they combine to provide meaning, or even a full understanding of the concepts behind the meaning.  Wikipedia can be a useful source of far more information behind the words, but there are also words that change with context.  All of this means the language becomes more like the syntax for expressing meaning, but it can take a lot of language to actually convey meaning.  Just as Python or Kotlin have a language syntax, but language extensions are built within that base syntax.  To understand the meaning of what is written, terms can become familiar to us, but until there are familiar we may have to go to the dictionary, the encyclopedia, or for context specific things like where something is, we may have to ask.  All of this is the paralleled in code but reading the definition of an object or function, but once we are familiar we should not need keep referring to that source.

Situation Specific Language extensions.

However humans also learn far more localised and situation specific language extensions.
Consider a random page from a novel. Most of the words can be found in the appropriate language dictionary, or an encyclopedia, because the are part of the general language (e.g. English). But there are words that are not sufficiently explained by either dictionary or encyclopedia, because they have  a specific meaning in the context of the novel, and the meanings are explained through the novel.  Names are one class of such words. Names can make reading a page at random a challenge. Read a random page from a novel and we skip the explanation of the specific meaning.  Just who is ‘Harry’ or ‘Sally’? Have they met?  What is their relation to the protagonist? Novels are designed to be read sequentially, so there is no index to easily find what has been defined, and usually no clear list of what is defined.  Depending on the novel, significant amounts can be specific to the novel.  Consider Lord of the Rings.  Not only are characters explored, but also types of creatures, new locations and imaginary world.  It can be described as a “Lord of the Rings Universe” being created.

So even to navigate an individual literary work, a new extended vocabulary can be required, varying from knowledge of just a few character names through to an entire altered universe.

Specific Human language: Independent Language vs Extension.

While jargon can seem impossible to understand for a ‘layman’ who only speaks regular ‘non-jargon’ language. The reality is, jargon normally does not create a replacement for regular language and communication in jargon alone is rarely sufficient. Even very domain specific communication requires a mixture of regular language and jargon together.

Imagine, for example, a French person and a Chinese person who are both from the same industry and use the same jargon, but other than that jargon are unable to communicate. Even adding words like ‘very’ to the jargon would be a problem. Like independent DSLs, jargon needs to coexist with a general purpose language.
Just like the French and Chinese colleagues, every independent DSL also needs some ‘normal words’ so they add their own limited set of ‘normal words’. This means Independent DSLs have to revisit many things already present in general programming languages, and the result is still restrictive.
The French and Chinese hypothetical colleagues would be far better placed if they both spoke a common regular language in addition to just the jargon.

Conclusion

Human communication in human language actually relies on language extension using a base set of rules.  This suggests that our thinking should be well adapted to the same approach within programs.

DSL Methodology: The basics

Simple Language extension.

Language extension is the core programming concept of defining things.  Even a variable definition is defining a new language element. For more significant language extensions in blocks, in python we have ‘import’ and to give increased scope of what can be imported there is ‘pip install’.

DSL methodology: building blocks

It is generally agreed that there is a maximum number of lines for a well defined function. Opinions on the actual limit vary,  but generally the recommendation range from that which can be seen on the screen at one time, through to as high as around 100 lines.

Now consider that, every program is described by a single function, usually called ‘main‘.  With a very simple program, all the code could be held in main, but as the code grows, that limit of  around 25 to 100 lines in one function will become a restraint.

How to describe the program in the main function, and keep main small enough to read and understand?

As the program grows, the developer can move some code to ‘other functions’ and in main simply call these functions containing the moved code This is one way the size of main can be controlled.

But simply moving ‘chunks’ of main into functions is not DSL methodology. DSL methodology is to create buildings blocks functions that allow writing the logic of main in a more concise way.  The logic of main stays in main, it is only logic to turn steps into extensions of the language that is moved to the functions.

Main stays readable if the concepts of the functions are clear.  Usually any  functions with ‘moved code’ will need to be generalised to convert them into building blocks, and the individual application specific nature come from parameters specified when those blocks are called. Then understanding what the program does can still be clear just by considering that main function.  A new developer may not need to read beyond main to learn, to understand what the program does, and may limit going beyond main to an area of functionality of particular interest, and infer the meaning of other new ‘vocabulary’.

More Levels

The concept is that main describes the program at the top level, but main will need the use of either program specific building blocks or ‘extended language’ and/or  language building blocks, know as packages, which are common to several applications.  In the case of “hello world”, the only other function is the print function, and that function is considered part of the language.  If you know the python language, you know how the python print works.  However if the other functions  beyond main are the ‘moved code’ described above, then these other functions will most likely be unique.  They are an extension to the language for the use of this main function, and unique to this main function.   While the function names can convey what is done at a high level, to know exactly what these functions do, a person reading the program will need to go and then read these functions.  To read main, this is the equivalent to looking up a word in the dictionary.

As the solution grows in detail, in turn these functions become complex and will also require their own extensions to stay within size constraints. As the program system grows the number of levels of extension to the original language grows.

The Core concept.

The core concept, is that the end result is the main is written using an extended language, and those extensions are build on other extensions. Each level  should be readable without looking up what each component of the new language means in full detail.   Each level is written in terms of an underlying extended language, and the program should be broken up into components that define new language blocks or ‘jargons’, which are separate from layers build using those jargon language blocks.

The role of each level or block is to provide the language extensions to make the level above simple to understand. The lowest level of the program is the only level of the program written in python or kotlin or whatever itself, as all other levels are built on the extensions.

A web server will rarely be built in ‘raw’ python alone, but will normally be built on a software stack of template engines, routing engines, database engines etc.  The language of the project becomes not just python, but python plus all those extensions.  Then the project may add its own extensions.

But to work in the project, you have to learn the language of each of the extensions, or at least the language of the extensions being used in the area of the project you are working.

Every project of any scale is not just built on a language, but on the language plus the set of extensions to that language.

When to apply DSL methodology?

It follows from the concept of ‘the main function describes the program’ that a very simple program such as hello world already achieves the goal of main being conveying what the program does. In fact any project simple enough to be contained in a single file, or  unlikely to require changes beyond a month from when the program is written, is too small scale or time frame to benefit significantly from DSL methodology,

The main relevance of DSL methodology is for long term projects with continued updates, developed by a team producing several releases over a time scale of more than one year.

It is with this type of software that reading what the code does can become a challenge even to the author of that block of code over time.

Implementing DSL methodology

the goals

DSL methodology is simply a slightly different way to view the normal principles of sound software development. The steps to implementing are all striving to achieve these goals.

The goals DSL methodology are:

  • allow each part of the system to be expressed in the simplest language possible and with the smallest possible language extension
  • keep system specific functionality at the highest layer possible, and avoid buried functionality
  • building blocks should be as generic as possible and able to be understood without considering the overall application

software modules:  group extensions together

Building blocks should where possible be grouped in to logical modules that together provide a specific type of new functionality.  In fact these modules should be considered to have functionality independent of the central application, and be able to have their own documentation, and perhaps own repository.

In fact, given that such extensions should not contain the logic of the main application, it may be possible to open source these extensions, even where the main application itself would not be open source.

Leveraging existing Language Extensions

Most often best described as packages, there are readymade language extensions which can give great capabilities.  Selection of these packages becomes very important as each selection becomes additional ‘extended language’ that the team must become familiar with.

The real world is, that beyond well established quite generic packages, there are many packages that takes steps towards keeping your code simple, but in the end in your usage still leave a program to big to maintain.  There choices then are:

  • extend an existing package
  • build a new package that uses the existing package internally but exposes only a new api
  • use the existing package as inspiration and effectively fork
  • build an alternative package

Conclusion

The end goal is to have a set of smaller packages, some internal modules to the application, some as separate packages perhaps even with their own lifecycle, together with the smallest possible core application.