Blog Open Source Lingo: A Go micro language framework for building Domain Specific Languages
May 26, 2022
15 min read

Lingo: A Go micro language framework for building Domain Specific Languages

Design, build and integrate your own Domain Specific Language with Lingo.

typeset.png

Domain Specific Languages (DSL) are small, focused languages with a narrow
domain of applicability. DSLs are tailored towards their target domain so that
domain experts can formalize ideas based on their knowledge and background.

This makes DSLs powerful tools that can be used for the purpose of increasing
programmer efficiency by being more expressive in their target
domain, compared to general purpose languages, and by providing concepts to
reduce the cognitive load on their users.

Consider the problem of summing up the balances of different bank accounts in a
CSV file. A sample CSV file is provided in the example below where the first
column contains the name of the account holder and the second column contains
the account balance.

name, balance
Lisa, 100.30
Bert, 241.41
Maria, 151.13

You could solve the problem of summing up balances by using a general-purpose
language such as Ruby as in the code snippet
below. Apart from the fact that the code below is not very robust, it contains
a lot of boilerplate that is irrelevant to the problem at hand, i.e., summing
up the account balances.

#!/usr/bin/env ruby

exit(1) if ARGV.empty? || !File.exist?(ARGV[0])

sum = 0
File.foreach(ARGV[0]).each_with_index do |line, idx|
  next if idx == 0
  sum += Float(line.split(',')[1])
end

puts sum.round(2)

Below is an example AWK script that solves
the same problem. AWK is a DSL that was specifically designed to address
problems related to text-processing.

#!/usr/bin/awk -f

BEGIN{FS=","}{sum+=$2}END{print sum}

The Ruby program has a size of 208 characters, whereas the AWK program has a size of 56. The AWK program is roughly 4x smaller than its Ruby
counterpart. In addition, the AWK implementation is more robust by being less
prone to glitches that may appear in the CSV file (e.g., empty newlines,
wrongly formatted data-fields). The significant difference in terms of size
illustrates that DSLs, by being more focused on solving specific problems, can
make their users more productive by sparing them the burden to write
boilerplate code and narrowing the focus of the language on the problem at
hand.

Some popular DSLs most software developers use on a regular basis include
Regular Expressions for
pattern matching, AWK for text
transformation or Standard Query Language
for interacting with databases.

Challenges when designing Domain Specific Languages

Prototyping, designing and evolving DSLs is a
challenging process. In our experience this is an exploratory cycle where you
constantly prototype ideas, incorporate them into the language, try them out in
reality, collect feedback and improve the DSL based on the feedback.

When designing a DSL, there are many components that have to be implemented and
evolved. At a very high level there are two main components: the language
lexer/parser and the language processor. The lexer/parser
is the component that accepts input as per the language definition which is
usually specified specified by means of a language grammar. The parsing/lexing
phase produces a syntax tree which is then passed onto the language processor.
A language processor evaluates the syntax tree. In the example we saw earlier,
we ran both the Ruby and AWK interpreters providing our scripts and the CSV
file as input; both interpreters evaluated the scripts and this evaluation
yielded the sum of all the account balances as a result.

Tools such as parser generators can significantly reduce the effort of
lexer/parser development by means of code generation. Sophisticated DSL
frameworks such as JetBrains MPS or
Xtext also provide features that help
implement custom language support in IDEs. However, if present at all, the
support for building the language processors is usually limited to generating
placeholders functions or boilerplate code for the language components that
have to be filled-in by the DSL developer. Moreover, such large and powerful DSL
frameworks usually have a fairly steep learning curve so that they are probably
a better fit for more sophisticated DSLs as opposed to small, easily
embeddable, focused languages, which we refer to as micro languages.

In some situations, it may be worth considering working around these problems
by just relying on standard data exchange formats such as .toml, .yaml or
.json as a means of configuration. Similar to the parser generators, using
such a format may relieve some of the burden when it comes to parser
development effort. However, this approach does not help when it comes to the
implementation of the actual language processor. In addition, most standard data
exchange formats are inherently limited to representing data in terms of simple
concepts (such as lists, dictionaries, strings and numbers). This limitation
can lead to bloated configuration files quickly as shown in the following
example.

Imagine the development of a calculator that operates on integers using
multiplication *, addition +. When using a data-description language like
YAML in the example below, you can see that even a small simple term like 1 + 2 * 3 + 5
can be hard to reason about, and by adding more terms the configuration
file would get bloated quickly.

term:
  add: 
    - 1
    - times:
      - 2
      - 3
    - 5

This blog post is focused on the design of micro languages. The core idea is to
provide a simple, extensible language core that can be easily extended with
custom-types and custom functions; the language can evolve without having
to touch the parser or the language processor. Instead, the DSL designer can
just focus on the concepts that ought to be integrated into the DSL by
implementing interfaces and "hooking" them into the core language
implementation.

Lingo: A micro language framework for Go

At GitLab, Go is one of our main programming languages and some of the tools we
develop required their own, small, embeddable DSLs so that users could properly
configure and interact with them.

Initially, we tried to integrate already existing, embeddable and expandable
language implementations. Our only condition was that they had to be
embeddable natively into a Go application. We explored several great free and
open-source (FOSS) projects such as go-lua
which is Lua VM implemented in Go, go-yeagi
which provides a Go interpreter with which Go can be used as a scripting
language or go-zygomys which is a LISP
interpreter written in Go. However, these packages are essentially modules to
integrate general-purpose languages on top of which a DSL could be built. These modules ended up being fairly complex. In contrast, we wanted to have basic support to design, implement, embed and evolve DSLs natively into a Go
application that is flexible, small, simple/easy to grasp, evolve and
adapt.

We were looking for a micro language framework with the properties listed below:

  1. Stability: Changes applied to the DSL should neither require any changes to the core lexer/parser implementation nor to the language processor implementation.
  2. Flexibility/Composability: New DSL concepts (data-types, functions) can be integrated via a simple plug-in mechanism.
  3. Simplicity: the language framework should have just
    enough features to provide a foundation that is powerful enough to implement
    and evolve a custom DSL. In addition, the whole implementation of the micro
    language framework should be in pure Go so that it is easily embeddable in Go
    applications.

Since none of the available FOSS tools we looked at was able to
fulfill all of those requirements, we developed our own micro language framework
in Go called Lingo which stands for "LISP-based Domain Specific Languages
(DSLs) in Go". Lingo is completely FOSS and available in the Lingo Git repository
under the free and open source space of the Vulnerability Research Team.

Lingo
provides a foundation for building DSLs based on Symbolic Expressions (S-expressions), i.e.,
expressions provided in the form of nested lists (f ...) where f can be
considered as the placeholder that represents the function symbol. Using this format,
the mathematical term we saw earlier could be written as S-expression (+ 1 (* 2 3) 5).

S-expressions are versatile and easy to process due to their uniformity. In
addition, they can be used to represent both code and data in a consistent
manner.

With regards to the Stability, Flexibility and Composability properties,
Lingo
provides a simple plug-in mechanism to add new functions as well as types
without having to touch the core parser or language processor. From the
perspective of the S-expression parser, the actual function symbol is
essentially irrelevant with regards to the S-expression parsing. The language processor is just evaluating S-expressions and dispatching the execution to the interface implementations. These implementations are provided by the plug-ins based on the function symbol.

With regards to Simplicity, the Lingo code base is roughly 3K lines of pure Go code including the lexer/parser, an
engine for code transformation, and the interpreter/evaluator. The small size
should make it possible to understand the entirety of the implementation.

Readers that are interested in the technical details of
Lingo itself can have a look at the
README.md
where the implementation details and the used theoretical foundations are explained.
This blog post focuses on how
Lingo can be used to build a DSL from scratch.

Using Lingo to design a data generation engine

In this example we are designing a data-generation engine in Go using
Lingo as a foundation. Our data generation engine may be used to generate structured input
data for fuzzing or other application contexts. This example illustrates how
you can use Lingo to create a language and the corresponding language
processor. Going back to the example from the beginning, let us assume we would
like to generate CSV files in the format we saw at the beginning covering
account balances.

name, balance
Lisa, 100.30
Bert, 241.41
Maria, 151.13

Our language includes the following functions:

  1. (oneof s0, s1, ..., sN): randomly returns one of the parameter strings sX (0 <= X <= N).
  2. (join e0, e1, ..., eN): joins all argument expressions and concatenates their string representation eX (0 <= X <= N).
  3. (genfloat min max): generates a random float number X (0 <= X <= N) and returns it.
  4. (times num exp): repeats the pattern generated by exp num times.

For this example we are using
Lingo to build the language and the language processor to automatically generate CSV
output which we are going to feed back into the Ruby and AWK programs we saw in
the introduction in order to perform a stress test on them.

We refer to our new language/tool as Random Text Generator (RTG) .rtg.
Below is a sample script script.rtg we'd like our program to digest in order
to randomly generate CSV files. As you can see in the example below, we are
joining sub-strings starting with the CSV header name, balance
after which we randomly generate 10 lines of names and balance amounts. In
between, we also randomly generate some empty lines.

(join 
  (join "name" "," "balance" "\n")
  (times 10 
    '(join 
      (oneof 
        "Jim" 
        "Max" 
        "Simone" 
        "Carl" 
        "Paul" 
        "Karl" 
        "Ines" 
        "Jane" 
        "Geralt" 
        "Dandelion" 
        "Triss" 
        "Yennefer" 
        "Ciri") 
      "," 
      (genfloat 0 10000) 
      "\n" 
      (oneof "" "\n"))))

Our engine takes the script above written in RTG and generates random CSV
content. Below is an example CSV file generated from this script.

name,balance
Carl,25.648205
Ines,11758.551

Ciri,13300.558
...

For the remainder of this section, we explore how we can implement a
data generation engine based on Lingo. The implementation of RTG requires
the two main ingredients: (1) a float data type and a result object to integrate a float
representation and (2) implementations for the times, oneof, genfloat and
join functions.

Introducing a float data type and result objects

Lingo differentiates between data types and result objects. Data types indicate how data is
meant to be used and result objects are used to pass intermediate results
between functions where every result has a unique type. In the code snippet
below, we introduce a new float data type. The comments in the code snippet below
provide more details.

// introduce float type
var TypeFloatId, TypeFloat = types.NewTypeWithProperties("float", types.Primitive)
// introduce token float type for parser
var TokFloat = parser.HookToken(parser.TokLabel(TypeFloat.Name))

// recognize (true) as boolean
type FloatMatcher struct{}

// this function is used by the parser to "recognize" floats as such
func (i FloatMatcher) Match(s string) parser.TokLabel {
  if !strings.Contains(s, ".") {
    return parser.TokUnknown
  }

  if _, err := strconv.ParseFloat(s, 32); err == nil {
	return TokFloat.Label
  }

  return parser.TokUnknown
}
func (i FloatMatcher) Id() string {
  return string(TokFloat.Label)
}

func init() {
  // hook matcher into the parser
  parser.HookMatcher(FloatMatcher{})
}

In addition, we also require a result object which we can use to pass around
float values. This is an interface implementation where most of the functions names
are self-explanatory. The important bit is the Type function
that returns our custom float type we introduced in the last snippet.

type FloatResult struct{ Val float32 }
// deep copy
func (r FloatResult) DeepCopy() eval.Result { return NewFloatResult(r.Val) }
// returns the string representation of this result type
func (r FloatResult) String() string {
  return strconv.FormatFloat(float64(r.Val), 'f', -1, 32)
}
// returns the data type for this result type
func (r FloatResult) Type() types.Type   { return custtypes.TypeFloat }
// call-back that is cleaned up when the environment is cleaned up
func (r FloatResult) Tidy()              {}

func (r FloatResult) Value() interface{} { return r.Val }
func (r *FloatResult) SetValue(value interface{}) error {
  boolVal, ok := value.(float32)
  if !ok {
    return fmt.Errorf("invalid type for Bool")
  }
  r.Val = boolVal
  return nil
}
func NewFloatResult(value float32) *FloatResult {
  return &FloatResult{
    value,
  }
}

Implementing the DSL functions

Similar to the data type and return object, implementation of a DSL function is
as simple as implementing an interface. In the example below we implement the
genfloat function as an example. The most important parts are the Symbol(),
Validate() and Evaluate() functions. The Symbol() function returns the
function symbol which is genfloat in this particular case.

Both, the Validate() and Evaluate() functions take the environment env
and the parameter Stack stack as the parameter. The environment is used to store
intermediate results which is useful when declaring/defining variables. The stack includes the input parameters of the function. For
(genfloat 0 10000), the stack would consist out of two IntResult parameters
0 and 10000 where IntResult is a standard result object already provided by the
core implementation of Lingo. Validate() makes sure that the parameter can be
digested by the function at hand, whereas Evaluate() actually invokes the
function. In this particular case, we are generating a float value within the
specified range and return the corresponding FloatResult.

type FunctionGenfloat struct{}

// returns a description of this function
func (f *FunctionGenfloat) Desc() (string, string) {
  return fmt.Sprintf("%s%s %s%s",
    string(parser.TokLeftPar),
    f.Symbol(),
	"min max",
	string(parser.TokRightPar)),
	"generate float in rang [min max]"
}

// this is the symbol f of the function (f ...)
func (f *FunctionGenfloat) Symbol() parser.TokLabel {
  return parser.TokLabel("genfloat")
}

// validates the parameters of this function which are passed in
func (f *FunctionGenfloat) Validate(env *eval.Environment, stack *eval.StackFrame) error {
  if stack.Size() != 2 {
    return eval.WrongNumberOfArgs(f.Symbol(), stack.Size(), 2)
  }

  for idx, item := range stack.Items() {
    if item.Type() != types.TypeInt {
	  return eval.WrongTypeOfArg(f.Symbol(), idx+1, item)
	}
  }
  return nil
}

// evaluates the function and returns the result
func (f *FunctionGenfloat) Evaluate(env *eval.Environment, stack *eval.StackFrame) (eval.Result, error) {
  var result float32
  rand.Seed(time.Now().UnixNano())
  for !stack.Empty() {
    max := stack.Pop().(*eval.IntResult)
    min := stack.Pop().(*eval.IntResult)

	minval := float32(min.Val)
	maxval := float32(max.Val)

	result = minval + (rand.Float32() * (maxval - minval))
  }

  return custresults.NewFloatResult(result), nil
}

func NewFunctionGenfloat() (eval.Function, error) {
  fun := &FunctionGenfloat{}
  parser.HookToken(fun.Symbol())
  return fun, nil
}

Putting it all together

After implementing all the functions, we only have to register/integrate them
(eval.HookFunction(...)) so that Lingo properly resolves them when processing
the program. In the example below, we are registering all of the custom functions
we implemented, i.e., times, oneof, join, genfloat. The main()
function in the example below includes the code required to evaluate our script
script.rtg.

// register function
func register(fn eval.Function, err error) {
  if err != nil {
    log.Fatalf("failed to create %s function %s:", fn.Symbol(), err.Error())
  }
  err = eval.HookFunction(fn)
  if err != nil {
    log.Fatalf("failed to hook bool function %s:", err.Error())
  }
}

func main() {
  // register custom functions
  register(functions.NewFunctionTimes())
  register(functions.NewFunctionOneof())
  register(functions.NewFunctionJoin())
  register(functions.NewFunctionGenfloat())
  register(functions.NewFunctionFloat())
  if len(os.Args) <= 1 {
    fmt.Println("No script provided")
    os.Exit(1)
  }
  // evaluate script
  result, err := eval.RunScriptPath(os.Args[1])
  if err != nil {
    fmt.Println(err.Error())
    os.Exit(1)
  }

  // print output
  fmt.Printf(strings.ReplaceAll(result.String(), "\\n", "\n"))

  os.Exit(0)
}

The source code for RTG is available
here. You can find information
about how to build and run the tool in the
README.md.

With approx. 300 lines of Go code, we have successfully designed a language and
implemented a language processor. We can now use RTG to test the robustness of
the Ruby (computebalance.rb) and AWK scripts (computebalance.awk) we used
at the beginning to sum up account balances.

timeout 10 watch -e './rtg script.rtg > out.csv && ./computebalance.awk out.csv'
timeout 10 watch -e './rtg script.rtg > out.csv && ./computebalance.rb out.csv'

The experiment above shows that the files generated by means of RTG can be
properly digested by the AWK script which is much more robust since it can cope
with the all generated CSV files. In contrast, executing of the Ruby script
results in errors because it cannot properly cope with newlines as they appear
in the CSV file.

Cover image by Charles Deluvio on Unsplash

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert