Skip to main content

· 12 min read
Jurgen Vinju

Vallang (https://github.com/usethesource/vallang) is a highly integrated and mostly-closed collection of mutually recursive fundamental data-types on the Java Virtual Machine:

  • locations represented by URIs: |java+class://java/lang/String| and |file:///tmp/HelloWorld.java|
  • integers of arbitrary size: 1,2,3, 134812345123841234
  • reals of arbitrary size, precision and scale: 1., 1.0, 1e10
  • rational numbers: 1r1, 1r7
  • unicode strings: "hello 🌐"
  • lists: [1,2, 1.0, "hello 🌐"], []
  • sets: {1,2, 1.0, "hello 🌐"}, {}
  • maps: (1:0, "a":"b"), ()
  • n-ary tuples with named fields: <1,2,"a",1.0>, <>
  • n-ary relations (represented as sets of n-ary tuples): {<1,2,"a",1.0>, <>}
  • tree nodes: "myNode"(1,2,3)
  • many-sorted algebraic terms, acting as typed tree nodes: myNode(1,2,3).
  • keyword fields or properties to tree nodes and algebraic data-types: "myNode"(name="Winston"), myNode(age=12)

Operations on these data-types are too many to list here. A selection is listed below, but you should expect the features to be pretty low level; i.e. directly accessing and manipulating the data rather than providing analysis algorithms. Algorithms in the library are added only if programming them below the abstraction layer of vallang provides a major efficiency benefit or it can factors out highly common client code into a reusable feature. More on this design decision later.

  • relational calculus operators such as transitive (reflexive) closure, query and projections, compositions and joins
  • generic tree traversal and primitives for implementing pattern matching

Vallang has a type system based on type systems in functional programming, but note that each value has a most specific dynamic type associated always at run-time. More on the design of the type system below, but here is a list of vallang types:

  • void - the bottom type with no values
  • value - the top type for all values
  • loc - the type for URI locations
  • int, real, rat are all sub-types of the aggregate type num
  • tuple[t1,...,tn] and tuple[t1 l1, ..., tn ln] to represent tuples of fixed but arbitrary arity
  • list[t], set[t], map[t1,t2] as incomparable alternative collection types.
  • node for trees
  • user-defined many-sorted mutually recursive algebraic data-types, acting as grammars for instances of tree nodes: data MyADT = myNode(int a, int b, int c, int age=...)
  • alias types for short-handed type equivalences
  • rel[t1, ..., tn] is an alias for set[tuple[t1,...tn]]
  • open type parameters with upper bounds, &T <: node, can be used to type parameterize composite types (used in type aliases) and to construct higher-order abstract algebraic datatypes.

Sub-typing is co-variant for the container types list, map, set and tuple. Otherwise these rules define the entire type system:

  • value is a strict supertype of all other types other than itself
  • node is the common supertype of all algebraic data-types
  • each algebraic constructor is a strict sub-type of its abstract sort
  • void is the sub-type of all types
  • num is the supertype of rat, int and real
  • an alias alias X = Y is a type equivalence
  • constructor types are sub-types if they have the same name and arity, and they are comparable in their argument types.
    • Within a single abstract sort no two alternative constructors may be sub-types of each other.

There exists an extension mechanism for adding type kinds and their associated value kinds to the vallang system. Rascal, for example, uses this to represent functions and co-routines at run-time. The extension mechanism works by declaring a bi-directional transformation between the extensions and a symbolic representation of choice (chosen freely from the core representation mechanisms of vallang). This bidirectional mapping is mostly used when serializing and deserializing values (see below).

The types of vallang in Java are represented by a Composite design pattern with maximally shared instances of (the otherwise opaque) abstract class Type. These types expose fast implementations of sub-type and type equivalence for implementing fast pattern matching.

The values of vallang are all instances of IValue and sub-interfaces thereof. For every kind of value there is an interface, e.g. ISet, IList, ITuple and IConstructor but they are not type-parametrized because Java's type system can not represent the aforementioned co-variant sub-typing rules we require.

Why does Vallang exist?

vallang is a UseTheSource project recently renamed from rascal-values, which was known earlier as pdb.values.

The project started as a part of the IDE metatooling platform in 2007 as a generic library for representing symbolic facts about source code, for use in the construction of IDE services in Eclipse, then it continued to form the basis of the run-time system for Rascal starting 2009, and finally was renamed to vallang to serve a wider scope.

We designed of vallang based on experience with and studying the ATerm library and ASF+SDF, but also by learning from RSF (Rigi Standard Format), Rscript and GXL and S-expressions. Perhaps JSON and YAML have also had a minor influence.

The main purpose of vallang is to provide a flexible and fully typed collection of symbolic representations of data, specifically "ready" to represent facts about software systems but amenable to basically any form of symbolic data analysis purposes.

This purpose aligns with the mission of the Rascal metaprogramming language which is made to analyze and manipulate exactly such symbolic representations. Therefore vallang is the run-time environment for both interpreted and compiled Rascal programs.

Note that while vallang is a great fit for symbolic data analysis, it is currently not the best fit for numerical data analysis as it features only a uniform symbolic represetation of numbers of arbitrary precision and size (ints, reals, rationals). In other words, the numbers and collections of numbers in vallang are optimized for storage size, clarity and equational reasoning rather than optimal computational efficiency. This also means that indirect numerical encodings of data (i.e. using numerical vectors and matrices), which are often used in symbolic analyses to optimize computational efficiency are not the right strategy when using vallang: it's better to stick with a more direct symbolic representation and let vallang maintainers optimize them.

Next to the maintainers of Rascal, the main users of vallang are currently programmers who write data acquisition and (de)serialisation adapters for the Rascal ecosystem:

  • connecting open-compiler front-ends to Rascal
  • providing external data-sources such as SQL and MongoDB databases
  • connecting reusable interaction and visualization front-ends to Rascal

Nevertheless vallang is a generic and Rascal-independent library which may serve as the run-time system for other programming languages or analysis systems, such as term rewriting systems, relational calculus systems, constraint solvers, model checkers, model transformations, etc.

The latter perspective is the reason for the re-branding of rascal-values to vallang. You might consider vallang as a functional replacement for ECore, an alternative to the ATerm library on the JVM, or an alternative to JSON-based noSQL in-memory database systems, or a way of implementing graph databases.

Finally, vallang is a JVM library because that is where we needed it for Rascal and the Eclipse IDE Metatooling Platform. We hope other JVM programmers will also benefit from it and we have no plans of porting it at the moment to any other technological space.

What are the main design considerations of Vallang?

Vallang values are symbolic and immutable.

We think software analysis is complex enough to be confusing to even the most experienced programmers. Manipulating huge stores of hierarchical and relational data about softwar easily goes wrong; trivial bugs due to aliasing and sharing data between different stages of an analysis or transformation can take weeks to resolve, or worse: will never even be diagnosed.

Since our goal is to provide many more variants of all kind of software analyses, we wish to focus on the interesting algorithmic details rather than the trivial mistakes we make. Therefore, vallang values are immutable. Sharing of values or parts of values is allowed under-the-hood but is not observable. The library is implemented using persistent and/or maximally shared data structures for reasons of efficiency.

Users of vallang freely share references to their data to other parts of an analysis because they know the data can not change due to an unforeseen interaction. We also believe that the immutable values can be shared freely between threads on the JVM, but there are not enough tests yet to make such a bold claim with full confidence.

Vallang values are generated via the AbstractFactory design pattern and do not leak implementation representations

The reason is that client code must abstract from the implementation details to arrive at the mathematical precision of symbolic reasoning which vallang should provide.

This also serves a maintenance and evolution purpose for implementations of the library. We can plug in a new implementation of the library without affecting client code.

Note that for efficiency reasons values produced from different implementations of an abstract value factory (different implementations of IValueFactory) are not required to interact correctly.

Vallang values uniquely deserialize/serialize from/to a standard and simple expression language

The general rule is that for any two JVM object reference o and p to any vallang object the following rule holds: o.toString().equals(p.toString) <==> o.equals(p)

We currently random test this rule and it sometimes fails due to a deprecated feature called "annotations" which we are removing to make the above contract true.

The intended effects of the toString/equals contract of vallang are the following:

  • What-you-see-is-what-you-get: debugging values by printing them means that you get as a programmer full disclosure about the meaning of the object
  • Structural equality and equational reasoning: the context in which values are created can not have any influence on their identity
  • Sharing is safe
  • Serialisation and deserialisation is never lossy
  • The sub-type relation for types of values coincides exactly with sublanguage concept of the set of sentences for all values of the given types.

The latter point is one of the main reasons why vallang is called a language. The result of anyValue.toString() is a members of a precisely defined textual languages. The full textual language is generated from the value type, and sub-languages are generated from the respective sub-types. void is the empty language. In this manner the types of vallang act like non-terminals of a precise context-free grammar. The vallang language as defined above is a strict sub-language of the Expression sub-language of Rascal.

The other reason why vallang is names as a language is because the implementations of the IValue interface and its sub-interfaces are seen as a closed combinator language for computations on the values, and their implementations are interpreters for this language.

Vallang values always know their most-precise concrete ad-hoc run-time type

  • This is nice for debugging purposes, the types are descriptions of values and if matching or equality checking fails then the type abstraction usually explains why without having to go into reading the entire value.
  • Types may be computed lazily or eagerly, cashed or not. This is not functionally observable but it may affect run-time efficiency
  • Having precise run-time types for every (nested) value, and having efficient access to this, is a prerequisite for building fast and type-safe rank-2 polymorphic higher order functional computations. Or in functional terms: you need this to make folds and maps work on heterogenous recursive and open data-types. Or in simpler terms: using this we can build statically type-safe data traversal and daya transformation features into Rascal.

Vallang values include both trees and relations

Even though both trees and relations are generic enought to represent any data, sometimes a graph or table is more natural than a tree and sometimes the other way around.

  • trees are nice for abstract and concrete syntax representations
  • trees are nice for abstract symbolic domains, such as terms for constraint variables and binary constraints
  • relations are nice for graph-like unstructred data, such as project dependencies, call graphs, etc.
  • relations are nice for access to external data stored in spreadsheets and databases
  • trees are nice for access to web data stored in HTML, XML, JSON formats etc.
  • trees are good for transformation purposes, where we parse something, rewrite it and unparse it again
  • relations are good for analysis purposes, where we extract facts, elaborate on them and finally report the result.

Rascal is a language which can be used to easily switch between different representations of the same information, using pattern matching, querying, comprehensions, etc. From vallang you should not expect any help in this regard: the choice of representation for any information is a key design decision for the user of vallang.

Who contributed to Vallang?

  • Robert M. Fuhrer (IBM TJ Watson)
  • Jurgen J. Vinju (IBM TJ Watson and Centrum Wiskunde & Informatica)
  • Arnold Lankamp (Centrum Wiskunde & Informatica)
  • Michael Steindorfer (Centrum Wiskunde & Informatica and TU Delft)
  • Davy Landman (Centrum Wiskunde & Informatica and SWAT.engineering)
  • Paul Klint (Centrum Wiskunde & Informatica)

and occasional contributions from others please see github's factual overview

What is in the near future for Vallang?

  1. Removal of the "annotations" feature, which is completely replaces by the "keyword fields" feature. The main differences between these features are:
    • While they both offer extensibility to the set of names and typed fields of nodes and constructors, annotations can never influence equals() while keyword fields always do.
    • Syntactically the notation for keyword fields is more compact: f()[@myAnno=1] versus f(myField=1)
  2. Further integration of the capabilities of Capsule for persistent and optimized immutable collections under the hood of IMap, ISet, IRelationAlgebra:
    • Reflexive relations with two indices (for both columns)
    • Heterogeneous collections of numbers (unboxing down to primitive types to safe space)
    • Smooth and incremental transitions from map to multimap representations
  3. IBag, the bag[&T] type

· One min read
Mark Hills

Helping PHP developers using Rascal-based code analysis.

@inproceedings{hillsicpc2016,
author = "Mark Hills",
title = "Navigating the WordPress Plugin Landscape",
fulltext = "http://www.cs.ecu.edu/hillsma/publications/icpc-plugins-2016.pdf",
booktitle = "Proceedings of the 2015 {IEEE} 23rd International Conference on Program
Comprehension"
year = 2016,
location = Austin,
month = may,
}

WordPress includes a plugin mechanism that allows user-provided code to be executed in response to specific system events and input/output requests. The large number of extension points provided by WordPress makes it challenging for new plugin developers to understand which extension points they should use, while the thousands of existing plugins make it hard to find existing extension point handler implementations for use as examples when creating a new plugin. In this paper, we present a lightweight analysis, supplemented with information mined from source comments and the webpages hosted by WordPress for each plugin, that guides developers to the right extension points and to existing implementations of handlers for these extension points. We also present empirical information about how plugins are used in practice, providing guidance to both tool and prospective plugin developers.

· 2 min read
Micheal Steindorfer

This paper won a Best Paper award at ICPE 2016 in Delft.

@inproceedings{icpe2016-steindorfer,
author = {Michael Steindorfer and Jurgen J. Vinju.},
title = {Performance Modeling of Maximal Sharing},
booktitle = {7th ACM/SPEC International Conference on Performance Engineering (ICPE)},
year = 2016,
fulltext = "http://homepages.cwi.nl/~jurgenv/papers/ICPE16.pdf"
}

It is noticeably hard to predict the effect of optimization strategies in Java without implementing them. “Maximal sharing” (a.k.a. “hash-consing”) is one of these strategies that may have great benefit in terms of time and space, or may have detrimental overhead. It all depends on the redundancy of data and the use of equality. We used a combination of new techniques to predict the impact of maximal sharing on existing code: Object Re- dundancy Profiling (ORP) to model the effect on memory when sharing all immutable objects, and Equals-Call Profil- ing (ECP) to reason about how removing redundancy impacts runtime performance. With comparatively low effort, using the MAximal SHaring Oracle (MASHO), a prototype pro- filer based on ORP and ECP, we can uncover optimization opportunities that otherwise would remain hidden. We report on the experience of applying MASHO to real and complex case: we conclude that ORP and ECP combined can accurately predict gains and losses of maximal sharing, and also that (by isolating variables) a cheap predictive model can sometimes provide more accurate information than an expensive experiment can.

· 2 min read
Davy Landman

Check out these two related articles on the empirical relation between the CC and SLOC source code metrics.

@ARTICLE{jsep2015-landman,
author = { Davy Landman and Alexander Serebrenik and Eric Bouwers and Jurgen J. Vinju },
title = { {Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions} },
journal = { Journal of Software: Evolution and Process },
year = { 2015 },
doi = { 10.1002/smr.1760 },
fulltext = "http://homepages.cwi.nl/~landman/docs/Landman2015-ccsloc-jsep2015-preprint.pdf",
datalink = { http://homepages.cwi.nl/~landman/jsep2015/ },
}

@INPROCEEDINGS{Landman2014,
author = { Davy Landman and Alexander Serebrenik and Jurgen J. Vinju },
title = { {Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods} },
booktitle = { 30th IEEE International Conference on Software Maintenance and
Evolution, ICSME 2014 },
year = { 2014 },
datalink = { http://homepages.cwi.nl/~landman/icsme2014/ },
fulltext= "http://homepages.cwi.nl/~landman/docs/Landman2014-ccsloc-icsme2014-preprint.pdf"
}

Measuring the internal quality of source code is one of the traditional goals of making software development into an engineering discipline. Cyclomatic Complexity (CC) is an often used source code quality metric, next to Source Lines of Code (SLOC). However, the use of the CC metric is challenged by the repeated claim that CC is redundant with respect to SLOC due to strong linear correlation.

We conducted an extensive literature study of the CC/SLOC correlation results. Next, we tested correlation on large Java (17.6 M methods) and C (6.3 M functions) corpora. Our results show that linear correlation between SLOC and CC is only moderate as caused by increasingly high variance. We further observe that aggregating CC and SLOC as well as performing a power transform improves the correlation.

Our conclusion is that the observed linear correlation between CC and SLOC of Java methods or C functions is not strong enough to conclude that CC is redundant with SLOC. This conclusion contradicts earlier claims from literature, but concurs with the widely accepted practice of measuring of CC next to SLOC.

· 2 min read
Jurgen Vinju

Hash-tries are the data-structure under Rascal's sets, maps and relations. These papers explain how they work and how we make them lean and fast on the JVM. Others have blogged about these results as well. The code can be found in the Capsule project.

@inproceedings{oopsla2015,
title = {Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections}
author = {Michael Steindorder and Jurgen J. Vinju}.
year = 2015,
booktitle = {Proceedings of the Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA)},
editor = {Patrick Eugster},
fulltext = "http://michael.steindorfer.name/publications/oopsla15.pdf",
}

@inproceedings{gpce14,
author = {Steindorfer, Michael J. and Vinju, Jurgen J.},
title = {Code Specialization for Memory Efficient Hash Tries},
booktitle = {Proceedings of the 2014 International Conference on Generative Programming: Concepts and Experiences},
series = {GPCE 2014},
year = {2014},
pages = {11--14},
numpages = {4},
doi = {10.1145/2658761.2658763},
publisher = {ACM},
fulltext = "http://michael.steindorfer.name/publications/gpce14.pdf"
}

The data structures under-pinning collection API (e.g. lists, sets, maps) in the standard libraries of programming languages are used intensively in many applications. The standard libraries of recent Java Virtual Machine languages, such as Clojure or Scala, contain scalable and well-performing immutable collection data structures that are implemented as Hash-Array Mapped Tries (HAMTs). HAMTs already feature efficient lookup, insert, and delete operations, however due to their tree-based nature their memory footprints and the runtime performance of iteration and equality checking lag behind array-based counterparts. This particularly prohibits their application in programs which process larger data sets. In this paper, we propose changes to the HAMT design that increase the overall performance of immutable sets and maps. The resulting general purpose design increases cache locality and features a canonical representation. It outperforms Scala’s and Clojure’s data structure implementations in terms of memory footprint and runtime efficiency of iteration (1.3– 6.7 x) and equality checking (3–25.4 x).

· 4 min read
Jurgen Vinju

Rascal features immutable data, but at the same time a number of language constructs which are a lot like "traditional" structured imperative programming: while, if, etc. Without going into the full power of these constructs in Rascal (featuring lexically scoped backtracking, for example) in this post we go into how in the same language we can program imperatively and functionally at the same time.

The reason Rascal features these two styles is that we want to make it easy for programmers who are used to the imperative paradigm to step into the language. More importantly, we want to make it easy to type classical textbook examples of program analysis and transformations pseudocode rather directly into Rascal syntax.

A story about even numbers

Let's write a function that generates all the even numbers in a list up to a certain maximum. We will do it in a few alternative ways: from very imperative to very declarative and some steps in between.

list[int] even0(int max) {
list[int] result = [];
for (int i <- [0..max])
if (i % 2 == 0)
result += i;
return result;
}

Now lets remove the temporary type declarations:

list[int] even1(int max) {
result = [];
for (i <- [0..max])
if (i % 2 == 0)
result += i;
return result;
}

To make the code shorter, we can inline the condition in the for loop:

list[int] even2(int max) {
result = [];
for (i <- [0..max], i % 2 == 0)
result += i;
return result;
}

In fact, for loops may produce lists as values, using the append statement:

list[int] even3(int max) {
result = for (i <- [0..max], i % 2 == 0)
append i;
return result;
}

So now, the result temporary is not necessary anymore:

list[int] even4(int max) {
return for (i <- [0..max], i % 2 == 0)
append i;
}

This code is actually very close to a list comprehension already:

list[int] even5(int max) {
return [ i | i <- [0..max], i % 2 == 0];
}

And now we can just define even using an expression only:

list[int] even6(int max) = [i | i <- [0..max], i % 2 == 0];

Or, perhaps we like a set instead of a list:

set[int] even7(int max) = {i | i <- [0..max], i % 2 == 0};

What just happened?

In summary:

  • We went from 5 lines of code to 1
  • We went from 3 control flow constructs (for, if, return) to 0
  • We introduced a list comprehension
  • All expressions have remained equal
  • Intermediate temporary variables dissappeared

What did not happen is any magic. The code still executes the same "algorithm" if you will. The functional programming style in Rascal can be seen as a shorter notation for a more bloated use of imperative control flow constructs.

The usefulness of imperative control flow constructs

Up front, Rascal's control flow constructs are more powerful than general purpose programming languages control flow constructs. They feature lexically scoped backtracking, lists of conditions, etc.

Even without those advanced features, it is sometimes very handy to split a computation in parts without having to introduce another function abstraction. While exploring a new algorithm, temporary variables can be printed and inspected at debug time, etc.

In Rascal, people often use imperative control flow to explore solutions or copy them from text books, and when they are happy with the algorithm they try and improve the formulation by finding the right functional abstractions.

The benefit of functional abstraction over imperative control flow

The real benefits, above brevity and elegance, of the functional style over the imperative style are reusability and extensibility. Smaller functional abstractions are easier to reuse and easier to override. Also, using Rascal's dynamic dispatch using pattern matching (a.k.a. term rewriting) adding new options to algebraic data types can be complemented with adding new cases for functions. In the imperative style such an extension would imply editing existing code, while in the functional style this would not be necessary. See also this other post.

· 4 min read
Jurgen Vinju

This post dives into some of the design decisions regarding the manipulation of parse trees and abstract syntax trees in Rascal using concrete syntax notation.

Definitions

What we call concrete syntax is a notation for syntax trees that is embedded into the expression notation of meta programming languages. This notation should be equal, or mostly equal, to the surface syntax of the language that is represented by the syntax trees, and the expressions are still syntactically and statically checked for correctness. As far as notations for syntax trees go, concrete syntax is a clear winner. Let's compare some expressions representing a piece of C code.

Concrete syntax

if (a && b) { 
println("a and b");
}

Lisp S-expressions

(if ((and (id "a") (id "b")) 
(block
((call (id "println") (args (strconst "a and b")))
)
)

XML

<if>
<and>
<id>a</id>
<id>b</id>
</and>
<block>
<call>
<id>println</id>
<args>
<strconst>a and b</strconst>
</args>
</call>
</block>
</if>

YAML/JSON

if:
- and:
- id: a
- id: b
- block:
- call:
- id: println
- args:
- strconst: a and b

For the degenerate case of a single node with no children, of course any abstract notation wins. As soon as we have nesting, even marginally interesting code snippets, however, concrete syntax wins by landslide in terms of brevity and cognitive overload.

Meta Variables

The above examples showed only literal program fragments. One distinguishing feature, however, of concrete syntax is ...

History

The concrete syntax feature appeared first, as far as I know and please correct me if I am wrong, in the early 1980's in experimental meta programming systems and algebraic specification systems. There was a concept of mix fix operator syntax where algebraic operators would not only be exclusively prefix, postfix or infix, but all at the same time. This would allow, for example, to define readable algebraic operators with arity larger than two such as if _ then _ else _. Some systems started to use BNF to define mix fix functions, and concrete syntax was born. In extreme cases, such as ASF+SDF any context-free grammar rule was allowed to be an operator, while in other systems more restrictions could apply.

If you are interested in what this all looked like, also in the years after that, the following is a list of names of systems that used or use mixfix operators or concrete syntax in some form or another:

  • ASF+SDF
  • TXL
  • StrategoXT
  • Maude
  • ELAN
  • OBJ
  • Smarttools
  • Repleo
  • SugarJ
  • K

People that I know by heart who published on the concrete syntax feature are Annika Aasa, Kent Petersson, Dan Synek, Chris Verhoef, Paul Klint, Eelco Visser, Peter Borovansky, Jan Rekers, Martin Bravenboer, Rob Vermaas, Radu Mereuta, Dorel Lucanu, Jeroen Arnoldus, and yours truly. There must be more.

Concrete syntax should not be confused with string or file templates, such as found in PHP-like template expanders. The difference is that such templates are flat strings that are not statically checked or parsed. With concrete syntax you can not write a pattern that will never match, and you can not write a pattern that will construct a syntactically incorrect output.

Concrete syntax is also strongly related to the older concept of syntax macros (1970's). The similarity is that with syntax macros the user can also define the syntax of functions. The difference is that syntax macros are always expanded into the host language, while with concrete syntax the objects can be manipulated in an arbitrary way, often not to expand into the host language but rather to output a transformed output form.

So, concrete syntax is not new or novel in any way. It is a good idea nevertheless, and it comes with interesting language usability trade-offs.

Quotes

Types

Rascal

· 4 min read
Tijs van der Storm

One of the goals of Rascal is to allow the definition of Domain-Specific Languages. In this small post we give a flavor of how you can use Rascal to define the syntax of a DSL, a simple semantic check and how to compile the DSL to Java.

The following example shows how to define a simple DSL for state machines. It includes a parser, a check for unreachable states and a compiler to Java code.

The grammar of the DSL is defined using Rascal's grammar formalism which is fully integrated in the language. Shown below is the syntax definition of a simple state machine language, inspired by Martin Fowler's example language for gothic security.

module Syntax

extend lang::std::Layout;
extend lang::std::Id;

start syntax Machine = machine: State+ states;
syntax State = @Foldable state: "state" Id name Trans* out;
syntax Trans = trans: Id event ":" Id to;

A state machine consists of a number of named state declarations, where each state contains transitions to other states (identified by name) when a certain event happens. The grammar reuses identifier syntax and whitespace convention from the standard library. Each non-terminal defines a type. Parse trees are typed values like any other value in Rascal. As a result, you can write functions that process such trees. An example would be a semantic check on state machines, such as finding all unreachable states:

module Analyze

import Syntax;

set[Id] unreachable(Machine m) {
r = { <q1,q2> | (State)`state <Id q1> <Trans* ts>` <- m.states,
(Trans)`<Id _>: <Id q2>` <- ts }+;
qs = [ q.name | q &- m.states ];
return { q | q <- qs, q notin r[qs[0]] };
}

To check for unreachable states, we first create a binary relation between states using a comprehension. This comprehension uses concrete syntax matching to find a state's transitions. The pattern between backticks is written in the object language, which in this case is the statemachine language defined in the grammar above (Note the embedded syntax highlighting!). The variables q1 and ts in between < and > are bound for each state that is found in the machine m. A similar pattern is used to find the target state q2 is found in each transition in ts. The post-fix + then computes the transitive closure of the relation.

The relation r is based on the transitions in a state machine. This means that it does not include declared (final) states which have no outgoing transitions. We collect the names of all defined states in qs , again using a comprehension.

The initial state is (conventionally) defined to be the state that is declared first. An unreachable state is then defined as a state that is not in the right image of the initial state in the transitive closure of the transition relation. This is exactly what is described by the last comprehension! The notation r[x], where r is a relation is short hand for { y | <x, y> <- r }.

There are various ways of compiling a DSL to target code in Rascal. The simplest is using string templates and generate code in a general purpose language. The following snippet shows the generation of a Java while loop to execute a state machine.

module Compile

import Syntax;

str compile(Machine m) =
"while (true) {
' event = input.next();
' switch (current) {
' <for (q <- m.states) {>
' case \"<q.name>\":
' <for (t <- q.out) {>
' if (event.equals(\"<t.event>\"))
' current = \"<t.to>\";
' <}>
' break;
' <}>
' }
'}";

String templates allow arbitrary Rascal values and control-flow constructs to be interpolated in string literals. Note how this code does not use concrete matching, but instead uses the labels defined in the grammar (i.e., states, out, event, and to).

And that's it! A complete DSL in 36 lines of code. Of course, the parser and the unreachable and compile functions can be connected to the IDE. This provides custom syntax highlighting, error-marking and automatic building in state machine editors.

· 6 min read
Jurgen Vinju

Here's a nifty design element from Rascal that I personally like: functions are actually rewrite rules. The bottom line here is that pattern matching drives dynamic dispatch which results in openly extensible meta programs.

The design choice seems obvious for people who have been programming in ASF+SDF, Stratego and TXL who argue that rewrite rules and strategies were actually just functions.

Complete functions

In Rascal we write functions in a Java/C/C# like syntax:

int fac(int n) {
if (n == 0)
return 1;
else
return n * f(n - 1);
}

Or slightly shorter and more elegant we could write:

int f(int n) = n == 0 ? 1 : n * f(n - 1);

In fact, function definitions are just rewrite rules with a funny syntax, the int n is actually a pattern that matches integers only and binds them to the variable n. This means we can write more concrete patterns and separate the case distinction:

int f(0) = 1;
default int f(int n) = n * f(n - 1);

The default keyword here indicates to try this alternative only after the other ones, which is obligatory here since Rascal's rules are statically mutually non-overlapping.

Functions with normal forms

So far we have written a function which is total, i.e. it has to provide a result for all elements of the parameter types. To make this more rewriting-like, where we have normal forms, consider the interaction with constructor functions in the following example:

data Bool
= t()
| f()
| and(Bool l, Bool r)
| or(Bool l, Bool r)
;

Bool and(f(), Bool _) = f();
Bool and(t(), Bool b) = b;

Here we see the and function defined for two cases, where the first argument matches either t() or f(). It is not defined for any other cases, for example the first argument could be an or(t(),t()), so in this sense it is partial. However, the data definition provides a default case for the and function, namely to construct the and term.

For those of us who are used to rewrite rules, we see rules that are labeled by the sort of the terms that are being rewritten. Here are some example expressions executed in the console:

rascal>and(t(),t())
Bool: t();
rascal>and(f(),t())
Bool: f();
rascal>and(or(t(),t()),t())
Bool: and(or(t(),t()),t())

Modularity by open extensibility

The key benefit of being able to use pattern matching for dynamic dispatch, is extensibility. Suppose we add "maybe" to our logical language in a separate module. This is possible since data signatures are extensible:

data Bool = maybe();

Now we need to reconsider the semantics of the and function, and without changing the original definitions for and we simply type these extensions to implement three-valued logic:

Bool and(maybe(), maybe()) = maybe()
Bool and(maybe(), true()) = maybe()
Bool and(maybe(), false()) = false()

Pattern matching galore

In Rascal we have a powerful pattern matching operator suite, including:

  • list matching (associative), for example: [*_, elem, _*, elem, _*]
  • set matching (commutative, associative, idempotent), for example: f({e,*other, f({e,*nested})})
  • deep matching (recursive), as in /t()
  • negative matching, as in !and(_,_)
  • non-linear matching (see above list matching example)
  • etc.

These operators may occur in the parameter positions of function definitions, just as they can in switch, visit, {list,set,map} comprehensions, loops, conditionals, etc. As such they give a very broad means to the programmer on how to dispatch between cases. This is only limited by the rule that patterns needs to be mutually exclusive for a certain function.

Here is a function to remove double elements from a list, term rewriting style:

list[value] dup([*value a, value e, *value b, e, *value c]) = dup([*a, e, *b, *c])
default list[value] dup(list[value] l) = l;

And here is the same function but type parametrized:

list[&T] dup([*&T a, &T e, *&T b, e, *&T c]) = dup([*a, e, *b, *c])
default list[&T] dup(list[&T] l) = l

From unlabeled to labeled rules

The big issue with rewrite rules is that they are applied automatically and this is sometimes cumbersome. Functions do not have this issue. Perhaps we should not have defined the boolean semantics so directly, and have wrapped it in a function:

data Bool
= t()
| f()
| and(Bool l, Bool r)
| or(Bool l, Bool r)
;

Bool eval(and(f(), Bool _)) = f();
Bool eval(and(t(), Bool b)) = b;

This reads as labeled rules, we have two rules called "eval" that could be applied but will never be applied automatically unless somebody calls them as a function.

rascal>eval(and(f(),t()))
Bool: f()

Or, if we wish to apply this rule bottom-up through an entire boolean expression:

rascal>visit (and(and(f(),t()),t())) { case Bool b => eval(b); }
Bool: f();

In the above we used visit to automate the recursion, but we could have manually implemented the recursion as well:

Bool eval(and(t(), Bool b)) = eval(b);

Higher order functions are higher-order rewrite rules

Since our "rules" are actually just functions, we can pass them and combine them just like in any other language that supports higher-order functions. This brings us dangerously close to the expressive power of what term rewriting people call strategies. Anonymous functions (lambdas) can use pattern matching just as any other function by the way. So we can write expressions such as (f + g)(x) where we non-deterministically choose between applying f and g using pattern matching.

A note on types

Rascal is a statically typed language, supporting type inference only within the body of functions. We made this choice in order to help keep bodies of functions slim, but without introducing difficult to understand error messages that can be caused by a too smart type inference algorithm.

This is the reason why all pattern variables in function definitions need to be typed while in normal patterns (nested in the bodies of functions) the types of pattern variables is inferred.

Conclusion

These were some thoughts on the correspondence between functions and rewrite rules as we put it into Rascal. We came from a term rewriting world and wanted to keep using their power of pattern matching and open extensibility. Now we are in a world of functional and imperative programming where we can control their application with the flick of a for loop.