Identifiers for Pattern Matching

Document #: xxx
Date: 2022-02-24
Project: Programming Language C++
SG17
Reply-to: Mihail Naydenov
<>

1 Abstract

This paper investigates the challenges we face with identifiers inside Pattern Matching (PM). It also proposes a method for their introduction, one that is inline with current practices, without new language syntax.

2 Challenges

One of the many challenges in PM has to do with identifiers. There is an unavoidable ambiguity in what we mean by a variable named x: Is it a newly introduced entity or a reference to an existing one? There is no right answer and each system decides on its own. For example in Python, x is a new name and if the user whats to access an existing one it must contain a dot (essentially to be a path) - something.x. In Rust also x is a new name. Swift and C# introduce new names the same way this is done outside PM - with let and var respectively.
Variable introduction style might seem like a minor, “esthetics” issue, but is really not, especially in C++, for two important reasons:

  • First, in C++ there is no variable introduction keyword. This is in contrast to the above mentioned Swift and C#.
  • Second, types and variables can have the same name - class s{}; s s;.

These two just add up to the preexisting ambiguities inside PM, creating an even more complicated context.

2.0.1 So many identifiers, so little syntax

Let’s examine the number of possibilities for what an identifier could mean in the C++ pattern matching scenario. We mentioned some of them already:

  • x can be a PM variable, a binding
  • x can be a non-PM variable, one to compare against
  • x can be a type

Before continuing, here are our current solutions for the above 3.
The main PM proposal, p13711, as of its latest iteration at the time of writing:

  • creating a binding is the default, with the explicit limitation, one can not bind an arbitrary pattern. This is, only the last value, after all matching is done, can be captured behind the name x.
  • non-PM variables must be proceeded with case - case x.
  • types are used inside the “Alternative Pattern”, which surrounds their names with angle brackets - <x>.

Alternative Pattern in p1371 also accepts a constant-expression, it is not clear if this means a constant variable as well. In that case we could have an ambiguity whatever <x> means a typename x or a variable x, used to select the alternative.

However, according to p26882, the above will be revisited in future versions and will be as follows:

  • creating a binding will be done via a let keyword.
  • non-PM variables will be “just expression” and used directly, making the “non-PM variable for comparison” the default.
  • (“Alternative Pattern” is unchanged.)

The “is as” proposal (p2392)3 is as follows:

  • binding is created if the identifier is to the left of is or as.
  • non-PM variables are the ones that appear to the right of is.
  • types also appear to the right, same as non-PM variables. This could lead to ambiguity as in is size - do we mean a type or a with the same name?

As you can see, things are already not easy with these 3 cases, but there is more:
- x can be an existing, non-PM variable, used for binding, an assignment.

Arguably, this last case mainly concerns patterns outside PM. For example, we can imagine extending structured bindings to bind to existing variables:

some_t a,b;
[a,b] = something; //< bind to preexisting variables

Although, there is std::tie, it’s a hidden, non-obvious way of doing things. More importantly, it is not composable with any patterns:

some_t a,b;
// some-pattern below is a hypothetical pattern-used-outside-PM.
some-pattern std::tie(some-pattern a, b) = something; //< can't bind to preexisting variables from within patterns

Outside PM we will certainly need to bind to existing variables, but binding to existing variables inside PM could also be beneficial. We might want to fill a preexisting data structure directly. Something along the lines of:

void useSize(size_t& size);
size_t useAndReturnSize() {
  size_t size{};
  inspect(some_rect) {
    [__, let sz = {5,5}] => useSize(size = sz);
    ...
  }
  ...
  return size;
};

Here, ideally we would have just used the size variable directly. sz is just a crutch to carry us around the fact, we can’t bind directly to preexisting variables. Granted, this use is not that important exactly because the workaround is easy and as such, current proposals do not address this usage.

There is however, much more important case, missing from the discussions so far.

We talk about introducing a variable for PM, a binding, but this binding should have control of its mutability, much like any other variable. We should have both a binding for reading and a binding for writing.
That scenario is not handled directly by the current proposals and a simplified approach is taken, where all binding are mutable, unless the object we “inspect” is const. In that case all binding are const by the regular constness preservation rules. Needless to say, this is far from perfect for two reasons:

  • Pattern Matching, in contrast to switch and if-else is recursive, which results in a lot of complexity, with many branches in an overall three structure. We will absolutely encounter the need to have almost all branches be const, but some be mutable simply because not all paths will have the same requirements and usage. An “all or nothing” approach will be suboptimal, hurting const correctness overall - if you want that one variable, in that one arm to be mutable, all of them will have to be, the root can no longer be const.
  • Making the root const will have no effect on the mutability of the data if the data is accessed indirectly (pointer, view). Considering the widespread use of such objects in C++, both modern and classical, this will be a half-measure or even useless. A practical mutability can only be achieved by having it be defined as part of the particular binding and not for the containing object we inspect.

As you can see, there are plenty of complications regarding identifiers in the PM context with many potential uses and very limited ways to express them. Here is the complete list:

  • x can be a PM variable, used for observation
  • x can be a PM variable, used for modification
  • x can be a non-PM variable, one to compare against
  • x can be a non-PM variable, one to write to
  • x can be a type

Of course, not all uses are equally important, but all must be taken into account as part of the overall requirements and view into “the bigger picture”. Otherwise we will write ourselves into a corner, as the syntax options are extremely limited.

2.1 Type selector

Identifiers for type selector need a special mention. If we step back and consider what syntax we would like to see as a natural way of doing type switching, we should look no further then the very first proposals for Pattern Matching like pattern-matching-November-20144 or the early p0095.5 They both advertise something like:

 Some{...} 

to select (and deconstruct) a type Some from a polymorphic source value. This makes sense - deconstructing is the opposite of constructing and often programming languages use a syntax, very similar to either class definition or construction:

Rust class definition

... Some{...}

C# class definition

... Some{...}

Python class construction

... Some(...)

Scala class definition

... Some(...)

Rust class deconstruction

... Some{...}

C# class deconstruction

... Some{...}

Python class deconstruction

... Some(...)

Scala class deconstruction

... Some(...)

In other words, our initial intention to have deconstruction mirroring construction was more or less “right”. Something however happen in the process and we drifted away of it. Using square instead of curly brackets for deconstruction is understandable, considering Structured Bindings are already in the language. However, we also lost the natural way of expressing the desired type:

Main Proposal

<Some>[...]

extra angle brackets

Is As Proposal

[...] as Some

a new as idiom

The as, although looking nice, is foreign to C++. To change this fact, we must alter the language, adding as outside PM also. On the other hand, angle brackets can hardly be perceived as pretty or natural and, considering the recursive nature of PM, will bring considerable visual noise.

3 The Bigger Picture

This is an overview, including future directions. For the concretely proposed items see next section.

In the previous section we made the observation, an identifier can be one of those:

  • а binding, creating a new variable
  • а binding, reusing an existing variable
  • аn existing variable to compare against
  • аn existing variable to write to
  • а type name

It is clear, it is not possible to have a simple, single identifier without encountering multiple ambiguities. To account for that, we must “mark” identifiers for their specific use. In the main proposal this is done by either using case for “existing variable to compare against” (p1371) or with let (p2688) to mark new identifiers. Angle brackets are used around type names. In the “is as” proposal, the location of the identifier in relation to is or as is the determining factor.

Regarding the latter, we can make the observation, is is quite similar to an operator, we already have, the operator==. This means, marking values for comparison with an == can be a reasonable approach as well, one that does not need introducing new language features:

int value = 5;
int number = ...;

inspect (number) {
  ==value => //< matches `value`
  // or !=value if we want!
  ...
};

This looks promising, the code is self-explanatory, free from ambiguity, though arguably a bit verbose. Later, in the proposal part, this issue is addressed as well.

Please note, inspect, and much of the syntax, is just a placeholder, reusing it from the main PM proposal (p1371). The authors seem to be moving away from it in p2688, but it can still serve us here.
Also, having == does not exclude intruding is in the future. is will be just another operator, both in and outside PM.

One very interesting side effect is that by using == (or possibly !=), we can also naturally introduce the other comparison operators:

inspect (number) {
  >=value => //< greater or equal
  <=value => //< less or equal
  < value => //< less then
  > value => //< greater then
  ...
};

This is a really nice bonus feature, already available in C# PM for example.

Moving forward. We saw how we can introduce existing identifiers for comparison naturally, now let’s get to the much harder cases - introducing new variables. In C++, much like C before that, there is no special keyword or any other syntactical marking. Identifiers for variables are introduced by a preceding identifier of a type:

id_of_type id_of_var;

id_of_var is new variable, indicated only by the fact, it has a preceding type id_of_type.

This is a though place to be as this kind of introduction is completely unusable for PM.
Facing that hard case, current proposals introduce new language. This is either special mini language, in the main PM proposal, where having new variables are introduced with the let keyword. Or, in the “as is” proposal, by having new language feature, the is and as operators, which in PM can create new variables among other things. Having to introduce a new language is high price to pay and it leads to inconsistencies - in and outside PM the variables are introduced differently. Faced with that new issue, we must either address it by changing the rest of the language (the “as is” proposal) or ignored it (main proposal).
This is not the case with languages, which already have a keyword for variable introduction! Swift using let inside PM is fundamentally different then C++ doing it - let is already the way to introduce variables. Same for C# and its var keyword.

This negative scenario is not the only one possible however.

Interestingly, having to precede the identifier with a type has one exception - the lambda capture. There we don’t have to supply the type before a new variable:

[a=something] { // New variable `a` is introduced for the lambda with the same value as `something`.
 ... 
} 

This is an interesting corner case. Is used to make the syntax concise and it works only because there are no uninitialized captures.
Coincidentally, both of those are true for the PM as well - we need a concise syntax and there are no uninitialized bindings!

Reusing the syntax seems like a very sensible option:

int number = ...;

inspect (number) {
  n=__ => // new binding `n` is introduced for the PM with same value as the matched ::number
  // or just n= omitting the __ 
  ...
};

Similarly how we create a lambda-local variable, always initialized to some preexisting value, we can create a PM-local variable, again always initialized to a preexisting value. This time however the value is coming implicitly from the object we “inspect”. In the first, lambda case, we assign manually, in the second, the PM system will assign for us.
In the above example, we might continue matching, adding an expected value:

int number = ...;

inspect (number) {
  n=5 => // new binding `n` is introduced for the PM, iff ::number is 5
  ...
};

Here we create a binding n, equal to 5. Of course, it is not we that “assign” 5 to the binding, in contrast to lambda. It is the PM system that will create the binding, with a value, equal to the number variable, but only if number is 5. The end result is the same and what is written will not be misleading - n is (equal to) 5.

Moving to lambda-like identifiers introduction has a major benefit - we have syntax ready for all other identifiers!

Mutable and immutable bindings:

int number = ...;

void mutate(int&){...}

inspect (number) {
   n=__ => mutate(n); // FAILS, n is immutable by default!
  &n=__ => mutate(n); // SUCCEEDS, n is requested to be mutable
  ...
};

Binding to preexisting variables:

int number = ...;

int n;

inspect (number) {
  &n __ => mutate(n); // binding to existing variable with the same value as ::number
  // or just &n omitting the __ 
  ...
}

Compared to lambda side by side, similaraties are there:
:::{.columns} :::{.column width=“50%”}
lambda capture

int x,y,z;


[ x= ..., // new x
 &y= ..., // new y, but reference 
 &z       // existing z reference
]{...};

::: :::{.column width=“50%”}
PM bindings

int x,y,z;

inspect (...) {
  x= ... => // new x
 &y= ... => // new y, but reference 
 &z  ... => // existing z reference
};

::: :::

Arguably, assignment to existing variable could look confusing when combined w/ other patterns. For example &z 5 will be the syntax to assign to a preexisting z if value is 5. We will have assignment w/o =, which is far from great. However, this use-case is rare, it is not even proposed in currently active proposals. As such, this paper is willing to accept this downside for the other benefits this approach brings.

And the benefits don’t stop there. One additional significant gain is the fact, we can name types directly!
Because all identifiers so far were “marked” by additional tokens, we could finally have “unmarked” names for types:

inspect (...) {
  int __ => // match int
  ...
};

Having natural type selection is a fundamental benefit, resurrecting the “dream syntax” from a decade ago, one w/o special tokens like <> or additional language features like as.

Putting it all together, we have a complete set with zero ambiguity:

struct size{...};
size size;
inspect (...) {
   size  => // match the ::size type
   size= => // new `size` binding, hiding ::size
  &size= => // new `size` binding, hiding ::size, mutable
  &size  => // assign to the ::size variable
  ==size => // match sizes, same as ::size
  <=size => // match sizes, less-then-or-equal to ::size
  >=size => // match sizes, greater-then-or-equal to ::size
  > size => // match sizes, greater then ::size
  < size => // match sizes, less then ::size
  ...
};

4 Proposal

The above study is not proposed in its entirety or detail. In particular, binding to existing variables, as well as the greater-then and less-then comparisons are not proposed. This proposal considers these to be future direction or at least future proofing for not writing ourselves into a corner.

With that out of the way, proposed identifiers are:

  • id - type name
  • id= - immutable binding
  • &id= - mutable binding
  • ==id - variable for equality comparison

Proposed is also a slightly different, more user-friendly version of the above specification.
In particular, having to compare to a value using == can get verbose and might seem atypical. Similarly, but to a lesser degree, having single identifier to mean a type name might be confusing:

struct size{...};
size size;

inspect (...){
   size         => //< There is a certain expectation, this is a value, not a type, we compare against!
   [size, size] => //< Here as well
};

To account for that, proposed is to have the == be optional if it is the last pattern, which is often the case. To avoid ambiguity, the type name must never be the last pattern. In practice this means, to match on type, one must follow up it’s name with another pattern, like __:

struct size{...};
size size;

inspect (...){
   size __          => //< Size is definitely a type
   size             => //< Value, "as expected"
   [size, size]     => //< Values here as well*
   [==size, ==size] => //< Same as above, more verbose, but more clear?
   
   size size     =>       //< also valid, type is size, value equals the ::size variable
   size (==size) =>       //< same as above, but more clear
};

// *assuming some hypothetical value, deconstructed into 2 sizes

There are cases however, where == is still required:

struct Size{...};
Size size;

inspect (...){
//     size[w=, h=]  =>  // Error, no type named size
// Size size[w=, h=] =>  // Same, also an error (attempt for two type matches one after another)
     ==size[w=, h=]  =>  // OK, match ::size 
};

Omitting the == here is not allowed because the pattern is not the last one. We continue to deconstruct, binding the width and height to w and h respectively. In that case an attempt is made to match on a type, named size, not a value. To request match by value instead, == must be is used.

Likewise the type match must not be the last pattern:

struct Size{...};
Size size;

struct Rectangle { int x; int y; Size size; };
Rectangle rect = ...;

inspect (rect) {
//    [__, __, size __] => // Error, no type named size
    [__, __, ==size __] => // OK, explicit value match
         [__, __, size] => // OK, last pattern, match value

//       [__, __, Size] => // Error, no such value (type match must not be last)
      [__, __, Size __] => // OK, not last pattern, match type
[__, __, Size [w=, h=]] => // OK, not last pattern, match type
};

Advanced example

int x = 5;

inspect (...){
    x     =>        // match ::x
    ==x   =>        // same as above, but explicit
    x=    =>        // new binding, hiding ::x
    x=x   =>        // new binding, match value ::x
  x=(==x) =>        // same as above, but explicit
  int x   =>        // match int, match value ::x
  px= (int x=x) =>  // bind px to a polymorphic type, then match the type to int, match value to ::x, bind result to a new `x` 
  int &x=x      =>  // match int, match value x, bind result to a new `x` as mutable
  ...
};

4.1 Conclusion

Presented here was an investigation of identifiers usage inside PM, with the hope all limitations and use-cases are accounted for, at least in principal.
Along that, a system of identifiers is presented, which accounts for all use cases, does not require new language syntax and is based on existing practices. A small, most useful subset is proposed as a way to handle identifiers in a practical PM implementation.
Last but not least, suggested is that, by using the described system, it will be possible to have type matching in a natural way by just naming the type - int x =>. This is the preferred way in most PM frameworks, both in C++ (all early proposals) and other languages.

4.2 More Examples

4.2.1 Point deconstruction

This Proposal

struct Point {...};
Point p;

int x = 10;
int y = 20;

inspect (...) {
  Point [==x, ==y] => //< Point type, at pos (::x, ::y), explicit
  Point [x, y] =>     //< Point type, at pos (::x, ::y)
  Point [&x=, &y=] => //< Point type, bind x, y for writing

  Point [x=, y] =>    //< Point type, at ::y pos, bind x pos to `x`
  Point [x, y=] =>    //< Point type, at ::x pos, bind x pos to `y`
  Point [x=x, y=y] => //< Point type, at pos (::x, ::y), bind x, y
};

Main Proposal

struct Point {...};
Point p;

int x = 10;
int y = 20;

inspect (...) {

  <Point> [x, y] =>    
// Same, always mutable

  <Point> [let x, y] =>    
  <Point> [x, y=] =>    
  <Point> [let x=x, let y=y] =>
};

Is As Proposal

struct Point {...};
Point p;

int x = 10;
int y = 20;

inspect (...) {

  [x, y] as Point =>    
// Same, always mutable

  [x, _ is y] as Point =>    
  [_ is x, y] as Point =>    
  [x is x, y is y] as Point =>
};

4.2.2 Explicitly mutable, implicitly immutable bindings

struct Person{ std::string name; uint age; };

void yearCycle(Person& p) {
  ...
  inspect (p) {
    [name=, &age=] => std::invoke([&]{
      age++;           // increase age
      if(name = "Bob") // Error, name is immutable
        sendHappyBirthdayToBob();
    });
  };
}

4.2.3 Pattern matching class hierarchies

class Animal { public: virtual void speak() const = 0; };

class Cat : public Animal {
public:
  virtual void speak() const override {  std::cout << "Meow from " << name << std::endl; }

  std::string name;
};

class Crow : public Animal {
public:
  virtual void speak() const override { std::cout << "Caaaw!" << std::endl; }
};

const char* fluffy = "Fluffy";

This Proposal

void listen(const Animal &a) {
  inspect (a) {
    Cat c=[fluffy] => // use cat c
    Cat [name=] => // use name
    Crow c= => // use crow c
  };
}

Main Proposal

void listen(const Animal &a) {
  inspect(a) {
    <Cat> (at!)[c, [case fluffy]] => // use cat c
    <Cat> [name] => // use name
    <Crow> c => // use crow c
  };
}

Is As Proposal

void listen(const Animal &a) {
 inspect (a) {
  c as Cat is [_ is fluffy] => // use cat c
  c as Cat is [name is _] => // use name
  c as Crow => // use crow c
 };
}

4.2.4 Extreme example

The following example extremity stems for the fact, both value matching and binding are assumed to be commutative. If this is not the case, then the patterns must appear in some predefined order. Composability is out of scope for this paper

struct Size{...};
Size size;

inspect (...){
   Size size [w=, h=] =>        //< type is ::Size, value equals ::size, w and h are introduced
   Size [w=, h=] size  =>       //< same as above

   Size sz= [w=, h=] size  =>   //< same as above, but we also bind the size value to `sz`
   Size size [w=, h=] sz=  =>   //< same as above
   Size sz= [w=, h=] ==size  => //< same as above (more clear)

   sz= [w=, h=] ==size  =>      //< same as above, but valid for any type, not just ::Size
   ==size [w=, h=] sz=  =>      //< same as above
};


  1. Pattern Matching: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1371r3.pdf↩︎

  2. Pattern Matching Discussion for Kona 2022: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2688r0.pdf↩︎

  3. Pattern matching using is and as: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2392r1.pdf↩︎

  4. Pattern Matching for C++: https://www.stroustrup.com/pattern-matching-November-2014.pdf↩︎

  5. Pattern Matching and Language Variants: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0095r1.html↩︎