Symmetry in Languages
This post is about languages and grammars*, and I’ll be using
examples in a pseudo-BNF notation which looks like the
following:
abc ::= (‘foo’ <bar>) | <baz> |
<bar> <baz>
This reads as “An abc is defined as
the letters ‘foo’ followed by a bar, or as a baz, or as a bar followed
by a baz.”
Consider the Newspeak language from George
Orwell’s 1984. Part of the grammar would look like the
following:
word ::= (‘un’ <base>) |
<base>
emphasised ::= ‘plus’
<word>
twiceemphasised ::= ‘double’
<emphasised>
valid ::= <word> | <emphasised> |
<twiceemphasised>
Such a grammar allows us to, given
the base word “good”, say the valid words “good”, “ungood”, “plusgood”,
“plusungood”, “doubleplusgood” and “doubleplusungood”. This is the
entire spectrum of “goodness” allowed by Newspeak, which is a
purposefully restrictive language.
Now, what happens if we
get rid of some of these rules? Let’s try getting rid of
‘twiceemphasised’:
word ::= (‘un’ <base>) |
<base>
emphasised ::= ‘plus’ <word>
valid ::=
<word> | <emphasised>
We reduce the
expressiveness, since we now only have one way to emphasise things,
there’s no level above “plusgood”.
Now what if we get rid of
“emphasised”, but just call it another kind of “word”?
word
::= ‘un’ <base> | <base> | ‘plus’ <word>
valid
::= <word>
Now, the definition of “word” contains the
use of a “word”, which is recursive. What does this mean? Well
“plusgood” is now a “word”, and we’re allowed to write ‘plus’ in front
of “words”, so we can say “plusplusgood”, “plusplusplusgood” and any
other number of new words. We can also say “plusplusungood” along with
an infinite number of negatives.
Thus getting rid of rules
completely (like “twiceemphasised”) reduces what we can say, but by
getting rid of distinctions and
special cases (like the distinction between “word” and “emphasised”) we
get a whole new level of expressiveness. In fact, since we can say
“plusplusgood” there isn’t a need for “doubleplusgood” anymore, it’s
redundant, so there’s no point putting it back in. Also, the idea of
“valid” is also redundant, since it’s now just the same as a “word”, so
we can get rid of that too, to reduce our entire grammar (minus the
“base” vocabulary of words like “good”) to:
word ::= ‘un’
<base> | <base> | ‘plus’ <word>
This
grammar, even though it only contains two concepts, allows us to say
infinitely more than the original grammar that contained five. We can
even get rid of the distinction between “base” and “word” by writing it
as:
word ::= ‘un’ <word> | <base> | ‘plus’
<word>
Now we can say “unungood”, “unplusgood”,
“plusunununununplusplusunplusgood” and infinitely more new
words.
Now, where am I going with this? Well, let’s consider
a bit of grammar for the C programming language:
simpletype
::= ‘int’ | ‘float’ | ‘double’ | ‘char’
modifiedtype ::=
<simpletype> | (‘const’ | ‘long’ | ‘signed’ | ‘unsigned’)
<modifiedtype>
composite ::= <union> |
<struct>
pointer ::= <type> ‘’
type ::=
<pointer> | <composite> |
<modifiedtype>
This is a bit off, but will serve for
our purposes. It says, quite simply, that we can make things that have
type “int”, “float”, “double” or “char”, or we can modify those types to
get things like “unsigned long long int”. We can have “unions” and
“structs”, which are built out of other types, and we can have pointers
to any of them (including pointers to pointers (to
pointers…)).
Now let’s consider a similar thing in Java,
which claims to be an “object oriented” language:
basetype
::= ’byte’ | ’short’ | ’int’ | ’long’ | ’float’ | ’double’ | ’boolean’ |
’char’
type ::= <object> | <basetype>
Where
anybody can create their own kind of “object”, for example strings of
text, key/value pairs, Web browsers, customers, etc. In the end though,
everything in Java ends up being defined via a few things from
“basetype”.
If we look at the first Object Oriented language,
Smalltalk, we can see why I said Java is “object oriented” in
quotes:
object
That’s it. EVERYTHING in Smalltalk
is an object. If you want a number, that’s fine since there are number
objects. You want a true/false boolean? That’s fine since there are true
and false objects. No expressiveness has been lost by throwing away the
complicated special-cases of languages like C, and by adding “base
types” Java has gained no expressiveness over Smalltalk. In fact, by
having more special cases we lose expressiveness. For example if we want
to write, in pseudocode, a function for the square of something, we can
do so in C like so:
int square(int num) {
return
numnum;
}
char square(char num) {
return
numnum;
}
float square(float num) {
return
numnum;
}
double square(double num) {
return
numnum;
}
We would also need to take into account
the various modified versions, whilst structs and unions can’t be
squared in a general way. We’re also forgetting here that C would
complain about multiple function definitions, so we would have to call
each function something different (eg. int square_int(int num)), and the
obvious issue that some of these number types might not be able to store
the result of the multiplication if it’s too big (don’t get me started
on that bollocks). In Java we get a similar thing, although we’re
allowed multiple functions with the same name since they take different
argument types:
int square(int num) {
return
numnum;
}
float square(float num) {
return
numnum;
}
byte square(byte num) {
return
numnum;
}
long square(long num) {
return
numnum;
}
short square(short num) {
return
numnum;
}
char square(char num) {
return
numnum;
}
boolean square(boolean num) {
return
numnum;
}
double square(double num) {
return
numnum;
}
Object square(Object num) {
return
ObjectObject;
}
The last case, of Object, won’t
work since Java has EVEN MORE special cases to do with operators like
, defining them for some kinds of objects and not others. Now, what
would we write for Smalltalk?
[:num | ^ num
num]
That’s it. Done. Works for everything (for which
multiplication is valid).
The nice thing about using objects
in this way is that everything stays symmetric, ie. you can treat
everything in the same way. This lack of special cases and artificial
distinctions means that it’s much easier and more intuitive to make
systems which handle everything thrown at them. As long as there’s a
sane fallback in the case of failure, we can leave this function (that
one line) alone FOREVER and be confident that it will still work with
everything we give it; we don’t need to keep fiddling with it as more
types are added to the language.
What’s more, now that we’ve
got this working function, we can stick it in a library and use it
everywhere. We can do the same for branches, via the
methods:
foo isTrue: [something run] isFalse: [somethingelse
run]
That works for ANY foo, and the two code blocks given
are of course examples. We can do the same for iteration:
foo
do: [:x | something run using x]
That will work for ANY foo
which is iterable. Everything works in the same way, 40 year old code
can work with objects written last week, code written last week can work
with objects defined 40 years ago.
If you’re adding a feature
to a language via extra syntax, ask yourself whether you’re allowing the
expression of something new which was not previously possible, or
whether you’re just breaking the symmetry between that feature and the
entirety of the rest of the language.
To bring this back to
where it started, I’m very interested in the development of
Newspeak.
* More
specifically formally defined languages and grammars, not broken
grammars retrofitted to natural languages via the use of duct tape and
insistence that the language’s users aren’t adhering to the ’standard’
which is based around the user’s use of it anyway.