Some ramblings about graphics, compilers and CPUs
This is probably very wrong, but whatever, it’s my blog
:P
I am following the work going on with
Gallium3D
as much as possible, since this seems like it should offer some pretty
cool features when it becomes stable. It seems (from what I understand)
to be an abstraction layer over the graphics card. Until now drivers are
written for graphics cards which implement certain things like
OpenGL, etc. and anything which is
not supported relies on (much slower) software rendering (like Mesa for
OpenGL), or if that’s not available it just doesn’t
work.
Since the drivers are responsible for providing a lot
of things as well as the basic hardware access this results in a lot of
duplicate work, with slightly different ways of doing the same things
for each driver (for instance every driver needs to implement a memory
manager. The Intel driver’s memory manager is called GEM and is Free
Software and available for other drivers to use, but it’s written in a
very Intel-specific way and is therefore useless to other drivers).
There is work going on to make a generic memory manager for the kernel
and a generic way for graphics drivers to access the hardware (called
the Direct Rendering Infrastructure, working towards version 2), but it
still leaves rather a lot of common ground in each
driver.
Gallium3D is a very generic abstraction layer that is
designed to sit in between the drivers and the actual programs and
libraries being used. Gallium3D is implemented to work in a very similar
way to the actual silicon inside current graphics cards, thus writing a
driver to sit between the card and Gallium3D is pretty straighforward
because not much has to be translated from one representation to another
(and DRI can still be used to do a lot of the background work). This
makes Gallium3D act rather like a software graphics card, but doesn’t
cause too much slowdown since a) it is so similar to real hardware and
b) it uses the Low Level Virtual Machine
(LLVM) system to gain Just In Time compilation and optimisation for its
code. Since Gallium3D is software it can be run on any hardware with
a suitable driver (which, as I said, is pretty painless to make), so
libraries like OpenGL can be written to talk to Gallium3D and they will
run on whatever hardware Gallium3D talks to (with software like Mesa
running anything that the hardware can’t handle). This means that
writing a graphics library is a lot easier (since you only have to write
it for Gallium3D’s state tracker, eerything else comes free after that)
and thus more than the basic OpenGL-and-that’s-it can be
used.
As an example,
WINE can run Windows programs on
Linux, BSD, MacOS, etc. (and even on Windows :P ). However, a lot of
Windows programs and especially games use Microsoft’s DirectX system for
3D graphics. Normally the WINE team writes replacements for Microsoft’s
libraries which perform the same job, thus allowing Windows programs to
run, but writing their own DirectX implementation would be a huge job on
its own. Making a different one for every graphics card driver out there
would be even worse, and relying on the driver’s developers to do it
like they currently do for OpenGL would make the drivers EVEN MORE
bloated and complicated than they already are.
Thus the WINE
team are writing a DirectX implementation, but instead of working with
the graphics card they are writing it to use OpenGL (since OpenGL is
already in the current drivers). This is pretty inefficient, however,
since DirectX operations need to be translated to OpenGL, then the
OpenGL needs to be translated to however the graphics card works, then
run, and since games are generally very resource intensive, not to
mention that 3D is one of the hardest things for a computer to do
anyway, it’s far from ideal. With Gallium3D, however, the OpenGL
translation can be skipped entirely and DirectX can be implemented
straight to Gallium3D, which then gets sent very efficiently to the
graphics card whilst still only needing to be written for one system.
Likewise other libraries can be accelerated through a graphics card
without fear of some users not having a driver capable of it.
Application developers could even write a custom library for their
application and count on it being accelerated regardless of the card it
runs on (for example, a proprietary program cannot count on drivers
containing code to handle their custom, proprietary library since it
can’t be included by its very nature. It can, however, count on
Gallium3D being installed and thus can include one implementation which
will be accelerated regardless of the driver).
JIT
compilation is pretty cool. There are historically 2 types of program:
one is compiled (like C) which means the program is first turned from
source code into machine code, and then this machine code runs when you
tell it to. The second is interpreted (like Python), which means that
instead of running directly on the computer, it is run instead inside
another program called a “virtual machine” (which is usually compiled,
although it can itself be interpreted, as long as at some point there is
a compiled virtual machine talking to the real computer). Programs are
more flexible than computer processors, which means that interpreted
programming languages can have many nice features added and usually be
written more simply than compiled languages. Compiled programs usually
run faster than interpreted programs, since they are run directly, there
is no virtual machine acting as a middle-man.
Just-In-Time
compilation, however, is like a mixture between the two. The source code
is NOT compiled into a runnable program, making it different to compiled
languages. However, there is no need for a virtual machine, thus making
it not an interpreted language. What happens is very siimlar to an
interpreted language, but instead of the virtual machine looking at what
to do next then telling the computer, in a JIT language the next bit of
code gets compiled just before it needs to run and is then given to the
real computer just like a compiled program.
At first glance
it would seem like this might be faster than interpreting a language
(again, no middle-man) but slightly slower than a compiled program
(since with a compiled program all of the compiling is already done
before you run the program). However, JIT-compiled languages can
actually be FASTER than compiled programs!
This is because
when a program is compiled then that is it, the finished thing, that
code will be sent to the computer. Compiled programs must therefore be
able to handle anything they are given, so a banking program must be
able to handle everything from fractions of pennies up to billions and
perhaps trillions of pounds. Such large numbers take up a lot of memory,
and moving them around and performing calculations on them can take
time. Compiled programs have to use a lot of memory for everything and
accept these slow-downs just in case they are given a huge number. This
means handling £1 as something like £0000000000001, just so it can deal
with a trillion pounds if so commanded. A JIT-compiled program, however,
knows how much it is dealing with, thus £1 can be stored as simply £1,
whilst a thousand billion pounds can still be dealt with if it comes
along by storing it as £1000000000000. This means JIT programs can use
less memory for storing values, the calculations they do can be quicker
as they don’t need to deal with unneeded digits, and the program can
speed up greatly due to less cache misses.
Another
advantage of JIT compilation is that unneeded calculations can be
skipped. For example, a program may need to add deposits to an account
and take away withdrawals. In a precompiled program this will always
take at least two calculations, either add the deposits to the total
then take the withdrawals from the total, or take the withdrawals away
from the deposits and add the result onto the total. In a JIT-compiled
program this can be made more efficient, since if the withdrawal amount
is zero then the compiler can skip one of the calculations, and likewise
if the deposits are zero. If both are zero then both calculations can be
skipped, and if they are both the same then the calculations can also be
skipped. For instance, compare the following precompiled pseudo-code,
where it has to work for any values, and the JIT pseudo-code which
already knows the values since it is compiled whilst the program is
running:
Precompiled:
add(total,
deposit)
subtract(total,
withdrawal)
OR
add(total, subtract(deposit,
withdrawal))
JIT:
add(total, 15)
subtract(total,
15)
OR
add(total, subtract(15, 15))
The
JIT compiler can skip either of those, since it knows they cancel out
each other’s effects.
JIT compilation is becoming more and
more common, with JIT compilers being written for Java and Smalltalk,
for example. There is even JIT support in EUAE (the
E(nhanced/xtended/ggplant) U(NIX/biquitous) Amiga Emulator, the exact
naming of which is open to interpretation). An emulator translates
programs written for one computer into something which will run on a
different type of computer. In EUAE’s case this means running code
written for the chips in Amiga computers (such as the Motorola 68020) on
different chips like the Intel 386. This used to be done by treating the
Amiga programs like an interpreted language, with the emulator acting as
the virtual machine. With the JIT engine, however, the Amiga programs
can be run directly on the different chip, with the “compilation”
actually being a translation from one chips instructions to
another’s.
A very promising project currently being worked on
is the Low Level Virtual Machine (LLVM). Traditional compilers, like the
GNU Compiler Collection, work by translating the given language into a
common, internal language which the compiler can work with. Various
optimisations can then be done on this internal representation before
translating it into machine code for whatever computer is requested.
LLVM, however, is slightly more clever. It performs the same translation
and optimisation, but is not only able to translate this internal
representation into machine code, it also has a virtual machine capable
of interpreting it and is even able to JIT compile it, depending on the
options chosen by the user. This means that, for example, C code, which
is the classic example of a compiled language, can be fed into LLVM and
made to run in a virtual machine or be JIT compiled. The same goes for
Ada, Smalltalk and other languages which LLVM can handle. This means
that LLVM could potentially (when bugs are fixed and GCC-specific
assumptions in some programs are handled) make almost the whole of a
computer system compile Just-In-Time and be optimised on-the-fly (not
quite everything though, since something needs to start up the JIT
compiler :P ). LLVM could even optimise itself by compiling itself.
Likewise an entire system could be run in a virtual machine without the
need for the likes of KVM or Virtual Box. The future looks pretty
interesting!
Computer processors can be as fast as
you care to make them, but that doesn’t matter if you can’t give them
things to do at the same rate. The processor contains “registers” which
contain the numbers it’s dealing with, however these registers can only
contain one thing at a time and thus their values have to keep changing
for each operation. The place where numbers can actually be stored over
a long time is the RAM (Random Access Memory), the processor’s registers
can therefore get their next values from any point in the RAM and dump
the results of calculations just performed into any other point in the
RAM. However, this is a huge slowdown, so processors have a cache which
is able to store a few things at a time, and the registers are connected
to the cache rather than the RAM. This is a lot faster, since frequently
used values can be stored in the cache and accessed really quickly
(these are called “hits”).However, if something needed is not yet in the
cache then it has to be taken from the RAM (a “miss”) which a) takes
time and b) means kicking something else out of the cache to make room
for the new value.
There are various algorithms for trying to
keep cache misses as low as possible, but an obvious way is to use small
numbers. In a really simplified (and decimal rather than binary)
example, let’s say a cache can store ten digits, then I can store for
example 5, 16, 3, 78, 4, 67 and 1 all at once, since that is 10 digits
in total. That can’t really be made any more efficient, so if a
different number is needed (a miss occurs) then at least one of them has
to get replaced. If a program is being overly cautious about the size of
the numbers it is expecting, then it might use the cache inefficiently.
For example, say the program is anticipating values of a few hundred, it
might want to store the number 547 which would require three places in
the cache. If the number it actually deals with turns out to be five,
then it will store 005 in the cache, wasting two spaces, replacing
numbers it doesn’t need to and thus increasing the likelyhood of a miss.
If there is a miss later and one or two of those three digits are chosen
to be replaced then the whole 005 gets kicked out rather than just the
wasteful zeroes, meaning that any future calculations needing that 5
will result in a miss and the whole 005 will have to be brought in again
from the RAM, causing the same inefficiencies again.
OK, back
to revision :P