From anton Tue Oct  5 20:23:52 1999
X-newsreader: xrn 9.02
Sender: anton@a0.complang.tuwien.ac.at (Anton Ertl)
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Subject: Re: Q: Merced a flop or not?
Path: a0.complang.tuwien.ac.at!anton
Newsgroups: comp.arch
Distribution: 
Followup-To: 
References: <37F32306.45496655@klagges.com> <rz7d7v0y4ka.fsf@corton.inria.fr> <7t0da1$55u@dfw-ixnews16.ix.netcom.com> <rstacpoo-0510991356080001@stud-207.appcomp.utas.edu.au>
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Keywords: 

In article <rstacpoo-0510991356080001@stud-207.appcomp.utas.edu.au>,
 rstacpoo@utas.edu.au (Stackers) writes:
>Serious question.  Has anyone done a coprehensive review of the IA64
>manual?  Is it on the web?  Any info appreciated.

Here's my review: It's written well enough (but then I am pretty
familiar with the techniques they are referring to).  The only
shortcoming I found when reading was that the special registers are
explained in one section without saying much about their use, and later,
when the features are explained, the registers are only referred to by
their (pretty cryptic) names, and I was left to wonder what they are
talking about.


If you want a review of the IA64 (user level, the published manual
does not explain system-level stuff), here's my take (there are
probably also earlier postings in this vein):

It's basically a RISC with lots of special features:

flag registers (Power/PPC also have this, but somewhat differently)
predicated execution (ARM, HPPA)
parallel-and/or
128 integer registers (AMD29K had 192)
128 FP registers
integer register stack with automatic saving and restoring (more
   flexible than SPARCs register windows, but less flexible than
   AMD29K's mechanism)
architectural support for moving loads speculatively up above branches
   (I wonder if they read our paper
   http://www.complang.tuwien.ac.at/papers/ertl-krall94cc.ps.Z, which
   proposes a similar mechanism)
support for run-time memory disambiguation (moving loads speculatively
   up above stores)
branch registers for indirect jumps (Power/PPC has two, IA64 8)
architectural support for counted loops (Power/PPC also has this).
rotating register files for register renaming in loops
the combination of loop support, rotating register files, and
   predication allows compact prologues and epiloges for
   modulo-scheduled loops
the architecture has 41-bit instructions, with three instructions per
   128-bit bundle; five bits per bundle are used for additional
   decoding info (in particular, program-specified instruction-group
   boundaries).  Bundles have significance as jump targets, but
   little significance for instruction grouping (it is possible to
   specify a group boundary within a bundle).
... (probably some features I forgot)

The overall picture I get is: Most of the features they incorporate
look good today by themselves, but the combination is quite complex
and somehow does not feel well-rounded.

And I wonder how good these ideas will look in the future; e.g., there
is a technique for run-time disambiguation
(http://domino.watson.ibm.com/library/CyberDig.nsf/a3807c5b4823c53f85256561006324be/12a089effaf3a918852565930072a0db?OpenDocument)
that does not need architectural support (but it needs more loads).  I
am especially sceptical about the modulo scheduling stuff; modulo
scheduling works great for simple loops (including loops that have
been made simple by if-conversion), but hopefully we will see some
technique that can schedule more complex control structures in the
future, and the architectural support for modulo scheduling is
probably not flexible enough to be useful there.

While the compiler techniques for exploiting the special features of
IA64 exist, putting them all in one compiler will make the compiler
quite unwieldy and probably pretty specialized for the IA64.  I.e.,
compilers for not-so-popular languages (i.e., nowadays anything but C,
C++, FORTRAN, Java) will probably not make use of these features,
because 1) it's a lot of work 2) it's not retargetable and 3) these
features probably don't fit the typical usage of the language.
Retargetable compilers for popular languages also will suffer from 1)
and 2).

The architecture is much less VLIW than I expected, the grouping
information is the only thing that looks remotely like VLIW.  I wonder
why they bothered with the grouping information and the bundling
restrictions at all.  It may save a cycle upon a branch mispredict;
but why not generate this information in hardware on loading the code
into the instruction cache?

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html