Gwydion Dylan ™ Porting and Maintenance Guide

Edited by

Peter Housel

Eric Kidd

Douglas Auclair

Use and copying of this software and preparation of derivative works based on this software are permitted, including commercial use, provided that the following conditions are observed:

  • This copyright notice must be retained in full on any copies and on appropriate parts of any derivative works.

  • Documentation (paper or online) accompanying any system that incorporates this software, or any part of it, must acknowledge the contribution of the Gwydion Project at Carnegie Mellon University.

This software is made available "as is". Neither the authors nor Carnegie Mellon University make any warranty about the software, its performance, or its conformity to any specification.

Bug reports, questions, comments, and suggestions should be sent by E-mail to the Internet address .

11 July 2001


Table of Contents

1. Introduction
2. Development Conventions
Basic Principles
CVS Suggestions
3. Porting Gwydion Dylan to a New Platform
4. mindy Internals Guide
5. d2c Internals Guide
The Compiler-Base library
Miscellaneous Utility Modules
OD-Format Module
The Dylan-Dump Module
The Compile-Time-Values Module
The Source Module
The Tokens Module
The Header Module
The Platform Module
The Errors Module
The Signature-Interface and Signature Modules
The Names Module
The Definitions Module
The Variables Module
The Policy Module
The CType Module
The Transformers Module
The Representation Module
The Classes Module
The Type-Dump Module
The C-Representation Module
The Compile-Time-Functions Module
The Flow Module
The Compiler-Parser library
The Tokenize Module
The Source-Utilities Module
The Lexer Module
The Fragments Module
The Parse-Tree Module
The Parser Module
The Macros Module
The Compiler-Front library
The Builder-Interface Module
The Primitives Module
The Front Module
The FER-OD Module
The Function-Definitions Module
The Variable-Definitions Module
The Top-Level-Forms Module
The Compiler-Convert library
The LexEnv Module
The Compile-Time-Eval Module
The Expanders Module
The FER-Convert Module
Processing of Top-Level Definitions
The Define-Macros Module
The Define-Libraries-and-Modules Module
The Define-Functions Module
The Define-Constants-and-Variables Module
The Define-Classes Module
The Top-Level-Expressions Module
The Compiler-Optimize Library
The Cheese Module
Data Formats
Control Flow
Call Optimization & Transformers
The Compiler-CBack Library
The Stack-Analysis Module
The CBack Module
The Heap Module
The Compiler-Main Library
The Main Module
The Dylan Runtime Library
6. Melange Internals Guide
A. Gwydion Project Coding Style Guide
Recommended
Indent level
Indent wrapping slot descriptions
Using /// comments
Using /* ... */
Wrapping function definition headers
Indent keyword parameters beyond #key token
Wrapping variable and constant definitions
Using end mumble to terminate statements
Horizontal space for columnization or other visual effects
Keyword syntax
Blank comment line between introductory comments and definitions
Wrapping case statements (applies to select too)
Using #t for otherwise in case statements
Wrapping for clauses
Mandated
Function name and open paren
Wrapping let statements
Terminate statements with semicolons
Single-line statements
Wrapping arithmetic/boolean expressions

Chapter 1. Introduction

This manual serves as a guide to porting and maintaining Gywdion Dylan ™ . It includes:

  • Suggestions for working with other Gwydion developers.

  • Guidelines for porting mindy and d2c to new platforms

  • An overview of the mindy interpreter

  • A detailed guide to the internals of the d2c compiler

  • A guide to the internals of the melange interface generator

  • An appendix containing the Gwydion Project coding guidelines.

Chapter 2. Development Conventions

Eric Kidd

Note

None of the advice contained in this chapter is binding. However, an attempt has been made to accurately represent the working style and informal rules followed by the current developers.

If you think that this chapter does not represent the way we work, please tell me and I'll fix it. It's a “descriptive” document, not “prescriptive” one.

Gwydion Dylan ™ was originally developed by the Gwydion Project at CMU. Today, it is maintained by a loose group of volunteers, some of whom have been working on it for several years. This chapter attempts to summarize the way we do things.

Basic Principles

We try to follow a few basic principles:

  • Decisions are made by consensus.  In general, we make an effort to discuss tricky problems until all the major contributors agree. This is similar to the IETF's philosophy of “rough consensus and running code”. If you think a decision is wrong, explain your viewpoint carefully and be prepared to back it up with code. Use the magic phrase often: “Does anybody object to this?

  • Final rulings are made by the core team.  The core team consists of the most prolific Gwydion developers and those who have been involved the longest. If the other contributors would feel guilty about making a decision without considering your opinion, you're effectively a member of the core team.

    Decisions within the core team should be made by unanimous consent whenever possible. This seems to have worked well enough historically. If this process ever breaks down in the future, the core team might choose to vote as measure of last resort.

  • No due dates.  As a matter of policy, we do not promise completion dates to outside entities. This is because of the volunteer nature of the project and the traditional unreliablity of software estimation. Of course, individuals may promise due dates of their own, but nobody else will take responsibility for meeting those dates.

  • Don't bite off more than you can chew.  Work on one or two projects at a time. Don't commit to implementing many separate libraries or fixing many separate bugs. By “reserving” too many areas, you effectively discourage others from working on them.

CVS Suggestions

The source in CVS should always work correctly. Those who break the build will be tickled with feathers or suffer a similar fate. Always, always, always test your code before committing it. Documentation changes should validate cleanly before being checked int. For compiler changes, you should at least run the test suites in gd/src/tests as a smoke test. Also, a full bootstrap cycle is strongly recommended before commiting more complicated changes.

Chapter 3. Porting Gwydion Dylan ™ to a New Platform

Here's a rough overview of how to port d2c. Take this with a grain of salt; the rules change occasionally.

  • Add your platform to the platforms.descr file and configure.in. This will require a bit of delibrate research.

  • Look in other places, such as the various Makegen files and platform-specific melange code. Do something intelligent.

  • Make sure that the Boehm GC runs on your platform. If not, you need to talk to the Boehm list.

  • Run configure with the flag --enable-mindy-bootstrap, then run make. Be prepared to wait, and be ready to fix any bugs you encounter.

  • Gripe to when it breaks. We'll help you sort out the mess.

The more your platform looks like a 32-bit Unix system with the GNU tools, the easier things will be. Supporting 64-bit platforms is a minor project requiring a number of changes to d2c (mindy already works on Alpha Linux, thanks to the efforts of Jeff Dubrule). Native Windows support has been broken slightly since the 2.0 days, but it has been built as recently as Visual C++ 6.0. Most of the problems are in the build system, not the source itself.

Macintoshes, vendor C compilers and other such oddities are the responsibility of those who use them. We'll be happy to help, but you'll probably need to find a significant number of dedicated volunteers to help you fix any problems you encounter.

Chapter 4. mindy Internals Guide

Relatively little has been written down about the internal workings of mindy. Interested parties are encouraged to submit a chapter for this manual.

Chapter 5. d2c Internals Guide

Eric Kidd

The d2c compiler was developed by the Gwydion Project at CMU. This chapter explains the structure of d2c and gives a brief overview of how the various components interact.

d2c compiles one library at a time. Each library contains one or more modules. A module may be implemented by one or more files. Each file contains one or more source records, the logical unit of Dylan ™ compilation.

By compiling many source records at once, d2c produces more efficient code at the cost of using more memory. In addition to processing an entire library's worth of source records, d2c also performs interlibrary optimizations by looking at certain information from previously compiled libraries.

In its current configuration, d2c is not well-suited to developing new Dylan ™ code. It is, however, well-suited to building production versions of libraries. Obvious future projects include reducing the granularity of compilation and implementing a selectively-incremental compiler that can modify the object code of a running program. The basic architecture was designed with these projects in mind.

The actual compilation process procedes roughly as follows:

  • Load any libraries used by the current library.  The compiler begins by reading in the *.lib.du file for the Dylan library. As other libraries are used, it loads the appropriate *.lib.du files. These contain most of the compiler data structures from the original versions of those libraries.

  • Parse and macro-expand each of the input files.  The parser is implemented by the Compiler-Parser library. As each top-level form gets parsed, it is passed to process-top-level-form, which is implemented by the Compiler-Convert library. At the end of this stage, the input files have been completely parsed and macro-expanded.

  • Finalize the top-level forms.  Top-level forms may contain forward dependencies within a module. In most cases, these dependencies aren't necessitated by the design of the Dylan ™ language, but they are needed for for efficient compilation. We need to study these dependencies if we want to make incremental compilation work. The implementation of finalize-top-level-form lives in Compiler-Convert library.

  • Process the class hierarchy.  Optimizing Dylan ™ code requires careful attention to the object model. Unfortunately, d2c uses algorithms which require the entire class hiearchy to be computed before link time. (The biggest problem here appear to be assigning unique IDs to individual classes.) The code to analyze the class hierachy lives in the Classes module of the Compiler-Base library.

  • Begin processing top-level definitions individually.  Up until this point, d2c performed each step on the entire library before continuing. The next several steps, however, are completed for each top-level form before moving on to the next top-level form. This process is controlled by compile-1-tlf in the Compiler-Main library

    • Convert the parse tree into the front-end representation .  For various unfortunate reasons, d2c uses the term front-end representation (FER) for what other compilers call the intermediate representation. The conversion process is controlled by methods on convert-top-level-form provided by the Compiler-Convert library. These take the parsed representation of a top-level form and convert it to the FER using the the Builder-Interface module.

    • Optimize the FER of each top-level form.  The compiler performs two kinds of optimizations: simplications and required optimizations. The simplications should produce output equivalent to their input. The required optimizations, however, clean up certain artifacts in the code and insert type checks. It is an error to pass a top-level form to the back end before performing the required optimizations. See the section called “The Compiler-Optimize Library” for more details.

    • Emit C code for the FER of each top-level form. The Compiler-CBack library emits the actual C code required to implement each top-level form.

  • Dump the local and global heaps.  Each library has a local heap which gets linked into the library itself. This contains as much static data as possible. However, not all data can safely live here; some must live in the executable application. The per-library heap is called the local heap and the per-application heap is called the global heap. The latter is only dumped if the current library will become an executable application. Both dumpers are implemented by the Heap module.

  • Save a *.lib.du file for the current library if necessary .  If d2c is not compiling an executable application, it dumps the data structures used by this library into a *.lib.du file. If another Dylan ™ library includes this library, the dumped data structures will be used to perform inter-library optimizations. The dumped data is stored as persistent objects in Object Description Format (ODF). The dumping process is controlled by the Compiler-Main library. ODF is implemented by the OD-Format module.

  • Generate a Makefile and run the C compiler.  The compilation driver in the Compiler-Main library creates a Makefile and C source files as needed during the compilation process. Once all the necessary code has been generated and the heaps are built, d2c invokes GNU make on the Makefile.

The Compiler-Base library exports the basic data structures used by the compiler. It also provides a number of miscellaneous utility modules. Every other library relies on Compiler-Base library.

Miscellaneous Utility Modules

Several modules in Compiler-Base library appear to contain assorted utility routines used elsewhere in the compiler.

Common

This module imports a number of standard modules and re-exports some or all of their constants. Essentially, this module is used to specify a local Dylan ™ dialect for use by the rest of the compiler. It includes Dylan, some of Extensions, a few names from Table-Extensions and several of the I/O-related modules.

Utils

This appears to be the “grab bag” module. It holds a mix of completely unrelated minor classes and functions.

OD-Format Module

Nick Kramer

Introduction

The OD-Format module was designed by RobMacLachlan and implements the Object Description Format, a binary format for storing persistent objects. This is used by the *.lib.du files emitted by d2c, though it might be potentially useful elsewhere.

The file gd/src/d2c/compiler/base/od-format.dylan contains an extensive discussion the module's goals and data format.

The basic requirement of the "Object Description Format" (ODF) is to allow an open-ended variety of objects to be dumped into a file, where:

  • instances may refer to each other in an arbitrary graph structure, and

  • some of the references may be to objects that are defined in distinct object description units (files, etc.)

In order to make the design interesting, the ODF attempts to satisfy various incompatible requirements:

  • To support efficient parsing and unparsing on a variety of architectures, but also to potentially serve as a cross-platform interchange format.

  • To work well on sequential media (like a socket), but also to potentially support random-access demand-loading of objects when possible (e.g. on files.)

We should consider making this into a standalone library at some point in the future.

Dumping Objects

<data-unit-name>[Constant]

Things that can be the name of a data unit.

Value
<symbol>

Description

<data-unit-name> is a <symbol> because is must be comparible with \==.


$library-summary-unit-type[Constant]

A constant for begin-dumping

Value
0

Description

This is currently the only type of OD format allowed. To build another format (e.g. DOOD), a new exported constant must be added to OD-Format and the file extension must be added to $unit-type-strings (an internal constant to OD-Format).


begin-dumping[Function]

Returns a dump state used by other OD functions.

Parameters
nameAn instance of <data-unit-name>. The thing to dump.
typeAn instance of <integer>. The "unit type code"
where:An instance of false-or(<string>). Dump file name.

Return Values
dump-stateAn instance of <dump-state>.

Description

Creates a state (which acts very much like a buffer) for other dumping functions to use to build the dump file in memory. The end-dumping function actually writes the built information to file.


end-dumping[Function]

Flushes the dump state to file.

Parameters
stateAn instance of <dump-state>.

Return Values
None

Description

Actually output the file.


dump-od[Open Generic]

Write an object to the dump buffer.

Parameters
objectAn instance of <object>. The object to dump.
bufferAn instance of <dump-buffer>. The in-memory buffer into which the object should be dumped.

Return Values
None

Description

This routine is overloaded to dump individual types objects. Methods for primitive Dylan ™ types can be found in the section called “The Dylan-Dump Module”.


<dump-state>[Class]

Holder of the in-progress object dump.

Superclasses
<dump-buffer>

Initialization Keywords
None

Description

Contains all information about the current object-dump in progress. It is illegal to use a dump-state after it has been passed to end-dumping. All instances of <dump-state> are created by calls to begin-dumping.


Loading Dumped Objects

<dispatcher>[Class]

Binary to in-memory instance transformer.

Superclasses
<object>

Initialization Keywords
initialize-from:An instance of <object>. Unless #f, initialize instance's dispatch information from this dispatcher.

Description

A dispatcher is an object that tells the loader how to convert binary data into Dylan ™ objects.


*default-dispatcher*[Variable]

Default dispatcher used for initialization.

Type
<dispatcher>

Description

Dylan-dump loads this variable with dispatching information for several commonly-used Dylan ™ types (see the section called “The Dylan-Dump Module”).


*data-unit-search-path*[Variable]

Paths to du files.

Type
<sequence>

Value
#[]

Description

This is used by find-data-unit to locate the du file to load.


<location-hint>[Constant]

A hint of where to find a data unit.

Value
<byte-string>

Description

This is currently an absolute(?) pathname.


<word>[Constant]

A portable, full-sized <integer>.

Value
<integer>

Description

It seems mindy has a limited size for <integer>, so, for mindy, <word> is a <general-integer> to allow for proper bit manipulations that do not clobber the value via overflow.


find-data-unit[Method]

Locates then loads a du file.

Parameters
nameAn instance of <data-unit-name>.
typeAn instance of <integer>. (Currently only $library-summary-unit-type allowed).
location-hint:An instance of false-or(<location-hint>). File name for the dump file; used if the data unit cannot be found in the current *data-unit-search-path*.
check-hash:An instance of false-or(<word>). Can be used to detect version mismatches between data units (used internally as expected-hash)
dispatcher:An instance of <dispatcher>. Provides the dispatcher to be used for the loading process.

Return Values
resAn instance of <data-unit>. Contains the resulting transformed objects from the du file.

Description

Find-data-unit, when given a data unit name and type, returns the corresponding data unit. This is the main entry for causing loading to happen. Also used recursively.


Defining Simple Dumpers and Loaders

To write a dumper for a class, one needs to think of a name for the class (usually just take the <>'s off the class name), register the name with register-object-id, and define a dumper. If your data structure contains any object sharing or circular usages, also see the section called “Circular Usage” and the section called “Object Sharing”. For automatic generation of object dumpers and loaders, see the section called “Autodump”.

register-object-id[Method]

Associates a number for a type for the OD dumping functions.

Parameters
nameAn instance of <symbol>. The name of the type.
idAn instance of <integer>. The associated integral value.

Return Values
None

Description

Equate the name with the id. This function is not exported; rather, all register-object-id calls must occur in the OD-Format module.

Of course, this requirement makes it really tough for extending OD-Format for user defined types ...


add-od-loader[Function]

Register a method for reading a dumped object back into memory.

Parameters
dispatcherAn instance of <dispatcher>. When loading .lib.du files, this is the *default-dispatcher*
nameAn instance of <symbol>. The method will be activated when the loader encounters the type registered with register-object-id in the du file.
funcAn instance of <function>. The loader method.

Return Values
None

Description

This function is the “other half” of dump-od. It is called to register a function for reloading dumped objects, which has the signature:

method(state :: <load-state>) => res :: <object-type-read-in>
	       

If you use add-od-loader, you will need to define a corresponding method on dump-od (see the section called “Dumping Objects”).


add-make-dumper[Method]

Convenience method to add both a loader and dumper.

Parameters
nameAn instance of <symbol>. Type name.
dispatcherAn instance of <dispatcher>. Where the loader and dumper are being added.
obj-classAn instance of <class>. Must be the direct class of the instances we are to dump (and of course, must also be instantiable via make.)
slotsAn instance of <list>. The slots' descriptions of the class.
load-external:An instance of <boolean>.
dumper-only:An instance of <boolean>. If #t, do not create the loader.
load-side-effect:An instance of false-or(<function>).
dump-side-effect:An instance of false-or(<function>).

Return Values
None

Description

A convenient way to define both a loader and a dumper for objects of class obj-class. Slots is a list which is in this format:

#(slot-accessor-function-1, init-keyword-1, setter-function-1,
  slot-accessor-function-2, init-keyword-2, setter-function-2,
  ...)
	      

Note that the stuff is implicitly grouped in clumps of 3. Either of the init-keyword or the setter may be #f, but not both. The "make dumper" requests backpatching when it encounters an unresolved forward reference (due to circularities). Any slot which might have a forward reference must have a setter function (it can also have an init-keyword, which will be used if possible).

A "make dumper" may be used when an object can be reconstructed from some accessor values by reinstatiating the object with make and backpatching some slots. This is pretty general, since the accessor and setter function are not required to be slot accessors and setters.


Autodump

[Description of Autodump coming soon]

ODF Representation of Objects

In the dump files, objects contain the following parts: a header, raw data, sub-objects, and an end-of-object marker. The header is a single machine word which contains the class of the object. (Actually, it contains the object-id, which is found by registry look-up.) The header also indicates if there is any raw data and sub-objects. The raw data section is zero or more machine words which only have meaning to the object's loader. If raw data is present, it is preceded by a byte-count (which is a single machine word). If there are sub-objects, they come next. The last object in a sub-object list is a magical $end-object, which denotes both the end of the sub-object list and the last word of the current object.

For more detail on the binary representation and on issues such as endian-ness and machine word size, see the section called “Binary Representation”.

When an object is loaded, the loading code reads in the header. The object class id is extracted from the header. The dispatcher then provides a load function based on the class id, which reads the remainder of the data associated with the object and returns the reconstructed object.

Functions for Writing Dumpers

When adding a method to dump-od use the methods described here to push out the objects correctly.

dump-definition-header[Method]

Dump an object definition header.

Parameters
nameAn instance of <symbol>.
bufAn instance of <dump-buffer>. Probably best to use the <dump-state> instance passed into dump-od
subobjects:An instance of <object>. #t means this object has subobjects
raw-data:An instance of <object>. The Raw-Format field in an object-definition header encodes information needed to byte/word format translation, and in the degenerate case, indicates there is no raw data at all. Allowable format codes are $odf-no-raw-data-format, $odf-byte-raw-data-format, $odf-16bit-raw-data-format, $odf-32bit-raw-data-format, $odf-64bit-raw-data-format, $odf-untranslatable-raw-data-format, and $odf-word-raw-data-format (which is 32 or 64 as appropriate, used for <word> in format pseudo-objects).

Return Values
None

Description

The first dump call in dump-op for simple objects, this function dumps the object's header to the buffer. If one is dumping compound objects, first save the position of the buf before calling this method with some code like this:

define method dump-od(obj :: <ratio>, buf :: <dump-state>) => ();
  let start-pos = buf.current-pos;
  dump-definition-header(#"ratio", buf, subobjects: #t);
  dump-od(obj.numerator, buf);
  dump-od(obj.denominator, buf);
  dump-end-entry(start-pos, buf);
end method dump-od;
	     

This is necessary so that dump-end-entry (see the next documented method) writes out the terminal correctly.


dump-end-entry[Method]

Dump an "end of object" marker. This is only necessary if there are subobjects.

Parameters
start-posnAn instance of <integer>. Where the object began.
bufAn instance of <dump-buffer>. Usually the <dump-state> instance passed into dump-od.

Return Values
None

Description

Before dumping the compound object, be sure to save the buffer's current position (as in the sample code for dump-definition-header) because this method needs that information.


dump-simple-object[Method]

Dump an object that has only a constant number of subobjects (no raw data).

Parameters
nameAn instance of <symbol>. The instance's class name.
bufAn instance of <dump-buffer>. Usually the <dump-state> instance passed into dump-od.
slotsAn instance of <object>. The values of the slots of this object.

Return Values
None

Description

This method provides a simple way to dump objects of known size, so using this method, writing dump-od for <ratio> would be:

define method dump-od(obj :: <ratio>, buf :: <dump-state>) => ();
  dump-simple-object(#"ratio", buf, obj.numerator, obj.denominator);
end method dump-od;
	   


dump-word[Method]

Dump an integer into the buffer as one word.

Parameters
objAn instance of <word>. The integer to dump.
bufAn instance of <dump-buffer>.

Return Values
None

Description

This dumps an <integer> as part of of an object. If you want to dump an <integer> as an <integer>, use the dump-od on <integer> in the section called “The Dylan-Dump Module”.


dump-raw-data[Method]

Dumps a collection into the buffer.

Parameters
objAn instance of <collection>. The collection to dump.
bsizeAn instance of <integer>. The number of elements in the collection.
bufAn instance of <dump-buffer>.

Return Values
None

Description

This take a collection, obj, and dumps its contents to buf. This method is usually called in connection with dumping a complex object, for example, dump-od for <byte-string> in the section called “The Dylan-Dump Module” dumps strings thusly:

define method dump-od(obj :: <byte-string>, buf :: <dump-buffer>) => ();
  dump-definition-header(#"byte-string", buf,
  			 raw-data: $odf-byte-raw-data-format);
  dump-raw-data(obj, obj.size, buf);
end method;
	     


Functions for Writing Loaders

The basic concept of writing a loader is to read in the information you wrote out with the method on dump-od. This process (writing loaders) is a bit different than writing dumpers, though. Here, you provide a method to add-od-loader (described in the section called “Defining Simple Dumpers and Loaders”) That is activated by a <dispatcher> instance when it encounters the object in the dump file to load.

A loader on <byte-string> (the corresponding dump-od method is found in the section called “Functions for Writing Dumpers”) would look like this:

add-od-loader(*default-dispatcher*, #"byte-string", 
  method (state :: <load-state>)
   => res :: <byte-string>;
    let next = state.od-next;
    let bsize = buffer-word(state.od-buffer, next);
    state.od-next := next + $word-bytes;
    let res = make(<byte-string>, size: bsize);
    load-raw-data(res, bsize, state);
    res;
  end method
);
	 

(In this case, the call to add-od-loader is made at the top-level in dylan-dump.dylan.) The functions used to build the dispatch method are described below.

<load-state>[Class]

A load-state is an object that describes the current state of the loading process. Among other things, it contains an input <stream> and the buffer to the stream.

Superclasses
<object>

Initialization Keywords
stream:An instance of <stream>. The stream we're reading from. Only used for refilling the buffer. This slot is accessed via the od-stream method.
buffer:An instance of <buffer>. State of the stream buffer, which we hold for the duration of the load. It is accessed via the od-buffer method.
next:An instance of <buffer-index>. The index of the state's stream; it is accessed via the od-next method and can be set (changed) using the od-next-setter method.
end:An instance of <buffer-index>. The end-index for the buffer and stream. This slot is accessed via the od-end method.
position-offset:An instance of <integer>. When added to the od-next pointer, this offset yields the byte offset of our current position from the data-unit start. This must be updated whenever we refill the buffer.
dispatcher:An instance of <dispatcher>.

Description

A <load-state> object is passed into the dispatching method to reconstruct a dumped object (so, you do not need to make a <load-state> instance). Use that object in conjuction with the functions described below.


$end-object[Constant]

An end-of-subobjects marker.

Type
<end-object>

Description

A magical object which is always the last sub-object in an object.


load-object-dispatch[Method]

Read an object header and invoke the appropriate loader. Return the object that the loader returns.

Parameters
stateAn instance of <load-state>.

Return Values
resAn instance of <object>. The reconstructed object.

Description

Start loading some objects from a load-state. Dispatches to an appropriate loader method depending on the dispatcher and the entry type. Returns the object.

For example, to load a <ratio> instance, build a dispatching method like this:

add-od-loader(*default-dispatcher*, #"ratio", 
  method (state :: <load-state>) => res :: <ratio>;
    
    let npart = load-object-dispatch(state);
    let dpart = load-object-dispatch(state);
    assert-end-object(state);
    ratio(npart, dpart);
  end method
);
	     


load-raw-data[Method]

Loads data into a parameter.

Parameters
resAn instance of <byte-string>. A pre-allocated buffer where the raw data gets loaded.
elsizeAn instance of <integer>. Size of the buffer (or the maximum bytes to load).
stateAn instance of <load-state>.

Return Values
nnextAn instance of <buffer-index>. The position in the <load-state>'s buffer after loading the raw data.

Description

Loads raw data into res. It is up to the dispatching method to interpret that loaded raw data in order to reconstruct the object's state.

Usually a call to load-raw-data is to load a string, so reconstruction is easy (see the example at the top of this section). One thing of note is that this method is not functional: res is not returned, but is a destructively modified parameter.


load-subobjects-vector[Method]

Returns a vector of an object's subobjects.

Parameters
stateAn instance of <load-state>.
size-hintAn instance of false-or(<integer>). If a number, then the number of subobjects MUST match.

Return Values
resAn instance of <simple-object-vector>. The subobjects of the object.

Description

Loads the dumped subobjects into a vector. Note that the subobjects may be unresolved when loaded with this function ... it is the responsibility of the user of this function to resolve those objects (see the section called “Circular Usage” for more information on functions for object resolution such as obj-resolved?, actual-obj, and request-backpatch).

So, getting a <vector> from this function still requires programming effort, e.g.:

add-od-loader(*default-dispatcher*, #"simple-object-vector", 
  method (state :: <load-state>) => res :: <simple-object-vector>;
    let contents = load-subobjects-vector(state, #f);
    let rsize = contents.size;
    let res = make(<simple-object-vector>, size: rsize);
    for (i from 0 below rsize)
      let el = contents[i];
      if (el.obj-resolved?)
        res[i] := el.actual-obj;
      else
        res[i] := el;
	request-backpatch(el, method (actual) res[i] := actual end);
      end;
    end;
    res;
  end method
);
	     

Fortunately, the Dylan-dump module provides this dispatching method. See the section called “The Dylan-Dump Module”.


load-sole-subobject[Method]

Loads the only subobject of this object.

Parameters
stateAn instance of <load-state>.

Return Values
resAn instance of <object>. The resolved subobject.

Description

Load a one-slot object. The value is returned, and we check that there was in fact only one subobject.


assert-end-object[Method]

Reads from the buffer, and signals an error if the next thing read isn't an end-of-object marker.

Parameters
stateAn instance of <load-state>.

Return Values
None

Description

Ensure the subobject list ends with the end-of-object token.


Circular Usage

To deal with circular object usage graphs, several mechanisms are employed. Objects can be dumped as forward references and defined later. The loader for the object containing the forward reference has the option of delaying object creation until the forward referenced object is loaded, or requesting that the object be backpatched when the forward referenced object is finally loaded.

The following descriptions TBD:

  • new-local-id

  • label-next-object

  • dump-local-reference

  • <forward-ref>

  • obj-resolved?

  • actual-obj

  • request-backpatch

  • resolve-forward-ref

Object Sharing

By default, it is assumed that the object being dumped has no object sharing, i.e., for all sub-objects the ref-count is one. For any object that has a ref-count greater than one, that object must be a sub-class of <identity-preserving-mixin>. (If it is ok to dump the object multiple times and load it back as separate objects, then this superclass may be omitted.)

The following descriptions TBD:

  • <identity-preserving-mixin>

  • maybe-dump-reference

Binary Representation

To allow efficiency as well as potential interchange, objects are described by a word sequence which vaguely resembles the in-core representation. The word size is implementation defined, and dumper/loader methods dump/load data as words having the native byte ordering. The format is defined so that any necessary byte-swapping and word-size adjustment can be done as a pre-pass filter to the actual loader. The format is described in detail at the top of od-format.dylan.

The implication of the above is that data-unit files produced from OD-Format are not portable.

Examples

These examples show the bits for some literals, assuming with 32bit word size and big-endian order (which, by the way i386 is not big-endian), and assigning convenient values for the object class-IDs. See the actual code for the real definitions and class IDs.

For example, we could describe a byte-string like this:

    0: <Header? = #b1, Etype = #b00, Subobjects? = #b0, Raw-Format = #x1,
	Class-ID = byte-string-odf-id>
    1: string byte-count
    2[ceiling(byte-count, word-bytes)]: 
       <string-chars: *>
	   

Assuming byte-string-$odf-id is #x666, the actual bits for "Foo" would be:

    #x81000666
    #x00000003
    #x466F6F00
	   

A list has subobjects, but no raw data, so it would look like:

    0: <Header? = #b1, Etype = #b00, Subobjects? = #b1, Raw-Format = #x0,
        Raw-Size = #x00, Class-ID = list-odf-id>

    1[N]: <header of subobject 0, subobject data>
       ...

    M: <Header? = #b1, Etype = #b01, Start-Offset: * = M>
	   

If list-odf-id were #x667, then #("Foo", "Bar") would be:

    0: #x90000667
    1: #x81000666
    2: #x00000003
    4: #x466F6F00
    5: #x81000666
    6: #x00000003
    9: #x42617200
    7: #xA0000007
	   

This demonstrates the nesting of objects, and shows how the end is marked. Note that:

  • "Foo" is represented by exactly the same bits here (it is position-independent), and that

  • The entire sequence is position-independent, since the end-entry has a relative offset, and that

  • The basic subobject protocol doesn't say in advance how many subobjects there's going to be; you have to wait for the end header. This could be avoided for e.g. vectors by considering subobject 0 to be the length.

The Dylan-Dump Module

This module provides ODF dumping and loading routines for all of the primitive types.

The Compile-Time-Values Module

This module exports <ct-value>, which represents a value known at compile time. It also supplies <literal> and a number of subclasses to represent literal strings, integers, lists and other primitive types.

Other compile time values can be found elsewhere. Literal <type> objects are handled by classes in the CType module. Literal <class> objects are handled by classes in the Classes module. Literal <function> objects are handled by classes in the Compile Time Functions module.

The Source Module

d2c was designed to fetch source records from several sources, including flat text files and eventaully code databases. To permit this flexibility, d2c uses a very abstract interface for representing locations in source code.

The <source-location> class is an open, abstract class with no slots. All instances must provide a method for describe-source-location.

The source-location generic takes an arbitrary object as an argument and returns a source location. The mix-in class <source-location-mixin> provides the init-keyword source-location: and a method on source-location which returns the value of the keyword or <unknown-source-location>.

The <source-file> class reads an entire input file into memory and provides access to the contents. It's tightly-coupled to the class <file-source-location>, which represents a location within a source file. Separating these two classes is not feasible. Among other things, this code generates the attractive, multiline displays of exactly where an error occured.

The Tokens Module

This module provides classes and constants for representing tokens and syntax tables.

The Header Module

Dylan ™ source files have RFC 822 headers, similar to those used by Internet e-mail and many other protocols. This module provides support for parsing headers at the top of a file and extracting the values of keywords. White space at the begining and ends of lines is removed automatically.

The Platform Module

This module contains code to parse platforms.descr , which contains descriptions of the various host platforms supported by d2c. A description of each of the parameters appears in platforms.descr itself.

The Errors Module

This module exports a number of classes and functions related to compiler errors and warnings.

The Signature-Interface and Signature Modules

The Signature-Interface module creates a group of names which are actually implemented in the Signature module. These represent the formal parameter and result lists of a Dylan ™ function.

The Names Module

When reporting errors, d2c often needs to provide names for various functions, anonymous local methods and other things found in Dylan ™ programs. This module knows about several different kinds of names and how they should be displayed to the user.

It does not appear that these classes are used in as part of the compiler's own computations. A quick look suggests that this is merely interface and debugging code.

The Definitions Module

I'm not quite sure what this module does, so take this description with a grain of salt. Better yet, figure out what this module does and fix this documentation.

The top-of-file comment for this module reads:

Abstract class that all definitions inherit from. A definition is the compilers handle on some run-time value. In addition to the obvious things, definitions exist for things like the type of a variable (when it isn't a compile-time constant).

It appears that named, top-level forms each get a definition. From the comment, it sounds as if the following code produces two definitions:

define variable foo :: run-time-expression() = #f;
	

One definition refers to the variable itself, and the other refers to the type value computed at initialization time. Is this a correct assumption?

The Variables Module

This module is in change of libraries, modules and namespaces. It maintains the global table of libraries, processes the various clauses in define library and define module, and keeps track of where variable definitions live. The magic Dylan-User modules get set up here as well.

Each <variable> object also provides slots for the associated definition, transformers and constant evaluator. It would appear that this module defines some of the more important data structures in the compiler.

The Policy Module

The Policy represents an optimization policy: a set of parameters used for making tradeoffs between speed, safety and code size. A quick inspection of the code shows that a <policy> object will not survive the dumping and loading processes intact. Do these actually get used?

The CType Module

Does ctype stand for “constant type”, “compile-time type”, or something else entirely?

This module defines the compiler's type model. This is used by the Flow module to represent the types of expressions, and by the Compiler-Optimize library during several phases of optimization.

Most of these types can be represented as Dylan ™ objects. Such types also inherit from <ct-value>, so they can be represented as compile-time values (see the section called “The Compile-Time-Values Module”).

All ctypes inherit from <values-ctype>. Single-value types (e.g., the types of variables) are represented by subclasses of <ctype>. Multiple-value types (e.g., the return values of functions) are represented as instances of <multi-value-ctype>.

Various subclasses of <ctype> represent class types, union types and all the standard (and non-standard) limited types supported by d2c. The class <unknown-ctype> represents a type we won't know until runtime. The function empty-type returns a type union with no members. No object can ever be an instance of this type. (Empty types are used extensively when computing the intersections, unions and differences of types.)

Class types are defined in the Classes module.

<multi-value-ctype> is a bit tricky. It represents zero or more required values of known types, zero or more optional values of known types, and an optional #rest list—which may also be typed. (See the source for more information.) The function wild-ctype returns a multi-value type representing zero or more values of any type.

This module supports type intersections, type unions and type differences, primarily for use by the optimizer. All of the operations may be imprecise; if so, the second return value will typically be false.

The intersection of two types is described as:

Return as restrictive a type as we can discover that is no more restrictive than the intersection of Type1 and Type2. The second value is true if the result is exact. At worst, we arbitrarily return one of the arguments as the first value (trying not to return an unknown type).

The union of two types is described as:

Find a type which includes both types. The result is an unknown type if either of the arguments are unknown; otherwise the result is precise.

The difference of two types is described as:

Return our best guess at the type that describes all objects that are in Type1 but not in Type2. If we can't precisely determine this type, then return something more inclusive than it, but never more inclusive than Type1.

Other things in this module include subtype tests, type extent computation and type specifiers. The various “-dispatch” functions have nothing to do with the compiler's support for generic functions—they're just hooks for extending various generic functions defined in this module.

The Transformers Module

Transformers are functions that are attached to function definitions. They serve as a sort of procedural macro in the optimizer. For instance, the do() function is transformed into an ordinary loop construct so the optimizer can bang on it some more. See the cheese module for usage of this.

The Representation Module

This module provides a very abstract interface for choosing data representations in generated code. It currently appears to be implemented by the C-Representation module.

The Classes Module

This module contains the guts of the class system, which is implemented with classes as a subtype of <ctype>. This module also computes class precedence lists, unique class IDs and slot layouts.

The Type-Dump Module

This is an implementation module which contains ODF support for various subclasses of <ctype>.

The C-Representation Module

This modules knows about the different C data types. It also knows about heap representations, direct representations and general representations.

The Compile-Time-Functions Module

From the top-of-file comment:

A ct-function is a kind of compile-time value used to represent a function. ct-functions contain various linkage-related information needed to call the function, but don't reference the FER for the function (e.g. the <function-literal>.) This information is used both by the backend and by the heap builder.

This is also where Peter Housel added the support for callback functions. This module includes support for general entries and generic entries, which are presumably the type-checked and dispatched entry points for functions, respectively.

The Flow Module

These classes define the flow model used by the compiler. Together with the classes defined in the Front module, they make up the intermediate representation used to communicate between the Compiler-Convert library, the Compiler-Front library, the Compiler-Optimize library and the Compiler-CBack library.

Unfortunately, the front-end and the back-end of the compiler use a slightly different version of the intermediate representation. To confuse matters further, both versions are called the “front-end representation” (see the section called “The Builder-Interface Module” for more details).

The actual flow model is fairly straightforward. Data-flow classes are defined in data-flow.dylan, and control-flow classes are defined in control-flow.dylan.

The data-flow classes represent variables, constants, operations and assignments. They use a generalization of the traditional three-address form. Expressions are broken down into a series of step-by-step assignments to temporary variables. The right-hand side of an assignment may contain a constant, a variable or a single operation (nesting is not allowed). The operation may take any number of arguments, each of which must be another variable or a constant.

When the flow classes are first created, a particular variable may be assigned to more than once. The optimizer, however, converts all variables to single static assignment (SSA) form. In SSA form, each variable is given a value at creation time and never modified. This restriction simplifies many optimization algorithms considerably.

Prior to SSA conversion, variables are represented by the class <intial-variable>. Each individual assignment to a variable is represented as an <initial-definition>. After SSA conversion, variables are represented by the class <ssa-variable>.

The control-flow model represents groups of assignments as regions. The simplest kind of region is <simple-region>, which represents a linear series of assignments. A <compound-region> holds several regions which are executed in sequence. The various subclasses of <join-region> represent control constructs with non-linear flow patterns. An <exit-region> transfers control to an enclosing block; various subclasses are used to return from functions, exit loops and perform other kinds of funky control flow.

This part of the compiler is well-commented, so you may want to read the source for further details. You'll also want to look at the classes defined in Front module, most of which extend the control- and data-flow model in various useful ways.

The classes <join-operation>, <join-assignment> and <join-region> are currently ignored by the rest of the compiler. In each case the word “join” refers to the Phi function used in traditional SSA form. These classes will be used if anybody ever fixes the optimizer's SSA conversion.

The Compiler-Parser library handles lexical analysis, parsing and macro expansion. It relies only on the Compiler-Base library.

The Tokenize Module

This module provides an abstract class <tokenizer> supporting get-token, unget-token and a few other functions.

For implementations of this interface, see the section called “The Lexer Module” and the section called “The Fragments Module”.

The Source-Utilities Module

This module implements a number of subclasses of <source-location> (see the section called “The Source Module” for details) that are used to represent the source of tokens during macro expansion.

There may be slightly more to this module than is apparent at first glance.

The Lexer Module

This module provides <lexer>, a subclass of <tokenizer> used to get tokens from a source record.

The Fragments Module

Fragments are the input to and output from macro expansion. According to the top-of-file comment:

The DRM says that:

The input to, and output from, a macro expansion is a fragment, which is a sequence of elementary fragments. An elementary fragment is one of the following:

  • A token: the output of the lexical grammar. ...

  • A bracketed fragment: balanced brackets ( (), [], or {} ) enclosing a fragment.

  • A macro-call fragment: a macro call.

  • A parsed fragment: a single unit that is not decomposable into its component tokens. It has been fully parsed by the phrase grammar. A parsed fragment is either an expression, a definition, or a local declaration.

So the parser needs to be able to produce fragments, the macro expander needs to be able to destructure and reconstruct fragments, and then the parser needs to be able to grovel the final results. This file implements all that.

This module also provides <fragment-tokenizer>, a subclass of <tokenizer> used to get tokens from a macro fragment.

The Parse-Tree Module

This module defines the tree representation used by the parser and the macro-expander. (It also includes the classes needed to represent the contents of a define macro form.)

The Parser Module

This modules takes input from a <tokenizer> and generates a parse tree. It looks like the main entry point is parse-source-record. There are other entry points which are called recursively by the macro expander; these parse a single production of Dylan ™ 's grammar.

This module defines the generic function process-top-level-form. It does not, however, define any methods. When the parser has found a top-level form, it calls process-top-level-form and allows other modules to take care of the details. This design may have implications for multi-threading and code reuse—it looks like the parser can only be used in one way by any given program.

The Macros Module

This module implements macro expansion as defined in the Dylan Reference Manual . It also provides <macro-definition> (see the section called “The Definitions Module”).

Note that this module only handles those macros which expand according to the standard rules. More complicated macros are handled elsewhere by procedural expanders. To define a new procedural expander, register it using define-procedural-expander (a method defined in this module). We'll discuss procedural macros as implemented by d2c in some detail below.

Gwydion Dylan ™ Procedural Macros

In this section, I shall use "procedural macro" to mean "procedural macro as implemented by Gwydion Dylan Maintenance Project on d2c". This special definition contains an implicit caveat: procedural macros are not defined in the Dylan Reference Manual and are not standardized across Dylan ™ implementations — use at your own risk. Also, discussion on the Gwydion Dylan Maintenance Project mailing list has noted that macros defined by the Dylan Reference Manual (hereinafter referred to as "plain macros") are "Turing-complete" (yes, just as machine language and C are Turing-complete), which means that Dylan Reference Manual macros can do whatever procedural macros can do; it just may require more effort and creativity on the user of the plain macro definition to do it. Procedural macros provide the macro writer alternatives to an aim, their lack does not disallow any desired aim. Those wishing to write Dylan Reference Manual -compliant Dylan ™ may safely skip this section.

An Introduction

First off, a procedural macro is a macro ... a special kind of macro that (explained simply) uses Dylan ™ expressions evaluated before expansion time to build a macro-expansion. Then, like any plain macro, the expansion substitutes fragments matched by the template with the procedural result. Procedural macros can be more expressive than the macros defined in the Dylan Reference Manual , but this expressiveness comes at the cost of necessarily more work both in preparing to write the macro (the bits of support code) and in actually writing the macro itself. I will use the make-if procedural macro as defined in the section called “The Expanders Module” to show how procedural macros work.

Second off, plain macros are straightforward (!) in their approach to source-code manipulation as compared to procedural macros. Plain macros match a pattern in the source code and then substitute that pattern with the (fixed) expansion:

source code (with a macro pattern) in
 => expanded source code out
	    

Procedural macros give the user programmatic control over the expansion:

source code (with a macro pattern) in
 => generate appropriate expansion (programmatically, of course)
  => expanded source code out
	    

A procedural macro writer must not only manage issues of writing plain macros, but must also manage the process of writing the expansion. This additional requirement means that writing code that creates a macro expansion is writing code that manipulates the Front-End Representation (see the section called “The Compiler-Front library”), and, as such, is quite different than writing general-purpose code. Particularly, the code (method) that writes the expander must interact with the code generator engine and feed it things it understands: fragments (see the section called “The Fragments Module”).

An Example

Section using make-if as an illustration TBD.

Roll Your Own, With Analysis

TBD: here we'll create avg from Paul Graham's On Lisp book as a regular macro (from Chris Double) and as a procedural macro, comparing and contrasting the two styles.

The Compiler-Front library is repsonsible for processing the Front End Representation of a Dylan ™ program. It depends only on the Compiler-Base library.

Data is passed from the Compiler-Parser library to the Compiler-Front library by the Compiler-Convert library.

The Builder-Interface Module

This module defines an interface which can be used to build the front-end representation (FER) of a parsed Dylan ™ program.

Note that this module actually defines two interfaces: <flow-builder>, an abstract interface for constructing basic blocks (see the section called “The Flow Module”); and <fer-builder>, a more specific version of that interface which is used to construct the front-end representation. It's not apparent that the former, more general interface is actually used anywhere else.

This interface is implemented by the Front module.

The optimize-component generic function gets implemented by the Compiler-Optimize library. We need to figure out why it's defined here instead of there.

The Primitives Module

This module defines the primitive operations which may appear in Dylan ™ code.

The Front Module

This module does all the heavy lifting for the front-end of the compiler. It's split across a number of source files:

front.dylan

The data structures used in the front-end representation. These extend the classes defined in the Flow module.

builder.dylan

Abstract classes and generic functions for the <flow-builder> and <fer-builder> classes.

fer-builder.dylan

The actual implementation of the various builders. This seems to be where the actual front-end code lives.

fer-dump.dylan

Code for dumping the front-end representation as text.

clone.dylan

Code to clone the data structures representing a top-level function. This seems to be used when inlining functions.

To add a new class to the FER hiearchy, you'll need to add code to the following files: front.dylan, fer-dump.dylan, clone.dylan, fer-od.dylan and base/od-format.dylan. If you miss any of these spots, things will break in various entertaining ways.

The FER-OD Module

This module contains code for dumping the front-end representation as ODF (see the section called “OD-Format Module”). Special case is taken to make sure that all dumps start with a <function-literal> object; smaller components of the front-end representation can not be dumped individually.

The Function-Definitions Module

This module implements various subclasses of <definition> representing functions (see the section called “The Definitions Module”). Most of the machinery for adding methods to a generic function at compile-time lives here, as does a good chunk of code related to sealing.

The Variable-Definitions Module

This module implements <variable-definition>. It also contains some of the code for handling load-time type expressions.

The Top-Level-Forms Module

This is not a very meaty module. It doesn't make much sense in the absence of the Compiler-Convert library. It does, however, contain a few classes for describing the things found at the top level of a Dylan ™ file and provides the generic function finalize-top-level-form, which is called by the compilation driver once everything has been parsed.

This should not be confused with the Top-Level-Expressions module, which instantiates the data structures defined in this module.

The Compiler-Convert library acts as an adaptor between the Compiler-Parser library and the Compiler-Front library.

As a rule of thumb, the Compiler-Convert library implements thosep arts of the front end which know about the output of the parser.

The LexEnv Module

This module implements lexical environments (a formal term for “local namespaces”). This is the code that associates variable names with the correct bindings.

The Compile-Time-Eval Module

This module performs compile-time evaluation of constant expressions. It operates on the parsed representation of a Dylan ™ program, not the front-end representation.

The Expanders Module

This module contains procedural macro expanders for basic, built-in macros such as make-if.

The FER-Convert Module

This module processes chunks of the parse tree and uses the <fer-builder> to produce the front-end representation.

Processing of Top-Level Definitions

The code to process top-level definitions is spread across several libraries. Most of the machinery actually lives here in the Compiler-Convert library.

Each top-level form must define methods on three generic functions:

process-top-level-form

When the parser finishes with a top-level form, it passes the form to this function, which produces the final parse tree and creates an object representing the top-level form. (See the section called “The Parser Module”.)

finalize-top-level-form

This function is called by the compilation driver when all the top-level forms have been parsed. It's used to implement various kinds of foward references within a module. (See the section called “The Top-Level-Forms Module”.)

convert-top-level-form

Once a top-level form has been processed and finalized, it can be converted to the appropriate front-end representation. Methods on this function use the <fer-builder> interface. (See the section called “The Top-Level-Forms Module” and the section called “The Builder-Interface Module”.)

Generally speaking, the parsed representation of a built-in define foo form would be named <define-foo-parse>. The object representing the top-level form itself would be named <define-foo-tlf>. There's also a <foo-definition> floating around in most cases. The relationships between these three representations need to be documented.

The Define-Macros Module

This module handles define macro forms and macro calls which appear at the top level.

The Define-Libraries-and-Modules Module

This module handles define library and define module forms.

The Define-Functions Module

This module handles define generic and define method forms. (define function forms are expanded by a macro defined in the runtime library.) Most of the logic for implicit generic functions lives here.

The Define-Constants-and-Variables Module

This module handles define generic and define method forms. It also includes code for constant methods, which may get used in a few strange places. More research is needed here.

This module contains an alarming number of “shouldn't happen” errors. Is this left-over code, safety checks for a fragile area of the compiler, or evidence of some fundamental confusion about what's going on?

The Define-Classes Module

This library handles define class forms. There's a lot in here, including ctype calculation for classes and creation of getter and setter methods. It's not immediately clear how this relates to the code in the module Classes.

The Top-Level-Expressions Module

This module handles actual code appearing at the top level. There's also something going on with <magic-interal-primitives-placeholder>, which is (of course) excessively magic. If you're the sort of person who likes to figure out how magicians do their tricks, feel free to figure this out for the rest of us.

The Compiler-Optimize Library

d2c was written as a research project. As such, it contains some very sophisticated optimization routines.

The Cheese Module

For some unknown reason, all of the optimization code found in the compiler is contained in a single module named Cheese. Furthermore, this module does not actually provide an entrance point for optimizing a chunk of code. Instead, it adds methods to the generic function optimize-component, which is exported by the Builder-Interface module. This interface is somewhat less than amusing.

This entry-point weirdness has been cleaned up on the OPTIMIZER_HACKING branch in CVS.

To study this module, begin by reading the method optimize-component in the file cheese.dylan. This routine is described in the section called “Control Flow”.

Data Formats

The optimizer's data structures are defined in a number of modules:

  • The type model.  Because this is a Dylan ™ compiler, the type model is pretty hairy. See the section called “The CType Module” and the section called “The Classes Module”.

  • The intermediate representation.  The optimizer manipulates code in the format defined by the Flow module and the Front module. In the code, this is is referred to as FER, which is short for “Front End Representation.

    The optimizer normally uses the Builder-Interface module to create new FER structures. If you're familiar with the underlying FER classes, you can pick up the builder interface as you read through the optimizer. Look for calls to build-assignment, make-unknown-call and related functions. They're everywhere.

The optimizer will make a lot more sense if you study the documentation and code for these modules first.

Control Flow

The optimizer has two modes: normal mode, and simplification mode. Normal mode is called from the driver in Compiler-Main library to optimize top-level functions. Simplification mode is called from several places in Compiler-Convert library to tidy up newly-defined inline functions. This means inline functions get optimized twice: once when they're defined, and a second time when they're inlined.

The optimizer's main entry point is the function optimize-component. This function contains a weird driver loop. The goal of the loop is drain two queues: component.initial-variables and component.reoptimize-queue.

The initial-variables queue is filled by the FER builder (and by certain routines in optimize). At first, the queue contains all the variables in the component. The optimizer drains the queue by passing each variable to the function maybe-convert-to-ssa.

The function maybe-convert-to-ssa tries to convert each variable to SSA form. SSA variables are “static, single assignments”. Each SSA variable is defined by a single assignment, and never changes in value. Because the values of these variables never change, the optimizer can make lots of useful assumptions about them.

Unfortunaely, maybe-convert-to-ssa isn't very smart. It only does the most trivial kind of conversion—it changes <initial-variable>s with one <initial-definition> into <ssa-variable>s. (There are good algorithms for SSA conversion which can convert every variable. If we need these, check the literature.)

Once the initial-variables queue has been drained, the optimizer starts work on the reoptimize-queue. This queue was initially set up during conversion (by the FER builder, I think). It contains everything we currently need to optimize.

The optimizer pops the first item off the front of the queue and passes it to the generic function optimize. This function knows how to do special-purpose optimizations on each subclass of <queueable-mixin>. These subclasses include expressions, assignments and certain basic blocks.

The generic function optimize is implemented by methods stewn throughout the optimizer. Each of these methods knows how to simplify a particular kind of queueable object. The method for <if-region>, for example, checks to see if the test condition is a compile-time value, and if so, replaces the <if-region> with the appropriate branch. Other methods do compile-time evaluation of primitives, dispatch generic functions, and optimize slot access.

The methods on optimize may or may not change a particular queueable object. But if a given method does make a change, it may open up new opportunities to improve various other queueables. So when we modify a queueable, we must generally reoptimize everything that depends on it.

To reoptimize depedents, we can't use recursive calls to optimize. Instead, we must re-insert the dependent into the reoptimize-queue (remember that?) using the function reoptimize. If the dependent is already in the queue, reoptimize does nothing. But if the dependent isn't already in the queue, reoptimize adds it to very front.

Once we've optimized a single queueable, we return to the top-level driver. At this point, we have to drain the initial-variables queue again (which will almost always be empty). Once this is done, we pop the next item off of reoptimize-queue and make another call to optimize.

Once both queues are finally empty, the driver can move on. It first applies a number of “simplifications”—control-flow cleanups, CSE-elimination, and things like that. Any of these simplifications might add something to one of our two queues. If this happens, we need to drain our queues again and reapply our simplifications from the beginning. Fortunately, this backtracking happens very rarely, and it doesn't trigger any kind of global reoptimization.

The next step depends on our simplify-only? argument. If this argument is true, we're doing “pre-optimizion” of an inline function, and we should stop at this point. But if simplify-only? is false, we're optimizing a regular function and need to get it ready for the back end.

To prepare a function for the back end, we need to add type checks, replace certain “placeholder” queueables, do environment analysis to find closure variables, and build the external entry points for local functions.

Each of these stages works like a simplification. It may add items to our two queues. If it does, we have to go drain those queues, reapply all our simplifications, and so forth.

Call Optimization & Transformers

Dylan ™ , like LISP, uses function calls to represent most language operations. Arithmetic oprators are functions. Slot getters and setters are functions. Even array access is done using functions. Given this, it's clear that the optimizer needs to spend a lot of time improving function calls.

Function calls are represented in FER using subclasses of <abstract-call>. You can find these classes in the file front/front.dylan. (Go ahead and read the comments, they're actually informative.)

The are three subclasses of <abstract-call>. Each represents a different kind of function call. Calls to specific, known methods are represented by <known-call>. Calls to a generic function are represented by subclasses of <general-call>. (The most common of these is <unknown-call>.) Calls which are known to be erroneous are represented by <error-call>.

Most function calls start life as an instance of <unknown-call>. These are processed by the optimize method for <unknown-call>. (Look in callopt.dylan.) This method does some complicated dispatching. But in several of the most common cases, we wind up calling optimize-generic.

The function optimize-generic tries two different optimization strategies. First, it looks for an appropriate transformer. Transformers know how to optimize specific generic functions. If no transformer can be found, or the transformer declines, optimize-generic tries to do compile-time dispatch instead. This is done by looking for applicable methods. If there's a single applicable method, then we can replace the <general-call> with a <known-call>.

Simpler optimizations are applied to <known-call>s. But again, it's possible to register a transformer. This transformer will be applied to known calls with exactly matching function signatures.

Most of the transformers live in trans.dylan. There's also some black magic in here to handle calls to instance? by converting them into <instance?> placeholders, and changing them back into real code later, when replacing placeholders. This was probably done to hide these calls during simplification passes.

The Compiler-CBack Library

The Compiler-CBack library generates C code from the optimized front-end representation of a component.

The Stack-Analysis Module

This module analyzes the stack usage of a function. It's relatively straightforward and independent of the rest of the back end.

The CBack Module

This module has two main entry points: emit-tlf-gunk and emit-component. The former emits arbitrary C code needed by a given top-level form. The latter translates a Dylan ™ function into actual C code.

The file cback.dylan contains more documentation about how things work.

Right now, the back end assumes that the optimizer has been run. It's unclear which optimizations can be skipped safely.

The Heap Module

This module emits local (per library) and global (per application) heaps. Extensive documentation can be found in heap.dylan.

The Compiler-Main Library

The Compiler-Main library contains the actual compilation driver.

The Main Module

This module provides the command-line interface and compilation driver for d2c. It's relatively safe to make changes here, because they generally don't affect the rest of the compiler.

The Dylan Runtime Library

The runtime library is strongly dependent on the internal workings of d2c. A moderate amount of code duplication occurs: d2c will contain a compile-time version of a function and the runtime will contain a fully dynamic version.

Chapter 6. Melange Internals Guide

Melange is scheduled to be replaced by Pidgin in a future version of Gwydion Dylan ™ . Therefore, this section of the guide will eventually contain a description of Pidgin and its public APIs.

Appendix A. Gwydion Project Coding Style Guide

Note

This appendix lists the Dylan ™ language coding style guidelines formerly used by the Gwydion Project.

This file lists two broad categories of style issues for Dylan ™ . One category, called "Recommended", contains style points to which Gwydion programmers should adhere, as opposed to "must adhere". If someone sometimes, or even always, breaks a recommended style point, then others must to be tolerant of that deviation. You must not modify the original author's code purely to satisfy an urge to convert the original author's style to your own. The second category, called "Mandated", contains style points to which Gwydion programmers must adhere. If you find someone has deviated from a mandated style point, then you may convert the code to be the mandated style.

Recommended

Indent level

You should use a two-space indentation. Each time code needs to be further indented, you should use one indent level, which is two spaces.

Indent wrapping slot descriptions

When a slot description wraps to a new line in a "define class" form, the extra lines should be indented one extra indent level.

Using /// comments

If you use /// comments, they should only be used between top-level declarations. /// comments can be used for logical page headers, inter-function comments, code section introductions (logical page headers), etc., for example:

...
end method;

/// order-espresso -- Method for Exported GF.
///
/// This method orders espresso at carts on the corner when it is
/// raining.  If it is not raining, this method delegates to
/// next-method. 
///
define method order-espresso (vendor :: <outdoor-espresso-cart>,
                              #next next-method)
  => ();
  ...
end method;
	

Many programmers do not like others to use /// comments. Rob noted that if these are only used between top-level declarations, where the /// indicates a comment of wider scope (as over a whole definition or logical page), then the Gwydion environment should be able to display such comments however any particular user wants to see them.

Using /* ... */

If you use /* ... */ comments, they should only be used at the beginning of a file or the beginning of a logical page so that their use is highly localized. This prevents them from showing up surprisingly in random places throughout a file of code. Using /*...*/ comments should go away once the Gwydion environment has a means for linking general commentary to logical units of code. This issue has absolutely nothing to do with using /*...*/ to temporarily exclude code fragments.

Wrapping function definition headers

When a function header wraps (that is, a define generic or a define module declaration), you should either break the line within the parameter list or between the name and the parameter list. If you break the line between the name and parameter list, or perhaps after the parameter list, then you should indent any extra lines one extra indent level. There is one exception: the first line of the return values specification may be indented only one space. The following are examples of recommended style:

define method some-longer-name-for-effect
    (one :: <class-1>, two :: <class-2>, ...)
    => (result1 :: <type1>, result2 :: <type2>);
  statement1;
  statement2;
  ...;
end method;

define method some-longer-name-for-effect
    (one :: <class-1>, two :: <class-2>, ...)
 => (result1 :: <type1>, result2 :: <type2>);
  statement1;
  statement2;
  ...;
end method;
	

Indent keyword parameters beyond #key token

If there is more than one keyword parameter specified in a function declaration, then all keyword parameter specifications should line up with the first specification. The following is an example:

define method foo (foo :: <simple-object-vector>,
                   #key start :: <fixed-integer> = 0,
		        stop :: <fixed-integer> = foo.size)
  ...
	

If a parameter list that contains a #key wraps, then you should wrap the parameter list before the #key token.

Wrapping variable and constant definitions

When a variable or constant definition wraps, you should break the line before the = token:

define constant western-culture :: <integer>
  = north-america-mask | south-america-mask | europe-mask;
	

Using end mumble to terminate statements

You should use end mumble, rather than just end, whenever there is sufficient vertical space between the beginning and the end of the statement to hinder readability. Some programmers like to use end mumble religiously, but those same programmers only use "end" when using end mumble would make a single-line statement wrap onto a new line. Programmers may add the mumble to an end if they feel the code was hard to visually scan, but programmers should not add mumbles to all ends in someone else's code just because they like that style. This policy is consistent with adding additional comments to someone else's code to help readability.

Horizontal space for columnization or other visual effects

You should NOT use horizontal spacing to columnize aspects of Dylan ™ code, or to present other visual effects within the program. If you do this, you should do it rarely. The one exception to this rule is the "Indent keyword parameters beyond #key token" recommendation.

Keyword syntax

You should only use keyword notation for symbols in function calls and function declarations. You can also use keyword notation in slot descriptions within define class, where you are supplying a symbol to be the key for an initialization argument. When a symbol is a value, such as a default value in a declaration or an argument value in a call, you should use symbol notation, not keyword notation. The following are examples of recommended style:

define method reposition (stream :: <stream>, offset :: <integer>,
                          #key from: = #"start")


define method foo ()
  ...
  bar(x, y, color: #"red");
  ...
end method;

// In this example, required-init-keyword: would always be in
// keyword notation, and "foo:" is the item to which this style
// recommendation speaks.
//
define class <frob>
  slot gag, required-init-keyword: foo:;
end;
//
// Or ...
//
define class <frob>
  slot gag, required-init-keyword: #"foo";
end;
	

Blank comment line between introductory comments and definitions

You should include a blank comment line between the introductory comments for a function and the function definition. This applies to any top-level declaration for which you supply an introductory comment. The following is an example:

// Blah blah blah GINGER blah blah blah.
//
define ...
	

Wrapping case statements (applies to select too)

If any single branch of a case statement wraps, then you should wrap every branch. In practice, there are common exceptions to this recommendation.

When wrapping a case statement, you should wrap it after the => token. There should be no exceptions to this recommendation.

The following exhibit the recommended style:

case
  test1 => expression1;
  test2 => expression2;
end;

case
  test1 =>
    statement1;
    statement2;
  test2 =>
    statement1;
    statement2;
end;
	

Using #t for otherwise in case statements

You should not use #t for otherwise in case statements.

Wrapping for clauses

When an =/then clause needs to wrap, you should wrap before the = and before the 'then'. Both the '=' and the 'then' should be indented one extra indent level:

// Correct.
//
... nested levels ...
                 for (some-index
			= initialization-expression(x, y, z)
			then stepper(u, v, w))
	           stuff(some-index);
		 end;

// Incorrect.
//
... nested levels ...
                 for (some-index = initialization-expression(x, y, z)
			then stepper(u, v, w))
		   stuff(some-index);
		 end;
	

Mandated

Function name and open paren

The define generic or define method declarations must have a space or newline between the name of the definition and the parameter list. At any function call site, the name and open paren must be adjacent, with no intervening characters.

Wrapping let statements

If a let statement wraps, you must put the equal sign on a new line and indent it one extra indent level.

Terminate statements with semicolons

Semicolons are terminators and must appear wherever they are optional, except as covered by the "Single-line statements" mandate.

Single-line statements

Single-line statement, such as if-else, must not have any internal semicolons. If a single-line statement requires an internal semicolon, then you must break the statement into multiple lines. For example, the following would be illegal in Gwydion style:

let x = if (some-test) foo(); 3; else 5; end;
let x = if (some-test) foo(); 3 else 5 end;
let f = method (a, b) foo(); #f; end;
	

These should be formatted as follows (ignoring the end if issue):

let x = if (some-test)
   foo();
   3;
 else
   5;
 end;
let f = method (a, b)
   foo();
   #f;
 end;
	

Most programmers felt this rule, by its nature of eliminating multiple statements within a block on a single line, made code much more readable.

Wrapping arithmetic/boolean expressions

You must wrap long expressions before operators and indent the operators with one extra indent level from the beginning of the expression. Note, assignments are considered to be "statements", and the "expression" that wraps is the code on the right-hand side of the assignment operator. Therefore, the first two examples below are considered to be approved style, but the third is not:

my-local := big-computation(arg1, arg2)
       + another-hairy-long-calculation(arg3, arg4, arg5)
my-local := (big-computation(arg1, arg2)
        + another-hairy-long-calculation(arg3, arg4, arg5))
my-local := big-computation(arg1, arg2)
  + another-hairy-long-calculation(arg3, arg4, arg5)