Coprocessor interfaces

Custom interfaces are used to enhance the capabilities of existing microprocessor chips. For example, the 680x0 and 80x86 devices have companion floating-point maths coprocessors.

A coprocessor may appear as a memory-mapped peripheral device and be accessed using standard instructions (the coprocessor could be built on a separate VME card, for example).

A coprocessor may be designed to operate with a coprocessor interface to its microprocessor. Special instructions will exist within the microprocessor to activate the coprocessor and to exchange data using the data bus.

The Undefined Instruction trap can be used to invent new instructions for a coprocessor (which was not envisaged at the time of designing the microprocessor). Normally this is used to invoke a software routine (i.e. to emulate a coprocessor) but custom hardware may be used to create a co-processor interface.

Software fixes

If a coprocessor is not physically present, its function may be emulated. However, this normally requires the software to be compiled in two forms, depending upon whether the coprocessor is present or absent, which can be inconvenient. A mid-way solution depends upon the software being able to detect the presence of a coprocessor and then to use specific coprocessor instructions or emulation routines accordingly.

A novel implementation, which does not require recompliation of the source code, has been used on the ARM processor.

ARM processor

The ARM processor was originally designed by Acorn Computers as an upgrade to their BBC microcomputers. The Acorn RISC Machine (ARM) is used in Archimedes workstations, the Apple Newton PDA and forms the core of many VLSI designs.

The ARM has an Undefined Instruction Trap which has an entry in the vector table for interrupt service routines. There are two kinds of undefined instruction - so allowing a user to devise a use for some of these codes.

The ARM's coprocessor instructions are implemented in a similar way to the undefined instructions, the difference being that the /CPI pin is pulled low whenever one of the coprocessor instructions is executed. A 4-bit field in the coprocessor instructions allows up to 16 separate coprocessors to be addressed.

When a coprocessor instruction is executed and /CPI is asserted, the designated coprocessor should respond by pulling CPA low immediately (CPA = coprocessor absent) to indicate that it will execute the command.

If no coprocessor exists, then CPA stays high and the undefined instruction trap is taken so that the coprocessor operation can be performed in software. (It could of course really be an undefined instruction and therefore an error).

CPB (coprocessor busy) is used by the coprocessor to delay the ARM processor while the operation is completed. The ARM will wait until CPB goes low and then start the next instruction.

The above process requires no recompilation whether a coprocessor is present or absent. But there must be a software routine available, if the coprocessor is absent.

The coprocessor monitors the data bus, using the /OPC (op-code fetch) signal as a strobe, looking for coprocessor instructions. But the ARM has a 3-word pipeline, i.e. instructions are fetched 2 cycles ahead of their execution. So the coprocessor must also have a 2 cycle delay (or an internal pipeline) so that it only responds to valid coprocessor instructions and not an item of data which happens to have the same numeric value as a coprocessor instruction.

Inmos Transputer T800

This device adopts another method of achieving inter-processor communications, by providing 4 fast serial links.

The T800 processor is hardware and pin compatible with the T414 transputer. Its main features are:

The T800 has only six registers, where three of these (A, B and C) are the sources and destinations for most arithmetic and logical operations. These three registers form an evaluation stack which operates in a Forth-like manner. For example, the add instruction adds the top two values on the stack and places the result on the top of the stack. It is up to the programmer to ensure that no more than three operands are stored in the stack.

Memory locations are accessed relative to the workspace pointer (one of the six registers). Instructions take a widely different number of cycles to execute (from 1 cycle to over 30 cycles). The FPU exhibits a similar variation and a single-precision multiplication takes 367 ns for a T800-30 device. Since there is no barrel-shifter, the execution time for Shift operations varies with the number of bits shifted. The processor has a microcoded scheduler which enables concurrent processes to share processor time. Inactive processes do not consume any processor time. Active processes are of two types, high and low priority, and are kept on separate linked lists. Each process runs until its action is completed or until the current time slot is exceeded or until it requires input from another transputer or process. A process can only be de-scheduled on certain instructions so that expression evaluation and completion can be guaranteed.

An important feature of the wmsputer is its serial links. These allow fast inter-transputer communication or communication with a host system. The normal rate of data transfer is 10 Mbits per second, but rates of 5 or 20 Mbits per second can also be used. The links are TTL compatible. Each byte of data sent must be acknowledged which signifies that the receiving process was able to receive the byte and that the serial link is ready to accept another byte. Communication via links is point-to-point, synchronised and unbuffered. Input message and output message instructions are used after loading the stack with a pointer to the message, the channel address and the number of bytes to transfer.

The T800 may be used standalone, or it is commonly interfaced to IBM-PC computers where high-level language support is available. The transputer can be programmed in C for example, or using Occam to enable parallel processing between transputers.


Back