Thursday, March 27, 2008

Homogeneous, N-Dimensional Arrays (ndarrays)

The basic representational element for PLT Scheme Schemelab is a homogeneous, n-dimensional array - or ndarray. This post will discuss the initial design for ndarrays and discuss some of the operations on them.

Internally, an ndarray is represented by a structure with the following elements:
  • ordering - (symbols 'row 'column), specifies the order in which dimensions are stored ('row for row major and 'column for column major). Generally, this isn't that useful except for (eventually) interfacing with foreign code (C uses row major and FORTRAN uses column major). Also, the transpose operation converts from one to the other (although that is more a side effect that its real functionality). [read only]
  • shape - (listof natural-number/c), specifies the shape of the array. The length of the shape list is the number of dimensions for the array and each element is the cardinality of the corresponding dimension. This may be set to reshape the array.
  • ndim - natural-number/c, the number of dimensions for the array. This is equal to (length shape). [read only]
  • size - natural-number/c, the total number of elements in the array. [read only]
  • disp - natural-number/c, the displacement (in elements) of this subarray is the data vector. This will be zero for any ndarray that owns its own data. [read only]
  • strides - (listof natural-number/c), a list of the number of elements to skip to move to the next element in the corresponding dimension. [read only]
  • offset - natural-number/c, the offset (in elements) for each addressed element. This will be zero for any ndarray that owns its own data. [read only]
  • data - typed-vector?, the typed vector containing the data for the array. [A typed vector encapsulates an SRFI 4 vector (for u8, u16, u32, u64, s8, s16, s32, s64, f32, or f64 arrays), a complex vector (for c64 or c128 arrays), or a Scheme vector (for Scheme object arrays).] [read only]
  • base - (or/c array? false/c), the base array for this (sub)array or #f if the array owns its own data (i.e. it is a base array). Note that the referenced array may itself base a non-#f base entry. [read only]
Note that I may not need both the displacement (disp) and offset fields. But, the general idea for array iteration is to add the displacement once when the iterator is initialized and to add the offset for each element. It might be possible to collapse those into one offset without loss of generality.

The most general array constructor is make-array, which has the following contract:

(->* ((listof natural-number/c))
(#:vtype (or/c vtype? symbol?)
#:ordering (symbols 'row 'column)
#:fill any/c)

The required argument is the shape of the array. The optional keyword arguments are #:vtype, which specifies the type of the array (default object), #:ordering, which defaults to 'row, and #:fill. If #fill is not specified, the array elements are not initialized.


(define a1 (make-array '(3 2) #:vtype f32 #:fill 0.0))

Creates an array whose shape is '(3 2) (i.e. three rows and two columns) and whose elements are 32-bit floating-point numbers that are initially 0.0.

A fully qualified reference for the array returns the corresponding (scalar) element. For example, (array-ref a1 '(1 1) returns the value of the element at '(1 1).

A partially qualified reference for the array returns a reference to the corresponding subarray. For example, (array-ref a1 '(: 1) ) returns the subarray of shape '(3) that corresponds to the second column of a1. [Note that this is a reference to the second column of a1 and not a copy of it. Array referencing never copies anything.] Likewise, (array-ref a1 '(1 :)) [or equivalently, (array-ref a1 '(1))] returns a reference to the second row of a1.

Array mutation also allows partially qualified references. In that case, the value specified must be broadcastable (described in a future post) to the shape of the reference. For example, (array-set! a1 '(: 1) 1.0) sets the elements of the second column of a1 to 1,0 (yes, the scalar 1.0 is broadcastable to shape '(3)). Likewise, (array-set! a1 '(1) '(4.0 8.0)) sets the elements of the second row of a1 to 4.0 and 8.0.

A complete set of array manipulation functions will be provided. This will include: build-array, array-map, array-map!, array-for-each (i.e., similar to what SRFI 43 provides for vectors).

I haven't though about the semantics for things like array-fold and array-unfold. They may be well-defined someplace - I really haven't looked. If anyone knows where or has any ideas, please let me know.

Note that it is not my intention to define functions like add, subtract, multiply, divide, ... (as numpy does for Python). Rather, we will rely on the array mapping functions to extend scalar operations to arrays.


Schemelab Concept

My PLT Scheme project for this year is to create a Schemelab collection to provide better analysis capabilities in PLT Scheme. It will be something of a mini-Matlab with capabilities similar to what numpy, scipy, and matplotlib provide in Python. I plan on making a new numeric collection to provide homogeneous, n-dimensional arrays (and, as a subset, matrices) as the primary underlying representation. Next, the science collection will be updated to use the new numeric package. Finally, a new plot collection will be developed to provide better visualization functionality. [After these are done I'll also update the simulation and inference collections to use the new Schemelab packages.]

I've already started work on the numeric collection. The basic representational element is a homogeneous, n-dimensional array (ndarray). An ndarray's type and shape are specified when it is created. The type may be any of the element types allowed for SRFI 4 vector types (u8, u16, u32, u64, s8, s16, s32, s64, f32, or f64), a complex type (c64 or c128), or may hold any Scheme object (object). The shape is a list of natural numbers where the length of the shape is the number of dimensions for the ndarray and each element is the cardinality of the corresponding dimension. References to array elements will support array slicing (in any dimension) for both array accessing (array-ref) and mutation (array-set!). Slicing operations create new views of the referenced array as opposed to copying portions of the array. I will start posting entries on different aspects of the numeric collection in the next few days.

The updates to the science collection to support (or to make use of internally) the new numeric collection should be relatively straightforward. The main issue will be to retain compatibility with the existing data types (i.e. vectors). Most of the really numerically complex code (e.g. special functions and random number distributions) will not be affected since they don't generally provide vector or array operations. The main changes will be to the statistics and histogram modules, which will be modified to work with ndarrays as well as vectors. Finally, some of the modules (e.g. histograms and ordinary differential equations) will benefit from being reimplemented using ndarrays internally.

I've also prototyped some code for the new plot package that provides much more functionality than the current PLoT package. The initial capability will be very much patterned after the functionality provided by Matlab (or more precisely, matplotlib for Python, which is also based on Matlab). It provides precise control over the elements of a graph (or multiple subgraphs). It also provides interactive graphics functionality for more dynamic analysis capabilities.

Note that all of these new (or updated) collections require PLT Scheme Version 4 (currently 3.99) and are not compatible with earlier versions. As such, I am not releasing them to PLaneT until either PLT Scheme Version 4 is officially released or there is a separate PLaneT repository for PLT Scheme Version 4 code. I will be putting the code on the Schematics web site at some point.