Saturday, September 26, 2009

Getting Ready for Inference Collection V3.0 Update 1

One of the major changes for version 3.0 of the inference collection is a rule compiler that generates code for (portions of) the match/join process. This will be done at expansion (compile) time by the rule language macros themselves. The rule language macro include define-ruleset, define-facts, and define-rule. The rule compilation itself takes place in the define-rule macro.

So far, I have the rule parsing, rule validation, and rule normalization phases pretty much complete. The rule normalization phase converts Boolean expressions within the pattern clauses on a standard (and <pattern-clause> ...) | (or (and <pattern-clause> ...) ...). The first form is a simple rule while the second is a sequence of simple rules. For example:

(define-rule (some-rule some-ruleset)
(p1 ?a ?b ?c)
(p2 ?a ?b ?c)
(or (and (p3 ?a)
(p4 ?b))
(p3 ?c))
==>
(printf "a = ~a, b = ~a, c = ~a~n" ?a ?b ?c))

would expand into the equivalent of the two rules:

(define-rule (some-rule-a some-ruleset)
(p1 ?a ?b ?c)
(p2 ?a ?b ?c)
(p3 ?a)
(p4 ?b)
==>
(printf "a = ~a, b = ~a, c = ~a~n" ?a ?b ?c))

(define-rule (some-rule-b some-ruleset)
(p1 ?a ?b ?c)
(p2 ?a ?b ?c)
(p3 ?c))
==>
(printf "a = ~a, b = ~a, c = ~a~n" ?a ?b ?c))

This can simplify writing complex rulesets.

I am now coding the actual rule compiler. This will generate procedures for matching pattern elements and joining matches - applying join constraints. For example, the pattern

(p1 ?a 12 ?b : (> ?b 10) (c . ?c))

would compile into:

(p1
(#:variable #f ?a #f (lambda (?a bindings)
(vector-immutable
(unsafe-vector-ref bindings 0))))
(#:literal #f #f 12 (lambda (x bindings)
(if (= x 12) bindings #f)))
(#:variable #f ?b (> ?b 10) (lambda (?b bindings)
(let ((?a (unsafe-vector-ref bindings 1)))
(if (> ?b 10)
(vector-immutable
(unsafe-vector-ref bindings 0) ?a ?b)
#f))))
(#:variable c ?c #f (lambda (?c bindings)
(let ((?a (unsafe-vector-ref bindings 1))
(?b (unsafe-vector-ref bindings 2)))
(vector-immutable
(unsafe-vector-ref bindings 0) ?a ?b ?c))))))

where each pattern element is a five element list (<type> <key> <variable> <constraint> <matcher>), where <type> is the pattern element type (e.g., #:variable or #:literal), <key> is the key for association list matching, <variable> is a variable name, <constraint> is a constraint expression, and <matcher> is a procedure to match the pattern element. These are used at execution time to build the rule network.

The structure of the pattern is maintained in the compiled form and is used to structure the match nodes and the data flow between them. The <type>, <key>, <variable>, and <constraint> fields are used to determine when nodes can be shared among rules. Finally, the procedure in the <matcher> field does the actual matching and variable binding.

Data Structures for Matching and Joining

Immutable vectors are used to represent matches. For example, the pattern

(p1 ?a 12 ?b : (> ?b 10) (c . ?c))

matched against the asserted fact

(p1 10 12 14 (c . 16) (d . 18))

would produce a match of

#(<assertion> 10 14 16)

where <assertion> is the assertion object for the specified fact and 10, 14, and 16 are the values matching the variables ?a, ?b, and ?c, respectively.

Immutable vectors are also used to represent joined matches. For example, the patterns

(p1 ?a ?b ?c)
?p2 <- (p2 ?a ?b ?c)
(no (p3 ?a))

matched against the asserted facts

(p1 1 2 3)
(p2 1 2 3)

with no asserted (p3 1)) fact, would produce a match of

#(#(<assertion> 1 2 3)
#(<assertion> 1 2 3)
#(#t))


A major bottleneck in the current inference engine is the amount of work done in propagating (or unpropagating) assertions and deletions through the join (and rule) nodes of the rule network. For version 3.0, I'm planning on two (eq?) hash tables for each join node to index left and right matches and a double doubly-linked list of joined matches. This will allow efficient insertion and deletion of both left and right matches during join processing (for propagating and unpropagating).

More to come.

Labels:

Sunday, September 20, 2009

Science Collection Updated

I released a new version of the Science Collection with reimplemented statistics and Chebyshev approximation routines.

Statistics

The statistics routines have been rewritten to take sequences (for example, lists and vectors) as arguments instead of just vectors - except for the median and quantile routines that require sorted vectors. Unfortunately, the sequence versions, which are generally implemented using for/fold, were more than twice as slow as the previous vector versions. So, I wrote dispatching for and for/fold macros (see below) that take a general sequence and generate a cond expression with special cases for vectors and lists that the complier can better optimize. These perform about the same as the old code for vectors and lists perform about the same (as vectors). And, we still get the benefit of general sequences.

I also added the correlation function that I had somehow missed in the original implementation. [Unfortunately, the first version had a bug (#203) in it - actually, there were two since the contract was also wrong. It has since been fixed.]

I also added running statistics to the statistics routines. If you only need simple statistics (e.g., n, min, max, mean, variance, and standard deviation), you don't need to maintain a list (or vector) of the values. Instead, you create a statistics object and tally successive values. The statistics object can be queried at any time for the values of its statistics at that point.

I also implemented a mean-variance function that computes both the mean and variance for a sequence in a single pass through the data. [The previous functions requires two passes - one to compute the mean and one to compute the variance.] This also reduces the number of passes required for computing skew and kurtosis when the mean and standard deviation aren't given.

I also implemented contracts for sequence-of-real? and nonempty-sequence-of-real? that are used in the contracts for the statistics routines. As usual in the Science Collection, there are unchecked versions of the statistics routines that bypass the contract check.

Chebyshev Approximations

The Chebyshev approximation routines were reimplemented for clarity and efficiency. I changed the make-chebyshev-series function to be more general. It replaces the functionality of make-chebyshev-series-order and chebyshev-series-init, which are still available as legacy routines. I also added make-chebyshev-series-derivative and make-chebyshev-series-integral that generate new series to compute the derivative and integral, respectively, of a given Chebyshev series.

Dispatching for and for/fold

In reimplementing the statistics routines to allow general sequences as arguments, I also implemented dispatch-for and dispatch-for/fold macros, as described above. These macro are not exported from the statistics collection and are implemented using syntax-rules - so error messages are pretty much limited to 'bad-syntax'. With that caveat, they do produce code that outperforms (due to compiler optimizations) a straight for or for/fold for vectors and lists.

The implementation of the macros and examples their usage can be seen in the statistics.ss file in the Science Collection.

Labels: , ,