Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!harpo!seismo!hao!hplabs!sri-unix!OKeefe.R.A.%EDXA@UCL-CS
From: OKeefe.R.A.%EDXA%UCL-CS@sri-unix.UUCP
Newsgroups: net.lang.prolog
Subject: Sets in Prolog
Message-ID: <16248@sri-arpa.UUCP>
Date: Thu, 26-Jan-84 17:56:00 EST
Article-I.D.: sri-arpa.16248
Posted: Thu Jan 26 17:56:00 1984
Date-Received: Wed, 8-Feb-84 00:42:33 EST
Lines: 245
From: "OKeefe.R.A."
Re: Shimon Cohen's note on sets, V2 #5 Prolog Digest.
1. He expressed concern at "the inefficient implementation of sets
in Prolog". This is like being concerned about the anti-clerical
attitude of the present King of France. There *is* none. I do
not know about LM-Prolog, but none of {DEC-10 Prolog, C-Prolog,
UNSW Prolog, MU Prolog, PopLog, Prolog-X, NIP} offers you a
Prolog data-type called "set" or, for that matter, "bag".
2. There is a very good reason for this. In various programs, I
have used the following representations of sets:
a) unordered lists (supported by SETUTL.PL)
-- this may be the library file whose inefficiency is worrisome
b) ordered lists with no duplicates (supported by ORDSET.PL)
-- except for membership tests, these routines are as efficient
-- as anything you can do with hash tables. Trouble is, they
-- only really work with "sufficiently instantiated" elements,
-- but then hash tables have precisely the same limitation (and
-- you need an unusually co-operative garbage collector if you
-- are not to be restricted to ground elements only)
c) ordered binary trees (I haven't submitted this to the library at
SU-SCORE yet, but ASSOC.PL comes close)
-- element testing takes O(lg N) time compared with O(N) for (a)
-- or (b), all the other operations take O(N) time as in (b),(a)
-- of course takes O(N^2) time. The union intersection ... code
-- is rather more complex than in (b)
d) vectors of var/nonvar (not submitted as I think it's hacky)
-- this applies to subsets of [1..N] (N can be up to 100 in
-- obsolete versions of C-Prolog, up to 200 in current versions,
-- up to several thousand in DEC-10 Prolog, any number in NIP
-- or PopLog), an element is in the set if arg(X,Set,A),nonvar(A)
-- or absent from it if var(A). I have used this for colouring
-- various kinds of graphs, where in the nonvar(A) case A is the
-- colour assigned.
e) bit-vectors using the /\ \/ / << and >> arithmetic operators.(If
you think I'm going to submit this you must be joking, it's
TRIVIAL, like so many things in Prolog if you look for a simple
solution, but you have to beware of different word-lengths)
-- this again applies to subsets of [1..N], but for much larger
-- N. An integer is 18 bits in DEC-10 Prolog, 24 bits in
Prolog-X
-- and NIP, 30 bits in PopLog(?) and TIP. Note that I am talking
-- about a *vector* of integers, not a list. Lists can be used
-- too, but I've never had occasion to work with sets of
integers
-- where I didn't know N before I wanted to construct the set.
f) a base set, represented as a table of elements (either vector
form or binary tree, see TREES.PL), and sets of integers
representing subsets of the base set in (d) or (e) form
I have also used (a), nested (b), (d), and (f)/(d) for graphs.
In various programs, I have used the following representations of
bags:
A) unordered lists (there is no library file to support this)
B) ordered sequences of - pairs (BAGUTL.PL)
C) binary trees of - pairs (no library file for
this but it would be trivial to adapt ASSOC.PL)
D) for bags of 1..N, vectors of counts
-- adding an element requires copying the vector, but see
-- ARRAYS.PL for constant-time updatable arrays in
Prolog
-- so adding an element to a bag takes constant time
E) a base set and a bag of integers in (A) (B) (C) or (D) form.
**** Not *ONE* of these representations is best for all purposes.
3. Hash tables are not best for all purposes either. If you spend
a lot of time working on entire sets, or if you are interested
in enumerating elements rather than testing for a single element,
ordered lists (use compare/3, {@<,@>=,@>,@=<,==,\==}/2) are hard
to beat. They do need the elements to be sufficiently instantiated
for you to tell them apart (basically, when making a set from a
list or other structure, you need (X = Y) => (X == Y)). But so
do hash tables. There are at least two ways of calculating hash
functions. You can use the print form (numbering variables from
left to right in the style of numbervars) or you can use the
internal addresses of variables, atoms, and functors, and the
values of integers. The former is slow, the latter means that
the garbage collector not only has to be prepared to rehash all
your sets, it has to FIND all your sets. (PopLog has hash
tables, but they are pop11 data structures, not Prolog data
structures [Prolog can't do anything to them except call Pop
functions], and the garbage collector CAN find the tables, and
DOES rehash them. Even so, ordered lists in C-Prolog are
competetive with hash-tables in PopLog.) It must be admitted
that DEC-10 Prolog and C-Prolog do look at the source form of
atoms to compare them, however the cost of atom comparison can
easily be reduced to the cost of integer comparison in a new
Prolog implementation at the cost of 3 pointers and 1 integer
extra per atom, and at the cost of making the creation of a
new atom O(lg|atoms|) instead of O(1).
4. It follows that "If you are honestly trying to make Prolog a
useful language", then by all means provide a "hash_on_name
(+Term,?Hash)" primitive so that people can build hash tables
(using vectors to hold the buckets, buckets being represented
perhaps as lists with unbound tails), but instead of forcing
the use of hash tables in every place, make them *JUST ONE* of
*SEVERAL* implementations of the same abstract data type.
5. A great deal depends on the needs of the particular programmer
for the program he is actually writing. Most of the time when
I call setof/3, I either want to perform some operation on the
whole set as it stands, in which case I exploit the fact that
in DEC-10 Prolog, C-Prolog, Prolog-X, and NIP the answer is an
ORDERED list, and ORDSET.PL provides efficient operations
on ordered lists,or else I want to map down the entire set
transforming the elements in some way. It is precisely these
tasks to which hash tables are least well suited. I cannot think
of any occasion on which I have calculated a setof and followed
it by a membership test (as opposed to an enumeration using
member). This is NOT, and is not intended as, an argument
against having hash tables around, nor indeed against having
hashed_setof/3 around. What it IS an argument FOR is for having
setof/3 in its present form (see SETOF.PL) and the
ordered set library [as well].
6. If you want to do a lot of membership tests, using binary trees
is nearly as efficient as using hash tables. For the next
couple of years on VAXen and M68000s (or on the DEC-10 with its
tiny address space), the difference between O(lgN) and O(1) is
likely to be swallowed up by the overheads of hashing. In the
longer term, the fact that (provided you accept comparison as a
primitive operation, and if you don't like comparing variables,
you can imitate MU-Prolog and delay until the terms are
sufficiently instantiated) binary trees can be accessed and
UPDATED in pure Prolog is likely to make binary trees the
method of choice in massively parallel systems. This fact
already provides binary trees with an important advantage
over hash tables when doing elementwise insertion and deletion.
See next paragraph.
7. The scheme Cohen describes for maintaining multiple versions
of a hash table has some problems. He says "Each set has sort
of internal clock which is modified whenever we create new
pointer to this set." I presume this doesn't actually mean
that after doing "X = Set" we have to increment the counter,
because he says he has implemented his ideas in C-Prolog, and
having an extensive knowledge of the internals of C-Prolog (the
speed of old versions was 800 LIPS on a VAX 750. The speed of
version 1.4b.edai is 1100 LIPS, and we are now up to version
1..4c.edai.) I am quite sure that there is no way of detecting
when we "create a new pointer to" anything at all without
substantial changes to main.c and unify.c and a MASSIVE overhead.
So I surmise that he means that the clock is modifed whenever a
new VERSION of a set is created, that is that the "insert" and
"delete" operations update the clock, and nothing else. Now we
have the question: what values does the clock take? If the
answer is "numbers", then the scheme has a serious deficiency.
% element(Element, SetWithIt, SetWithoutIt)
% is the assumed primitive, insert/delete depending on the
% instantiation state of the variables, just 'insert' here.
... :-
...
element(x, Set1, Set0),
element(y, Set2, Set1),
element(z, Set3, Set2),
...
works just fine, and operations on Set0..Set2 which just LOOK at
their elements work just fine as well. BUT
... :-
...
element(x, Setx, Set0),
element(y, Sety, Set0),
element(w, Setw, Setx),
element(v, Setv, Sety),
...
will almost certainly NOT work, because Setx and Sety will get
the SAME clock value, and Setw and Setv will also get the same
(but different from Setx and Sety) clock value. There is a cure
for this, which involves using structured clocks. I won't go
into the details, because it turns out that by the time you have
something that resets itself when you backtrack and permits
multiple versions AND lets the version space branch you would be
MUCH better off with binary trees.
8. The updatable array implementation in ARRAYS.PL does not
provide multiple versions. About a week before I sent it to
SU-SCORE I had this obvious idea about timestamping changes to
the array and started coding it. Then the deficiency struck me,
and I spent a couple of days trying to figure out an efficient
way of sharing as much structure as possible with a branching
version space. That's when I came up with TREES.PL.
Actual timing tests showed that the (compiled Prolog) binary
tree implementation of arrays was always much faster than the
(compiled Prolog) updatable vector implementation, and it was
purer Prolog and could be generalised to non-numeric indices.
Updatable hash tables which support branching versions spaces
can easily be built from TREES.PL given the
hash_on_name/2 operation (which is not available in any Prolog
available to me except Poplog, where I wrote the Prolog-term
hash function -- the Pop hash function hashes on addresses).
If Shimon Cohen or Ken Forbus has actually invented a technique
for handling a branching version space in updatable structures
at LESS cost than O(lgN) space and O(lgN) time, I shall propose
his or their election to the pantheon and ask for a signed
photograph.
9. I would very much like to see Shimon Cohen's code. Thank you,
sir, for offering to mail it. If this means physically posting
it to me, fine, but it might be better if you posted it to SCORE
where anyone who is interested can FTP it without bothering you.
(Actually, I have an ulterior motive. I want your stuff to sink
into the total oblivion that ARRAYS.PL and
ORDSET.PL seem to have sunk into. (:-}) I would also like to
see Ken Forbus's code. As implementing constant-time updatable
arrays in Prolog is basically trivial (have a look at
ARRAYS.PL if you don't believe me) I would expect that the
ZetaLisp code would be similarly trivial. If anyone is really
interested in having multiple versions of arrays and can put
with with a purely sequential version space, I could send my
code for that to SU-SCORE as well.
[ Shimons code is available from {SU-SCORE} as:
SCORE:Cohen_Sets.Txt ! note for using the PL code
Cohen_Sets.PL ! the PL code - ed ]
10. It just occurred to me that some people who haven't thought about
data structures in Prolog or read the manual or the Clocksin &
Mellish book with any degree of attention may wonder what I mean
by "vectors". I refer to the use of the functor/3 and arg/3
primitives which support 1-dimensional origin-1 vectors very
efficiently in any usable Prolog. I didn't call them "arrays"
because once you have bound an argument of a term you can't
change it.
11. PS: there are versions of many of the {SU-SCORE}*.PL
files for C-Prolog. If enough people are interested I could
post them to net.sources now that our VAX is on the air again.
It is not possible for me to post them to ARPANet.
-- Richard A. O'Keefe
(Te Tohunga Prolog)