next up previous
Next: Scheduling Up: repspringer Previous: Introduction


Runtime Environment: DAOS

Parallel logic programming (PLP) systems should serve the broadest range of applications to become popular. To achieve this goal, PLP systems should support both ANDP and ORP for existing Prolog applications. In practice, exploiting just one form of implicit parallelism requires sophisticated system design and exploiting distinct forms of parallelism at the same time is even harder.

Arguably, shared memory PLP systems have been the most successful and widespread so far as mentioned before.

Several distributed memory PLPs have been proposed. In some models, computation nodes receive messages with goals to solve, and send answers to these goals. Unfortunately, sending goals and receiving solutions is expensive and may become a bottleneck to the system. Another alternative is to copy segments of the execution stacks between workers, as examplified by Opera [19,77]. Because stacks can grow to be rather large, incremental copying of stacks is used to reduce message size. Although these systems perform well for ORP, they have difficulties with IAP, as IAP execution cannot be represented by a simple stack [91].

The differences between shared memory and message-based machines have become less significant in the last few years due to distributed shared memory systems (DSMs) that gives a shared memory abstraction. Work on PLPs for hardware DSMs has given interesting results. Silva's Dorpp or-parallel system [94] has shown that good performance is possible. More recently, Santos Costa et al  [85] performed an extensive analysis of the Andorra-I system on a DASH-like simulated machine. Again, their numbers confirm that most read cache misses result from scheduling and from accessing the code area. Misses to the execution stacks varied from 8% on a ORP parallel application to 30% on an ANDP application. In fact, most of these misses were eviction or cold misses (the authors used 128K cache size). Only the ORP application has significant sharing misses for an execution stack, the choice-point stack, (60% of misses for this stack) and these misses are associated with the fact that this stack is also used for scheduling. We argue that the previous analysis suggests a new approach to distributed PLP systems:

The DAOS model: Distributing And/Or in Scalable machines is the first parallel logic programming model for distributed systems that supports these innovations. The model's main contributions are:

ORP and IAP are arguably two of the most interesting forms of parallelism available in Prolog programs. Several methods have been proposed to exploit And/Or parallelism in PLP systems. We propose to use recomputation, according to the C-Tree concept [57]. To understand the motivation for recomputation, consider the following and-parallel query:

?- a(X) & b(Y).
where & represents a parallel conjunction. Figure 1 shows that both goals have several solutions. After running the goals a(X) and b(Y) in parallel, one needs to combine the solutions. One solution is to use a special node, that maintains pointers to each solution for each goal. Solutions to the conjunction are obtained by calculating the cross-product between values for X and for Y. This approach is known as reuse as presented in Figure 1(a) , because solutions to each goal are reused several times while calculating the cross-product.

Recomputation-based models are based on having an Or search tree, such that and-parallel goals correspond to building bits of the search tree in advance. Thus, when one starts a(X) and b(X) in parallel, the solution to b(X) is a continuation branch for a(X), and is thus associated with a specific solution for a(X), as shown in Figure 1(b). These models are named recomputation-based because to exploit several alternatives for a(X), the solution for b(X) has to be recomputed for each alternative. Recomputation has important advantages, namely, avoiding the overheads in implementing a cross-product node, and simplifying the support of Full Prolog, as semantics are closer. The price is that reuse saves some effort in recomputing goals to the right. There have been arguments that for Prolog applications reuse does not reduce much search space over recomputing solutions, as Prolog programmers avoid such situations [57].

Figure 1: (a) Reuse (b) Recomputation (c) C-Tree
\begin{figure}
\centerline {\epsfig{file=figura1.eps,width=10.2cm}}\end{figure}

The C-tree is shown in Figure 1(c). In the C-tree  [57], whenever a worker picks an alternative from another process, it creates a copy of all parallel conjunctions found along the way and restarts work to the right. Traditionally, the C-tree has been associated with a notion of team, i.e, a group of workers. Normally, IAP work is explored inside a team and ORP work is executed between teams. DAOS gives freedom to system designer to decide if he or she wants to use or not teams. The use of teams has the advantage of simplifying scheduling allowing (re)use of IAP schedulers within teams and ORP scheduler between teams. This organization also simplifies the solution propagation in IAP through the possibility of using multicast inside a group.

DAOS aims at two goals: improved efficiency over traditional distributed systems, and good support for And/Or parallelism under Prolog semantics. It is a fundamental point in DAOS establish which data areas should be private to a worker, and which ones should be virtually shared. There are two opposite approachs: (a) all stacks must be private as in distributed PLP system, or (b) all stacks must be shared through a DSM layer. The later option seems interesting because we could use a previous implementation to shared memory systems. However, this may be inefficient due to scheduling data structures. DAOS presents a intermediate solution: the areas that implement the major data-structures will be logically shared and the areas that are used to implement work management should be private.

The key problem we address in DAOS is how to implement the virtually shared address space. This shared space contains the major data structures used in Prolog, such as all Prolog variables and compound terms. Because DAOS is designed to support or-parallelism, having a shared space suggests using a binding array (BA) based approach, such as the one used in Aurora [70] and Andorra-I [87]. The original BA was designed for ORP and keeps a private slot for every variable in the current search-tree. This slot stores all conditional bindings made by a worker, instead of writing on the shared space. Because all bindings to the shared stacks are indeed conditional, BA based models have the important property that workers can only write to their own address space. Accesses to other memory areas are read-only until this memory is reclaimed. This gives a nice single-writer, multiple-reader pattern for the shared memory.

Unfortunately, the original BA design does not extend to IAP, because one does not know beforehand how many cells each and-goal will need. Management of slots between workers running in and-parallel becomes highly complex [56]. In DAOS, we propose to use the Sparse Binding Array (SBA) [81] to manage bindings to shared variables. The SBA addresses the memory management problems in traditional BAs by shadowing the whole shared stacks, that is, every cell in a shared stack has a private ``shadow'' in a worker or team's SBA. Although the SBA was designed to address the problems in combining AND and OR parallelism, it has shown quite good performance even for purely ORP applications [81].

The initial SBA design depends on a shared memory environment: the design organises workers into teams. All workers in a team share the same choice-points, and thus can share the same SBA. Unfortunately, sharing the SBA eliminates the main advantage of the design, as one can expect the SBA to be write-intensive, and to have worse locality than the stacks. In DAOS, we propose a different solution:

The Heap and Environment Stack support forward execution. The Control Stack is used both for backtracking and for parallelism. The Trail supports backtracking. The Sparse Binding Array (SBA) provides access to private bindings. Last, the Goal Stack gives goals to be exploited in IAP.

The two largest stacks in Prolog execution store terms and logical variables. Both have data structures that are read-only after sharing and should thus be virtually shared:

False sharing may happen in two situations (see Silva [94] for the initial discussion of these problems in an ORP context). First, workers can make previous work public, and then continue execution on the same stack. In this case, their next updates to the stack might be sent to the sharing workers. Fortunately, such sharing can be easily attacked by standard techniques, such as using a lazy release consistency memory model [48]. Relaxed consistency ensures that new updates from the worker owning the stack segment will only be sent at synchronisation points.

A second source of false sharing is backtracking. In general, when all alternatives have been exploited, both global and environment stack space may be reclaimed, and next reutilised to store new work. Updates to this space may then be sent to the workers which had originally shared the work, unless one uses an invalidate-based protocol.

The two previous stacks take care of Prolog's forward execution. They do not address how to implement backtracking, ORP, and IAP. The Control stack stores choice-points, that is, open alternatives. Choice-points include pointers to the stacks and to the goal's arguments before creating the choice-point, plus pointers to the available alternatives. Some ORP systems also store scheduling information in choice-points, such as how many workers are exploiting alternatives from this choice-point. IAP system extend the control stack to also include parcall frames, that describe the conjunctions to be executed in parallel.

Originally, the control stack was only used to store alternatives. Then ORP systems had used this stack to manage the available work in the system. For instance, in the chat-80 ORP-only benchmark running under Aurora, about a third of the sharing misses originated from the control stack [80] (the rest originated from the scheduler data structures). Although a similar study is not available for IAP systems, we expect parcall-frames to also be an important source of sharing misses.

The Trail Stack stores all conditional bindings. In a sequential execution this is required to undo bindings to variables when backtracking. In a BA-based system, the trail is a fundamental data-structure, as it is the one used to synchronise bindings between different branches. Since conditional bindings may not be placed directly on the stacks, the alternative is to store these bindings in the Trail. When workers fetch work, they read the bindings from the Trail and stored them in the SBA. Thus, the BA or SBA are optimised for access to a variable's binding, and the trail is optimised for installing and deinstalling bindings, as workers move around the search-tree.

In DAOS the first operation to be performed when sharing work is to install all conditional bindings in the SBA, which requires immediate access to corresponding trail section. This opens the question of whether to keep the trail under DSM, or whether to use explicit distribution, as for the Control Stack:

A final decision will depend on several factors, such as the efficiency of the DSM system and of the message passing implementation. In the IDAOS implementation, our IAP part of DAOs, we have decided to initially follow a fully distributed implementation because trail copying can be naturally integrated with control-stack copying.




next up previous
Next: Scheduling Up: repspringer Previous: Introduction
Ines de Castro Dutra
1999-05-04