Concepts

XCentric is a logic programming language specialized in XML processing (its syntax is similar to Prolog). Its central features are:

  • new unification and pattern matching algorithms that allows easy query and processing of semi structured data;
  • a type system based on regular types;
  • builtins to translate HTML/XML data from local or remote files to an internal representation of tree.

Unification and Pattern Matching

This approach can be explained with a simple example, consider the following simple XML document representing an address book:

<?xml version="1.0" encoding="UTF-8"?>
<addressbook>
    <record number=1>
        <name>John</name>
        <address>New York</address>
        <email>john.ny@mailserver.com</email>
    </record>
    …. 
</addressbook>

the resulting term (internal representation) is:

addressbook([attribute(number,1)],record(name(‘John’),address(‘New York’),email(‘john.ny@mailserver.com’),…))

Using this internal representation one can use the unification on a domain of trees with an arbitrary number of leaf nodes to retrieve important data. Consider the following example, if we want the names of every person living in New York we can simply do:

http2pro(AddressBookURL,Term),
Term =*= addressbook(_,record(name(N),address('New York'),_),_).

First the address book is loaded from the web and translated to its internal representation, then using the special kind of unification (provided by operator =*=) one queries the document retrieving the name (variable N) of all records where the address is 'New York', all records and other elements inside addressbook that don't match this rule are simply ignored (variables '_'). Note that the first result is the first person found that lives in New York, all the remaining results are retrieved using Prolog backtracking.

Pattern Matching

In a similar way the user can use pattern matching (provided by operator =~=). In this case one of the arguments must be fully instantiated. This approach is more efficient but doesn't allow the dynamic creation of a new term where some of the variables are not instantiated.

Also we provide three predicates that use pattern matching and allow the programmer to find a sequence of elements, find the nth occurrence of a particular sequence of elements and count the number of occurrences of a sequence of elements. The predicates are deep/2, deepp/3 and deepc/3.

For example consider we want to find a sequence of elements between two elements named incision in the XML file translated to a term and stored in variable O. We can do:

deep(<incision(_),Critical,incision(_)>,O).

The sequence we pretend is stored in variable Critical. If we want to find the text of the third occurrence occurrence of element author in document Bib we can simply do:

deepp(author(_,T),Bib,3).

Variable T will store the value we pretend. If we want to count the number of occurrences of author elements in document Book we can simply do:

deepc(author(_),Book,C).

Variable C will store the number of occurrences of author elements.
Note that any of these predicates accepts elements and sequences of elements.

Type system

The programmer can declare types and use them along the program in order to ensure that the data manipulated is in the correct form. This feature is optional; if you don't want you don't need to use types.

Type Declarations

Now, the programmer can add type declarations to programs in order to impose constraints over the data values it processes. Type declarations in Typed XCentric are Regular Expression Types and include a notation for several kinds of values (*,+,?,| and ,). The following table describes Regular Expression Types:

a*

sequence of zero or more a’s

a+

sequence of one or more a’s

a?

zero or one a

a|b

a or b
a,b a followed by b.

The | operator is implemented by the ';'. For example, a|b is written a;b.

Types are associated with sequence variables by means of the operator ::. For example X::t means that sequence variable X only unifies with values of type t.

Example 1. A sequence of two or more authors:

:- type ta ---> (author([],string),author([],string),author([],string)*).

Example 2.
Given an XML file with names and authors of books. Get all the names of books with two or more authors (type ta is the same as declared in example 1):

process(N):-
    xml2pro('bibs.xml',Bib),
    Bib =*= bib([],_,book([],name([],N),X::ta),_).

Here we are searching books that, after the name element have a sequence (that unifies with X) with type ta. The result is:

N = Practical Cryptography

Comparing XCentric with the usual list processing approach

XCentric improves XML processing avoiding the traditional list processing. Consider we have a document containing books and need to get the names of all books with 2 or more authors, we can use the following XCentric program:

:- type type_a ---> (author(string),author(string),author(string)*).
bib(_,book(X::type_a,name(N)),_) =*= BibDoc.

To do the same thing using only SWI-Prolog (which has a quite good library for processing XML in Prolog):

pbib([element(_,_,L)]):-
      pbib2(L).
pbib2([]).

pbib2([element('book',_,Cont)|Books]):-
      authors(Cont),!,pbib2(Books).
pbib2([_|Books]):-
      pbib2(Books).

authors([element('author',_,_),element('author',_,_)|R]):-
      write_name(R).

write_name([element('name',_,[N])]):-
      write(N),nl.
write_name([_|R]):-
      write_name(R).


Basic XML Schema Support

XCentric also provides a basic XML Schema support:

  • Basic types:  string, integer, float and boolean.
  • Occurrences of sequences:
    The programmer can declare the minimum and maximum number of occurrences of a sequence. Consider for example the following type:

    type oc ---> author([],name){2,unbounded}.

    Type oc represents every sequence of two or more authors.

  • Orderless sequences:
    The programmer can declare a sequence of elements which can appear in any given order. Consider the following type:

    type mix ---> record([],{name([],string) & address([],string) & email([],string)?}).

    This type represents a record with three elements where the order they appear doesn't matter.

Builtins for XML handling

There are four ways to load HTML/XML data:

  • Directly from the web using builtin http2pro(URL,Term).
  • Directly from the web with validation using builtin http2pro(URL,DTDFile,Term).
  • From a local file using builtin xml2pro(XMLFile,Term).
  • From a local file insuring the XML is valid against a given DTD using builtin xml2pro(XMLFile,DTDFile,Term).