首页 > 解决方案 > What makes a DCG predicate expensive?

问题描述

I'm building a Definite Clause Grammar to parse 20,000 pieces of semi-natural text. As the size of my database of predicates grows (now up to 1,200 rules), parsing a string can take quite a long time -- particularly for strings that are not currently interpretable by the DCG, due to syntax I haven't yet encoded. The current worst-case is 3 minutes for a string containing 30 words. I'm trying to figure out how I can optimize this, or if I should just start researching cloud computing.

I'm using SWI-Prolog, and that provides a "profile" goal, which provides some statistics. I was surprised to find that the simplest rules in my database are taking up the majority of execution time. My corpus contains strings that represent numbers, and I want to capture these in a scalar/3 predicate. These are hogging ~50-60% of total execution time.

At the outset, I had 70 lines in my scalars.pl, representing the numeric and natural language representations of the numbers in my corpus. Like so:

scalar(scalar(3)) --> ["three"].
scalar(scalar(3)) --> ["3"].
scalar(scalar(4)) --> ["four"].
scalar(scalar(4)) --> ["4"].

...and so on.

Thinking that the length of the file was the problem, I put in a new rule that would automatically parse any numeric representations:

scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.

Thanks to that, I've gone from 70 rules to 31, and helped a bit -- but it wasn't a huge savings. Is there anything more that can be done? My feeling is maybe not, because what could be simpler than a single atom in a list?

These scalars are called in a lot of places throughout the grammar, and I assume that's the root of the issue. Though they're simple rules, they're everywhere, and unavoidably so. A highly general grammar just won't work for my application, and I wouldn't be surprised if I end up with 3,000 rules or more.

I've never built a DCG this large, so I'm not sure how much I can expect in terms of performance. Happy to take any kind of advice on this one: is there some other way of encoding these rules? Should I accept that some parses will take a long time, and figure out how to run parses in parallel?

Thank you in advance!

EDIT: I was asked to provide a reproducible example, but to do that I'd have to link SO to the entire project, since this is an issue of scale. Here's a toy version of what I'm doing for the sake of completeness. Just imagine there were large files describing hundreds of nouns, hundreds of verbs, and hundreds of syntactic structures.

sent(sent(VP, NP)) --> vp(VP), np(NP).
vp(vp(V)) --> v(V).
np(np(Qty, Noun)) --> qty(Qty), n(Noun).
scalar(scalar(3)) --> ["three"].
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.

qty(qty(Scalar)) --> scalar(Scalar).
v(v(eat)) --> ["eat"].
n(n(pie)) --> ["pie"].

标签: prologswi-prologdcg

解决方案


One aspect of your program that you might investigate is to make sure individual predicates succeed quickly and fail quickly. This is particularly useful to check for predicates that have many clauses.

For instance, when scalar(X) is evaluated on a token that is not a scalar then the program will have to try 31 (by your last count) times before it can determine that scalar//1 fails. If the structure of your program is such that scalar(X) is checked against every token then this could be very expensive.

Further, if scalar(X) does happen to find that a token matches but a subsequent goal fails then it appears that your program will retry the scalar(X) until all of the scalar//1 clauses have been attempted.

The judicious use of cut (!) or if-then-else (C1->G1;C2->G2;G3) can provide a tremendous performance improvement. Or you can structure your predicates so that they rely on indexing to select the appropriate clause. E.g.:

scalar(scalar(N)) --> [Token], {scalar1(Token, scalar(N))}.

scalar1("3", scalar(3)) :- !.
scalar1(Y, scalar(X)) :- atom_number(Y, X).

This uses both cut and clause indexing (if the compiler provides it) with the scalar1/1 predicate.

EDIT: You should read R. A. O'Keefe's The Craft of Prolog. It is an excellent guide to the practical aspects of Prolog.


推荐阅读