(Advanced) Search Syntax

The following page describes extra details about the search syntax not found in the otherwise comprehensive search syntax help.

Big Picture

The search has two distinct phases:

building the query
executing the query, collecting results

The first phase is what we are going to discuss here. We’ll break it down further into:

parsing text into the AST (abstract syntax tree)
modifying the query tree (semantic parsing)
building the query object

Building the Query

The Search Grammar defines the search language of SciX. It is a context-free grammar and it is used to generate a client library (by ANTLR).

If you don’t like reading context-free computer grammars (who does?) you’ll find a good explanation of SciX syntax here: search syntax help.

But as a reward for having found this obscure corner of the help, we’ll illustrate a few more special situations not covered there.

Operators

The search operators have the following precedence (from higher to lower priority): NEARx -> NOT -> AND -> OR -> " "

Some details worth mentioning:

Empty space is the default AND

Better to illustrate it by way of examples:

jim and john not mary becomes behind the scenes: (jim AND (john NOT mary)) because NOT has precedence over AND. But john jim and mary becomes (john AND (jim AND mary)) because AND operator has precedence over empty space (operator) – notice how jim and mary is evaluated as a group: i.e. the query is not parsed as john AND jim AND mary.

You can modify the default operator

It can also be changed on demand by adding q.op=OR into URL parameters (i.e. NOT inside the search form), in which case the logic will change dramatically. For any given query the results will contain many more records, but if sorted by relevancy score, the top items will be still the ones returned with the default AND operator.

SciX supports proximity searches for text fields

Yes, many people may not know about it, but you can do stuff like: title:(dog NEAR5 cat) – this will find any mention of the barking animal appearing up to 5 text words (tokens) from the meowing animal. The NEAR has to be followed by a number [1-5] and it will not care about the order; i.e. cat NEAR5 dog == dog NEAR5 cat – this search can be very powerful, especially if applied against fulltext. It can also be quite expensive (computationally) - especially when the search term(s) have synonyms. Use this operator with fielded queries on text fields such as full, abs, title.

There is no in-order proximity operator, but SciX still supports this feature

SciX has a limited support for in-order proximity - if you make a phrase search like so abs:"newtonian solar"~3 the word newtonian (and its synonyms) will have to be followed by solar (and its synonyms) for up to 3 positions away. We do not have a special operator for it though; if what you search for has more than 2 words, we’ll decide how far apart they can be. For example if you do abs:"one two three"~3 then one may be 3 words away from two three (and it doesn’t matter that there are really 4 tokens between one and three).

Syntax Parsing

OK, so back to the syntactic parsing – how does it actually work? We have a formal grammar which describes the query language. Based on that, we have generated a library (in Java) which is included inside SOLR, our search engine. When SOLR receives the user input, before it can start searching for documents, the user query (string) will be turned into a query object. And that is the objective of the parser. First comes the syntactic phase during which an ANTLR parser will be ‘eating’ input character by character. It will occassionally veer off to explore a possible (alternative) branch, to either pursue it further or return back and start branching from some previous position. The input has to be syntactically correct; if we explored all possible readings and there are still some input characters left, it means the input is non-conforming and we’ll generate an exception and give up.

If the query is correct, however, after the parser is finished parsing, we’ll have the AST (Abstract Syntax Tree): a hierarchical datastructure (a tree) instead of the flat chain of characters.

Here an example (it is only an illustration, inside the search engine the AST will be richer):

"(this) (that)" 	->	
                            (DEFOP 
(CLAUSE (                               (CLAUSE (
    DEFOP (                                 DEFOP (
        MODIFIER (                              MODIFIER (
            TMODIFIER (                             TMODIFIER (
                FIELD (                                 FIELD (
                    QNORMAL this))))))                      QNORMAL that)))))))

In human words: the input (this) (that) has been parsed into an AST; the tree starts at a DEFOP node (default operator) which has two children (CLAUSEs). Each CLAUSE itself is made of a strict chain of components: DEFOP->MODIFIER->TMODIFIER->FIELD – they are all empty (with implicit value of none). It is only after we have arrived to a terminus QNORMAL that we also see values. Inside the tree, we will have information about every bracket, position, and links to parent/children.

This tree will be further modified in the next phase.

Semantic Parsing

There is a lot of magic that happens in this next phase. All of it is defined inside the pipeline.

Pro tip: if you add debugQuery=true to your search request URL parameters (and look at the data as returned by our API), you’ll see the serialized version of the query as parsed by SOLR. For example

"debug":{
    "rawquerystring":"abs:\"newtonian solar\"~3",
    "querystring":"abs:\"newtonian solar\"~3",
    "parsedquery":"CustomScoreQuery(custom(abstract:\"syn::newton syn::solar\"~3 title:\"syn::newton syn::solar\"~3 keyword:\"syn::newton syn::solar\"~3, sum(float(cite_read_boost),const(0.5))))",
    "parsedquery_toString":"custom(abstract:\"syn::newton syn::solar\"~3 title:\"syn::newton syn::solar\"~3 keyword:\"syn::newton syn::solar\"~3, sum(float(cite_read_boost),const(0.5)))",
    ...
}