The code behind BibClassify: the extraction algorithm

Contents

1. Overview
2. Preprocessing
3. Single Keyword (mkw) processing
4. Composite keyword (ckw)processing
5. Postprocessing

1. Overview

This section provides a detailed account of the phrase matching techniques used by BibClassify to automatically extract significant terms from fulltext documents. While reading this guide, you are advised to refer to the original BibClassify code, mostly contained in lib/python/invenio/bibclassifylib.py. This guide refers to version 2006/09/15.

The bulk of the extraction mechanism takes place inside the function generate_keywords_rdf. This function is triggered when BibClassify is launched with parameter -K, --taxonomy=FILENAME. Let's have a look at what happens inside this function, step-by-step.

 NB. BibClassify can also run on top of simple text thesauri (parameter -k, --thesaurus=FILENAME, function generate_keywords), however this mode is now deprecated and no longer maintained.

2. Preprocessing

At the beginning of the function, various local variables are declared. Among these,

The taxonomy (dictfile) is stored and parsed into memory via RDFlib by the following two lines of code:

store = rdflib.Graph()
store.parse(dictfile)

 NB. RDFLib provides very handy libraries for RDF manipulation, however when dealing with large RDF files, loading and parsing are by far the principal factor affecting the performance of BibClassify. For example, when loading the HEP taxonomy (7.4 MB, 16000 Concepts) on an Intel(R) Xeon(TM) CPU 3.06GHz the two lines of code above take a total of 26 seconds to complete (over two thirds of the total execution time - 36 seconds). If performance is your main concern, consider a faster library or working on a pre-loaded RDF store.

At this point, the fulltext of the document is converted from text to PDF (using standard linux command pdftotext) and stored into a string text_string. This string will contain the full document if running in slow mode or an arbitrary excerpt of the document (about 40%) if running in fast mode. Moreover, the very beginning of the string (10%) is stored in a variable called abstract. This is done base on the assumption that manuscripts generally contain crucial information such as title and abstract in the very first portion of the document. This portion can be then treated to be more relevant than the remainder of the document. Please bear this in mind if running BibClassify on documents with different structures or when running on heterogeneous collections.

In many manuscripts, the author includes a list of pre-assigned terms that describe the topic of the article. The last step before keyword extraction begins is to locate these author-assigned keywords. We try to isolate these by searching for the key phrase Keywords followed by a list of terms. When found, the string is stored into variable safe_keys and used later to match BibClassify output against author assigned keywords (these are marked in the output with an asterisk, e.g. 13* Hubble constant)

3. Single Keyword (mkw) processing

The bulk of the phrase matching operations - the extraction of the single keywords from the fulltext - is contained in a big for loop. In this loop, every RDF Concept is parsed, one at a time, and its components (such as prefLabel and altLabel) matched inside the document.

 NB. For a detailed explanation of the RDF SKOS syntax, please refer to the SKOS W3C website.
The rationale behind single and composite keywords is covered in the HEP taxonomy hacking section.

For every concept in the taxonomy that has a prefLabel, we store its searchable labels (altLabel and hiddenLabel) in a list (candidates). At this point we also check whether the concept has been flagged with a nostandalone note, i.e. it will be used for the computation of composite keywords, but it does not count as a single keyword on its own (this is the case of many short particle names that if considered on their own would generate a lot of noise output).

For every candidate term in the candidates list, a regex is compiled around the term. In doing this, it is important to consider carefully the anatomy of the term:

hiddenLabel containing wildcards (e.g. gauge theor*) are processed separately at the beginning. This is done to avoid double matching, i.e. if a hiddenLabel's regex matches any other label, the latter will be excluded from the occurrence calculation.

The assembling and compilation of the regex is performed for each candidate in the function makePattern. The function wraps a regex around the candidate term according to the type of term detected, as explained above. When compiling the regular expressions around the candidate terms, the basic rule for making patterns is:

(?:[^A-Za-z0-9\+-])( + candidate term + )(?=[^A-Za-z0-9\+-])
The word separator (in bold) differs from the standard regex for non-whitespace character (\s) as it includes plus and minus signs (that in the case of HEP thesaurus terms cannot be regarded as whitespace). The compiled pattern is then matched inside the text_string via a sre.findall and the resulting keyword occurrences stored in a list of single keywords(keylist). This list is then sorted by term occurrence and can be used for the final processing task: the computation of the combined keywords.

 NB. The compilation of candidate terms into regex patterns also affects negatively performance times. However, this is not as influent as the storing and parsing of the RDF taxonomy, demonstrated above. When compiling the 16000 Concepts of the HEP taxonomy on an Intel(R) Xeon(TM) CPU 3.06GHz this step takes a total of 9 seconds (one fourth of the total execution time - 36 seconds). It is clear that such performance can be greatly improved by working on a pre-compiled set of regex patterns, to be re-generated only whenever the taxonomy is modified.

4. Composite keyword (ckw) processing

This is the final processing step of the extraction mechanism. Although this step takes place only if composite keywords are present in the taxonomy, it is the one step that - in the case of high energy physics - produces the most accurate results - especially thanks to the richness and level of detail reached by the current HEP thesaurus.

In this part, BibClassify looks for possible keyword combinations and key pairs by analysing the list of most recurrent single keywords (keylist). The total number of mkws that are possible ckw candidates is defined by the parameter -l, --limit. The higher this value (default is 70), the higher the number of mkws that make it to the pool of ckw candidates. In order to compute possible combinations, BibClassify loops around l single keywords in keylist. For each entry, it creates key : value pairs in two dictionaries:

For example, if the mkws zinc and tungsten both appear in the keylist, they make a possible composite keyword as the ckw zinc, tungsten exist in the taxonomy. Bibclassify would then create entries in the composites dictionary as follows:
http://cern.ch/thesauri/HEP.rdf#Composite.zinctungsten : [http://cern.ch/thesauri/HEP.rdf#zinc, http://cern.ch/thesauri/HEP.rdf#tungsten]
which would be a valid ckw candidate.

 NB. One could add at this point the possibility of having combinations of more than two mkws, e.g. ckwURI : [mkw1URI, mkw2URI, mkw3URI]. This is feasible, but was not implemented at this stage, because of the performance overhead that would be generated by the phrase matching of more complex regular expressions.

Once the composites dictionary is completed, its keys that point to lists of two values (like the example above) are ckw candidates. We now need to check whether they actually appear one next to the other in the text. This is done in function makeCompPattern by compiling a pair of regular expressions: one for mkw1 followed by mkw2 and one for the inverse situation, mkw2 followed by mkw1. Once again, when compiling the regex pattern we have to take extra care to treat special cases (hyphens, short names, wildcards) accordingly. The sum of the incidence of the two patterns in the text_string is stored in a list (compositesOUT) that is then sorted (by occurrence) and output.

5. Postprocessing

Before presenting the results to the user, some extra filtering occurs, primarily to refine the output keywords. The main postprocessing actions performed on the results are:

The final results that are produced to the user consist of the first n entries of keylist (the single keywords) and the entries in compositesOUT (the composite keywords). The results may be presented in text or html format, according to the output mode chosen at the command line. Sample output (both text and html) can be found in the BibClassify Admin Guide.