| Infoscience | ||||||||||
|
| Home > Hacking CDS Invenio > BibClassify Internals > The code behind BibClassify: the extraction algorithm |
This section provides a detailed account of the phrase matching
techniques used by BibClassify to automatically extract significant terms
from fulltext documents. While reading this guide, you are advised to refer to the original
BibClassify code, mostly contained in lib/python/invenio/bibclassifylib.py.
This guide refers to version 2006/09/15.
The bulk of the extraction mechanism takes place inside the function
generate_keywords_rdf. This function is triggered when
BibClassify is launched with parameter -K,
--taxonomy=FILENAME. Let's have a look at what happens inside this function, step-by-step.
NB. BibClassify can also run on top of simple text thesauri
(parameter -k, --thesaurus=FILENAME, function
generate_keywords), however this mode is now deprecated
and no longer maintained.
At the beginning of the function, various local variables are declared. Among these,
namespace: This variable
points to the main rdf:Namespace used in the taxonomy. In the case of
RDF SKOS, this is
http://www.w3.org/2004/02/skos/core#. If you need to use namespaces other than this one, please modify this variable
accordingly.delimiter: This variable represents the delimiter
symbol used to separate composite keywords. For example, current HEP
taxonomy adopts ":" (e.g. baryon: asymmetry) whereas the SPIRES standard
adopts "," (e.g. baryon, asymmetry). If you intend to set up and use
composite keywords in your taxonomy, please set this variable to the desired format.
The taxonomy (dictfile) is stored and parsed into memory via RDFlib by the following two lines of code:
store = rdflib.Graph()
store.parse(dictfile)
NB. RDFLib provides very handy libraries for RDF manipulation, however when dealing with large RDF files, loading and parsing are by far the principal factor affecting the performance of BibClassify. For example, when loading the HEP taxonomy (7.4 MB, 16000 Concepts) on an Intel(R) Xeon(TM) CPU 3.06GHz the two lines of code above take a total of 26 seconds to complete (over two thirds of the total execution time - 36 seconds). If performance is your main concern, consider a faster library or working on a pre-loaded RDF store.
At this point, the fulltext of the document is converted from text
to PDF (using standard linux command pdftotext) and stored into a
string text_string. This string will contain the full
document if running in slow mode or an arbitrary excerpt of the
document (about 40%) if running in fast mode. Moreover, the very
beginning of the string (10%) is stored in a variable called
abstract. This is done base on the assumption that
manuscripts generally contain crucial information such as title and
abstract in the very first portion of the document. This portion can
be then treated to be more relevant than the remainder of the document. Please
bear this in mind if running BibClassify on documents with different
structures or when running on heterogeneous collections.
In many manuscripts, the author includes a list of pre-assigned
terms that describe the topic of the article. The last step before
keyword extraction begins is to locate these author-assigned
keywords. We try to isolate these by searching for the key phrase
Keywords followed by a list of terms. When found, the string
is stored into variable safe_keys and used later to match
BibClassify output against author assigned keywords (these are marked
in the output with an asterisk, e.g. 13* Hubble constant)
The bulk of the phrase matching operations - the extraction of the single
keywords from the fulltext - is contained in a big for loop. In this
loop, every RDF Concept is parsed, one at a time, and its
components (such as prefLabel and altLabel) matched inside the document.
NB. For a detailed explanation
of the RDF SKOS syntax, please refer to the
SKOS W3C
website.
The rationale behind single and composite keywords is covered in the HEP taxonomy hacking
section.
For every concept in the taxonomy that has a prefLabel, we store
its searchable labels (altLabel and hiddenLabel) in a list
(candidates). At this point we also check whether the
concept has been flagged with a nostandalone note,
i.e. it will be used for the computation of composite keywords, but it
does not count as a single keyword on its own (this is the case of many
short particle names that if considered on their own would generate a lot of noise
output).
For every candidate term in the
candidates list, a regex is compiled around the term. In
doing this, it is important to consider carefully the anatomy of the
term:
hiddenLabel containing wildcards (e.g. gauge theor*) are
processed separately at the beginning. This is done to avoid double
matching, i.e. if a hiddenLabel's regex matches any other label, the
latter will be excluded from the occurrence calculation.
The assembling and compilation of the regex is performed for each
candidate in the function makePattern. The function wraps
a regex around the candidate term according to the type of term
detected, as explained above. When compiling the regular expressions around
the candidate terms, the basic rule for making patterns is:
(?:[^A-Za-z0-9\+-])( + candidate term + )(?=[^A-Za-z0-9\+-])The word separator (in bold) differs from the standard regex for non-whitespace character (\s) as it includes plus and minus signs (that in the case of HEP thesaurus terms cannot be regarded as whitespace). The compiled pattern is then matched inside the
text_string via a
sre.findall and the resulting keyword occurrences stored
in a list of single keywords(keylist). This list is then
sorted by term occurrence and can be used for the final processing
task: the computation of the combined keywords.
NB. The compilation of candidate terms into regex patterns also affects negatively performance times. However, this is not as influent as the storing and parsing of the RDF taxonomy, demonstrated above. When compiling the 16000 Concepts of the HEP taxonomy on an Intel(R) Xeon(TM) CPU 3.06GHz this step takes a total of 9 seconds (one fourth of the total execution time - 36 seconds). It is clear that such performance can be greatly improved by working on a pre-compiled set of regex patterns, to be re-generated only whenever the taxonomy is modified.
This is the final processing step of the extraction mechanism. Although this step takes place only if composite keywords are present in the taxonomy, it is the one step that - in the case of high energy physics - produces the most accurate results - especially thanks to the richness and level of detail reached by the current HEP thesaurus.
In this part, BibClassify looks for possible keyword combinations
and key pairs by analysing the list of most recurrent single keywords
(keylist). The total number of mkws that are possible ckw
candidates is defined by the parameter -l,
--limit. The higher this value (default is 70), the higher
the number of mkws that make it to the pool of ckw candidates. In
order to compute possible combinations, BibClassify loops around
l single keywords in keylist. For each entry,
it creates key : value pairs in two dictionaries:
compositesIDX, an index dictionary to aid retrieval of components in
keylist (key:value pair of the form URI : keylist entry)
composites, a dictionary to keep track of possible
combinations between mkws (key: value pair of the form
ckwURI : [mkw1URI, mkw2URI])keylist, they make
a possible composite keyword as the ckw zinc, tungsten exist
in the taxonomy. Bibclassify would then create entries in the
composites dictionary as follows:
http://cern.ch/thesauri/HEP.rdf#Composite.zinctungsten : [http://cern.ch/thesauri/HEP.rdf#zinc, http://cern.ch/thesauri/HEP.rdf#tungsten]which would be a valid ckw candidate.
NB. One could add at this point the possibility of having combinations of more than two mkws, e.g. ckwURI : [mkw1URI, mkw2URI, mkw3URI]. This is feasible, but was not implemented at this stage, because of the performance overhead that would be generated by the phrase matching of more complex regular expressions.
Once the composites dictionary is completed, its keys
that point to lists of two values (like the example above) are ckw
candidates. We now need to check whether they actually appear one next
to the other in the text. This is done in function
makeCompPattern by compiling a pair of regular
expressions: one for mkw1 followed by mkw2 and one
for the inverse situation, mkw2 followed by mkw1. Once again,
when compiling the regex pattern we have to take extra care to treat
special cases (hyphens, short names, wildcards) accordingly. The sum
of the incidence of the two patterns in the text_string is
stored in a list (compositesOUT) that is then sorted (by
occurrence) and output.
Before presenting the results to the user, some extra filtering occurs, primarily to refine the output keywords. The main postprocessing actions performed on the results are:
NB. One could also perform this last post-processing step at the composite keyword level, e.g. energy: density to be overridden by dark energy: density. This has not been yet implemented for security reasons (the incidence of altLabels on the ckw computation).
The final results that are produced to the user consist of the
first n entries of keylist (the single
keywords) and the entries in compositesOUT (the composite
keywords). The results may be presented in text or html format,
according to the output mode chosen at the command line. Sample output (both text and html) can be found in the
BibClassify Admin Guide.