Probabilistic Tagging of Unstructured Genealogical Records

Perrow, Mike; Barber, David

Perrow, Mike; Barber, David

2005

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Résumé

In this paper we present a method of parsing unstructured textual records briefly describing a person and their direct relatives. The string `Stephanus, brother of Johannes Magnin, from Saillon' is a typical example of a record. We wish to annotate every term (word and symbol) in our records with a label which describes whether the term is a name (e.g. `Stephanus'), a place (e.g. `Saillon'), or a relationship (e.g. `brother'). We build upon work developed for the cleaning and standardization of names for record linkage corpora, adding several enhancements to deal with our more difficult data, which contains common name structures of French, Italian and Latin, over hundreds of years. We present an approach to this problem that works interactively with a user to annotate the data set accurately, greatly reducing the human effort required. We do this by learning a Hidden Markov Model representing a record structure, and finding structural patterns in new records

Détails

Titre Probabilistic Tagging of Unstructured Genealogical Records

Auteur(s) Perrow, Mike ; Barber, David

Date 2005

Editeur IDIAP

Mots-clés (libres)

learning

Lien supplémentaire URL

Laboratoires LIDIAP

Le document apparaît dans Production scientifique et compétences > STI - Faculté des sciences et techniques de l'ingénieur > IEM - Institute of Electrical and Micro Engineering > LIDIAP - Laboratoire de l'IDIAP
Production scientifique et compétences > Euler Center for Signal Processing
Travail produit à l'EPFL
Rapports techniques
Publié

Date de création de la notice 2006-03-10

Files

Résumé

Détails

PDF