Facilitating access to digitalized dictionaries in DjVu format

Bień, Janusz S. (2009) Facilitating access to digitalized dictionaries in DjVu format. In: MONDILEX project open workshop " Representing Semantics in Digital Lexicography" within an international scientific conference COGNITIVE and CONTRASTIVE STUDIES, June 29 - July 1, 2009, Warszawa. (Submitted)

WarningThere is a more recent version of this item available.
JSB_DL-09s.pdf - Presentation

Download (2MB) | Preview
[img] PDF (paper)
JSB_DL-09.pdf - Submitted Version
Restricted to Repository staff only until 2010.

Download (959kB) | Request a copy
Official URL: http://www.ispan.waw.pl/images/konferencje/mondile...


One of the best formats for scanned documents is DjVu. An essential feature of the format is the hidden text layer, usually containing the results of Optical Character Recognition. Another important feature is the ability to store (and serve over Internet) the documents as a collection of individual pages. From the very beginning it has been used also for dictionaries, in particular there are also several Polish dictionaries available in this format. So the question is how to search efficiently the text layer in such large multi-volume works. For this purpose we intend to adapt Poliqarp (Polyinterpretation Indexing Query and Retrieval Procesor), a GPLed corpus query tool developed in the Institute of Computer Science of Polish Academy of Sciences. Some preliminary experiments are described in the talk. In our ,,quick and dirty'' approach we treat every page as a single document with the metadata consisting of the name of the document index and the name of the file with the page content. For every word, instead of grammatical tags, we provide its localization on the page in the form of the line number and its position in the line. All the data taken together allow to link the search results to the appropriate fragments of the original scans.

Item Type: Conference or Workshop Item (Paper)
Uncontrolled Keywords: digitalization, DjVu, dictionaries, Poliqarp,djvu-xfgrep
Subjects: Z Bibliography. Library Science. Information Resources > Z004 Books. Writing. Paleography
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
P Language and Literature > PG Slavic, Baltic, Albanian languages and literature
P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics > QA76 Computer software
Depositing User: Janusz S. Bień
Date Deposited: 21 Sep 2009 06:17
Last Modified: 05 Aug 2012 10:27
URI: http://bc.klf.uw.edu.pl/id/eprint/118

Available Versions of this Item

Actions (login required)

View Item View Item