Scanned texts as corpora - a case study

Bień, Janusz S. (2010) Scanned texts as corpora - a case study. In: SLAVICORP. CORPORA OF SLAVIC LANGUAGES, 22-23 November 2010, University of Warsaw. (Unpublished)

JSB_SlavCorp10s.pdf - Presentation
Available under License GNU Verbatim Copying and Distribution.

Download (1MB) | Preview
Official URL:


A modification of the Poliqarp corpora search tool is described, which is oriented towards searching scanned texts with dirty OCR (i.e. the fully automatic Optical Character Recognition without any proof-reading). This search tool, called Poliqarp for DjVu, operates since December 2009 and is available at The two-level regular expressions, which can be used in the queries, allow at least in principle to circumvent the OCR errors. The crucial property of the search engine is to highlight the hits on the original scans stored in the DjVu format. Although the feature is not original, as it has been used for the first time for the Century Dictionary ( and later for Jamieson's Etymological Dictionary of the Scottish Language (, it is substantially augmented by allowing so called graphical concordances and providing a convenient way to bookmark the hits. Our system handles now several dictionaries. One of them is so called Warsaw Dictionary, published in 8 volumes in 1900-1927, comprising almost 8000 pages, in particular 7541 pages with entries. It has been scanned and OCRed by the library of the University of Warsaw ( Another one is the monumental dictionary of 16th century Polish, which is still the work in progress. Our system handles 33 already published volumes, three of them are digitally born, the rest has been scanned by Kujawsko-Pomorska Digital Library ( The volumes comprise together almost 20000 pages, in particular 18296 pages with entries. Our case study will be focused on the very dictionary, because it allow us to demonstrate the use of regular expressions in the queries to account for the pecularities of the 16th century Polish spelling. Although at first our system (called at first marasca) was just a modification of Poliqarp, we contribute in return to the original project. Since March 2010 National Corpus of Polish uses our version of the WWW Poliqarp client.

Item Type: Conference or Workshop Item (Paper)
Depositing User: Janusz S. Bień
Date Deposited: 15 Nov 2010 14:41
Last Modified: 05 Aug 2012 10:29

Actions (login required)

View Item View Item