Skanowane teksty jako korpusy

Bień, Janusz S. (2011) Skanowane teksty jako korpusy. Prace Filologiczne, LX. ISSN 0138-0567 (Submitted)

WarningThere is a more recent version of this item available.
[img] Microsoft Word
JSB-SlaviCorpPL_PF-LX.doc - Submitted Version
Restricted to Registered users only

Download (91kB) | Request a copy
Image (JPEG)

Download (57kB) | Preview
Image (JPEG)

Download (66kB) | Preview
Image (JPEG)

Download (31kB) | Preview


Scanned texts as corpora A modification of the Poliqarp corpora search tool is described, which is oriented towards searching scanned texts with dirty OCR (i.e. the fully automatic Optical Character Recognition without any proof-reading). This search tool operates since December 2009 and is available at The two-level regular expressions, which can be used in the queries, allow at least in principle to circumvent the OCR errors. The crucial property of the search engine is to highlight the hits on the original scans stored in the DjVu format. Although the feature is not original, as it has been used for the first time for the Century Dictionary and later for Jamieson’s Etymological Dictionary of the Scottish Language, it is substantially augmented by allowing so called graphical concordances and providing a convenient way to bookmark the hits. Our system handles now four dictionaries, with the total size of over 40 000 pages. It is expected that in the near future other texts will be added to the system.

Item Type: Article
Additional Information: Polskojęzyczna wersja referatu
Depositing User: Janusz S. Bień
Date Deposited: 20 Apr 2011 03:50
Last Modified: 05 Aug 2012 10:29

Available Versions of this Item

Actions (login required)

View Item View Item