Skanowane teksty jako korpusy

Bień, Janusz S. (2011) Skanowane teksty jako korpusy. Prace Filologiczne. ISSN 0138-0567 (Submitted)

WarningThere is a more recent version of this item available.
[img]
Preview
Image (JPEG)
5.jpg

Download (57Kb) | Preview
[img]
Preview
Image (JPEG)
new4.jpg

Download (46Kb) | Preview
[img]
Preview
Image (JPEG)
new3.jpg

Download (95Kb) | Preview
[img]
Preview
Image (JPEG)
new2.jpg

Download (84Kb) | Preview
[img]
Preview
Image (JPEG)
1.jpg

Download (48Kb) | Preview
[img] Microsoft Word
JSB-SlaviCorp-ost_PF-LX.doc - Submitted Version
Available under License Personal Use Only.

Download (92Kb)

Abstract

Scanned texts as corpora A modification of the Poliqarp corpus search tool is described, which is oriented towards searching scanned texts with dirty OCR (i.e. the fully automatic Optical Character Recognition without any proof-reading). This search tool operates since December 2009 and is available at http://wbl.klf.uw.edu.pl/. The two-level regular expressions, which can be used in the queries, allow, at least in principle, to circumvent the OCR errors. The crucial property of the search engine is its ability to highlight the hits on the original scans stored in the DjVu format. Although the feature is not original, as it has been used for the first time for the Century Dictionary and later for Jamieson’s Etymological Dictionary of the Scottish Language, it is substantially augmented by allowing so called graphical concordances and providing a convenient way to bookmark the hits. Our system handles now four dictionaries, with the total size of over 40 000 pages. It is expected that in the near future other texts will be added to the system.

Item Type: Article
Additional Information: Polskojęzyczna wersja referatu http://bc.klf.uw.edu.pl/173/
Depositing User: Janusz S. Bień
Date Deposited: 10 Jul 2011 06:01
Last Modified: 13 Nov 2012 06:14
URI: http://bc.klf.uw.edu.pl/id/eprint/201

Available Versions of this Item

Actions (login required)

View Item View Item