Scanned publications in digital libraries: new Open Source DjVu tools

Bień, Janusz S. (2012) Scanned publications in digital libraries: new Open Source DjVu tools. In: the Library 2.012 worldwide virtual conference, October 3 - 5, 2012, Internet. (Unpublished)

[img]
Preview
PDF
JSBatal_Library-12s.pdf
Available under License Creative Commons Attribution.

Download (11Mb) | Preview
[img]
Preview
Image (JPEG)
2.012badges180px_present.jpg - Additional Metadata

Download (72Kb) | Preview
Official URL: https://sas.elluminate.com/site/external/playback/...

Abstract

The DjVu technology is described by its authors as "an image compression technique, a document format, and a software platform for delivering documents images over the Internet"; according to the recent statistics, about 80% of documents stored in Polish digital libraries is in this format. Besides the commercial software supporting this technology there is also the DjVuLibre suit of Open Source tools and utilities, developed by the technology creators. In the presentations another Open Source suit of programs will be discussed. It consist of two sets. The first set contains some programs for creation and improvement of DjVu documents including the results of Optical Character Recogniton. A typical OCR program outputs the results as a PDF "sandwich" document containg text under image (although since version 11 ABBY FineReader can save directly the output as a DjVu files, the output in the PDF form contains more information). The pdf2djvu program conceived by Jakub Wilk (http://jwilk.net/software/pdf2djvu) convert the PDF files into DjVu preserving all the features (e.g. outlines) which are representable in the latter format. The purpose of another program, also conceived by Jakub Wilk (http://jwilk.net/software/didjvu), is the conversion of graphic files into the DjVu documents consisting of foreground (the printed text), mask and background layers (e.g. illustrations). Such separation not only allows to achieve a high compression ratio, but also improves the quality of OCR results which should operate only on foreground or mask. The third program named, for the historical reasons, ocrodjvu (http://jwilk.net/software/ocrodjvu) is a wrapper for several Open Source OCR programs including Tesseract, which achieves quality comparable with commercial systems (cf. e.g. a test results). The second set of programs concerns the delivery of DjVu documents to the users. It consist of a search engine server and two kind of clients: marasca installable as a WWW site and djview4poliqarp, a standalone client installable on a user computer. As the server is based on the Poliqarp corpus tool, the whole set is called just Poliqarp for DjVu. The author of djview4poliqarp is Michał Rudolf, the rest of the system was created by Jakub Wilk. The tools has been developed in the framework of the project directed by the present author, the results are available on the principle of GNU General Public License.

Item Type: Conference or Workshop Item (Speech)
Additional Information: Prezentacja odbyła się w czwartek 4 pażdziernika od godz. 12 do 13. Dostępne jest nagranie referatu. The talk has been presented on Thursday 4 October 2012. The recording is available in several formats.
Depositing User: Janusz S. Bień
Date Deposited: 10 Aug 2012 08:21
Last Modified: 10 Dec 2013 20:05
URI: http://bc.klf.uw.edu.pl/id/eprint/298

Actions (login required)

View Item View Item