QT4 editor of tesseract-ocr box files
Well this is boxfile editor, so it is expected that you have some image and relevant box file. QT Box Editor can work with following image format:
Multipage tif is not supported yet. It is strongly suggested to use tif with compression or png due to image quality and space. If you are interesting in good result, strictly follow instruction mention on Tesseract training page plus use image with 300 DPI (at least). Also try to avoid big files (10-12 pt for font size is ok, checking a lot of text can cause you oversaw errors).
You can open image file in menu (File->Open) or just use Drag&Drop. Based on image filename QT Box Editor will search for box file.
You can create box file outside of application (useful if you want to use additional parameters or check for possible error). From version 1.07 QT Box Editor is able to generate box file during opening image. But first you need to set-up (Edit->Settings... or Ctrl+T) path to tesseract language files. If they are located in “C:/usr/projects/BuildFolder/tesseract-ocr/tessdata”, than enter “C:/usr/projects/BuildFolder/tesseract-ocr/” (application will anyway remove “tessdata” if you select “C:/usr/projects/BuildFolder/tesseract-ocr/tessdata”) and click on ”_Check_” button. If QT Box Editor is able to find language data, it will enter them to combobox, so you can choose, which one will be used in tesseract.
Note: On windows QT Box Editor was linked against tesseract 3.02 library and leptonica 1.68. If you build application from source make sure you use tesseract 3.01 or 3.02 version.
QT Box Editor workspace has several part:
If you press F1 - dialogue with available shortcuts will be displayed. It is also available via menu Help->Shortcut list.
In table you can see box file information. If you click into cell (single box feature) your information will be visualized in viewer part: box rectangle will be drawn together with symbol/letter (if option/button “_Show symbol_” is on). This give you opportunity to check box information (size of box and where symbol match to its image).
If something is wrong just start typing (in correct table cell) and than this the Enter or Tab key.
You can also set font type (bold, italics, underlined) if need. This action works on multiple rows. Just select rows (one cell from row is enough) by mouse or keyboard and click desired button (or use shortcut or use menu). Ctrl and Shift key can help you to make selection.
There is implemented “Find_” function (_Edit->Find or Ctrl+F) to search for selected symbol.
If you use training based on real scan usualy your image is not ideal and tesseract is not able create good box file (e.g. some symbols do not have box, box is split into several parts, there are boxes for noise etc.).
For this case there is toolbar that help you to operate with whole box information (rows):
In these actions there is no support for multirow selection (yet). Even if you select more rows it will act on last selected.
Warning: There is no Undo implemented in QT Box Editor yet, so you have to be careful and save box file wisely.
QT Box Editor is visual editor to help editor understand what he/she is doing. Box rectangle is drawn based on table selection automatically .
When you first time open box file you need to check quality of generated boxes. If you want to see all boxes, just use function “Show boxes_” (_View->Show boxes or Ctrl+H).
To get best view you need to zoom in/out. Application offers several zooming option (by step, to fit width, height, selected box.
You can select box also in viewer - just click on image box location and correct box will be selected in table editor (box must exists). Multibox selection from viewer is not supported (it is supported only in table editor).
Color and font can be adjusted in Settings dialogue.
If your image has several types of fonts you have problem - you do not follow tesseract training instruction. But there is workaround: mark symbol as bold, italics or underlined and use menu function: “Spilt to boxfiles” (File->Split to boxfiles). This will create several boxfiles. E.g. if your (main) box filename is ‘calc.arial.exp0.box’ it will create (if you used that styles):
Then you need to create copies (or symlinks on linux) of main image and you can run training with this commands:
Creating copies/symlinks is important because based on input image tesseract search for box files and second argument is used for output filename (tr file).
You will get a lot several warnings like this:
APPLY_BOXES: Unlabelled word at :Bounding box=(128,1555)->(255,1582)
This is because there are not box data for part of image (e.g. calc.arialnormal.exp0 miss all box data for bold, italic and underlined symbols). If you want to avoid this warnings you need to remove (clear) in image editor (e.g. gimp) not boxed symbols.
If you get error message like this:
APPLY_BOXES: boxfile line 16/$ ((1943,2674),(2037,2758)): FAILURE! Couldn't find a matching blob
you should check that region. You can visualize it by function “Draw Rectangle...” (Edit->Draw Rectangle... or Ctrl+R). New dialogue will be opened with 4 field where you can enter this coordinates (bounding box).