tesseract v5.3.3.20231005
tesseract::PageIterator Class Reference

#include <pageiterator.h>

Inheritance diagram for tesseract::PageIterator:
tesseract::LTRResultIterator tesseract::ResultIterator tesseract::MutableIterator

Public Member Functions

 PageIterator (PAGE_RES *page_res, Tesseract *tesseract, int scale, int scaled_yres, int rect_left, int rect_top, int rect_width, int rect_height)
 
virtual ~PageIterator ()
 
 PageIterator (const PageIterator &src)
 
const PageIteratoroperator= (const PageIterator &src)
 
bool PositionedAtSameWord (const PAGE_RES_IT *other) const
 
virtual void Begin ()
 
virtual void RestartParagraph ()
 
bool IsWithinFirstTextlineOfParagraph () const
 
virtual void RestartRow ()
 
virtual bool Next (PageIteratorLevel level)
 
virtual bool IsAtBeginningOf (PageIteratorLevel level) const
 
virtual bool IsAtFinalElement (PageIteratorLevel level, PageIteratorLevel element) const
 
int Cmp (const PageIterator &other) const
 
void SetBoundingBoxComponents (bool include_upper_dots, bool include_lower_dots)
 
bool BoundingBox (PageIteratorLevel level, int *left, int *top, int *right, int *bottom) const
 
bool BoundingBox (PageIteratorLevel level, int padding, int *left, int *top, int *right, int *bottom) const
 
bool BoundingBoxInternal (PageIteratorLevel level, int *left, int *top, int *right, int *bottom) const
 
bool Empty (PageIteratorLevel level) const
 
PolyBlockType BlockType () const
 
Pta * BlockPolygon () const
 
Pix * GetBinaryImage (PageIteratorLevel level) const
 
Pix * GetImage (PageIteratorLevel level, int padding, Pix *original_img, int *left, int *top) const
 
bool Baseline (PageIteratorLevel level, int *x1, int *y1, int *x2, int *y2) const
 
void RowAttributes (float *row_height, float *descenders, float *ascenders) const
 
void Orientation (tesseract::Orientation *orientation, tesseract::WritingDirection *writing_direction, tesseract::TextlineOrder *textline_order, float *deskew_angle) const
 
void ParagraphInfo (tesseract::ParagraphJustification *justification, bool *is_list_item, bool *is_crown, int *first_line_indent) const
 
bool SetWordBlamerBundle (BlamerBundle *blamer_bundle)
 

Protected Member Functions

void BeginWord (int offset)
 

Protected Attributes

PAGE_RESpage_res_
 
Tesseracttesseract_
 
PAGE_RES_ITit_
 
WERDword_
 
int word_length_
 
int blob_index_
 
C_BLOB_IT * cblob_it_
 
bool include_upper_dots_
 
bool include_lower_dots_
 
int scale_
 
int scaled_yres_
 
int rect_left_
 
int rect_top_
 
int rect_width_
 
int rect_height_
 

Detailed Description

Class to iterate over tesseract page structure, providing access to all levels of the page hierarchy, without including any tesseract headers or having to handle any tesseract structures. WARNING! This class points to data held within the TessBaseAPI class, and therefore can only be used while the TessBaseAPI class still exists and has not been subjected to a call of Init, SetImage, Recognize, Clear, End DetectOS, or anything else that changes the internal PAGE_RES. See tesseract/publictypes.h for the definition of PageIteratorLevel. See also ResultIterator, derived from PageIterator, which adds in the ability to access OCR output with text-specific methods.

Definition at line 50 of file pageiterator.h.

Constructor & Destructor Documentation

◆ PageIterator() [1/2]

tesseract::PageIterator::PageIterator ( PAGE_RES page_res,
Tesseract tesseract,
int  scale,
int  scaled_yres,
int  rect_left,
int  rect_top,
int  rect_width,
int  rect_height 
)

page_res and tesseract come directly from the BaseAPI. The rectangle parameters are copied indirectly from the Thresholder, via the BaseAPI. They represent the coordinates of some rectangle in an original image (in top-left-origin coordinates) and therefore the top-left needs to be added to any output boxes in order to specify coordinates in the original image. See TessBaseAPI::SetRectangle. The scale and scaled_yres are in case the Thresholder scaled the image rectangle prior to thresholding. Any coordinates in tesseract's image must be divided by scale before adding (rect_left, rect_top). The scaled_yres indicates the effective resolution of the binary image that tesseract has been given by the Thresholder. After the constructor, Begin has already been called.

Definition at line 30 of file pageiterator.cpp.

33 : page_res_(page_res),
35 word_(nullptr),
36 word_length_(0),
37 blob_index_(0),
38 cblob_it_(nullptr),
41 scale_(scale),
42 scaled_yres_(scaled_yres),
43 rect_left_(rect_left),
44 rect_top_(rect_top),
45 rect_width_(rect_width),
46 rect_height_(rect_height) {
47 it_ = new PAGE_RES_IT(page_res);
49}

◆ ~PageIterator()

tesseract::PageIterator::~PageIterator ( )
virtual

Definition at line 51 of file pageiterator.cpp.

51 {
52 delete it_;
53 delete cblob_it_;
54}

◆ PageIterator() [2/2]

tesseract::PageIterator::PageIterator ( const PageIterator src)

Page/ResultIterators may be copied! This makes it possible to iterate over all the objects at a lower level, while maintaining an iterator to objects at a higher level. These constructors DO NOT CALL Begin, so iterations will continue from the location of src.

PageIterators may be copied! This makes it possible to iterate over all the objects at a lower level, while maintaining an iterator to objects at a higher level.

Definition at line 61 of file pageiterator.cpp.

62 : page_res_(src.page_res_),
63 tesseract_(src.tesseract_),
64 word_(nullptr),
65 word_length_(src.word_length_),
66 blob_index_(src.blob_index_),
67 cblob_it_(nullptr),
68 include_upper_dots_(src.include_upper_dots_),
69 include_lower_dots_(src.include_lower_dots_),
70 scale_(src.scale_),
71 scaled_yres_(src.scaled_yres_),
72 rect_left_(src.rect_left_),
73 rect_top_(src.rect_top_),
74 rect_width_(src.rect_width_),
75 rect_height_(src.rect_height_) {
76 it_ = new PAGE_RES_IT(*src.it_);
77 BeginWord(src.blob_index_);
78}
void BeginWord(int offset)

Member Function Documentation

◆ Baseline()

bool tesseract::PageIterator::Baseline ( PageIteratorLevel  level,
int *  x1,
int *  y1,
int *  x2,
int *  y2 
) const

Returns the baseline of the current object at the given level. The baseline is the line that passes through (x1, y1) and (x2, y2). WARNING: with vertical text, baselines may be vertical! Returns false if there is no baseline at the current position.

Returns the baseline of the current object at the given level. The baseline is the line that passes through (x1, y1) and (x2, y2). WARNING: with vertical text, baselines may be vertical!

Definition at line 534 of file pageiterator.cpp.

535 {
536 if (it_->word() == nullptr) {
537 return false; // Already at the end!
538 }
539 ROW *row = it_->row()->row;
540 WERD *word = it_->word()->word;
541 TBOX box = (level == RIL_WORD || level == RIL_SYMBOL) ? word->bounding_box()
542 : row->bounding_box();
543 int left = box.left();
544 ICOORD startpt(left, static_cast<int16_t>(row->base_line(left) + 0.5));
545 int right = box.right();
546 ICOORD endpt(right, static_cast<int16_t>(row->base_line(right) + 0.5));
547 // Rotate to image coordinates and convert to global image coords.
548 startpt.rotate(it_->block()->block->re_rotation());
549 endpt.rotate(it_->block()->block->re_rotation());
550 *x1 = startpt.x() / scale_ + rect_left_;
551 *y1 = (rect_height_ - startpt.y()) / scale_ + rect_top_;
552 *x2 = endpt.x() / scale_ + rect_left_;
553 *y2 = (rect_height_ - endpt.y()) / scale_ + rect_top_;
554 return true;
555}
@ TBOX
FCOORD re_rotation() const
Definition: ocrblock.h:129
BLOCK_RES * block() const
Definition: pageres.h:769
WERD_RES * word() const
Definition: pageres.h:763
ROW_RES * row() const
Definition: pageres.h:766

◆ Begin()

void tesseract::PageIterator::Begin ( )
virtual

Moves the iterator to point to the start of the page to begin an iteration.

Resets the iterator to point to the start of the page.

Reimplemented in tesseract::ResultIterator.

Definition at line 105 of file pageiterator.cpp.

105 {
107 BeginWord(0);
108}
WERD_RES * restart_page_with_empties()
Definition: pageres.h:713

◆ BeginWord()

void tesseract::PageIterator::BeginWord ( int  offset)
protected

Sets up the internal data for iterating the blobs of a new word, then moves the iterator to the given offset.

Definition at line 636 of file pageiterator.cpp.

636 {
637 WERD_RES *word_res = it_->word();
638 if (word_res == nullptr) {
639 // This is a non-text block, so there is no word.
640 word_length_ = 0;
641 blob_index_ = 0;
642 word_ = nullptr;
643 return;
644 }
645 if (word_res->best_choice != nullptr) {
646 // Recognition has been done, so we are using the box_word, which
647 // is already baseline denormalized.
648 word_length_ = word_res->best_choice->length();
649 if (word_res->box_word != nullptr) {
650 if (word_res->box_word->length() != static_cast<unsigned>(word_length_)) {
651 tprintf("Corrupted word! best_choice[len=%d] = %s, box_word[len=%d]: ",
652 word_length_, word_res->best_choice->unichar_string().c_str(),
653 word_res->box_word->length());
654 word_res->box_word->bounding_box().print();
655 }
656 ASSERT_HOST(word_res->box_word->length() ==
657 static_cast<unsigned>(word_length_));
658 }
659 word_ = nullptr;
660 // We will be iterating the box_word.
661 delete cblob_it_;
662 cblob_it_ = nullptr;
663 } else {
664 // No recognition yet, so a "symbol" is a cblob.
665 word_ = word_res->word;
666 ASSERT_HOST(word_->cblob_list() != nullptr);
667 word_length_ = word_->cblob_list()->length();
668 if (cblob_it_ == nullptr) {
669 cblob_it_ = new C_BLOB_IT;
670 }
671 cblob_it_->set_to_list(word_->cblob_list());
672 }
673 for (blob_index_ = 0; blob_index_ < offset; ++blob_index_) {
674 if (cblob_it_ != nullptr) {
675 cblob_it_->forward();
676 }
677 }
678}
#define ASSERT_HOST(x)
Definition: errcode.h:54
void tprintf(const char *format,...)
Definition: tprintf.cpp:41
C_BLOB_LIST * cblob_list()
Definition: werd.h:96

◆ BlockPolygon()

Pta * tesseract::PageIterator::BlockPolygon ( ) const

Returns the polygon outline of the current block. The returned Pta must be ptaDestroy-ed after use. Note that the returned Pta lists the vertices of the polygon, and the last edge is the line segment between the last point and the first point. nullptr will be returned if the iterator is at the end of the document or layout analysis was not used.

Returns the polygon outline of the current block. The returned Pta must be ptaDestroy-ed after use.

Definition at line 400 of file pageiterator.cpp.

400 {
401 if (it_->block() == nullptr || it_->block()->block == nullptr) {
402 return nullptr; // Already at the end!
403 }
404 if (it_->block()->block->pdblk.poly_block() == nullptr) {
405 return nullptr; // No layout analysis used - no polygon.
406 }
407 // Copy polygon, so we can unrotate it to image coordinates.
408 POLY_BLOCK *internal_poly = it_->block()->block->pdblk.poly_block();
409 ICOORDELT_LIST vertices;
410 vertices.deep_copy(internal_poly->points(), ICOORDELT::deep_copy);
411 POLY_BLOCK poly(&vertices, internal_poly->isA());
412 poly.rotate(it_->block()->block->re_rotation());
413 ICOORDELT_IT it(poly.points());
414 Pta *pta = ptaCreate(it.length());
415 int num_pts = 0;
416 for (it.mark_cycle_pt(); !it.cycled_list(); it.forward(), ++num_pts) {
417 ICOORD *pt = it.data();
418 // Convert to top-down coords within the input image.
419 int x = static_cast<float>(pt->x()) / scale_ + rect_left_;
420 int y = rect_top_ + rect_height_ - static_cast<float>(pt->y()) / scale_;
423 ptaAddPt(pta, x, y);
424 }
425 return pta;
426}
const double y
T ClipToRange(const T &x, const T &lower_bound, const T &upper_bound)
Definition: helpers.h:105
PDBLK pdblk
Page Description Block.
Definition: ocrblock.h:185
POLY_BLOCK * poly_block() const
Definition: pdblock.h:59
static ICOORDELT * deep_copy(const ICOORDELT *src)
Definition: points.h:180

◆ BlockType()

PolyBlockType tesseract::PageIterator::BlockType ( ) const

Returns the type of the current block. See tesseract/publictypes.h for PolyBlockType.

Definition at line 388 of file pageiterator.cpp.

388 {
389 if (it_->block() == nullptr || it_->block()->block == nullptr) {
390 return PT_UNKNOWN; // Already at the end!
391 }
392 if (it_->block()->block->pdblk.poly_block() == nullptr) {
393 return PT_FLOWING_TEXT; // No layout analysis used - assume text.
394 }
395 return it_->block()->block->pdblk.poly_block()->isA();
396}
@ PT_FLOWING_TEXT
Definition: publictypes.h:53
PolyBlockType isA() const
Definition: polyblk.h:48

◆ BoundingBox() [1/2]

bool tesseract::PageIterator::BoundingBox ( PageIteratorLevel  level,
int *  left,
int *  top,
int *  right,
int *  bottom 
) const

Returns the bounding rectangle of the current object at the given level. See comment on coordinate system above. Returns false if there is no such object at the current position. The returned bounding box is guaranteed to match the size and position of the image returned by GetBinaryImage, but may clip foreground pixels from a grey image. The padding argument to GetImage can be used to expand the image to include more foreground pixels. See GetImage below.

Returns the bounding rectangle of the current object at the given level in coordinates of the original image. See comment on coordinate system above. Returns false if there is no such object at the current position.

Definition at line 349 of file pageiterator.cpp.

350 {
351 return BoundingBox(level, 0, left, top, right, bottom);
352}
bool BoundingBox(PageIteratorLevel level, int *left, int *top, int *right, int *bottom) const

◆ BoundingBox() [2/2]

bool tesseract::PageIterator::BoundingBox ( PageIteratorLevel  level,
int  padding,
int *  left,
int *  top,
int *  right,
int *  bottom 
) const

Definition at line 354 of file pageiterator.cpp.

356 {
357 if (!BoundingBoxInternal(level, left, top, right, bottom)) {
358 return false;
359 }
360 // Convert to the coordinate system of the original image.
361 *left = ClipToRange(*left / scale_ + rect_left_ - padding, rect_left_,
363 *top = ClipToRange(*top / scale_ + rect_top_ - padding, rect_top_,
365 *right = ClipToRange((*right + scale_ - 1) / scale_ + rect_left_ + padding,
366 *left, rect_left_ + rect_width_);
367 *bottom = ClipToRange((*bottom + scale_ - 1) / scale_ + rect_top_ + padding,
368 *top, rect_top_ + rect_height_);
369 return true;
370}
bool BoundingBoxInternal(PageIteratorLevel level, int *left, int *top, int *right, int *bottom) const

◆ BoundingBoxInternal()

bool tesseract::PageIterator::BoundingBoxInternal ( PageIteratorLevel  level,
int *  left,
int *  top,
int *  right,
int *  bottom 
) const

Returns the bounding rectangle of the object in a coordinate system of the working image rectangle having its origin at (rect_left_, rect_top_) with respect to the original image and is scaled by a factor scale_.

Returns the bounding rectangle of the current object at the given level in the coordinates of the working image that is pix_binary(). See comment on coordinate system above. Returns false if there is no such object at the current position.

Definition at line 286 of file pageiterator.cpp.

288 {
289 if (Empty(level)) {
290 return false;
291 }
292 TBOX box;
293 PARA *para = nullptr;
294 switch (level) {
295 case RIL_BLOCK:
298 break;
299 case RIL_PARA:
300 para = it_->row()->row->para();
301 // Fall through.
302 case RIL_TEXTLINE:
305 break;
306 case RIL_WORD:
309 break;
310 case RIL_SYMBOL:
311 if (cblob_it_ == nullptr) {
312 box = it_->word()->box_word->BlobBox(blob_index_);
313 } else {
314 box = cblob_it_->data()->bounding_box();
315 }
316 }
317 if (level == RIL_PARA) {
318 PageIterator other = *this;
319 other.Begin();
320 do {
321 if (other.it_->block() &&
322 other.it_->block()->block == it_->block()->block &&
323 other.it_->row() && other.it_->row()->row &&
324 other.it_->row()->row->para() == para) {
325 box = box.bounding_union(other.it_->row()->row->bounding_box());
326 }
327 } while (other.Next(RIL_TEXTLINE));
328 }
329 if (level != RIL_SYMBOL || cblob_it_ != nullptr) {
330 box.rotate(it_->block()->block->re_rotation());
331 }
332 // Now we have a box in tesseract coordinates relative to the image rectangle,
333 // we have to convert the coords to a top-down system.
334 const int pix_height = pixGetHeight(tesseract_->pix_binary());
335 const int pix_width = pixGetWidth(tesseract_->pix_binary());
336 *left = ClipToRange(static_cast<int>(box.left()), 0, pix_width);
337 *top = ClipToRange(pix_height - box.top(), 0, pix_height);
338 *right = ClipToRange(static_cast<int>(box.right()), *left, pix_width);
339 *bottom = ClipToRange(pix_height - box.bottom(), *top, pix_height);
340 return true;
341}
bool Empty(PageIteratorLevel level) const
PageIterator(PAGE_RES *page_res, Tesseract *tesseract, int scale, int scaled_yres, int rect_left, int rect_top, int rect_width, int rect_height)
Image pix_binary() const
const TBOX & BlobBox(unsigned index) const
Definition: boxword.h:84
TBOX restricted_bounding_box(bool upper_dots, bool lower_dots) const
Definition: ocrblock.cpp:88
TBOX restricted_bounding_box(bool upper_dots, bool lower_dots) const
Definition: ocrrow.cpp:84
PARA * para() const
Definition: ocrrow.h:120
tesseract::BoxWord * box_word
Definition: pageres.h:270
TBOX restricted_bounding_box(bool upper_dots, bool lower_dots) const
Definition: werd.cpp:161

◆ Cmp()

int tesseract::PageIterator::Cmp ( const PageIterator other) const

Returns whether this iterator is positioned before other: -1 equal to other: 0 after other: 1

Definition at line 253 of file pageiterator.cpp.

253 {
254 int word_cmp = it_->cmp(*other.it_);
255 if (word_cmp != 0) {
256 return word_cmp;
257 }
258 if (blob_index_ < other.blob_index_) {
259 return -1;
260 }
261 if (blob_index_ == other.blob_index_) {
262 return 0;
263 }
264 return 1;
265}
int cmp(const PAGE_RES_IT &other) const
Definition: pageres.cpp:1183

◆ Empty()

bool tesseract::PageIterator::Empty ( PageIteratorLevel  level) const

Returns whether there is no object of a given level.

Return that there is no such object at a given level.

Definition at line 373 of file pageiterator.cpp.

373 {
374 if (it_->block() == nullptr) {
375 return true; // Already at the end!
376 }
377 if (it_->word() == nullptr && level != RIL_BLOCK) {
378 return true; // image block
379 }
380 if (level == RIL_SYMBOL && blob_index_ >= word_length_) {
381 return true; // Zero length word, or already at the end of it.
382 }
383 return false;
384}

◆ GetBinaryImage()

Pix * tesseract::PageIterator::GetBinaryImage ( PageIteratorLevel  level) const

Returns a binary image of the current object at the given level. The position and size match the return from BoundingBoxInternal, and so this could be upscaled with respect to the original input image. Use pixDestroy to delete the image after use.

Returns a binary image of the current object at the given level. The position and size match the return from BoundingBoxInternal, and so this could be upscaled with respect to the original input image. Use pixDestroy to delete the image after use. The following methods are used to generate the images: RIL_BLOCK: mask the page image with the block polygon. RIL_TEXTLINE: Clip the rectangle of the line box from the page image. TODO(rays) fix this to generate and use a line polygon. RIL_WORD: Clip the rectangle of the word box from the page image. RIL_SYMBOL: Render the symbol outline to an image for cblobs (prior to recognition) or the bounding box otherwise. A reconstruction of the original image (using xor to check for double representation) should be reasonably accurate, apart from removed noise, at the block level. Below the block level, the reconstruction will be missing images and line separators. At the symbol level, kerned characters will be invade the bounding box if rendered after recognition, making an xor reconstruction inaccurate, but an or construction better. Before recognition, symbol-level reconstruction should be good, even with xor, since the images come from the connected components.

Definition at line 450 of file pageiterator.cpp.

450 {
451 int left, top, right, bottom;
452 if (!BoundingBoxInternal(level, &left, &top, &right, &bottom)) {
453 return nullptr;
454 }
455 if (level == RIL_SYMBOL && cblob_it_ != nullptr &&
456 cblob_it_->data()->area() != 0) {
457 return cblob_it_->data()->render();
458 }
459 Box *box = boxCreate(left, top, right - left, bottom - top);
460 Image pix = pixClipRectangle(tesseract_->pix_binary(), box, nullptr);
461 boxDestroy(&box);
462 if (level == RIL_BLOCK || level == RIL_PARA) {
463 // Clip to the block polygon as well.
464 TBOX mask_box;
465 Image mask = it_->block()->block->render_mask(&mask_box);
466 int mask_x = left - mask_box.left();
467 int mask_y = top - (tesseract_->ImageHeight() - mask_box.top());
468 // AND the mask and pix, putting the result in pix.
469 pixRasterop(pix, std::max(0, -mask_x), std::max(0, -mask_y),
470 pixGetWidth(pix), pixGetHeight(pix), PIX_SRC & PIX_DST, mask,
471 std::max(0, mask_x), std::max(0, mask_y));
472 mask.destroy();
473 }
474 return pix;
475}
Image render_mask(TBOX *mask_box)
Definition: ocrblock.h:155

◆ GetImage()

Pix * tesseract::PageIterator::GetImage ( PageIteratorLevel  level,
int  padding,
Pix *  original_img,
int *  left,
int *  top 
) const

Returns an image of the current object at the given level in greyscale if available in the input. To guarantee a binary image use BinaryImage. NOTE that in order to give the best possible image, the bounds are expanded slightly over the binary connected component, by the supplied padding, so the top-left position of the returned image is returned in (left,top). These will most likely not match the coordinates returned by BoundingBox. If you do not supply an original image, you will get a binary one. Use pixDestroy to delete the image after use.

Definition at line 488 of file pageiterator.cpp.

489 {
490 int right, bottom;
491 if (!BoundingBox(level, left, top, &right, &bottom)) {
492 return nullptr;
493 }
494 if (original_img == nullptr) {
495 return GetBinaryImage(level);
496 }
497
498 // Expand the box.
499 *left = std::max(*left - padding, 0);
500 *top = std::max(*top - padding, 0);
501 right = std::min(right + padding, rect_width_);
502 bottom = std::min(bottom + padding, rect_height_);
503 Box *box = boxCreate(*left, *top, right - *left, bottom - *top);
504 Image grey_pix = pixClipRectangle(original_img, box, nullptr);
505 boxDestroy(&box);
506 if (level == RIL_BLOCK || level == RIL_PARA) {
507 // Clip to the block polygon as well.
508 TBOX mask_box;
509 Image mask = it_->block()->block->render_mask(&mask_box);
510 // Copy the mask registered correctly into an image the size of grey_pix.
511 int mask_x = *left - mask_box.left();
512 int mask_y = *top - (pixGetHeight(original_img) - mask_box.top());
513 int width = pixGetWidth(grey_pix);
514 int height = pixGetHeight(grey_pix);
515 Image resized_mask = pixCreate(width, height, 1);
516 pixRasterop(resized_mask, std::max(0, -mask_x), std::max(0, -mask_y), width,
517 height, PIX_SRC, mask, std::max(0, mask_x),
518 std::max(0, mask_y));
519 mask.destroy();
520 pixDilateBrick(resized_mask, resized_mask, 2 * padding + 1,
521 2 * padding + 1);
522 pixInvert(resized_mask, resized_mask);
523 pixSetMasked(grey_pix, resized_mask, UINT32_MAX);
524 resized_mask.destroy();
525 }
526 return grey_pix;
527}
Pix * GetBinaryImage(PageIteratorLevel level) const

◆ IsAtBeginningOf()

bool tesseract::PageIterator::IsAtBeginningOf ( PageIteratorLevel  level) const
virtual

Returns true if the iterator is at the start of an object at the given level.

For instance, suppose an iterator it is pointed to the first symbol of the first word of the third line of the second paragraph of the first block in a page, then: it.IsAtBeginningOf(RIL_BLOCK) = false it.IsAtBeginningOf(RIL_PARA) = false it.IsAtBeginningOf(RIL_TEXTLINE) = true it.IsAtBeginningOf(RIL_WORD) = true it.IsAtBeginningOf(RIL_SYMBOL) = true

Returns true if the iterator is at the start of an object at the given level. Possible uses include determining if a call to Next(RIL_WORD) moved to the start of a RIL_PARA.

Reimplemented in tesseract::ResultIterator.

Definition at line 194 of file pageiterator.cpp.

194 {
195 if (it_->block() == nullptr) {
196 return false; // Already at the end!
197 }
198 if (it_->word() == nullptr) {
199 return true; // In an image block.
200 }
201 switch (level) {
202 case RIL_BLOCK:
203 return blob_index_ == 0 && it_->block() != it_->prev_block();
204 case RIL_PARA:
205 return blob_index_ == 0 &&
206 (it_->block() != it_->prev_block() ||
207 it_->row()->row->para() != it_->prev_row()->row->para());
208 case RIL_TEXTLINE:
209 return blob_index_ == 0 && it_->row() != it_->prev_row();
210 case RIL_WORD:
211 return blob_index_ == 0;
212 case RIL_SYMBOL:
213 return true;
214 }
215 return false;
216}
BLOCK_RES * prev_block() const
Definition: pageres.h:760
ROW_RES * prev_row() const
Definition: pageres.h:757

◆ IsAtFinalElement()

bool tesseract::PageIterator::IsAtFinalElement ( PageIteratorLevel  level,
PageIteratorLevel  element 
) const
virtual

Returns whether the iterator is positioned at the last element in a given level. (e.g. the last word in a line, the last line in a block)

Here's some two-paragraph example

text. It starts off innocuously enough but quickly turns bizarre. The author inserts a cornucopia of words to guard against confused references.

Now take an iterator it pointed to the start of "bizarre." it.IsAtFinalElement(RIL_PARA, RIL_SYMBOL) = false it.IsAtFinalElement(RIL_PARA, RIL_WORD) = true it.IsAtFinalElement(RIL_BLOCK, RIL_WORD) = false

Returns whether the iterator is positioned at the last element in a given level. (e.g. the last word in a line, the last line in a block)

Reimplemented in tesseract::ResultIterator.

Definition at line 222 of file pageiterator.cpp.

223 {
224 if (Empty(element)) {
225 return true; // Already at the end!
226 }
227 // The result is true if we step forward by element and find we are
228 // at the end of the page or at beginning of *all* levels in:
229 // [level, element).
230 // When there is more than one level difference between element and level,
231 // we could for instance move forward one symbol and still be at the first
232 // word on a line, so we also have to be at the first symbol in a word.
233 PageIterator next(*this);
234 next.Next(element);
235 if (next.Empty(element)) {
236 return true; // Reached the end of the page.
237 }
238 while (element > level) {
239 element = static_cast<PageIteratorLevel>(element - 1);
240 if (!next.IsAtBeginningOf(element)) {
241 return false;
242 }
243 }
244 return true;
245}
def next(obj)
Definition: ast.py:56

◆ IsWithinFirstTextlineOfParagraph()

bool tesseract::PageIterator::IsWithinFirstTextlineOfParagraph ( ) const

Return whether this iterator points anywhere in the first textline of a paragraph.

Definition at line 125 of file pageiterator.cpp.

125 {
126 PageIterator p_start(*this);
127 p_start.RestartParagraph();
128 return p_start.it_->row() == it_->row();
129}

◆ Next()

bool tesseract::PageIterator::Next ( PageIteratorLevel  level)
virtual

Moves to the start of the next object at the given level in the page hierarchy, and returns false if the end of the page was reached. NOTE that RIL_SYMBOL will skip non-text blocks, but all other PageIteratorLevel level values will visit each non-text block once. Think of non text blocks as containing a single para, with a single line, with a single imaginary word. Calls to Next with different levels may be freely intermixed. This function iterates words in right-to-left scripts correctly, if the appropriate language has been loaded into Tesseract.

Moves to the start of the next object at the given level in the page hierarchy, and returns false if the end of the page was reached. NOTE (CHANGED!) that ALL PageIteratorLevel level values will visit each non-text block at least once. Think of non text blocks as containing a single para, with at least one line, with a single imaginary word, containing a single symbol. The bounding boxes mark out any polygonal nature of the block, and PTIsTextType(BLockType()) is false for non-text blocks. Calls to Next with different levels may be freely intermixed. This function iterates words in right-to-left scripts correctly, if the appropriate language has been loaded into Tesseract.

Reimplemented in tesseract::ResultIterator.

Definition at line 149 of file pageiterator.cpp.

149 {
150 if (it_->block() == nullptr) {
151 return false; // Already at the end!
152 }
153 if (it_->word() == nullptr) {
154 level = RIL_BLOCK;
155 }
156
157 switch (level) {
158 case RIL_BLOCK:
160 break;
161 case RIL_PARA:
163 break;
164 case RIL_TEXTLINE:
165 for (it_->forward_with_empties(); it_->row() == it_->prev_row();
167 ;
168 }
169 break;
170 case RIL_WORD:
172 break;
173 case RIL_SYMBOL:
174 if (cblob_it_ != nullptr) {
175 cblob_it_->forward();
176 }
177 ++blob_index_;
178 if (blob_index_ >= word_length_) {
180 } else {
181 return true;
182 }
183 break;
184 }
185 BeginWord(0);
186 return it_->block() != nullptr;
187}
WERD_RES * forward_paragraph()
Definition: pageres.cpp:1700
WERD_RES * forward_with_empties()
Definition: pageres.h:747
WERD_RES * forward_block()
Definition: pageres.cpp:1715

◆ operator=()

const PageIterator & tesseract::PageIterator::operator= ( const PageIterator src)

Definition at line 80 of file pageiterator.cpp.

80 {
81 page_res_ = src.page_res_;
82 tesseract_ = src.tesseract_;
83 include_upper_dots_ = src.include_upper_dots_;
84 include_lower_dots_ = src.include_lower_dots_;
85 scale_ = src.scale_;
86 scaled_yres_ = src.scaled_yres_;
87 rect_left_ = src.rect_left_;
88 rect_top_ = src.rect_top_;
89 rect_width_ = src.rect_width_;
90 rect_height_ = src.rect_height_;
91 delete it_;
92 it_ = new PAGE_RES_IT(*src.it_);
93 BeginWord(src.blob_index_);
94 return *this;
95}

◆ Orientation()

void tesseract::PageIterator::Orientation ( tesseract::Orientation orientation,
tesseract::WritingDirection writing_direction,
tesseract::TextlineOrder textline_order,
float *  deskew_angle 
) const

Returns orientation for the block the iterator points to. orientation, writing_direction, textline_order: see publictypes.h deskew_angle: after rotating the block so the text orientation is upright, how many radians does one have to rotate the block anti-clockwise for it to be level? -Pi/4 <= deskew_angle <= Pi/4

Definition at line 565 of file pageiterator.cpp.

568 {
569 auto *block_res = it_->block();
570 if (block_res == nullptr) {
571 // Nothing can be done, so return default values.
572 *orientation = ORIENTATION_PAGE_UP;
573 *writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
574 *textline_order = TEXTLINE_ORDER_TOP_TO_BOTTOM;
575 return;
576 }
577 auto *block = block_res->block;
578
579 // Orientation
580 FCOORD up_in_image(0.0, 1.0);
581 up_in_image.unrotate(block->classify_rotation());
582 up_in_image.rotate(block->re_rotation());
583
584 if (up_in_image.x() == 0.0F) {
585 if (up_in_image.y() > 0.0F) {
586 *orientation = ORIENTATION_PAGE_UP;
587 } else {
588 *orientation = ORIENTATION_PAGE_DOWN;
589 }
590 } else if (up_in_image.x() > 0.0F) {
591 *orientation = ORIENTATION_PAGE_RIGHT;
592 } else {
593 *orientation = ORIENTATION_PAGE_LEFT;
594 }
595
596 // Writing direction
597 bool is_vertical_text = (block->classify_rotation().x() == 0.0);
598 bool right_to_left = block->right_to_left();
599 *writing_direction = is_vertical_text
601 : (right_to_left ? WRITING_DIRECTION_RIGHT_TO_LEFT
603
604 // Textline Order
605 const bool is_mongolian = false; // TODO(eger): fix me
606 *textline_order = is_vertical_text
607 ? (is_mongolian ? TEXTLINE_ORDER_LEFT_TO_RIGHT
610
611 // Deskew angle
612 FCOORD skew = block->skew(); // true horizontal for textlines
613 *deskew_angle = -skew.angle();
614}
@ TEXTLINE_ORDER_LEFT_TO_RIGHT
Definition: publictypes.h:147
@ TEXTLINE_ORDER_RIGHT_TO_LEFT
Definition: publictypes.h:148
@ TEXTLINE_ORDER_TOP_TO_BOTTOM
Definition: publictypes.h:149
@ ORIENTATION_PAGE_LEFT
Definition: publictypes.h:118
@ ORIENTATION_PAGE_DOWN
Definition: publictypes.h:117
@ ORIENTATION_PAGE_UP
Definition: publictypes.h:115
@ ORIENTATION_PAGE_RIGHT
Definition: publictypes.h:116
@ WRITING_DIRECTION_TOP_TO_BOTTOM
Definition: publictypes.h:132
@ WRITING_DIRECTION_LEFT_TO_RIGHT
Definition: publictypes.h:130
@ WRITING_DIRECTION_RIGHT_TO_LEFT
Definition: publictypes.h:131

◆ ParagraphInfo()

void tesseract::PageIterator::ParagraphInfo ( tesseract::ParagraphJustification justification,
bool *  is_list_item,
bool *  is_crown,
int *  first_line_indent 
) const

Returns information about the current paragraph, if available.

justification - LEFT if ragged right, or fully justified and script is left-to-right. RIGHT if ragged left, or fully justified and script is right-to-left. unknown if it looks like source code or we have very few lines. is_list_item - true if we believe this is a member of an ordered or unordered list. is_crown - true if the first line of the paragraph is aligned with the other lines of the paragraph even though subsequent paragraphs have first line indents. This typically indicates that this is the continuation of a previous paragraph or that it is the very first paragraph in the chapter. first_line_indent - For LEFT aligned paragraphs, the first text line of paragraphs of this kind are indented this many pixels from the left edge of the rest of the paragraph. for RIGHT aligned paragraphs, the first text line of paragraphs of this kind are indented this many pixels from the right edge of the rest of the paragraph. NOTE 1: This value may be negative. NOTE 2: if *is_crown == true, the first line of this paragraph is actually flush, and first_line_indent is set to the "common" first_line_indent for subsequent paragraphs in this block of text.

Definition at line 616 of file pageiterator.cpp.

618 {
620 if (!it_->row() || !it_->row()->row || !it_->row()->row->para() ||
621 !it_->row()->row->para()->model) {
622 return;
623 }
624
625 PARA *para = it_->row()->row->para();
626 *is_list_item = para->is_list_item;
627 *is_crown = para->is_very_first_or_continuation;
628 *first_line_indent = para->model->first_indent() - para->model->body_indent();
629 *just = para->model->justification();
630}
@ JUSTIFICATION_UNKNOWN
Definition: publictypes.h:247
const ParagraphModel * model
Definition: ocrpara.h:40
bool is_list_item
Definition: ocrpara.h:42

◆ PositionedAtSameWord()

bool tesseract::PageIterator::PositionedAtSameWord ( const PAGE_RES_IT other) const

Are we positioned at the same location as other?

Definition at line 97 of file pageiterator.cpp.

97 {
98 return (it_ == nullptr && it_ == other) ||
99 ((other != nullptr) && (it_ != nullptr) && (*it_ == *other));
100}

◆ RestartParagraph()

void tesseract::PageIterator::RestartParagraph ( )
virtual

Moves the iterator to the beginning of the paragraph. This class implements this functionality by moving it to the zero indexed blob of the first (leftmost) word on the first row of the paragraph.

Definition at line 110 of file pageiterator.cpp.

110 {
111 if (it_->block() == nullptr) {
112 return; // At end of the document.
113 }
114 PAGE_RES_IT para(page_res_);
115 PAGE_RES_IT next_para(para);
116 next_para.forward_paragraph();
117 while (next_para.cmp(*it_) <= 0) {
118 para = next_para;
119 next_para.forward_paragraph();
120 }
121 *it_ = para;
122 BeginWord(0);
123}

◆ RestartRow()

void tesseract::PageIterator::RestartRow ( )
virtual

Moves the iterator to the beginning of the text line. This class implements this functionality by moving it to the zero indexed blob of the first (leftmost) word of the row.

Definition at line 131 of file pageiterator.cpp.

131 {
132 it_->restart_row();
133 BeginWord(0);
134}
WERD_RES * restart_row()
Definition: pageres.cpp:1683

◆ RowAttributes()

void tesseract::PageIterator::RowAttributes ( float *  row_height,
float *  descenders,
float *  ascenders 
) const

Definition at line 557 of file pageiterator.cpp.

558 {
559 *row_height = it_->row()->row->x_height() + it_->row()->row->ascenders() -
560 it_->row()->row->descenders();
561 *descenders = it_->row()->row->descenders();
562 *ascenders = it_->row()->row->ascenders();
563}
float x_height() const
Definition: ocrrow.h:66
float ascenders() const
Definition: ocrrow.h:84
float descenders() const
Definition: ocrrow.h:87

◆ SetBoundingBoxComponents()

void tesseract::PageIterator::SetBoundingBoxComponents ( bool  include_upper_dots,
bool  include_lower_dots 
)
inline

Controls what to include in a bounding box. Bounding boxes of all levels between RIL_WORD and RIL_BLOCK can include or exclude potential diacritics. Between layout analysis and recognition, it isn't known where all diacritics belong, so this control is used to include or exclude some diacritics that are above or below the main body of the word. In most cases where the placement is obvious, and after recognition, it doesn't make as much difference, as the diacritics will already be included in the word.

Definition at line 188 of file pageiterator.h.

189 {
190 include_upper_dots_ = include_upper_dots;
191 include_lower_dots_ = include_lower_dots;
192 }

◆ SetWordBlamerBundle()

bool tesseract::PageIterator::SetWordBlamerBundle ( BlamerBundle blamer_bundle)

Definition at line 680 of file pageiterator.cpp.

680 {
681 if (it_->word() != nullptr) {
682 it_->word()->blamer_bundle = blamer_bundle;
683 return true;
684 } else {
685 return false;
686 }
687}
BlamerBundle * blamer_bundle
Definition: pageres.h:250

Member Data Documentation

◆ blob_index_

int tesseract::PageIterator::blob_index_
protected

The current blob index within the word.

Definition at line 343 of file pageiterator.h.

◆ cblob_it_

C_BLOB_IT* tesseract::PageIterator::cblob_it_
protected

Iterator to the blobs within the word. If nullptr, then we are iterating OCR results in the box_word. Owned by this ResultIterator.

Definition at line 349 of file pageiterator.h.

◆ include_lower_dots_

bool tesseract::PageIterator::include_lower_dots_
protected

Definition at line 352 of file pageiterator.h.

◆ include_upper_dots_

bool tesseract::PageIterator::include_upper_dots_
protected

Control over what to include in bounding boxes.

Definition at line 351 of file pageiterator.h.

◆ it_

PAGE_RES_IT* tesseract::PageIterator::it_
protected

The iterator to the page_res_. Owned by this ResultIterator. A pointer just to avoid dragging in Tesseract includes.

Definition at line 334 of file pageiterator.h.

◆ page_res_

PAGE_RES* tesseract::PageIterator::page_res_
protected

Pointer to the page_res owned by the API.

Definition at line 327 of file pageiterator.h.

◆ rect_height_

int tesseract::PageIterator::rect_height_
protected

Definition at line 359 of file pageiterator.h.

◆ rect_left_

int tesseract::PageIterator::rect_left_
protected

Definition at line 356 of file pageiterator.h.

◆ rect_top_

int tesseract::PageIterator::rect_top_
protected

Definition at line 357 of file pageiterator.h.

◆ rect_width_

int tesseract::PageIterator::rect_width_
protected

Definition at line 358 of file pageiterator.h.

◆ scale_

int tesseract::PageIterator::scale_
protected

Parameters saved from the Thresholder. Needed to rebuild coordinates.

Definition at line 354 of file pageiterator.h.

◆ scaled_yres_

int tesseract::PageIterator::scaled_yres_
protected

Definition at line 355 of file pageiterator.h.

◆ tesseract_

Tesseract* tesseract::PageIterator::tesseract_
protected

Pointer to the Tesseract object owned by the API.

Definition at line 329 of file pageiterator.h.

◆ word_

WERD* tesseract::PageIterator::word_
protected

The current input WERD being iterated. If there is an output from OCR, then word_ is nullptr. Owned by the API

Definition at line 339 of file pageiterator.h.

◆ word_length_

int tesseract::PageIterator::word_length_
protected

The length of the current word_.

Definition at line 341 of file pageiterator.h.


The documentation for this class was generated from the following files: