tesseract v5.3.3.20231005
pdfrenderer.cpp
Go to the documentation of this file.
1
2// File: pdfrenderer.cpp
3// Description: PDF rendering interface to inject into TessBaseAPI
4//
5// (C) Copyright 2011, Google Inc.
6// Licensed under the Apache License, Version 2.0 (the "License");
7// you may not use this file except in compliance with the License.
8// You may obtain a copy of the License at
9// http://www.apache.org/licenses/LICENSE-2.0
10// Unless required by applicable law or agreed to in writing, software
11// distributed under the License is distributed on an "AS IS" BASIS,
12// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13// See the License for the specific language governing permissions and
14// limitations under the License.
15//
17
18// Include automatically generated configuration file if running autoconf.
19#ifdef HAVE_CONFIG_H
20# include "config_auto.h"
21#endif
22
23#include "pdf_ttf.h"
24#include "tprintf.h"
25
26#include <allheaders.h>
27#include <tesseract/baseapi.h>
28#include <tesseract/publictypes.h> // for PTIsTextType()
29#include <tesseract/renderer.h>
30#include <cmath>
31#include <cstring>
32#include <fstream> // for std::ifstream
33#include <locale> // for std::locale::classic
34#include <memory> // std::unique_ptr
35#include <sstream> // for std::stringstream
36#include "helpers.h" // for Swap
37
38/*
39
40Design notes from Ken Sharp, with light editing.
41
42We think one solution is a font with a single glyph (.notdef) and a
43CIDToGIDMap which maps all the CIDs to 0. That map would then be
44stored as a stream in the PDF file, and when flat compressed should
45be pretty small. The font, of course, will be approximately the same
46size as the one you currently use.
47
48I'm working on such a font now, the CIDToGIDMap is trivial, you just
49create a stream object which contains 128k bytes (2 bytes per possible
50CID and your CIDs range from 0 to 65535) and where you currently have
51"/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
52
53Note that if, in future, you were to use a different (ie not 2 byte)
54CMap for character codes you could trivially extend the CIDToGIDMap.
55
56The following is an explanation of how some of the font stuff works,
57this may be too simple for you in which case please accept my
58apologies, its hard to know how much knowledge someone has. You can
59skip all this anyway, its just for information.
60
61The font embedded in a PDF file is usually intended just to be
62rendered, but extensions allow for at least some ability to locate (or
63copy) text from a document. This isn't something which was an original
64goal of the PDF format, but its been retro-fitted, presumably due to
65popular demand.
66
67To do this reliably the PDF file must contain a ToUnicode CMap, a
68device for mapping character codes to Unicode code points. If one of
69these is present, then this will be used to convert the character
70codes into Unicode values. If its not present then the reader will
71fall back through a series of heuristics to try and guess the
72result. This is, as you would expect, prone to failure.
73
74This doesn't concern you of course, since you always write a ToUnicode
75CMap, so because you are writing the text in text rendering mode 3 it
76would seem that you don't really need to worry about this, but in the
77PDF spec you cannot have an isolated ToUnicode CMap, it has to be
78attached to a font, so in order to get even copy/paste to work you
79need to define a font.
80
81This is what leads to problems, tools like pdfwrite assume that they
82are going to be able to (or even have to) modify the font entries, so
83they require that the font being embedded be valid, and to be honest
84the font Tesseract embeds isn't valid (for this purpose).
85
86
87To see why lets look at how text is specified in a PDF file:
88
89(Test) Tj
90
91Now that looks like text but actually it isn't. Each of those bytes is
92a 'character code'. When it comes to rendering the text a complex
93sequence of events takes place, which converts the character code into
94'something' which the font understands. Its entirely possible via
95character mappings to have that text render as 'Sftu'
96
97For simple fonts (PostScript type 1), we use the character code as the
98index into an Encoding array (256 elements), each element of which is
99a glyph name, so this gives us a glyph name. We then consult the
100CharStrings dictionary in the font, that's a complex object which
101contains pairs of keys and values, you can use the key to retrieve a
102given value. So we have a glyph name, we then use that as the key to
103the dictionary and retrieve the associated value. For a type 1 font,
104the value is a glyph program that describes how to draw the glyph.
105
106For CIDFonts, its a little more complicated. Because CIDFonts can be
107large, using a glyph name as the key is unreasonable (it would also
108lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
109as the key. CIDs are just numbers.
110
111But.... We don't use the character code as the CID. What we do is use
112a CMap to convert the character code into a CID. We then use the CID
113to key the CharStrings dictionary and proceed as before. So the 'CMap'
114is the equivalent of the Encoding array, but its a more compact and
115flexible representation.
116
117Note that you have to use the CMap just to find out how many bytes
118constitute a character code, and it can be variable. For example you
119can say if the first byte is 0x00->0x7f then its just one byte, if its
1200x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
121have seen CMaps defining character codes up to 5 bytes wide.
122
123Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
124TrueType CIDFonts. The thing is that TrueType fonts are accessed using
125a Glyph ID (GID) (and the LOCA table) which may well not be anything
126like the CID. So for this case PDF includes a CIDToGIDMap. That maps
127the CIDs to GIDs, and we can then use the GID to get the glyph
128description from the GLYF table of the font.
129
130So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
131
132Looking at the PDF file I was supplied with we see that it contains
133text like :
134
135<0x0075> Tj
136
137So we start by taking the character code (117) and look it up in the
138CMap. Well you don't supply a CMap, you just use the Identity-H one
139which is predefined. So character code 117 maps to CID 117. Then we
140use the CIDToGIDMap, again you don't supply one, you just use the
141predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
142were supplied with only contains 116 glyphs.
143
144Now for Latin that's not a huge problem, you can just supply a bigger
145font. But for more complex languages that *is* going to be more of a
146problem. Either you need to supply a font which contains glyphs for
147all the possible CID->GID mappings, or we need to think laterally.
148
149Our solution using a TrueType CIDFont is to intervene at the
150CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
151font with just one glyph, the .notdef glyph at GID 0. This is what I'm
152looking into now.
153
154It would also be possible to have a 'PostScript' (ie type 1 outlines)
155CIDFont which contained 1 glyph, and a CMap which mapped all character
156codes to CID 0. The effect would be the same.
157
158Its possible (I haven't checked) that the PostScript CIDFont and
159associated CMap would be smaller than the TrueType font and associated
160CIDToGIDMap.
161
162--- in a followup ---
163
164OK there is a small problem there, if I use GID 0 then Acrobat gets
165upset about it and complains it cannot extract the font. If I set the
166CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
167mad......
168
169*/
170
171namespace tesseract {
172
173// If the font is 10 pts, nominal character width is 5 pts
174static const int kCharWidth = 2;
175
176// Used for memory allocation. A codepoint must take no more than this
177// many bytes, when written in the PDF way. e.g. "<0063>" for the
178// letter 'c'
179static const int kMaxBytesPerCodepoint = 20;
180
181/**********************************************************************
182 * PDF Renderer interface implementation
183 **********************************************************************/
184TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly)
185 : TessResultRenderer(outputbase, "pdf"), datadir_(datadir) {
186 obj_ = 0;
187 textonly_ = textonly;
188 offsets_.push_back(0);
189}
190
191void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
192 offsets_.push_back(objectsize + offsets_.back());
193 obj_++;
194}
195
196void TessPDFRenderer::AppendPDFObject(const char *data) {
197 AppendPDFObjectDIY(strlen(data));
198 AppendString(data);
199}
200
201// Helper function to prevent us from accidentally writing
202// scientific notation to an HOCR or PDF file. Besides, three
203// decimal points are all you really need.
204static double prec(double x) {
205 double kPrecision = 1000.0;
206 double a = round(x * kPrecision) / kPrecision;
207 if (a == -0) {
208 return 0;
209 }
210 return a;
211}
212
213static long dist2(int x1, int y1, int x2, int y2) {
214 return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
215}
216
217// Viewers like evince can get really confused during copy-paste when
218// the baseline wanders around. So I've decided to project every word
219// onto the (straight) line baseline. All numbers are in the native
220// PDF coordinate system, which has the origin in the bottom left and
221// the unit is points, which is 1/72 inch. Tesseract reports baselines
222// left-to-right no matter what the reading order is. We need the
223// word baseline in reading order, so we do that conversion here. Returns
224// the word's baseline origin and length.
225static void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1,
226 int word_x2, int word_y2, int line_x1, int line_y1, int line_x2,
227 int line_y2, double *x0, double *y0, double *length) {
228 if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
229 std::swap(word_x1, word_x2);
230 std::swap(word_y1, word_y2);
231 }
232 double word_length;
233 double x, y;
234 {
235 int px = word_x1;
236 int py = word_y1;
237 double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
238 if (l2 == 0) {
239 x = line_x1;
240 y = line_y1;
241 } else {
242 double t = ((px - line_x2) * (line_x2 - line_x1) + (py - line_y2) * (line_y2 - line_y1)) / l2;
243 x = line_x2 + t * (line_x2 - line_x1);
244 y = line_y2 + t * (line_y2 - line_y1);
245 }
246 word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1, word_x2, word_y2)));
247 word_length = word_length * 72.0 / ppi;
248 x = x * 72 / ppi;
249 y = height - (y * 72.0 / ppi);
250 }
251 *x0 = x;
252 *y0 = y;
253 *length = word_length;
254}
255
256// Compute coefficients for an affine matrix describing the rotation
257// of the text. If the text is right-to-left such as Arabic or Hebrew,
258// we reflect over the Y-axis. This matrix will set the coordinate
259// system for placing text in the PDF file.
260//
261// RTL
262// [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
263// [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
264static void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2,
265 double *a, double *b, double *c, double *d) {
266 double theta =
267 atan2(static_cast<double>(line_y1 - line_y2), static_cast<double>(line_x2 - line_x1));
268 *a = cos(theta);
269 *b = sin(theta);
270 *c = -sin(theta);
271 *d = cos(theta);
272 switch (writing_direction) {
274 *a = -*a;
275 *b = -*b;
276 break;
278 // TODO(jbreiden) Consider using the vertical PDF writing mode.
279 break;
280 default:
281 break;
282 }
283}
284
285// There are some really awkward PDF viewers in the wild, such as
286// 'Preview' which ships with the Mac. They do a better job with text
287// selection and highlighting when given perfectly flat baseline
288// instead of very slightly tilted. We clip small tilts to appease
289// these viewers. I chose this threshold large enough to absorb noise,
290// but small enough that lines probably won't cross each other if the
291// whole page is tilted at almost exactly the clipping threshold.
292static void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1,
293 int *line_x2, int *line_y2) {
294 *line_x1 = x1;
295 *line_y1 = y1;
296 *line_x2 = x2;
297 *line_y2 = y2;
298 int rise = abs(y2 - y1) * 72;
299 int run = abs(x2 - x1) * 72;
300 if (rise < 2 * ppi && 2 * ppi < run) {
301 *line_y1 = *line_y2 = (y1 + y2) / 2;
302 }
303}
304
305static bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
306 if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
307 tprintf("Dropping invalid codepoint %d\n", code);
308 return false;
309 }
310 if (code < 0x10000) {
311 snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
312 } else {
313 int a = code - 0x010000;
314 int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
315 int low_surrogate = (0x03FF & a) + 0xDC00;
316 snprintf(utf16, kMaxBytesPerCodepoint, "%04X%04X", high_surrogate, low_surrogate);
317 }
318 return true;
319}
320
321char *TessPDFRenderer::GetPDFTextObjects(TessBaseAPI *api, double width, double height) {
322 double ppi = api->GetSourceYResolution();
323
324 // These initial conditions are all arbitrary and will be overwritten
325 double old_x = 0.0, old_y = 0.0;
326 int old_fontsize = 0;
328 bool new_block = true;
329 int fontsize = 0;
330 double a = 1;
331 double b = 0;
332 double c = 0;
333 double d = 1;
334
335 std::stringstream pdf_str;
336 // Use "C" locale (needed for double values prec()).
337 pdf_str.imbue(std::locale::classic());
338 // Use 8 digits for double values.
339 pdf_str.precision(8);
340
341 // TODO(jbreiden) This marries the text and image together.
342 // Slightly cleaner from an abstraction standpoint if this were to
343 // live inside a separate text object.
344 pdf_str << "q " << prec(width) << " 0 0 " << prec(height) << " 0 0 cm";
345 if (!textonly_) {
346 pdf_str << " /Im1 Do";
347 }
348 pdf_str << " Q\n";
349
350 int line_x1 = 0;
351 int line_y1 = 0;
352 int line_x2 = 0;
353 int line_y2 = 0;
354
355 const std::unique_ptr</*non-const*/ ResultIterator> res_it(api->GetIterator());
356 while (!res_it->Empty(RIL_BLOCK)) {
357 if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
358 auto block_type = res_it->BlockType();
359 if (!PTIsTextType(block_type)) {
360 // ignore non-text blocks
361 res_it->Next(RIL_BLOCK);
362 continue;
363 }
364 pdf_str << "BT\n3 Tr"; // Begin text object, use invisible ink
365 old_fontsize = 0; // Every block will declare its fontsize
366 new_block = true; // Every block will declare its affine matrix
367 }
368
369 if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
370 int x1, y1, x2, y2;
371 res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
372 ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
373 }
374
375 if (res_it->Empty(RIL_WORD)) {
376 res_it->Next(RIL_WORD);
377 continue;
378 }
379
380 // Writing direction changes at a per-word granularity
381 tesseract::WritingDirection writing_direction;
382 {
383 tesseract::Orientation orientation;
384 tesseract::TextlineOrder textline_order;
385 float deskew_angle;
386 res_it->Orientation(&orientation, &writing_direction, &textline_order, &deskew_angle);
387 if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
388 switch (res_it->WordDirection()) {
390 writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
391 break;
393 writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
394 break;
395 default:
396 writing_direction = old_writing_direction;
397 }
398 }
399 }
400
401 // Where is word origin and how long is it?
402 double x, y, word_length;
403 {
404 int word_x1, word_y1, word_x2, word_y2;
405 res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
406 GetWordBaseline(writing_direction, ppi, height, word_x1, word_y1, word_x2, word_y2, line_x1,
407 line_y1, line_x2, line_y2, &x, &y, &word_length);
408 }
409
410 if (writing_direction != old_writing_direction || new_block) {
411 AffineMatrix(writing_direction, line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
412 pdf_str << " " << prec(a) // . This affine matrix
413 << " " << prec(b) // . sets the coordinate
414 << " " << prec(c) // . system for all
415 << " " << prec(d) // . text that follows.
416 << " " << prec(x) // .
417 << " " << prec(y) // .
418 << (" Tm "); // Place cursor absolutely
419 new_block = false;
420 } else {
421 double dx = x - old_x;
422 double dy = y - old_y;
423 pdf_str << " " << prec(dx * a + dy * b) << " " << prec(dx * c + dy * d)
424 << (" Td "); // Relative moveto
425 }
426 old_x = x;
427 old_y = y;
428 old_writing_direction = writing_direction;
429
430 // Adjust font size on a per word granularity. Pay attention to
431 // fontsize, old_fontsize, and pdf_str. We've found that for
432 // in Arabic, Tesseract will happily return a fontsize of zero,
433 // so we make up a default number to protect ourselves.
434 {
435 bool bold, italic, underlined, monospace, serif, smallcaps;
436 int font_id;
437 res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace, &serif, &smallcaps,
438 &fontsize, &font_id);
439 const int kDefaultFontsize = 8;
440 if (fontsize <= 0) {
441 fontsize = kDefaultFontsize;
442 }
443 if (fontsize != old_fontsize) {
444 pdf_str << "/f-0-0 " << fontsize << " Tf ";
445 old_fontsize = fontsize;
446 }
447 }
448
449 bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
450 bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
451 std::string pdf_word;
452 int pdf_word_len = 0;
453 do {
454 const std::unique_ptr<const char[]> grapheme(res_it->GetUTF8Text(RIL_SYMBOL));
455 if (grapheme && grapheme[0] != '\0') {
456 std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(grapheme.get());
457 char utf16[kMaxBytesPerCodepoint];
458 for (char32 code : unicodes) {
459 if (CodepointToUtf16be(code, utf16)) {
460 pdf_word += utf16;
461 pdf_word_len++;
462 }
463 }
464 }
465 res_it->Next(RIL_SYMBOL);
466 } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
467 if (res_it->IsAtBeginningOf(RIL_WORD)) {
468 pdf_word += "0020";
469 pdf_word_len++;
470 }
471 if (word_length > 0 && pdf_word_len > 0) {
472 double h_stretch = kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
473 pdf_str << h_stretch << " Tz" // horizontal stretch
474 << " [ <" << pdf_word // UTF-16BE representation
475 << "> ] TJ"; // show the text
476 }
477 if (last_word_in_line) {
478 pdf_str << " \n";
479 }
480 if (last_word_in_block) {
481 pdf_str << "ET\n"; // end the text object
482 }
483 }
484 const std::string &text = pdf_str.str();
485 char *result = new char[text.length() + 1];
486 strcpy(result, text.c_str());
487 return result;
488}
489
491 AppendPDFObject("%PDF-1.5\n%\xDE\xAD\xBE\xEB\n");
492
493 // CATALOG
494 AppendPDFObject(
495 "1 0 obj\n"
496 "<<\n"
497 " /Type /Catalog\n"
498 " /Pages 2 0 R\n"
499 ">>\nendobj\n");
500
501 // We are reserving object #2 for the /Pages
502 // object, which I am going to create and write
503 // at the end of the PDF file.
504 AppendPDFObject("");
505
506 // TYPE0 FONT
507 AppendPDFObject(
508 "3 0 obj\n"
509 "<<\n"
510 " /BaseFont /GlyphLessFont\n"
511 " /DescendantFonts [ 4 0 R ]\n" // CIDFontType2 font
512 " /Encoding /Identity-H\n"
513 " /Subtype /Type0\n"
514 " /ToUnicode 6 0 R\n" // ToUnicode
515 " /Type /Font\n"
516 ">>\n"
517 "endobj\n");
518
519 // CIDFONTTYPE2
520 std::stringstream stream;
521 // Use "C" locale (needed for int values larger than 999).
522 stream.imbue(std::locale::classic());
523 stream << "4 0 obj\n"
524 "<<\n"
525 " /BaseFont /GlyphLessFont\n"
526 " /CIDToGIDMap 5 0 R\n" // CIDToGIDMap
527 " /CIDSystemInfo\n"
528 " <<\n"
529 " /Ordering (Identity)\n"
530 " /Registry (Adobe)\n"
531 " /Supplement 0\n"
532 " >>\n"
533 " /FontDescriptor 7 0 R\n" // Font descriptor
534 " /Subtype /CIDFontType2\n"
535 " /Type /Font\n"
536 " /DW "
537 << (1000 / kCharWidth)
538 << "\n"
539 ">>\n"
540 "endobj\n";
541 AppendPDFObject(stream.str().c_str());
542
543 // CIDTOGIDMAP
544 const int kCIDToGIDMapSize = 2 * (1 << 16);
545 const std::unique_ptr<unsigned char[]> cidtogidmap(new unsigned char[kCIDToGIDMapSize]);
546 for (int i = 0; i < kCIDToGIDMapSize; i++) {
547 cidtogidmap[i] = (i % 2) ? 1 : 0;
548 }
549 size_t len;
550 unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
551 stream.str("");
552 stream << "5 0 obj\n"
553 "<<\n"
554 " /Length "
555 << len
556 << " /Filter /FlateDecode\n"
557 ">>\n"
558 "stream\n";
559 AppendString(stream.str().c_str());
560 long objsize = stream.str().size();
561 AppendData(reinterpret_cast<char *>(comp), len);
562 objsize += len;
563 lept_free(comp);
564 const char *endstream_endobj =
565 "endstream\n"
566 "endobj\n";
567 AppendString(endstream_endobj);
568 objsize += strlen(endstream_endobj);
569 AppendPDFObjectDIY(objsize);
570
571 const char stream2[] =
572 "/CIDInit /ProcSet findresource begin\n"
573 "12 dict begin\n"
574 "begincmap\n"
575 "/CIDSystemInfo\n"
576 "<<\n"
577 " /Registry (Adobe)\n"
578 " /Ordering (UCS)\n"
579 " /Supplement 0\n"
580 ">> def\n"
581 "/CMapName /Adobe-Identify-UCS def\n"
582 "/CMapType 2 def\n"
583 "1 begincodespacerange\n"
584 "<0000> <FFFF>\n"
585 "endcodespacerange\n"
586 "1 beginbfrange\n"
587 "<0000> <FFFF> <0000>\n"
588 "endbfrange\n"
589 "endcmap\n"
590 "CMapName currentdict /CMap defineresource pop\n"
591 "end\n"
592 "end\n";
593
594 // TOUNICODE
595 stream.str("");
596 stream << "6 0 obj\n"
597 "<< /Length "
598 << (sizeof(stream2) - 1)
599 << " >>\n"
600 "stream\n"
601 << stream2
602 << "endstream\n"
603 "endobj\n";
604 AppendPDFObject(stream.str().c_str());
605
606 // FONT DESCRIPTOR
607 stream.str("");
608 stream << "7 0 obj\n"
609 "<<\n"
610 " /Ascent 1000\n"
611 " /CapHeight 1000\n"
612 " /Descent -1\n" // Spec says must be negative
613 " /Flags 5\n" // FixedPitch + Symbolic
614 " /FontBBox [ 0 0 "
615 << (1000 / kCharWidth)
616 << " 1000 ]\n"
617 " /FontFile2 8 0 R\n"
618 " /FontName /GlyphLessFont\n"
619 " /ItalicAngle 0\n"
620 " /StemV 80\n"
621 " /Type /FontDescriptor\n"
622 ">>\n"
623 "endobj\n";
624 AppendPDFObject(stream.str().c_str());
625
626 stream.str("");
627 stream << datadir_.c_str() << "/pdf.ttf";
628 const uint8_t *font;
629 std::ifstream input(stream.str().c_str(), std::ios::in | std::ios::binary);
630 std::vector<unsigned char> buffer(std::istreambuf_iterator<char>(input), {});
631 auto size = buffer.size();
632 if (size) {
633 font = buffer.data();
634 } else {
635#if !defined(NDEBUG)
636 tprintf("Cannot open file \"%s\"!\nUsing internal glyphless font.\n", stream.str().c_str());
637#endif
638 font = pdf_ttf;
639 size = sizeof(pdf_ttf);
640 }
641
642 // FONTFILE2
643 stream.str("");
644 stream << "8 0 obj\n"
645 "<<\n"
646 " /Length "
647 << size
648 << "\n"
649 " /Length1 "
650 << size
651 << "\n"
652 ">>\n"
653 "stream\n";
654 AppendString(stream.str().c_str());
655 objsize = stream.str().size();
656 AppendData(reinterpret_cast<const char *>(font), size);
657 objsize += size;
658 AppendString(endstream_endobj);
659 objsize += strlen(endstream_endobj);
660 AppendPDFObjectDIY(objsize);
661 return true;
662}
663
664bool TessPDFRenderer::imageToPDFObj(Pix *pix, const char *filename, long int objnum,
665 char **pdf_object, long int *pdf_object_size,
666 const int jpg_quality) {
667 if (!pdf_object_size || !pdf_object) {
668 return false;
669 }
670 *pdf_object = nullptr;
671 *pdf_object_size = 0;
672 if (!filename && !pix) {
673 return false;
674 }
675
676 L_Compressed_Data *cid = nullptr;
677
678 int sad = 0;
679 if (pixGetInputFormat(pix) == IFF_PNG) {
680 sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid);
681 }
682 if (!cid) {
683 sad = l_generateCIDataForPdf(filename, pix, jpg_quality, &cid);
684 }
685
686 if (sad || !cid) {
687 l_CIDataDestroy(&cid);
688 return false;
689 }
690
691 const char *group4 = "";
692 const char *filter;
693 switch (cid->type) {
694 case L_FLATE_ENCODE:
695 filter = "/FlateDecode";
696 break;
697 case L_JPEG_ENCODE:
698 filter = "/DCTDecode";
699 break;
700 case L_G4_ENCODE:
701 filter = "/CCITTFaxDecode";
702 group4 = " /K -1\n";
703 break;
704 case L_JP2K_ENCODE:
705 filter = "/JPXDecode";
706 break;
707 default:
708 l_CIDataDestroy(&cid);
709 return false;
710 }
711
712 // Maybe someday we will accept RGBA but today is not that day.
713 // It requires creating an /SMask for the alpha channel.
714 // http://stackoverflow.com/questions/14220221
715 std::stringstream colorspace;
716 // Use "C" locale (needed for int values larger than 999).
717 colorspace.imbue(std::locale::classic());
718 if (cid->ncolors > 0) {
719 colorspace << " /ColorSpace [ /Indexed /DeviceRGB " << (cid->ncolors - 1) << " "
720 << cid->cmapdatahex << " ]\n";
721 } else {
722 switch (cid->spp) {
723 case 1:
724 if (cid->bps == 1 && pixGetInputFormat(pix) == IFF_PNG) {
725 colorspace.str(
726 " /ColorSpace /DeviceGray\n"
727 " /Decode [1 0]\n");
728 } else {
729 colorspace.str(" /ColorSpace /DeviceGray\n");
730 }
731 break;
732 case 3:
733 colorspace.str(" /ColorSpace /DeviceRGB\n");
734 break;
735 default:
736 l_CIDataDestroy(&cid);
737 return false;
738 }
739 }
740
741 int predictor = (cid->predictor) ? 14 : 1;
742
743 // IMAGE
744 std::stringstream b1;
745 // Use "C" locale (needed for int values larger than 999).
746 b1.imbue(std::locale::classic());
747 b1 << objnum
748 << " 0 obj\n"
749 "<<\n"
750 " /Length "
751 << cid->nbytescomp
752 << "\n"
753 " /Subtype /Image\n";
754
755 std::stringstream b2;
756 // Use "C" locale (needed for int values larger than 999).
757 b2.imbue(std::locale::classic());
758 b2 << " /Width " << cid->w
759 << "\n"
760 " /Height "
761 << cid->h
762 << "\n"
763 " /BitsPerComponent "
764 << cid->bps
765 << "\n"
766 " /Filter "
767 << filter
768 << "\n"
769 " /DecodeParms\n"
770 " <<\n"
771 " /Predictor "
772 << predictor
773 << "\n"
774 " /Colors "
775 << cid->spp << "\n"
776 << group4 << " /Columns " << cid->w
777 << "\n"
778 " /BitsPerComponent "
779 << cid->bps
780 << "\n"
781 " >>\n"
782 ">>\n"
783 "stream\n";
784
785 const char *b3 =
786 "endstream\n"
787 "endobj\n";
788
789 size_t b1_len = b1.str().size();
790 size_t b2_len = b2.str().size();
791 size_t b3_len = strlen(b3);
792 size_t colorspace_len = colorspace.str().size();
793
794 *pdf_object_size = b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
795 *pdf_object = new char[*pdf_object_size];
796
797 char *p = *pdf_object;
798 memcpy(p, b1.str().c_str(), b1_len);
799 p += b1_len;
800 memcpy(p, colorspace.str().c_str(), colorspace_len);
801 p += colorspace_len;
802 memcpy(p, b2.str().c_str(), b2_len);
803 p += b2_len;
804 memcpy(p, cid->datacomp, cid->nbytescomp);
805 p += cid->nbytescomp;
806 memcpy(p, b3, b3_len);
807 l_CIDataDestroy(&cid);
808 return true;
809}
810
812 Pix *pix = api->GetInputImage();
813 const char *filename = api->GetInputName();
814 int ppi = api->GetSourceYResolution();
815 if (!pix || ppi <= 0) {
816 return false;
817 }
818 double width = pixGetWidth(pix) * 72.0 / ppi;
819 double height = pixGetHeight(pix) * 72.0 / ppi;
820
821 std::stringstream xobject;
822 // Use "C" locale (needed for int values larger than 999).
823 xobject.imbue(std::locale::classic());
824 if (!textonly_) {
825 xobject << "/XObject << /Im1 " << (obj_ + 2) << " 0 R >>\n";
826 }
827
828 // PAGE
829 std::stringstream stream;
830 // Use "C" locale (needed for double values width and height).
831 stream.imbue(std::locale::classic());
832 stream.precision(2);
833 stream << std::fixed << obj_
834 << " 0 obj\n"
835 "<<\n"
836 " /Type /Page\n"
837 " /Parent 2 0 R\n" // Pages object
838 " /MediaBox [0 0 "
839 << width << " " << height
840 << "]\n"
841 " /Contents "
842 << (obj_ + 1)
843 << " 0 R\n" // Contents object
844 " /Resources\n"
845 " <<\n"
846 " "
847 << xobject.str() << // Image object
848 " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
849 " /Font << /f-0-0 3 0 R >>\n" // Type0 Font
850 " >>\n"
851 ">>\n"
852 "endobj\n";
853 pages_.push_back(obj_);
854 AppendPDFObject(stream.str().c_str());
855
856 // CONTENTS
857 const std::unique_ptr<char[]> pdftext(GetPDFTextObjects(api, width, height));
858 const size_t pdftext_len = strlen(pdftext.get());
859 size_t len;
860 unsigned char *comp_pdftext =
861 zlibCompress(reinterpret_cast<unsigned char *>(pdftext.get()), pdftext_len, &len);
862 long comp_pdftext_len = len;
863 stream.str("");
864 stream << obj_
865 << " 0 obj\n"
866 "<<\n"
867 " /Length "
868 << comp_pdftext_len
869 << " /Filter /FlateDecode\n"
870 ">>\n"
871 "stream\n";
872 AppendString(stream.str().c_str());
873 long objsize = stream.str().size();
874 AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
875 objsize += comp_pdftext_len;
876 lept_free(comp_pdftext);
877 const char *b2 =
878 "endstream\n"
879 "endobj\n";
880 AppendString(b2);
881 objsize += strlen(b2);
882 AppendPDFObjectDIY(objsize);
883
884 if (!textonly_) {
885 char *pdf_object = nullptr;
886 int jpg_quality;
887 api->GetIntVariable("jpg_quality", &jpg_quality);
888 if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize, jpg_quality)) {
889 return false;
890 }
891 AppendData(pdf_object, objsize);
892 AppendPDFObjectDIY(objsize);
893 delete[] pdf_object;
894 }
895 return true;
896}
897
899 // We reserved the /Pages object number early, so that the /Page
900 // objects could refer to their parent. We finally have enough
901 // information to go fill it in. Using lower level calls to manipulate
902 // the offset record in two spots, because we are placing objects
903 // out of order in the file.
904
905 // PAGES
906 const long int kPagesObjectNumber = 2;
907 offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
908 std::stringstream stream;
909 // Use "C" locale (needed for int values larger than 999).
910 stream.imbue(std::locale::classic());
911 stream << kPagesObjectNumber << " 0 obj\n<<\n /Type /Pages\n /Kids [ ";
912 AppendString(stream.str().c_str());
913 size_t pages_objsize = stream.str().size();
914 for (const auto &page : pages_) {
915 stream.str("");
916 stream << page << " 0 R ";
917 AppendString(stream.str().c_str());
918 pages_objsize += stream.str().size();
919 }
920 stream.str("");
921 stream << "]\n /Count " << pages_.size() << "\n>>\nendobj\n";
922 AppendString(stream.str().c_str());
923 pages_objsize += stream.str().size();
924 offsets_.back() += pages_objsize; // manipulation #2
925
926 // INFO
927 std::string utf16_title = "FEFF"; // byte_order_marker
928 std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(title());
929 char utf16[kMaxBytesPerCodepoint];
930 for (char32 code : unicodes) {
931 if (CodepointToUtf16be(code, utf16)) {
932 utf16_title += utf16;
933 }
934 }
935
936 char *datestr = l_getFormattedDate();
937 stream.str("");
938 stream << obj_
939 << " 0 obj\n"
940 "<<\n"
941 " /Producer (Tesseract "
943 << ")\n"
944 " /CreationDate (D:"
945 << datestr
946 << ")\n"
947 " /Title <"
948 << utf16_title.c_str()
949 << ">\n"
950 ">>\n"
951 "endobj\n";
952 lept_free(datestr);
953 AppendPDFObject(stream.str().c_str());
954 stream.str("");
955 stream << "xref\n0 " << obj_ << "\n0000000000 65535 f \n";
956 AppendString(stream.str().c_str());
957 for (int i = 1; i < obj_; i++) {
958 stream.str("");
959 stream.width(10);
960 stream.fill('0');
961 stream << offsets_[i] << " 00000 n \n";
962 AppendString(stream.str().c_str());
963 }
964 stream.str("");
965 stream << "trailer\n<<\n /Size " << obj_
966 << "\n"
967 " /Root 1 0 R\n" // catalog
968 " /Info "
969 << (obj_ - 1)
970 << " 0 R\n" // info
971 ">>\nstartxref\n"
972 << offsets_.back() << "\n%%EOF\n";
973 AppendString(stream.str().c_str());
974 return true;
975}
976} // namespace tesseract
struct TessBaseAPI TessBaseAPI
Definition: capi.h:60
signed int char32
const double y
const char * p
void tprintf(const char *format,...)
Definition: tprintf.cpp:41
signed int char32
Definition: unichar.h:49
@ DIR_LEFT_TO_RIGHT
Definition: unichar.h:43
@ DIR_RIGHT_TO_LEFT
Definition: unichar.h:44
@ WRITING_DIRECTION_TOP_TO_BOTTOM
Definition: publictypes.h:132
@ WRITING_DIRECTION_LEFT_TO_RIGHT
Definition: publictypes.h:130
@ WRITING_DIRECTION_RIGHT_TO_LEFT
Definition: publictypes.h:131
bool PTIsTextType(PolyBlockType type)
Definition: publictypes.h:80
const char * GetInputName()
Definition: baseapi.cpp:928
bool GetIntVariable(const char *name, int *value) const
Definition: baseapi.cpp:294
static const char * Version()
Definition: baseapi.cpp:241
void AppendString(const char *s)
Definition: renderer.cpp:111
const char * title() const
Definition: renderer.h:87
void AppendData(const char *s, int len)
Definition: renderer.cpp:118
bool EndDocumentHandler() override
bool BeginDocumentHandler() override
TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly=false)
bool AddImageHandler(TessBaseAPI *api) override
static std::vector< char32 > UTF8ToUTF32(const char *utf8_str)
Definition: unichar.cpp:220