tesseract  4.00.00dev
pdfrenderer.cpp
Go to the documentation of this file.
1 // File: pdfrenderer.cpp
3 // Description: PDF rendering interface to inject into TessBaseAPI
4 //
5 // (C) Copyright 2011, Google Inc.
6 // Licensed under the Apache License, Version 2.0 (the "License");
7 // you may not use this file except in compliance with the License.
8 // You may obtain a copy of the License at
9 // http://www.apache.org/licenses/LICENSE-2.0
10 // Unless required by applicable law or agreed to in writing, software
11 // distributed under the License is distributed on an "AS IS" BASIS,
12 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 // See the License for the specific language governing permissions and
14 // limitations under the License.
15 //
17 
18 // Include automatically generated configuration file if running autoconf.
19 #ifdef HAVE_CONFIG_H
20 #include "config_auto.h"
21 #endif
22 
23 #include <memory> // std::unique_ptr
24 #include "allheaders.h"
25 #include "baseapi.h"
26 #include "math.h"
27 #include "renderer.h"
28 #include "strngs.h"
29 #include "tprintf.h"
30 
31 /*
32 
33 Design notes from Ken Sharp, with light editing.
34 
35 We think one solution is a font with a single glyph (.notdef) and a
36 CIDToGIDMap which maps all the CIDs to 0. That map would then be
37 stored as a stream in the PDF file, and when flate compressed should
38 be pretty small. The font, of course, will be approximately the same
39 size as the one you currently use.
40 
41 I'm working on such a font now, the CIDToGIDMap is trivial, you just
42 create a stream object which contains 128k bytes (2 bytes per possible
43 CID and your CIDs range from 0 to 65535) and where you currently have
44 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
45 
46 Note that if, in future, you were to use a different (ie not 2 byte)
47 CMap for character codes you could trivially extend the CIDToGIDMap.
48 
49 The following is an explanation of how some of the font stuff works,
50 this may be too simple for you in which case please accept my
51 apologies, its hard to know how much knowledge someone has. You can
52 skip all this anyway, its just for information.
53 
54 The font embedded in a PDF file is usually intended just to be
55 rendered, but extensions allow for at least some ability to locate (or
56 copy) text from a document. This isn't something which was an original
57 goal of the PDF format, but its been retro-fitted, presumably due to
58 popular demand.
59 
60 To do this reliably the PDF file must contain a ToUnicode CMap, a
61 device for mapping character codes to Unicode code points. If one of
62 these is present, then this will be used to convert the character
63 codes into Unicode values. If its not present then the reader will
64 fall back through a series of heuristics to try and guess the
65 result. This is, as you would expect, prone to failure.
66 
67 This doesn't concern you of course, since you always write a ToUnicode
68 CMap, so because you are writing the text in text rendering mode 3 it
69 would seem that you don't really need to worry about this, but in the
70 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
71 attached to a font, so in order to get even copy/paste to work you
72 need to define a font.
73 
74 This is what leads to problems, tools like pdfwrite assume that they
75 are going to be able to (or even have to) modify the font entries, so
76 they require that the font being embedded be valid, and to be honest
77 the font Tesseract embeds isn't valid (for this purpose).
78 
79 
80 To see why lets look at how text is specified in a PDF file:
81 
82 (Test) Tj
83 
84 Now that looks like text but actually it isn't. Each of those bytes is
85 a 'character code'. When it comes to rendering the text a complex
86 sequence of events takes place, which converts the character code into
87 'something' which the font understands. Its entirely possible via
88 character mappings to have that text render as 'Sftu'
89 
90 For simple fonts (PostScript type 1), we use the character code as the
91 index into an Encoding array (256 elements), each element of which is
92 a glyph name, so this gives us a glyph name. We then consult the
93 CharStrings dictionary in the font, that's a complex object which
94 contains pairs of keys and values, you can use the key to retrieve a
95 given value. So we have a glyph name, we then use that as the key to
96 the dictionary and retrieve the associated value. For a type 1 font,
97 the value is a glyph program that describes how to draw the glyph.
98 
99 For CIDFonts, its a little more complicated. Because CIDFonts can be
100 large, using a glyph name as the key is unreasonable (it would also
101 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
102 as the key. CIDs are just numbers.
103 
104 But.... We don't use the character code as the CID. What we do is use
105 a CMap to convert the character code into a CID. We then use the CID
106 to key the CharStrings dictionary and proceed as before. So the 'CMap'
107 is the equivalent of the Encoding array, but its a more compact and
108 flexible representation.
109 
110 Note that you have to use the CMap just to find out how many bytes
111 constitute a character code, and it can be variable. For example you
112 can say if the first byte is 0x00->0x7f then its just one byte, if its
113 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
114 have seen CMaps defining character codes up to 5 bytes wide.
115 
116 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
117 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
118 a Glyph ID (GID) (and the LOCA table) which may well not be anything
119 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
120 the CIDs to GIDs, and we can then use the GID to get the glyph
121 description from the GLYF table of the font.
122 
123 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
124 
125 Looking at the PDF file I was supplied with we see that it contains
126 text like :
127 
128 <0x0075> Tj
129 
130 So we start by taking the character code (117) and look it up in the
131 CMap. Well you don't supply a CMap, you just use the Identity-H one
132 which is predefined. So character code 117 maps to CID 117. Then we
133 use the CIDToGIDMap, again you don't supply one, you just use the
134 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
135 were supplied with only contains 116 glyphs.
136 
137 Now for Latin that's not a huge problem, you can just supply a bigger
138 font. But for more complex languages that *is* going to be more of a
139 problem. Either you need to supply a font which contains glyphs for
140 all the possible CID->GID mappings, or we need to think laterally.
141 
142 Our solution using a TrueType CIDFont is to intervene at the
143 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
144 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
145 looking into now.
146 
147 It would also be possible to have a 'PostScript' (ie type 1 outlines)
148 CIDFont which contained 1 glyph, and a CMap which mapped all character
149 codes to CID 0. The effect would be the same.
150 
151 Its possible (I haven't checked) that the PostScript CIDFont and
152 associated CMap would be smaller than the TrueType font and associated
153 CIDToGIDMap.
154 
155 --- in a followup ---
156 
157 OK there is a small problem there, if I use GID 0 then Acrobat gets
158 upset about it and complains it cannot extract the font. If I set the
159 CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
160 mad......
161 
162 */
163 
164 namespace tesseract {
165 
166 // Use for PDF object fragments. Must be large enough
167 // to hold a colormap with 256 colors in the verbose
168 // PDF representation.
169 static const int kBasicBufSize = 2048;
170 
171 // If the font is 10 pts, nominal character width is 5 pts
172 static const int kCharWidth = 2;
173 
174 // Used for memory allocation. A codepoint must take no more than this
175 // many bytes, when written in the PDF way. e.g. "<0063>" for the
176 // letter 'c'
177 static const int kMaxBytesPerCodepoint = 20;
178 
179 /**********************************************************************
180  * PDF Renderer interface implementation
181  **********************************************************************/
182 
183 TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
184  bool textonly)
185  : TessResultRenderer(outputbase, "pdf") {
186  obj_ = 0;
187  datadir_ = datadir;
188  textonly_ = textonly;
189  offsets_.push_back(0);
190 }
191 
192 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
193  offsets_.push_back(objectsize + offsets_.back());
194  obj_++;
195 }
196 
197 void TessPDFRenderer::AppendPDFObject(const char *data) {
198  AppendPDFObjectDIY(strlen(data));
199  AppendString((const char *)data);
200 }
201 
202 // Helper function to prevent us from accidentally writing
203 // scientific notation to an HOCR or PDF file. Besides, three
204 // decimal points are all you really need.
205 double prec(double x) {
206  double kPrecision = 1000.0;
207  double a = round(x * kPrecision) / kPrecision;
208  if (a == -0)
209  return 0;
210  return a;
211 }
212 
213 long dist2(int x1, int y1, int x2, int y2) {
214  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
215 }
216 
217 // Viewers like evince can get really confused during copy-paste when
218 // the baseline wanders around. So I've decided to project every word
219 // onto the (straight) line baseline. All numbers are in the native
220 // PDF coordinate system, which has the origin in the bottom left and
221 // the unit is points, which is 1/72 inch. Tesseract reports baselines
222 // left-to-right no matter what the reading order is. We need the
223 // word baseline in reading order, so we do that conversion here. Returns
224 // the word's baseline origin and length.
225 void GetWordBaseline(int writing_direction, int ppi, int height,
226  int word_x1, int word_y1, int word_x2, int word_y2,
227  int line_x1, int line_y1, int line_x2, int line_y2,
228  double *x0, double *y0, double *length) {
229  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
230  Swap(&word_x1, &word_x2);
231  Swap(&word_y1, &word_y2);
232  }
233  double word_length;
234  double x, y;
235  {
236  int px = word_x1;
237  int py = word_y1;
238  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
239  if (l2 == 0) {
240  x = line_x1;
241  y = line_y1;
242  } else {
243  double t = ((px - line_x2) * (line_x2 - line_x1) +
244  (py - line_y2) * (line_y2 - line_y1)) / l2;
245  x = line_x2 + t * (line_x2 - line_x1);
246  y = line_y2 + t * (line_y2 - line_y1);
247  }
248  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
249  word_x2, word_y2)));
250  word_length = word_length * 72.0 / ppi;
251  x = x * 72 / ppi;
252  y = height - (y * 72.0 / ppi);
253  }
254  *x0 = x;
255  *y0 = y;
256  *length = word_length;
257 }
258 
259 // Compute coefficients for an affine matrix describing the rotation
260 // of the text. If the text is right-to-left such as Arabic or Hebrew,
261 // we reflect over the Y-axis. This matrix will set the coordinate
262 // system for placing text in the PDF file.
263 //
264 // RTL
265 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
266 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
267 void AffineMatrix(int writing_direction,
268  int line_x1, int line_y1, int line_x2, int line_y2,
269  double *a, double *b, double *c, double *d) {
270  double theta = atan2(static_cast<double>(line_y1 - line_y2),
271  static_cast<double>(line_x2 - line_x1));
272  *a = cos(theta);
273  *b = sin(theta);
274  *c = -sin(theta);
275  *d = cos(theta);
276  switch(writing_direction) {
278  *a = -*a;
279  *b = -*b;
280  break;
282  // TODO(jbreiden) Consider using the vertical PDF writing mode.
283  break;
284  default:
285  break;
286  }
287 }
288 
289 // There are some really awkward PDF viewers in the wild, such as
290 // 'Preview' which ships with the Mac. They do a better job with text
291 // selection and highlighting when given perfectly flat baseline
292 // instead of very slightly tilted. We clip small tilts to appease
293 // these viewers. I chose this threshold large enough to absorb noise,
294 // but small enough that lines probably won't cross each other if the
295 // whole page is tilted at almost exactly the clipping threshold.
296 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
297  int *line_x1, int *line_y1,
298  int *line_x2, int *line_y2) {
299  *line_x1 = x1;
300  *line_y1 = y1;
301  *line_x2 = x2;
302  *line_y2 = y2;
303  double rise = abs(y2 - y1) * 72 / ppi;
304  double run = abs(x2 - x1) * 72 / ppi;
305  if (rise < 2.0 && 2.0 < run)
306  *line_y1 = *line_y2 = (y1 + y2) / 2;
307 }
308 
309 bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
310  if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
311  tprintf("Dropping invalid codepoint %d\n", code);
312  return false;
313  }
314  if (code < 0x10000) {
315  snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
316  } else {
317  int a = code - 0x010000;
318  int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
319  int low_surrogate = (0x03FF & a) + 0xDC00;
320  snprintf(utf16, kMaxBytesPerCodepoint,
321  "%04X%04X", high_surrogate, low_surrogate);
322  }
323  return true;
324 }
325 
326 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
327  double width, double height) {
328  STRING pdf_str("");
329  double ppi = api->GetSourceYResolution();
330 
331  // These initial conditions are all arbitrary and will be overwritten
332  double old_x = 0.0, old_y = 0.0;
333  int old_fontsize = 0;
334  tesseract::WritingDirection old_writing_direction =
336  bool new_block = true;
337  int fontsize = 0;
338  double a = 1;
339  double b = 0;
340  double c = 0;
341  double d = 1;
342 
343  // TODO(jbreiden) This marries the text and image together.
344  // Slightly cleaner from an abstraction standpoint if this were to
345  // live inside a separate text object.
346  pdf_str += "q ";
347  pdf_str.add_str_double("", prec(width));
348  pdf_str += " 0 0 ";
349  pdf_str.add_str_double("", prec(height));
350  pdf_str += " 0 0 cm";
351  if (!textonly_) {
352  pdf_str += " /Im1 Do";
353  }
354  pdf_str += " Q\n";
355 
356  int line_x1 = 0;
357  int line_y1 = 0;
358  int line_x2 = 0;
359  int line_y2 = 0;
360 
361  ResultIterator *res_it = api->GetIterator();
362  while (!res_it->Empty(RIL_BLOCK)) {
363  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
364  pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink
365  old_fontsize = 0; // Every block will declare its fontsize
366  new_block = true; // Every block will declare its affine matrix
367  }
368 
369  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
370  int x1, y1, x2, y2;
371  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
372  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
373  }
374 
375  if (res_it->Empty(RIL_WORD)) {
376  res_it->Next(RIL_WORD);
377  continue;
378  }
379 
380  // Writing direction changes at a per-word granularity
381  tesseract::WritingDirection writing_direction;
382  {
383  tesseract::Orientation orientation;
384  tesseract::TextlineOrder textline_order;
385  float deskew_angle;
386  res_it->Orientation(&orientation, &writing_direction,
387  &textline_order, &deskew_angle);
388  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
389  switch (res_it->WordDirection()) {
390  case DIR_LEFT_TO_RIGHT:
391  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
392  break;
393  case DIR_RIGHT_TO_LEFT:
394  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
395  break;
396  default:
397  writing_direction = old_writing_direction;
398  }
399  }
400  }
401 
402  // Where is word origin and how long is it?
403  double x, y, word_length;
404  {
405  int word_x1, word_y1, word_x2, word_y2;
406  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
407  GetWordBaseline(writing_direction, ppi, height,
408  word_x1, word_y1, word_x2, word_y2,
409  line_x1, line_y1, line_x2, line_y2,
410  &x, &y, &word_length);
411  }
412 
413  if (writing_direction != old_writing_direction || new_block) {
414  AffineMatrix(writing_direction,
415  line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
416  pdf_str.add_str_double(" ", prec(a)); // . This affine matrix
417  pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate
418  pdf_str.add_str_double(" ", prec(c)); // . system for all
419  pdf_str.add_str_double(" ", prec(d)); // . text that follows.
420  pdf_str.add_str_double(" ", prec(x)); // .
421  pdf_str.add_str_double(" ", prec(y)); // .
422  pdf_str += (" Tm "); // Place cursor absolutely
423  new_block = false;
424  } else {
425  double dx = x - old_x;
426  double dy = y - old_y;
427  pdf_str.add_str_double(" ", prec(dx * a + dy * b));
428  pdf_str.add_str_double(" ", prec(dx * c + dy * d));
429  pdf_str += (" Td "); // Relative moveto
430  }
431  old_x = x;
432  old_y = y;
433  old_writing_direction = writing_direction;
434 
435  // Adjust font size on a per word granularity. Pay attention to
436  // fontsize, old_fontsize, and pdf_str. We've found that for
437  // in Arabic, Tesseract will happily return a fontsize of zero,
438  // so we make up a default number to protect ourselves.
439  {
440  bool bold, italic, underlined, monospace, serif, smallcaps;
441  int font_id;
442  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
443  &serif, &smallcaps, &fontsize, &font_id);
444  const int kDefaultFontsize = 8;
445  if (fontsize <= 0)
446  fontsize = kDefaultFontsize;
447  if (fontsize != old_fontsize) {
448  char textfont[20];
449  snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
450  pdf_str += textfont;
451  old_fontsize = fontsize;
452  }
453  }
454 
455  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
456  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
457  STRING pdf_word("");
458  int pdf_word_len = 0;
459  do {
460  const std::unique_ptr<const char[]> grapheme(
461  res_it->GetUTF8Text(RIL_SYMBOL));
462  if (grapheme && grapheme[0] != '\0') {
463  std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(grapheme.get());
464  char utf16[kMaxBytesPerCodepoint];
465  for (char32 code : unicodes) {
466  if (CodepointToUtf16be(code, utf16)) {
467  pdf_word += utf16;
468  pdf_word_len++;
469  }
470  }
471  }
472  res_it->Next(RIL_SYMBOL);
473  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
474  if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
475  double h_stretch =
476  kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
477  pdf_str.add_str_double("", h_stretch);
478  pdf_str += " Tz"; // horizontal stretch
479  pdf_str += " [ <";
480  pdf_str += pdf_word; // UTF-16BE representation
481  pdf_str += "> ] TJ"; // show the text
482  }
483  if (last_word_in_line) {
484  pdf_str += " \n";
485  }
486  if (last_word_in_block) {
487  pdf_str += "ET\n"; // end the text object
488  }
489  }
490  char *ret = new char[pdf_str.length() + 1];
491  strcpy(ret, pdf_str.string());
492  delete res_it;
493  return ret;
494 }
495 
497  char buf[kBasicBufSize];
498  size_t n;
499 
500  n = snprintf(buf, sizeof(buf),
501  "%%PDF-1.5\n"
502  "%%%c%c%c%c\n",
503  0xDE, 0xAD, 0xBE, 0xEB);
504  if (n >= sizeof(buf)) return false;
505  AppendPDFObject(buf);
506 
507  // CATALOG
508  n = snprintf(buf, sizeof(buf),
509  "1 0 obj\n"
510  "<<\n"
511  " /Type /Catalog\n"
512  " /Pages %ld 0 R\n"
513  ">>\n"
514  "endobj\n",
515  2L);
516  if (n >= sizeof(buf)) return false;
517  AppendPDFObject(buf);
518 
519  // We are reserving object #2 for the /Pages
520  // object, which I am going to create and write
521  // at the end of the PDF file.
522  AppendPDFObject("");
523 
524  // TYPE0 FONT
525  n = snprintf(buf, sizeof(buf),
526  "3 0 obj\n"
527  "<<\n"
528  " /BaseFont /GlyphLessFont\n"
529  " /DescendantFonts [ %ld 0 R ]\n"
530  " /Encoding /Identity-H\n"
531  " /Subtype /Type0\n"
532  " /ToUnicode %ld 0 R\n"
533  " /Type /Font\n"
534  ">>\n"
535  "endobj\n",
536  4L, // CIDFontType2 font
537  6L // ToUnicode
538  );
539  if (n >= sizeof(buf)) return false;
540  AppendPDFObject(buf);
541 
542  // CIDFONTTYPE2
543  n = snprintf(buf, sizeof(buf),
544  "4 0 obj\n"
545  "<<\n"
546  " /BaseFont /GlyphLessFont\n"
547  " /CIDToGIDMap %ld 0 R\n"
548  " /CIDSystemInfo\n"
549  " <<\n"
550  " /Ordering (Identity)\n"
551  " /Registry (Adobe)\n"
552  " /Supplement 0\n"
553  " >>\n"
554  " /FontDescriptor %ld 0 R\n"
555  " /Subtype /CIDFontType2\n"
556  " /Type /Font\n"
557  " /DW %d\n"
558  ">>\n"
559  "endobj\n",
560  5L, // CIDToGIDMap
561  7L, // Font descriptor
562  1000 / kCharWidth);
563  if (n >= sizeof(buf)) return false;
564  AppendPDFObject(buf);
565 
566  // CIDTOGIDMAP
567  const int kCIDToGIDMapSize = 2 * (1 << 16);
568  const std::unique_ptr<unsigned char[]> cidtogidmap(
569  new unsigned char[kCIDToGIDMapSize]);
570  for (int i = 0; i < kCIDToGIDMapSize; i++) {
571  cidtogidmap[i] = (i % 2) ? 1 : 0;
572  }
573  size_t len;
574  unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
575  n = snprintf(buf, sizeof(buf),
576  "5 0 obj\n"
577  "<<\n"
578  " /Length %lu /Filter /FlateDecode\n"
579  ">>\n"
580  "stream\n",
581  (unsigned long)len);
582  if (n >= sizeof(buf)) {
583  lept_free(comp);
584  return false;
585  }
586  AppendString(buf);
587  long objsize = strlen(buf);
588  AppendData(reinterpret_cast<char *>(comp), len);
589  objsize += len;
590  lept_free(comp);
591  const char *endstream_endobj =
592  "endstream\n"
593  "endobj\n";
594  AppendString(endstream_endobj);
595  objsize += strlen(endstream_endobj);
596  AppendPDFObjectDIY(objsize);
597 
598  const char *stream =
599  "/CIDInit /ProcSet findresource begin\n"
600  "12 dict begin\n"
601  "begincmap\n"
602  "/CIDSystemInfo\n"
603  "<<\n"
604  " /Registry (Adobe)\n"
605  " /Ordering (UCS)\n"
606  " /Supplement 0\n"
607  ">> def\n"
608  "/CMapName /Adobe-Identify-UCS def\n"
609  "/CMapType 2 def\n"
610  "1 begincodespacerange\n"
611  "<0000> <FFFF>\n"
612  "endcodespacerange\n"
613  "1 beginbfrange\n"
614  "<0000> <FFFF> <0000>\n"
615  "endbfrange\n"
616  "endcmap\n"
617  "CMapName currentdict /CMap defineresource pop\n"
618  "end\n"
619  "end\n";
620 
621  // TOUNICODE
622  n = snprintf(buf, sizeof(buf),
623  "6 0 obj\n"
624  "<< /Length %lu >>\n"
625  "stream\n"
626  "%s"
627  "endstream\n"
628  "endobj\n", (unsigned long) strlen(stream), stream);
629  if (n >= sizeof(buf)) return false;
630  AppendPDFObject(buf);
631 
632  // FONT DESCRIPTOR
633  n = snprintf(buf, sizeof(buf),
634  "7 0 obj\n"
635  "<<\n"
636  " /Ascent %d\n"
637  " /CapHeight %d\n"
638  " /Descent -1\n" // Spec says must be negative
639  " /Flags 5\n" // FixedPitch + Symbolic
640  " /FontBBox [ 0 0 %d %d ]\n"
641  " /FontFile2 %ld 0 R\n"
642  " /FontName /GlyphLessFont\n"
643  " /ItalicAngle 0\n"
644  " /StemV 80\n"
645  " /Type /FontDescriptor\n"
646  ">>\n"
647  "endobj\n",
648  1000,
649  1000,
650  1000 / kCharWidth,
651  1000,
652  8L // Font data
653  );
654  if (n >= sizeof(buf)) return false;
655  AppendPDFObject(buf);
656 
657  n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_);
658  if (n >= sizeof(buf)) return false;
659  FILE *fp = fopen(buf, "rb");
660  if (!fp) {
661  tprintf("Can not open file \"%s\"!\n", buf);
662  return false;
663  }
664  fseek(fp, 0, SEEK_END);
665  long int size = ftell(fp);
666  fseek(fp, 0, SEEK_SET);
667  const std::unique_ptr<char[]> buffer(new char[size]);
668  if (fread(buffer.get(), 1, size, fp) != static_cast<size_t>(size)) {
669  fclose(fp);
670  return false;
671  }
672  fclose(fp);
673  // FONTFILE2
674  n = snprintf(buf, sizeof(buf),
675  "8 0 obj\n"
676  "<<\n"
677  " /Length %ld\n"
678  " /Length1 %ld\n"
679  ">>\n"
680  "stream\n", size, size);
681  if (n >= sizeof(buf)) {
682  return false;
683  }
684  AppendString(buf);
685  objsize = strlen(buf);
686  AppendData(buffer.get(), size);
687  objsize += size;
688  AppendString(endstream_endobj);
689  objsize += strlen(endstream_endobj);
690  AppendPDFObjectDIY(objsize);
691  return true;
692 }
693 
694 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
695  char *filename,
696  long int objnum,
697  char **pdf_object,
698  long int *pdf_object_size) {
699  size_t n;
700  char b0[kBasicBufSize];
701  char b1[kBasicBufSize];
702  char b2[kBasicBufSize];
703  if (!pdf_object_size || !pdf_object)
704  return false;
705  *pdf_object = NULL;
706  *pdf_object_size = 0;
707  if (!filename)
708  return false;
709 
710  L_Compressed_Data *cid = NULL;
711  const int kJpegQuality = 85;
712 
713  int format, sad;
714  findFileFormat(filename, &format);
715  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
716  Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
717  sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
718  pixDestroy(&p1);
719  } else {
720  sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
721  }
722 
723  if (sad || !cid) {
724  l_CIDataDestroy(&cid);
725  return false;
726  }
727 
728  const char *group4 = "";
729  const char *filter;
730  switch(cid->type) {
731  case L_FLATE_ENCODE:
732  filter = "/FlateDecode";
733  break;
734  case L_JPEG_ENCODE:
735  filter = "/DCTDecode";
736  break;
737  case L_G4_ENCODE:
738  filter = "/CCITTFaxDecode";
739  group4 = " /K -1\n";
740  break;
741  case L_JP2K_ENCODE:
742  filter = "/JPXDecode";
743  break;
744  default:
745  l_CIDataDestroy(&cid);
746  return false;
747  }
748 
749  // Maybe someday we will accept RGBA but today is not that day.
750  // It requires creating an /SMask for the alpha channel.
751  // http://stackoverflow.com/questions/14220221
752  const char *colorspace;
753  if (cid->ncolors > 0) {
754  n = snprintf(b0, sizeof(b0),
755  " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
756  cid->ncolors - 1, cid->cmapdatahex);
757  if (n >= sizeof(b0)) {
758  l_CIDataDestroy(&cid);
759  return false;
760  }
761  colorspace = b0;
762  } else {
763  switch (cid->spp) {
764  case 1:
765  colorspace = " /ColorSpace /DeviceGray\n";
766  break;
767  case 3:
768  colorspace = " /ColorSpace /DeviceRGB\n";
769  break;
770  default:
771  l_CIDataDestroy(&cid);
772  return false;
773  }
774  }
775 
776  int predictor = (cid->predictor) ? 14 : 1;
777 
778  // IMAGE
779  n = snprintf(b1, sizeof(b1),
780  "%ld 0 obj\n"
781  "<<\n"
782  " /Length %ld\n"
783  " /Subtype /Image\n",
784  objnum, (unsigned long) cid->nbytescomp);
785  if (n >= sizeof(b1)) {
786  l_CIDataDestroy(&cid);
787  return false;
788  }
789 
790  n = snprintf(b2, sizeof(b2),
791  " /Width %d\n"
792  " /Height %d\n"
793  " /BitsPerComponent %d\n"
794  " /Filter %s\n"
795  " /DecodeParms\n"
796  " <<\n"
797  " /Predictor %d\n"
798  " /Colors %d\n"
799  "%s"
800  " /Columns %d\n"
801  " /BitsPerComponent %d\n"
802  " >>\n"
803  ">>\n"
804  "stream\n",
805  cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
806  group4, cid->w, cid->bps);
807  if (n >= sizeof(b2)) {
808  l_CIDataDestroy(&cid);
809  return false;
810  }
811 
812  const char *b3 =
813  "endstream\n"
814  "endobj\n";
815 
816  size_t b1_len = strlen(b1);
817  size_t b2_len = strlen(b2);
818  size_t b3_len = strlen(b3);
819  size_t colorspace_len = strlen(colorspace);
820 
821  *pdf_object_size =
822  b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
823  *pdf_object = new char[*pdf_object_size];
824 
825  char *p = *pdf_object;
826  memcpy(p, b1, b1_len);
827  p += b1_len;
828  memcpy(p, colorspace, colorspace_len);
829  p += colorspace_len;
830  memcpy(p, b2, b2_len);
831  p += b2_len;
832  memcpy(p, cid->datacomp, cid->nbytescomp);
833  p += cid->nbytescomp;
834  memcpy(p, b3, b3_len);
835  l_CIDataDestroy(&cid);
836  return true;
837 }
838 
840  size_t n;
841  char buf[kBasicBufSize];
842  char buf2[kBasicBufSize];
843  Pix *pix = api->GetInputImage();
844  char *filename = (char *)api->GetInputName();
845  int ppi = api->GetSourceYResolution();
846  if (!pix || ppi <= 0)
847  return false;
848  double width = pixGetWidth(pix) * 72.0 / ppi;
849  double height = pixGetHeight(pix) * 72.0 / ppi;
850 
851  snprintf(buf2, sizeof(buf2), "/XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
852  const char *xobject = (textonly_) ? "" : buf2;
853 
854  // PAGE
855  n = snprintf(buf, sizeof(buf),
856  "%ld 0 obj\n"
857  "<<\n"
858  " /Type /Page\n"
859  " /Parent %ld 0 R\n"
860  " /MediaBox [0 0 %.2f %.2f]\n"
861  " /Contents %ld 0 R\n"
862  " /Resources\n"
863  " <<\n"
864  " %s"
865  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
866  " /Font << /f-0-0 %ld 0 R >>\n"
867  " >>\n"
868  ">>\n"
869  "endobj\n",
870  obj_,
871  2L, // Pages object
872  width, height,
873  obj_ + 1, // Contents object
874  xobject, // Image object
875  3L); // Type0 Font
876  if (n >= sizeof(buf)) return false;
877  pages_.push_back(obj_);
878  AppendPDFObject(buf);
879 
880  // CONTENTS
881  const std::unique_ptr<char[]> pdftext(GetPDFTextObjects(api, width, height));
882  const size_t pdftext_len = strlen(pdftext.get());
883  size_t len;
884  unsigned char *comp_pdftext = zlibCompress(
885  reinterpret_cast<unsigned char *>(pdftext.get()), pdftext_len, &len);
886  long comp_pdftext_len = len;
887  n = snprintf(buf, sizeof(buf),
888  "%ld 0 obj\n"
889  "<<\n"
890  " /Length %ld /Filter /FlateDecode\n"
891  ">>\n"
892  "stream\n", obj_, comp_pdftext_len);
893  if (n >= sizeof(buf)) {
894  lept_free(comp_pdftext);
895  return false;
896  }
897  AppendString(buf);
898  long objsize = strlen(buf);
899  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
900  objsize += comp_pdftext_len;
901  lept_free(comp_pdftext);
902  const char *b2 =
903  "endstream\n"
904  "endobj\n";
905  AppendString(b2);
906  objsize += strlen(b2);
907  AppendPDFObjectDIY(objsize);
908 
909  if (!textonly_) {
910  char *pdf_object = nullptr;
911  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
912  return false;
913  }
914  AppendData(pdf_object, objsize);
915  AppendPDFObjectDIY(objsize);
916  delete[] pdf_object;
917  }
918  return true;
919 }
920 
921 
923  size_t n;
924  char buf[kBasicBufSize];
925 
926  // We reserved the /Pages object number early, so that the /Page
927  // objects could refer to their parent. We finally have enough
928  // information to go fill it in. Using lower level calls to manipulate
929  // the offset record in two spots, because we are placing objects
930  // out of order in the file.
931 
932  // PAGES
933  const long int kPagesObjectNumber = 2;
934  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
935  n = snprintf(buf, sizeof(buf),
936  "%ld 0 obj\n"
937  "<<\n"
938  " /Type /Pages\n"
939  " /Kids [ ", kPagesObjectNumber);
940  if (n >= sizeof(buf)) return false;
941  AppendString(buf);
942  size_t pages_objsize = strlen(buf);
943  for (size_t i = 0; i < pages_.unsigned_size(); i++) {
944  n = snprintf(buf, sizeof(buf),
945  "%ld 0 R ", pages_[i]);
946  if (n >= sizeof(buf)) return false;
947  AppendString(buf);
948  pages_objsize += strlen(buf);
949  }
950  n = snprintf(buf, sizeof(buf),
951  "]\n"
952  " /Count %d\n"
953  ">>\n"
954  "endobj\n", pages_.size());
955  if (n >= sizeof(buf)) return false;
956  AppendString(buf);
957  pages_objsize += strlen(buf);
958  offsets_.back() += pages_objsize; // manipulation #2
959 
960  // INFO
961  STRING utf16_title = "FEFF"; // byte_order_marker
962  std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(title());
963  char utf16[kMaxBytesPerCodepoint];
964  for (char32 code : unicodes) {
965  if (CodepointToUtf16be(code, utf16)) {
966  utf16_title += utf16;
967  }
968  }
969 
970  char* datestr = l_getFormattedDate();
971  n = snprintf(buf, sizeof(buf),
972  "%ld 0 obj\n"
973  "<<\n"
974  " /Producer (Tesseract %s)\n"
975  " /CreationDate (D:%s)\n"
976  " /Title <%s>\n"
977  ">>\n"
978  "endobj\n",
979  obj_, TESSERACT_VERSION_STR, datestr, utf16_title.c_str());
980  lept_free(datestr);
981  if (n >= sizeof(buf)) return false;
982  AppendPDFObject(buf);
983  n = snprintf(buf, sizeof(buf),
984  "xref\n"
985  "0 %ld\n"
986  "0000000000 65535 f \n", obj_);
987  if (n >= sizeof(buf)) return false;
988  AppendString(buf);
989  for (int i = 1; i < obj_; i++) {
990  n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
991  if (n >= sizeof(buf)) return false;
992  AppendString(buf);
993  }
994  n = snprintf(buf, sizeof(buf),
995  "trailer\n"
996  "<<\n"
997  " /Size %ld\n"
998  " /Root %ld 0 R\n"
999  " /Info %ld 0 R\n"
1000  ">>\n"
1001  "startxref\n"
1002  "%ld\n"
1003  "%%%%EOF\n",
1004  obj_,
1005  1L, // catalog
1006  obj_ - 1, // info
1007  offsets_.back());
1008  if (n >= sizeof(buf)) return false;
1009  AppendString(buf);
1010  return true;
1011 }
1012 } // namespace tesseract
void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1, int *line_x2, int *line_y2)
double prec(double x)
bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint])
void Orientation(tesseract::Orientation *orientation, tesseract::WritingDirection *writing_direction, tesseract::TextlineOrder *textline_order, float *deskew_angle) const
const char * GetInputName()
Definition: baseapi.cpp:940
virtual bool Next(PageIteratorLevel level)
T & back() const
int size() const
Definition: genericvector.h:72
#define tprintf(...)
Definition: tprintf.h:31
TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly)
void add_str_double(const char *str, double number)
Definition: strngs.cpp:391
signed int char32
Definition: unichar.h:52
const char * string() const
Definition: strngs.cpp:198
int push_back(T object)
long dist2(int x1, int y1, int x2, int y2)
virtual bool BeginDocumentHandler()
virtual bool IsAtFinalElement(PageIteratorLevel level, PageIteratorLevel element) const
virtual bool EndDocumentHandler()
virtual char * GetUTF8Text(PageIteratorLevel level) const
Definition: strngs.h:45
size_t unsigned_size() const
Definition: genericvector.h:76
static std::vector< char32 > UTF8ToUTF32(const char *utf8_str)
Definition: unichar.cpp:213
const char * title() const
Definition: renderer.h:81
bool Empty(PageIteratorLevel level) const
void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1, int word_x2, int word_y2, int line_x1, int line_y1, int line_x2, int line_y2, double *x0, double *y0, double *length)
void AppendString(const char *s)
Definition: renderer.cpp:102
virtual bool IsAtBeginningOf(PageIteratorLevel level) const
virtual bool AddImageHandler(TessBaseAPI *api)
void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2, double *a, double *b, double *c, double *d)
void AppendData(const char *s, int len)
Definition: renderer.cpp:106
bool Baseline(PageIteratorLevel level, int *x1, int *y1, int *x2, int *y2) const
const char * WordFontAttributes(bool *is_bold, bool *is_italic, bool *is_underlined, bool *is_monospace, bool *is_serif, bool *is_smallcaps, int *pointsize, int *font_id) const
const char * c_str() const
Definition: strngs.cpp:209
StrongScriptDirection WordDirection() const
#define TESSERACT_VERSION_STR
Definition: version.h:8
inT32 length() const
Definition: strngs.cpp:193
ResultIterator * GetIterator()
Definition: baseapi.cpp:1252
void Swap(T *p1, T *p2)
Definition: helpers.h:97