tesseract  4.0.0-beta.1-59-g2cc4
pdfrenderer.cpp
Go to the documentation of this file.
1 // File: pdfrenderer.cpp
3 // Description: PDF rendering interface to inject into TessBaseAPI
4 //
5 // (C) Copyright 2011, Google Inc.
6 // Licensed under the Apache License, Version 2.0 (the "License");
7 // you may not use this file except in compliance with the License.
8 // You may obtain a copy of the License at
9 // http://www.apache.org/licenses/LICENSE-2.0
10 // Unless required by applicable law or agreed to in writing, software
11 // distributed under the License is distributed on an "AS IS" BASIS,
12 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 // See the License for the specific language governing permissions and
14 // limitations under the License.
15 //
17 
18 // Include automatically generated configuration file if running autoconf.
19 #ifdef HAVE_CONFIG_H
20 #include "config_auto.h"
21 #endif
22 
23 #include <memory> // std::unique_ptr
24 #include "allheaders.h"
25 #include "baseapi.h"
26 #include "math.h"
27 #include "renderer.h"
28 #include "strngs.h"
29 #include "tprintf.h"
30 
31 /*
32 
33 Design notes from Ken Sharp, with light editing.
34 
35 We think one solution is a font with a single glyph (.notdef) and a
36 CIDToGIDMap which maps all the CIDs to 0. That map would then be
37 stored as a stream in the PDF file, and when flate compressed should
38 be pretty small. The font, of course, will be approximately the same
39 size as the one you currently use.
40 
41 I'm working on such a font now, the CIDToGIDMap is trivial, you just
42 create a stream object which contains 128k bytes (2 bytes per possible
43 CID and your CIDs range from 0 to 65535) and where you currently have
44 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
45 
46 Note that if, in future, you were to use a different (ie not 2 byte)
47 CMap for character codes you could trivially extend the CIDToGIDMap.
48 
49 The following is an explanation of how some of the font stuff works,
50 this may be too simple for you in which case please accept my
51 apologies, its hard to know how much knowledge someone has. You can
52 skip all this anyway, its just for information.
53 
54 The font embedded in a PDF file is usually intended just to be
55 rendered, but extensions allow for at least some ability to locate (or
56 copy) text from a document. This isn't something which was an original
57 goal of the PDF format, but its been retro-fitted, presumably due to
58 popular demand.
59 
60 To do this reliably the PDF file must contain a ToUnicode CMap, a
61 device for mapping character codes to Unicode code points. If one of
62 these is present, then this will be used to convert the character
63 codes into Unicode values. If its not present then the reader will
64 fall back through a series of heuristics to try and guess the
65 result. This is, as you would expect, prone to failure.
66 
67 This doesn't concern you of course, since you always write a ToUnicode
68 CMap, so because you are writing the text in text rendering mode 3 it
69 would seem that you don't really need to worry about this, but in the
70 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
71 attached to a font, so in order to get even copy/paste to work you
72 need to define a font.
73 
74 This is what leads to problems, tools like pdfwrite assume that they
75 are going to be able to (or even have to) modify the font entries, so
76 they require that the font being embedded be valid, and to be honest
77 the font Tesseract embeds isn't valid (for this purpose).
78 
79 
80 To see why lets look at how text is specified in a PDF file:
81 
82 (Test) Tj
83 
84 Now that looks like text but actually it isn't. Each of those bytes is
85 a 'character code'. When it comes to rendering the text a complex
86 sequence of events takes place, which converts the character code into
87 'something' which the font understands. Its entirely possible via
88 character mappings to have that text render as 'Sftu'
89 
90 For simple fonts (PostScript type 1), we use the character code as the
91 index into an Encoding array (256 elements), each element of which is
92 a glyph name, so this gives us a glyph name. We then consult the
93 CharStrings dictionary in the font, that's a complex object which
94 contains pairs of keys and values, you can use the key to retrieve a
95 given value. So we have a glyph name, we then use that as the key to
96 the dictionary and retrieve the associated value. For a type 1 font,
97 the value is a glyph program that describes how to draw the glyph.
98 
99 For CIDFonts, its a little more complicated. Because CIDFonts can be
100 large, using a glyph name as the key is unreasonable (it would also
101 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
102 as the key. CIDs are just numbers.
103 
104 But.... We don't use the character code as the CID. What we do is use
105 a CMap to convert the character code into a CID. We then use the CID
106 to key the CharStrings dictionary and proceed as before. So the 'CMap'
107 is the equivalent of the Encoding array, but its a more compact and
108 flexible representation.
109 
110 Note that you have to use the CMap just to find out how many bytes
111 constitute a character code, and it can be variable. For example you
112 can say if the first byte is 0x00->0x7f then its just one byte, if its
113 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
114 have seen CMaps defining character codes up to 5 bytes wide.
115 
116 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
117 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
118 a Glyph ID (GID) (and the LOCA table) which may well not be anything
119 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
120 the CIDs to GIDs, and we can then use the GID to get the glyph
121 description from the GLYF table of the font.
122 
123 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
124 
125 Looking at the PDF file I was supplied with we see that it contains
126 text like :
127 
128 <0x0075> Tj
129 
130 So we start by taking the character code (117) and look it up in the
131 CMap. Well you don't supply a CMap, you just use the Identity-H one
132 which is predefined. So character code 117 maps to CID 117. Then we
133 use the CIDToGIDMap, again you don't supply one, you just use the
134 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
135 were supplied with only contains 116 glyphs.
136 
137 Now for Latin that's not a huge problem, you can just supply a bigger
138 font. But for more complex languages that *is* going to be more of a
139 problem. Either you need to supply a font which contains glyphs for
140 all the possible CID->GID mappings, or we need to think laterally.
141 
142 Our solution using a TrueType CIDFont is to intervene at the
143 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
144 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
145 looking into now.
146 
147 It would also be possible to have a 'PostScript' (ie type 1 outlines)
148 CIDFont which contained 1 glyph, and a CMap which mapped all character
149 codes to CID 0. The effect would be the same.
150 
151 Its possible (I haven't checked) that the PostScript CIDFont and
152 associated CMap would be smaller than the TrueType font and associated
153 CIDToGIDMap.
154 
155 --- in a followup ---
156 
157 OK there is a small problem there, if I use GID 0 then Acrobat gets
158 upset about it and complains it cannot extract the font. If I set the
159 CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
160 mad......
161 
162 */
163 
164 namespace tesseract {
165 
166 // Use for PDF object fragments. Must be large enough
167 // to hold a colormap with 256 colors in the verbose
168 // PDF representation.
169 static const int kBasicBufSize = 2048;
170 
171 // If the font is 10 pts, nominal character width is 5 pts
172 static const int kCharWidth = 2;
173 
174 // Used for memory allocation. A codepoint must take no more than this
175 // many bytes, when written in the PDF way. e.g. "<0063>" for the
176 // letter 'c'
177 static const int kMaxBytesPerCodepoint = 20;
178 
179 /**********************************************************************
180  * PDF Renderer interface implementation
181  **********************************************************************/
182 
183 TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
184  bool textonly)
185  : TessResultRenderer(outputbase, "pdf") {
186  obj_ = 0;
187  datadir_ = datadir;
188  textonly_ = textonly;
189  offsets_.push_back(0);
190 }
191 
192 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
193  offsets_.push_back(objectsize + offsets_.back());
194  obj_++;
195 }
196 
197 void TessPDFRenderer::AppendPDFObject(const char *data) {
198  AppendPDFObjectDIY(strlen(data));
199  AppendString((const char *)data);
200 }
201 
202 // Helper function to prevent us from accidentally writing
203 // scientific notation to an HOCR or PDF file. Besides, three
204 // decimal points are all you really need.
205 double prec(double x) {
206  double kPrecision = 1000.0;
207  double a = round(x * kPrecision) / kPrecision;
208  if (a == -0)
209  return 0;
210  return a;
211 }
212 
213 long dist2(int x1, int y1, int x2, int y2) {
214  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
215 }
216 
217 // Viewers like evince can get really confused during copy-paste when
218 // the baseline wanders around. So I've decided to project every word
219 // onto the (straight) line baseline. All numbers are in the native
220 // PDF coordinate system, which has the origin in the bottom left and
221 // the unit is points, which is 1/72 inch. Tesseract reports baselines
222 // left-to-right no matter what the reading order is. We need the
223 // word baseline in reading order, so we do that conversion here. Returns
224 // the word's baseline origin and length.
225 void GetWordBaseline(int writing_direction, int ppi, int height,
226  int word_x1, int word_y1, int word_x2, int word_y2,
227  int line_x1, int line_y1, int line_x2, int line_y2,
228  double *x0, double *y0, double *length) {
229  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
230  Swap(&word_x1, &word_x2);
231  Swap(&word_y1, &word_y2);
232  }
233  double word_length;
234  double x, y;
235  {
236  int px = word_x1;
237  int py = word_y1;
238  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
239  if (l2 == 0) {
240  x = line_x1;
241  y = line_y1;
242  } else {
243  double t = ((px - line_x2) * (line_x2 - line_x1) +
244  (py - line_y2) * (line_y2 - line_y1)) / l2;
245  x = line_x2 + t * (line_x2 - line_x1);
246  y = line_y2 + t * (line_y2 - line_y1);
247  }
248  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
249  word_x2, word_y2)));
250  word_length = word_length * 72.0 / ppi;
251  x = x * 72 / ppi;
252  y = height - (y * 72.0 / ppi);
253  }
254  *x0 = x;
255  *y0 = y;
256  *length = word_length;
257 }
258 
259 // Compute coefficients for an affine matrix describing the rotation
260 // of the text. If the text is right-to-left such as Arabic or Hebrew,
261 // we reflect over the Y-axis. This matrix will set the coordinate
262 // system for placing text in the PDF file.
263 //
264 // RTL
265 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
266 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
267 void AffineMatrix(int writing_direction,
268  int line_x1, int line_y1, int line_x2, int line_y2,
269  double *a, double *b, double *c, double *d) {
270  double theta = atan2(static_cast<double>(line_y1 - line_y2),
271  static_cast<double>(line_x2 - line_x1));
272  *a = cos(theta);
273  *b = sin(theta);
274  *c = -sin(theta);
275  *d = cos(theta);
276  switch(writing_direction) {
278  *a = -*a;
279  *b = -*b;
280  break;
282  // TODO(jbreiden) Consider using the vertical PDF writing mode.
283  break;
284  default:
285  break;
286  }
287 }
288 
289 // There are some really awkward PDF viewers in the wild, such as
290 // 'Preview' which ships with the Mac. They do a better job with text
291 // selection and highlighting when given perfectly flat baseline
292 // instead of very slightly tilted. We clip small tilts to appease
293 // these viewers. I chose this threshold large enough to absorb noise,
294 // but small enough that lines probably won't cross each other if the
295 // whole page is tilted at almost exactly the clipping threshold.
296 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
297  int *line_x1, int *line_y1,
298  int *line_x2, int *line_y2) {
299  *line_x1 = x1;
300  *line_y1 = y1;
301  *line_x2 = x2;
302  *line_y2 = y2;
303  double rise = abs(y2 - y1) * 72 / ppi;
304  double run = abs(x2 - x1) * 72 / ppi;
305  if (rise < 2.0 && 2.0 < run)
306  *line_y1 = *line_y2 = (y1 + y2) / 2;
307 }
308 
309 bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
310  if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
311  tprintf("Dropping invalid codepoint %d\n", code);
312  return false;
313  }
314  if (code < 0x10000) {
315  snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
316  } else {
317  int a = code - 0x010000;
318  int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
319  int low_surrogate = (0x03FF & a) + 0xDC00;
320  snprintf(utf16, kMaxBytesPerCodepoint,
321  "%04X%04X", high_surrogate, low_surrogate);
322  }
323  return true;
324 }
325 
326 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
327  double width, double height) {
328  STRING pdf_str("");
329  double ppi = api->GetSourceYResolution();
330 
331  // These initial conditions are all arbitrary and will be overwritten
332  double old_x = 0.0, old_y = 0.0;
333  int old_fontsize = 0;
334  tesseract::WritingDirection old_writing_direction =
336  bool new_block = true;
337  int fontsize = 0;
338  double a = 1;
339  double b = 0;
340  double c = 0;
341  double d = 1;
342 
343  // TODO(jbreiden) This marries the text and image together.
344  // Slightly cleaner from an abstraction standpoint if this were to
345  // live inside a separate text object.
346  pdf_str += "q ";
347  pdf_str.add_str_double("", prec(width));
348  pdf_str += " 0 0 ";
349  pdf_str.add_str_double("", prec(height));
350  pdf_str += " 0 0 cm";
351  if (!textonly_) {
352  pdf_str += " /Im1 Do";
353  }
354  pdf_str += " Q\n";
355 
356  int line_x1 = 0;
357  int line_y1 = 0;
358  int line_x2 = 0;
359  int line_y2 = 0;
360 
361  ResultIterator *res_it = api->GetIterator();
362  while (!res_it->Empty(RIL_BLOCK)) {
363  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
364  pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink
365  old_fontsize = 0; // Every block will declare its fontsize
366  new_block = true; // Every block will declare its affine matrix
367  }
368 
369  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
370  int x1, y1, x2, y2;
371  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
372  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
373  }
374 
375  if (res_it->Empty(RIL_WORD)) {
376  res_it->Next(RIL_WORD);
377  continue;
378  }
379 
380  // Writing direction changes at a per-word granularity
381  tesseract::WritingDirection writing_direction;
382  {
383  tesseract::Orientation orientation;
384  tesseract::TextlineOrder textline_order;
385  float deskew_angle;
386  res_it->Orientation(&orientation, &writing_direction,
387  &textline_order, &deskew_angle);
388  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
389  switch (res_it->WordDirection()) {
390  case DIR_LEFT_TO_RIGHT:
391  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
392  break;
393  case DIR_RIGHT_TO_LEFT:
394  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
395  break;
396  default:
397  writing_direction = old_writing_direction;
398  }
399  }
400  }
401 
402  // Where is word origin and how long is it?
403  double x, y, word_length;
404  {
405  int word_x1, word_y1, word_x2, word_y2;
406  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
407  GetWordBaseline(writing_direction, ppi, height,
408  word_x1, word_y1, word_x2, word_y2,
409  line_x1, line_y1, line_x2, line_y2,
410  &x, &y, &word_length);
411  }
412 
413  if (writing_direction != old_writing_direction || new_block) {
414  AffineMatrix(writing_direction,
415  line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
416  pdf_str.add_str_double(" ", prec(a)); // . This affine matrix
417  pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate
418  pdf_str.add_str_double(" ", prec(c)); // . system for all
419  pdf_str.add_str_double(" ", prec(d)); // . text that follows.
420  pdf_str.add_str_double(" ", prec(x)); // .
421  pdf_str.add_str_double(" ", prec(y)); // .
422  pdf_str += (" Tm "); // Place cursor absolutely
423  new_block = false;
424  } else {
425  double dx = x - old_x;
426  double dy = y - old_y;
427  pdf_str.add_str_double(" ", prec(dx * a + dy * b));
428  pdf_str.add_str_double(" ", prec(dx * c + dy * d));
429  pdf_str += (" Td "); // Relative moveto
430  }
431  old_x = x;
432  old_y = y;
433  old_writing_direction = writing_direction;
434 
435  // Adjust font size on a per word granularity. Pay attention to
436  // fontsize, old_fontsize, and pdf_str. We've found that for
437  // in Arabic, Tesseract will happily return a fontsize of zero,
438  // so we make up a default number to protect ourselves.
439  {
440  bool bold, italic, underlined, monospace, serif, smallcaps;
441  int font_id;
442  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
443  &serif, &smallcaps, &fontsize, &font_id);
444  const int kDefaultFontsize = 8;
445  if (fontsize <= 0)
446  fontsize = kDefaultFontsize;
447  if (fontsize != old_fontsize) {
448  char textfont[20];
449  snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
450  pdf_str += textfont;
451  old_fontsize = fontsize;
452  }
453  }
454 
455  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
456  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
457  STRING pdf_word("");
458  int pdf_word_len = 0;
459  do {
460  const std::unique_ptr<const char[]> grapheme(
461  res_it->GetUTF8Text(RIL_SYMBOL));
462  if (grapheme && grapheme[0] != '\0') {
463  std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(grapheme.get());
464  char utf16[kMaxBytesPerCodepoint];
465  for (char32 code : unicodes) {
466  if (CodepointToUtf16be(code, utf16)) {
467  pdf_word += utf16;
468  pdf_word_len++;
469  }
470  }
471  }
472  res_it->Next(RIL_SYMBOL);
473  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
474  if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
475  double h_stretch =
476  kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
477  pdf_str.add_str_double("", h_stretch);
478  pdf_str += " Tz"; // horizontal stretch
479  pdf_str += " [ <";
480  pdf_str += pdf_word; // UTF-16BE representation
481  pdf_str += "> ] TJ"; // show the text
482  }
483  if (last_word_in_line) {
484  pdf_str += " \n";
485  }
486  if (last_word_in_block) {
487  pdf_str += "ET\n"; // end the text object
488  }
489  }
490  char *ret = new char[pdf_str.length() + 1];
491  strcpy(ret, pdf_str.string());
492  delete res_it;
493  return ret;
494 }
495 
497  char buf[kBasicBufSize];
498  size_t n;
499 
500  n = snprintf(buf, sizeof(buf),
501  "%%PDF-1.5\n"
502  "%%%c%c%c%c\n",
503  0xDE, 0xAD, 0xBE, 0xEB);
504  if (n >= sizeof(buf)) return false;
505  AppendPDFObject(buf);
506 
507  // CATALOG
508  n = snprintf(buf, sizeof(buf),
509  "1 0 obj\n"
510  "<<\n"
511  " /Type /Catalog\n"
512  " /Pages %ld 0 R\n"
513  ">>\n"
514  "endobj\n",
515  2L);
516  if (n >= sizeof(buf)) return false;
517  AppendPDFObject(buf);
518 
519  // We are reserving object #2 for the /Pages
520  // object, which I am going to create and write
521  // at the end of the PDF file.
522  AppendPDFObject("");
523 
524  // TYPE0 FONT
525  n = snprintf(buf, sizeof(buf),
526  "3 0 obj\n"
527  "<<\n"
528  " /BaseFont /GlyphLessFont\n"
529  " /DescendantFonts [ %ld 0 R ]\n"
530  " /Encoding /Identity-H\n"
531  " /Subtype /Type0\n"
532  " /ToUnicode %ld 0 R\n"
533  " /Type /Font\n"
534  ">>\n"
535  "endobj\n",
536  4L, // CIDFontType2 font
537  6L // ToUnicode
538  );
539  if (n >= sizeof(buf)) return false;
540  AppendPDFObject(buf);
541 
542  // CIDFONTTYPE2
543  n = snprintf(buf, sizeof(buf),
544  "4 0 obj\n"
545  "<<\n"
546  " /BaseFont /GlyphLessFont\n"
547  " /CIDToGIDMap %ld 0 R\n"
548  " /CIDSystemInfo\n"
549  " <<\n"
550  " /Ordering (Identity)\n"
551  " /Registry (Adobe)\n"
552  " /Supplement 0\n"
553  " >>\n"
554  " /FontDescriptor %ld 0 R\n"
555  " /Subtype /CIDFontType2\n"
556  " /Type /Font\n"
557  " /DW %d\n"
558  ">>\n"
559  "endobj\n",
560  5L, // CIDToGIDMap
561  7L, // Font descriptor
562  1000 / kCharWidth);
563  if (n >= sizeof(buf)) return false;
564  AppendPDFObject(buf);
565 
566  // CIDTOGIDMAP
567  const int kCIDToGIDMapSize = 2 * (1 << 16);
568  const std::unique_ptr<unsigned char[]> cidtogidmap(
569  new unsigned char[kCIDToGIDMapSize]);
570  for (int i = 0; i < kCIDToGIDMapSize; i++) {
571  cidtogidmap[i] = (i % 2) ? 1 : 0;
572  }
573  size_t len;
574  unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
575  n = snprintf(buf, sizeof(buf),
576  "5 0 obj\n"
577  "<<\n"
578  " /Length %lu /Filter /FlateDecode\n"
579  ">>\n"
580  "stream\n",
581  (unsigned long)len);
582  if (n >= sizeof(buf)) {
583  lept_free(comp);
584  return false;
585  }
586  AppendString(buf);
587  long objsize = strlen(buf);
588  AppendData(reinterpret_cast<char *>(comp), len);
589  objsize += len;
590  lept_free(comp);
591  const char *endstream_endobj =
592  "endstream\n"
593  "endobj\n";
594  AppendString(endstream_endobj);
595  objsize += strlen(endstream_endobj);
596  AppendPDFObjectDIY(objsize);
597 
598  const char *stream =
599  "/CIDInit /ProcSet findresource begin\n"
600  "12 dict begin\n"
601  "begincmap\n"
602  "/CIDSystemInfo\n"
603  "<<\n"
604  " /Registry (Adobe)\n"
605  " /Ordering (UCS)\n"
606  " /Supplement 0\n"
607  ">> def\n"
608  "/CMapName /Adobe-Identify-UCS def\n"
609  "/CMapType 2 def\n"
610  "1 begincodespacerange\n"
611  "<0000> <FFFF>\n"
612  "endcodespacerange\n"
613  "1 beginbfrange\n"
614  "<0000> <FFFF> <0000>\n"
615  "endbfrange\n"
616  "endcmap\n"
617  "CMapName currentdict /CMap defineresource pop\n"
618  "end\n"
619  "end\n";
620 
621  // TOUNICODE
622  n = snprintf(buf, sizeof(buf),
623  "6 0 obj\n"
624  "<< /Length %lu >>\n"
625  "stream\n"
626  "%s"
627  "endstream\n"
628  "endobj\n", (unsigned long) strlen(stream), stream);
629  if (n >= sizeof(buf)) return false;
630  AppendPDFObject(buf);
631 
632  // FONT DESCRIPTOR
633  n = snprintf(buf, sizeof(buf),
634  "7 0 obj\n"
635  "<<\n"
636  " /Ascent %d\n"
637  " /CapHeight %d\n"
638  " /Descent -1\n" // Spec says must be negative
639  " /Flags 5\n" // FixedPitch + Symbolic
640  " /FontBBox [ 0 0 %d %d ]\n"
641  " /FontFile2 %ld 0 R\n"
642  " /FontName /GlyphLessFont\n"
643  " /ItalicAngle 0\n"
644  " /StemV 80\n"
645  " /Type /FontDescriptor\n"
646  ">>\n"
647  "endobj\n",
648  1000,
649  1000,
650  1000 / kCharWidth,
651  1000,
652  8L // Font data
653  );
654  if (n >= sizeof(buf)) return false;
655  AppendPDFObject(buf);
656 
657  n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_);
658  if (n >= sizeof(buf)) return false;
659  FILE *fp = fopen(buf, "rb");
660  if (!fp) {
661  tprintf("Can not open file \"%s\"!\n", buf);
662  return false;
663  }
664  fseek(fp, 0, SEEK_END);
665  long int size = ftell(fp);
666  if (size < 0) {
667  fclose(fp);
668  return false;
669  }
670  fseek(fp, 0, SEEK_SET);
671  const std::unique_ptr<char[]> buffer(new char[size]);
672  if (fread(buffer.get(), 1, size, fp) != static_cast<size_t>(size)) {
673  fclose(fp);
674  return false;
675  }
676  fclose(fp);
677  // FONTFILE2
678  n = snprintf(buf, sizeof(buf),
679  "8 0 obj\n"
680  "<<\n"
681  " /Length %ld\n"
682  " /Length1 %ld\n"
683  ">>\n"
684  "stream\n", size, size);
685  if (n >= sizeof(buf)) {
686  return false;
687  }
688  AppendString(buf);
689  objsize = strlen(buf);
690  AppendData(buffer.get(), size);
691  objsize += size;
692  AppendString(endstream_endobj);
693  objsize += strlen(endstream_endobj);
694  AppendPDFObjectDIY(objsize);
695  return true;
696 }
697 
698 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
699  char *filename,
700  long int objnum,
701  char **pdf_object,
702  long int *pdf_object_size) {
703  size_t n;
704  char b0[kBasicBufSize];
705  char b1[kBasicBufSize];
706  char b2[kBasicBufSize];
707  if (!pdf_object_size || !pdf_object)
708  return false;
709  *pdf_object = NULL;
710  *pdf_object_size = 0;
711  if (!filename)
712  return false;
713 
714  L_Compressed_Data *cid = NULL;
715  const int kJpegQuality = 85;
716 
717  int format, sad;
718  findFileFormat(filename, &format);
719  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
720  Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
721  sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
722  pixDestroy(&p1);
723  } else {
724  sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
725  }
726 
727  if (sad || !cid) {
728  l_CIDataDestroy(&cid);
729  return false;
730  }
731 
732  const char *group4 = "";
733  const char *filter;
734  switch(cid->type) {
735  case L_FLATE_ENCODE:
736  filter = "/FlateDecode";
737  break;
738  case L_JPEG_ENCODE:
739  filter = "/DCTDecode";
740  break;
741  case L_G4_ENCODE:
742  filter = "/CCITTFaxDecode";
743  group4 = " /K -1\n";
744  break;
745  case L_JP2K_ENCODE:
746  filter = "/JPXDecode";
747  break;
748  default:
749  l_CIDataDestroy(&cid);
750  return false;
751  }
752 
753  // Maybe someday we will accept RGBA but today is not that day.
754  // It requires creating an /SMask for the alpha channel.
755  // http://stackoverflow.com/questions/14220221
756  const char *colorspace;
757  if (cid->ncolors > 0) {
758  n = snprintf(b0, sizeof(b0),
759  " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
760  cid->ncolors - 1, cid->cmapdatahex);
761  if (n >= sizeof(b0)) {
762  l_CIDataDestroy(&cid);
763  return false;
764  }
765  colorspace = b0;
766  } else {
767  switch (cid->spp) {
768  case 1:
769  colorspace = " /ColorSpace /DeviceGray\n";
770  break;
771  case 3:
772  colorspace = " /ColorSpace /DeviceRGB\n";
773  break;
774  default:
775  l_CIDataDestroy(&cid);
776  return false;
777  }
778  }
779 
780  int predictor = (cid->predictor) ? 14 : 1;
781 
782  // IMAGE
783  n = snprintf(b1, sizeof(b1),
784  "%ld 0 obj\n"
785  "<<\n"
786  " /Length %ld\n"
787  " /Subtype /Image\n",
788  objnum, (unsigned long) cid->nbytescomp);
789  if (n >= sizeof(b1)) {
790  l_CIDataDestroy(&cid);
791  return false;
792  }
793 
794  n = snprintf(b2, sizeof(b2),
795  " /Width %d\n"
796  " /Height %d\n"
797  " /BitsPerComponent %d\n"
798  " /Filter %s\n"
799  " /DecodeParms\n"
800  " <<\n"
801  " /Predictor %d\n"
802  " /Colors %d\n"
803  "%s"
804  " /Columns %d\n"
805  " /BitsPerComponent %d\n"
806  " >>\n"
807  ">>\n"
808  "stream\n",
809  cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
810  group4, cid->w, cid->bps);
811  if (n >= sizeof(b2)) {
812  l_CIDataDestroy(&cid);
813  return false;
814  }
815 
816  const char *b3 =
817  "endstream\n"
818  "endobj\n";
819 
820  size_t b1_len = strlen(b1);
821  size_t b2_len = strlen(b2);
822  size_t b3_len = strlen(b3);
823  size_t colorspace_len = strlen(colorspace);
824 
825  *pdf_object_size =
826  b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
827  *pdf_object = new char[*pdf_object_size];
828 
829  char *p = *pdf_object;
830  memcpy(p, b1, b1_len);
831  p += b1_len;
832  memcpy(p, colorspace, colorspace_len);
833  p += colorspace_len;
834  memcpy(p, b2, b2_len);
835  p += b2_len;
836  memcpy(p, cid->datacomp, cid->nbytescomp);
837  p += cid->nbytescomp;
838  memcpy(p, b3, b3_len);
839  l_CIDataDestroy(&cid);
840  return true;
841 }
842 
844  size_t n;
845  char buf[kBasicBufSize];
846  char buf2[kBasicBufSize];
847  Pix *pix = api->GetInputImage();
848  char *filename = (char *)api->GetInputName();
849  int ppi = api->GetSourceYResolution();
850  if (!pix || ppi <= 0)
851  return false;
852  double width = pixGetWidth(pix) * 72.0 / ppi;
853  double height = pixGetHeight(pix) * 72.0 / ppi;
854 
855  snprintf(buf2, sizeof(buf2), "/XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
856  const char *xobject = (textonly_) ? "" : buf2;
857 
858  // PAGE
859  n = snprintf(buf, sizeof(buf),
860  "%ld 0 obj\n"
861  "<<\n"
862  " /Type /Page\n"
863  " /Parent %ld 0 R\n"
864  " /MediaBox [0 0 %.2f %.2f]\n"
865  " /Contents %ld 0 R\n"
866  " /Resources\n"
867  " <<\n"
868  " %s"
869  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
870  " /Font << /f-0-0 %ld 0 R >>\n"
871  " >>\n"
872  ">>\n"
873  "endobj\n",
874  obj_,
875  2L, // Pages object
876  width, height,
877  obj_ + 1, // Contents object
878  xobject, // Image object
879  3L); // Type0 Font
880  if (n >= sizeof(buf)) return false;
881  pages_.push_back(obj_);
882  AppendPDFObject(buf);
883 
884  // CONTENTS
885  const std::unique_ptr<char[]> pdftext(GetPDFTextObjects(api, width, height));
886  const size_t pdftext_len = strlen(pdftext.get());
887  size_t len;
888  unsigned char *comp_pdftext = zlibCompress(
889  reinterpret_cast<unsigned char *>(pdftext.get()), pdftext_len, &len);
890  long comp_pdftext_len = len;
891  n = snprintf(buf, sizeof(buf),
892  "%ld 0 obj\n"
893  "<<\n"
894  " /Length %ld /Filter /FlateDecode\n"
895  ">>\n"
896  "stream\n", obj_, comp_pdftext_len);
897  if (n >= sizeof(buf)) {
898  lept_free(comp_pdftext);
899  return false;
900  }
901  AppendString(buf);
902  long objsize = strlen(buf);
903  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
904  objsize += comp_pdftext_len;
905  lept_free(comp_pdftext);
906  const char *b2 =
907  "endstream\n"
908  "endobj\n";
909  AppendString(b2);
910  objsize += strlen(b2);
911  AppendPDFObjectDIY(objsize);
912 
913  if (!textonly_) {
914  char *pdf_object = nullptr;
915  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
916  return false;
917  }
918  AppendData(pdf_object, objsize);
919  AppendPDFObjectDIY(objsize);
920  delete[] pdf_object;
921  }
922  return true;
923 }
924 
925 
927  size_t n;
928  char buf[kBasicBufSize];
929 
930  // We reserved the /Pages object number early, so that the /Page
931  // objects could refer to their parent. We finally have enough
932  // information to go fill it in. Using lower level calls to manipulate
933  // the offset record in two spots, because we are placing objects
934  // out of order in the file.
935 
936  // PAGES
937  const long int kPagesObjectNumber = 2;
938  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
939  n = snprintf(buf, sizeof(buf),
940  "%ld 0 obj\n"
941  "<<\n"
942  " /Type /Pages\n"
943  " /Kids [ ", kPagesObjectNumber);
944  if (n >= sizeof(buf)) return false;
945  AppendString(buf);
946  size_t pages_objsize = strlen(buf);
947  for (size_t i = 0; i < pages_.unsigned_size(); i++) {
948  n = snprintf(buf, sizeof(buf),
949  "%ld 0 R ", pages_[i]);
950  if (n >= sizeof(buf)) return false;
951  AppendString(buf);
952  pages_objsize += strlen(buf);
953  }
954  n = snprintf(buf, sizeof(buf),
955  "]\n"
956  " /Count %d\n"
957  ">>\n"
958  "endobj\n", pages_.size());
959  if (n >= sizeof(buf)) return false;
960  AppendString(buf);
961  pages_objsize += strlen(buf);
962  offsets_.back() += pages_objsize; // manipulation #2
963 
964  // INFO
965  STRING utf16_title = "FEFF"; // byte_order_marker
966  std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(title());
967  char utf16[kMaxBytesPerCodepoint];
968  for (char32 code : unicodes) {
969  if (CodepointToUtf16be(code, utf16)) {
970  utf16_title += utf16;
971  }
972  }
973 
974  char* datestr = l_getFormattedDate();
975  n = snprintf(buf, sizeof(buf),
976  "%ld 0 obj\n"
977  "<<\n"
978  " /Producer (Tesseract %s)\n"
979  " /CreationDate (D:%s)\n"
980  " /Title <%s>\n"
981  ">>\n"
982  "endobj\n",
984  datestr, utf16_title.c_str());
985  lept_free(datestr);
986  if (n >= sizeof(buf)) return false;
987  AppendPDFObject(buf);
988  n = snprintf(buf, sizeof(buf),
989  "xref\n"
990  "0 %ld\n"
991  "0000000000 65535 f \n", obj_);
992  if (n >= sizeof(buf)) return false;
993  AppendString(buf);
994  for (int i = 1; i < obj_; i++) {
995  n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
996  if (n >= sizeof(buf)) return false;
997  AppendString(buf);
998  }
999  n = snprintf(buf, sizeof(buf),
1000  "trailer\n"
1001  "<<\n"
1002  " /Size %ld\n"
1003  " /Root %ld 0 R\n"
1004  " /Info %ld 0 R\n"
1005  ">>\n"
1006  "startxref\n"
1007  "%ld\n"
1008  "%%%%EOF\n",
1009  obj_,
1010  1L, // catalog
1011  obj_ - 1, // info
1012  offsets_.back());
1013  if (n >= sizeof(buf)) return false;
1014  AppendString(buf);
1015  return true;
1016 }
1017 } // namespace tesseract
virtual bool AddImageHandler(TessBaseAPI *api)
void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2, double *a, double *b, double *c, double *d)
const char * WordFontAttributes(bool *is_bold, bool *is_italic, bool *is_underlined, bool *is_monospace, bool *is_serif, bool *is_smallcaps, int *pointsize, int *font_id) const
void Orientation(tesseract::Orientation *orientation, tesseract::WritingDirection *writing_direction, tesseract::TextlineOrder *textline_order, float *deskew_angle) const
void add_str_double(const char *str, double number)
Definition: strngs.cpp:391
Definition: strngs.h:45
virtual bool Next(PageIteratorLevel level)
const char * title() const
Definition: renderer.h:81
T & back() const
void Swap(T *p1, T *p2)
Definition: helpers.h:97
int size() const
Definition: genericvector.h:72
virtual char * GetUTF8Text(PageIteratorLevel level) const
size_t unsigned_size() const
Definition: genericvector.h:76
bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint])
const char * GetInputName()
Definition: baseapi.cpp:915
int push_back(T object)
TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly)
virtual bool BeginDocumentHandler()
void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1, int *line_x2, int *line_y2)
#define tprintf(...)
Definition: tprintf.h:31
virtual bool IsAtFinalElement(PageIteratorLevel level, PageIteratorLevel element) const
void AppendData(const char *s, int len)
Definition: renderer.cpp:106
static const char * Version()
Definition: baseapi.cpp:196
bool Empty(PageIteratorLevel level) const
long dist2(int x1, int y1, int x2, int y2)
signed int char32
Definition: unichar.h:52
const char * c_str() const
Definition: strngs.cpp:209
const char * string() const
Definition: strngs.cpp:198
virtual bool EndDocumentHandler()
int32_t length() const
Definition: strngs.cpp:193
ResultIterator * GetIterator()
Definition: baseapi.cpp:1227
static std::vector< char32 > UTF8ToUTF32(const char *utf8_str)
Definition: unichar.cpp:213
void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1, int word_x2, int word_y2, int line_x1, int line_y1, int line_x2, int line_y2, double *x0, double *y0, double *length)
double prec(double x)
StrongScriptDirection WordDirection() const
virtual bool IsAtBeginningOf(PageIteratorLevel level) const
bool Baseline(PageIteratorLevel level, int *x1, int *y1, int *x2, int *y2) const
void AppendString(const char *s)
Definition: renderer.cpp:102