All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Modules Pages
pdfrenderer.cpp
Go to the documentation of this file.
1 // Include automatically generated configuration file if running autoconf.
2 #ifdef HAVE_CONFIG_H
3 #include "config_auto.h"
4 #endif
5 
6 #include "baseapi.h"
7 #include "renderer.h"
8 #include "math.h"
9 #include "strngs.h"
10 #include "cube_utils.h"
11 #include "allheaders.h"
12 
13 #ifdef _MSC_VER
14 #include "mathfix.h"
15 #endif
16 
17 /*
18 
19 Design notes from Ken Sharp, with light editing.
20 
21 We think one solution is a font with a single glyph (.notdef) and a
22 CIDToGIDMap which maps all the CIDs to 0. That map would then be
23 stored as a stream in the PDF file, and when flate compressed should
24 be pretty small. The font, of course, will be approximately the same
25 size as the one you currently use.
26 
27 I'm working on such a font now, the CIDToGIDMap is trivial, you just
28 create a stream object which contains 128k bytes (2 bytes per possible
29 CID and your CIDs range from 0 to 65535) and where you currently have
30 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
31 
32 Note that if, in future, you were to use a different (ie not 2 byte)
33 CMap for character codes you could trivially extend the CIDToGIDMap.
34 
35 The following is an explanation of how some of the font stuff works,
36 this may be too simple for you in which case please accept my
37 apologies, its hard to know how much knowledge someone has. You can
38 skip all this anyway, its just for information.
39 
40 The font embedded in a PDF file is usually intended just to be
41 rendered, but extensions allow for at least some ability to locate (or
42 copy) text from a document. This isn't something which was an original
43 goal of the PDF format, but its been retro-fitted, presumably due to
44 popular demand.
45 
46 To do this reliably the PDF file must contain a ToUnicode CMap, a
47 device for mapping character codes to Unicode code points. If one of
48 these is present, then this will be used to convert the character
49 codes into Unicode values. If its not present then the reader will
50 fall back through a series of heuristics to try and guess the
51 result. This is, as you would expect, prone to failure.
52 
53 This doesn't concern you of course, since you always write a ToUnicode
54 CMap, so because you are writing the text in text rendering mode 3 it
55 would seem that you don't really need to worry about this, but in the
56 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
57 attached to a font, so in order to get even copy/paste to work you
58 need to define a font.
59 
60 This is what leads to problems, tools like pdfwrite assume that they
61 are going to be able to (or even have to) modify the font entries, so
62 they require that the font being embedded be valid, and to be honest
63 the font Tesseract embeds isn't valid (for this purpose).
64 
65 
66 To see why lets look at how text is specified in a PDF file:
67 
68 (Test) Tj
69 
70 Now that looks like text but actually it isn't. Each of those bytes is
71 a 'character code'. When it comes to rendering the text a complex
72 sequence of events takes place, which converts the character code into
73 'something' which the font understands. Its entirely possible via
74 character mappings to have that text render as 'Sftu'
75 
76 For simple fonts (PostScript type 1), we use the character code as the
77 index into an Encoding array (256 elements), each element of which is
78 a glyph name, so this gives us a glyph name. We then consult the
79 CharStrings dictionary in the font, that's a complex object which
80 contains pairs of keys and values, you can use the key to retrieve a
81 given value. So we have a glyph name, we then use that as the key to
82 the dictionary and retrieve the associated value. For a type 1 font,
83 the value is a glyph program that describes how to draw the glyph.
84 
85 For CIDFonts, its a little more complicated. Because CIDFonts can be
86 large, using a glyph name as the key is unreasonable (it would also
87 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
88 as the key. CIDs are just numbers.
89 
90 But.... We don't use the character code as the CID. What we do is use
91 a CMap to convert the character code into a CID. We then use the CID
92 to key the CharStrings dictionary and proceed as before. So the 'CMap'
93 is the equivalent of the Encoding array, but its a more compact and
94 flexible representation.
95 
96 Note that you have to use the CMap just to find out how many bytes
97 constitute a character code, and it can be variable. For example you
98 can say if the first byte is 0x00->0x7f then its just one byte, if its
99 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
100 have seen CMaps defining character codes up to 5 bytes wide.
101 
102 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
103 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
104 a Glyph ID (GID) (and the LOCA table) which may well not be anything
105 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
106 the CIDs to GIDs, and we can then use the GID to get the glyph
107 description from the GLYF table of the font.
108 
109 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
110 
111 Looking at the PDF file I was supplied with we see that it contains
112 text like :
113 
114 <0x0075> Tj
115 
116 So we start by taking the character code (117) and look it up in the
117 CMap. Well you don't supply a CMap, you just use the Identity-H one
118 which is predefined. So character code 117 maps to CID 117. Then we
119 use the CIDToGIDMap, again you don't supply one, you just use the
120 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
121 were supplied with only contains 116 glyphs.
122 
123 Now for Latin that's not a huge problem, you can just supply a bigger
124 font. But for more complex languages that *is* going to be more of a
125 problem. Either you need to supply a font which contains glyphs for
126 all the possible CID->GID mappings, or we need to think laterally.
127 
128 Our solution using a TrueType CIDFont is to intervene at the
129 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
130 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
131 looking into now.
132 
133 It would also be possible to have a 'PostScript' (ie type 1 outlines)
134 CIDFont which contained 1 glyph, and a CMap which mapped all character
135 codes to CID 0. The effect would be the same.
136 
137 Its possible (I haven't checked) that the PostScript CIDFont and
138 associated CMap would be smaller than the TrueType font and associated
139 CIDToGIDMap.
140 
141 --- in a followup ---
142 
143 OK there is a small problem there, if I use GID 0 then Acrobat gets
144 upset about it and complains it cannot extract the font. If I set the
145 CIDToGIDMap so that all the entries are 1 instead, its happy. Totally
146 mad......
147 
148 */
149 
150 namespace tesseract {
151 
152 // Use for PDF object fragments. Must be large enough
153 // to hold a colormap with 256 colors in the verbose
154 // PDF representation.
155 const int kBasicBufSize = 2048;
156 
157 // If the font is 10 pts, nominal character width is 5 pts
158 const int kCharWidth = 2;
159 
160 /**********************************************************************
161  * PDF Renderer interface implementation
162  **********************************************************************/
163 
164 TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir)
165  : TessResultRenderer(outputbase, "pdf") {
166  obj_ = 0;
167  datadir_ = datadir;
168  offsets_.push_back(0);
169 }
170 
171 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
172  offsets_.push_back(objectsize + offsets_.back());
173  obj_++;
174 }
175 
176 void TessPDFRenderer::AppendPDFObject(const char *data) {
177  AppendPDFObjectDIY(strlen(data));
178  AppendString((const char *)data);
179 }
180 
181 // Helper function to prevent us from accidentaly writing
182 // scientific notation to an HOCR or PDF file. Besides, three
183 // decimal points are all you really need.
184 double prec(double x) {
185  double kPrecision = 1000.0;
186  double a = round(x * kPrecision) / kPrecision;
187  if (a == -0)
188  return 0;
189  return a;
190 }
191 
192 long dist2(int x1, int y1, int x2, int y2) {
193  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
194 }
195 
196 // Viewers like evince can get really confused during copy-paste when
197 // the baseline wanders around. So I've decided to project every word
198 // onto the (straight) line baseline. All numbers are in the native
199 // PDF coordinate system, which has the origin in the bottom left and
200 // the unit is points, which is 1/72 inch. Tesseract reports baselines
201 // left-to-right no matter what the reading order is. We need the
202 // word baseline in reading order, so we do that conversion here. Returns
203 // the word's baseline origin and length.
204 void GetWordBaseline(int writing_direction, int ppi, int height,
205  int word_x1, int word_y1, int word_x2, int word_y2,
206  int line_x1, int line_y1, int line_x2, int line_y2,
207  double *x0, double *y0, double *length) {
208  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
209  Swap(&word_x1, &word_x2);
210  Swap(&word_y1, &word_y2);
211  }
212  double word_length;
213  double x, y;
214  {
215  int px = word_x1;
216  int py = word_y1;
217  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
218  if (l2 == 0) {
219  x = line_x1;
220  y = line_y1;
221  } else {
222  double t = ((px - line_x2) * (line_x2 - line_x1) +
223  (py - line_y2) * (line_y2 - line_y1)) / l2;
224  x = line_x2 + t * (line_x2 - line_x1);
225  y = line_y2 + t * (line_y2 - line_y1);
226  }
227  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1,
228  word_x2, word_y2)));
229  word_length = word_length * 72.0 / ppi;
230  x = x * 72 / ppi;
231  y = height - (y * 72.0 / ppi);
232  }
233  *x0 = x;
234  *y0 = y;
235  *length = word_length;
236 }
237 
238 // Compute coefficients for an affine matrix describing the rotation
239 // of the text. If the text is right-to-left such as Arabic or Hebrew,
240 // we reflect over the Y-axis. This matrix will set the coordinate
241 // system for placing text in the PDF file.
242 //
243 // RTL
244 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
245 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
246 void AffineMatrix(int writing_direction,
247  int line_x1, int line_y1, int line_x2, int line_y2,
248  double *a, double *b, double *c, double *d) {
249  double theta = atan2(static_cast<double>(line_y1 - line_y2),
250  static_cast<double>(line_x2 - line_x1));
251  *a = cos(theta);
252  *b = sin(theta);
253  *c = -sin(theta);
254  *d = cos(theta);
255  switch(writing_direction) {
257  *a = -*a;
258  *b = -*b;
259  break;
261  // TODO(jbreiden) Consider using the vertical PDF writing mode.
262  break;
263  default:
264  break;
265  }
266 }
267 
268 // There are some really stupid PDF viewers in the wild, such as
269 // 'Preview' which ships with the Mac. They do a better job with text
270 // selection and highlighting when given perfectly flat baseline
271 // instead of very slightly tilted. We clip small tilts to appease
272 // these viewers. I chose this threshold large enough to absorb noise,
273 // but small enough that lines probably won't cross each other if the
274 // whole page is tilted at almost exactly the clipping threshold.
275 void ClipBaseline(int ppi, int x1, int y1, int x2, int y2,
276  int *line_x1, int *line_y1,
277  int *line_x2, int *line_y2) {
278  *line_x1 = x1;
279  *line_y1 = y1;
280  *line_x2 = x2;
281  *line_y2 = y2;
282  double rise = abs(y2 - y1) * 72 / ppi;
283  double run = abs(x2 - x1) * 72 / ppi;
284  if (rise < 2.0 && 2.0 < run)
285  *line_y1 = *line_y2 = (y1 + y2) / 2;
286 }
287 
288 char* TessPDFRenderer::GetPDFTextObjects(TessBaseAPI* api,
289  double width, double height) {
290  STRING pdf_str("");
291  double ppi = api->GetSourceYResolution();
292 
293  // These initial conditions are all arbitrary and will be overwritten
294  double old_x = 0.0, old_y = 0.0;
295  int old_fontsize = 0;
296  tesseract::WritingDirection old_writing_direction =
298  bool new_block = true;
299  int fontsize = 0;
300  double a = 1;
301  double b = 0;
302  double c = 0;
303  double d = 1;
304 
305  // TODO(jbreiden) This marries the text and image together.
306  // Slightly cleaner from an abstraction standpoint if this were to
307  // live inside a separate text object.
308  pdf_str += "q ";
309  pdf_str.add_str_double("", prec(width));
310  pdf_str += " 0 0 ";
311  pdf_str.add_str_double("", prec(height));
312  pdf_str += " 0 0 cm /Im1 Do Q\n";
313 
314  ResultIterator *res_it = api->GetIterator();
315  while (!res_it->Empty(RIL_BLOCK)) {
316  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
317  pdf_str += "BT\n3 Tr"; // Begin text object, use invisible ink
318  old_fontsize = 0; // Every block will declare its fontsize
319  new_block = true; // Every block will declare its affine matrix
320  }
321 
322  int line_x1, line_y1, line_x2, line_y2;
323  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
324  int x1, y1, x2, y2;
325  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
326  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
327  }
328 
329  if (res_it->Empty(RIL_WORD)) {
330  res_it->Next(RIL_WORD);
331  continue;
332  }
333 
334  // Writing direction changes at a per-word granularity
335  tesseract::WritingDirection writing_direction;
336  {
337  tesseract::Orientation orientation;
338  tesseract::TextlineOrder textline_order;
339  float deskew_angle;
340  res_it->Orientation(&orientation, &writing_direction,
341  &textline_order, &deskew_angle);
342  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
343  switch (res_it->WordDirection()) {
344  case DIR_LEFT_TO_RIGHT:
345  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
346  break;
347  case DIR_RIGHT_TO_LEFT:
348  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
349  break;
350  default:
351  writing_direction = old_writing_direction;
352  }
353  }
354  }
355 
356  // Where is word origin and how long is it?
357  double x, y, word_length;
358  {
359  int word_x1, word_y1, word_x2, word_y2;
360  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
361  GetWordBaseline(writing_direction, ppi, height,
362  word_x1, word_y1, word_x2, word_y2,
363  line_x1, line_y1, line_x2, line_y2,
364  &x, &y, &word_length);
365  }
366 
367  if (writing_direction != old_writing_direction || new_block) {
368  AffineMatrix(writing_direction,
369  line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
370  pdf_str.add_str_double(" ", prec(a)); // . This affine matrix
371  pdf_str.add_str_double(" ", prec(b)); // . sets the coordinate
372  pdf_str.add_str_double(" ", prec(c)); // . system for all
373  pdf_str.add_str_double(" ", prec(d)); // . text that follows.
374  pdf_str.add_str_double(" ", prec(x)); // .
375  pdf_str.add_str_double(" ", prec(y)); // .
376  pdf_str += (" Tm "); // Place cursor absolutely
377  new_block = false;
378  } else {
379  double dx = x - old_x;
380  double dy = y - old_y;
381  pdf_str.add_str_double(" ", prec(dx * a + dy * b));
382  pdf_str.add_str_double(" ", prec(dx * c + dy * d));
383  pdf_str += (" Td "); // Relative moveto
384  }
385  old_x = x;
386  old_y = y;
387  old_writing_direction = writing_direction;
388 
389  // Adjust font size on a per word granularity. Pay attention to
390  // fontsize, old_fontsize, and pdf_str. We've found that for
391  // in Arabic, Tesseract will happily return a fontsize of zero,
392  // so we make up a default number to protect ourselves.
393  {
394  bool bold, italic, underlined, monospace, serif, smallcaps;
395  int font_id;
396  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace,
397  &serif, &smallcaps, &fontsize, &font_id);
398  const int kDefaultFontsize = 8;
399  if (fontsize <= 0)
400  fontsize = kDefaultFontsize;
401  if (fontsize != old_fontsize) {
402  char textfont[20];
403  snprintf(textfont, sizeof(textfont), "/f-0-0 %d Tf ", fontsize);
404  pdf_str += textfont;
405  old_fontsize = fontsize;
406  }
407  }
408 
409  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
410  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
411  STRING pdf_word("");
412  int pdf_word_len = 0;
413  do {
414  const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
415  if (grapheme && grapheme[0] != '\0') {
416  // TODO(jbreiden) Do a real UTF-16BE conversion
417  // http://en.wikipedia.org/wiki/UTF-16#Example_UTF-16_encoding_procedure
418  string_32 utf32;
419  CubeUtils::UTF8ToUTF32(grapheme, &utf32);
420  char utf16[20];
421  for (int i = 0; i < static_cast<int>(utf32.length()); i++) {
422  snprintf(utf16, sizeof(utf16), "<%04X>", utf32[i]);
423  pdf_word += utf16;
424  pdf_word_len++;
425  }
426  }
427  delete []grapheme;
428  res_it->Next(RIL_SYMBOL);
429  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
430  if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
431  double h_stretch =
432  kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
433  pdf_str.add_str_double("", h_stretch);
434  pdf_str += " Tz"; // horizontal stretch
435  pdf_str += " [ ";
436  pdf_str += pdf_word; // UTF-16BE representation
437  pdf_str += " ] TJ"; // show the text
438  }
439  if (last_word_in_line) {
440  pdf_str += " \n";
441  }
442  if (last_word_in_block) {
443  pdf_str += "ET\n"; // end the text object
444  }
445  }
446  char *ret = new char[pdf_str.length() + 1];
447  strcpy(ret, pdf_str.string());
448  delete res_it;
449  return ret;
450 }
451 
453  char buf[kBasicBufSize];
454  size_t n;
455 
456  n = snprintf(buf, sizeof(buf),
457  "%%PDF-1.5\n"
458  "%%%c%c%c%c\n",
459  0xDE, 0xAD, 0xBE, 0xEB);
460  if (n >= sizeof(buf)) return false;
461  AppendPDFObject(buf);
462 
463  // CATALOG
464  n = snprintf(buf, sizeof(buf),
465  "1 0 obj\n"
466  "<<\n"
467  " /Type /Catalog\n"
468  " /Pages %ld 0 R\n"
469  ">>\n"
470  "endobj\n",
471  2L);
472  if (n >= sizeof(buf)) return false;
473  AppendPDFObject(buf);
474 
475  // We are reserving object #2 for the /Pages
476  // object, which I am going to create and write
477  // at the end of the PDF file.
478  AppendPDFObject("");
479 
480  // TYPE0 FONT
481  n = snprintf(buf, sizeof(buf),
482  "3 0 obj\n"
483  "<<\n"
484  " /BaseFont /GlyphLessFont\n"
485  " /DescendantFonts [ %ld 0 R ]\n"
486  " /Encoding /Identity-H\n"
487  " /Subtype /Type0\n"
488  " /ToUnicode %ld 0 R\n"
489  " /Type /Font\n"
490  ">>\n"
491  "endobj\n",
492  4L, // CIDFontType2 font
493  6L // ToUnicode
494  );
495  if (n >= sizeof(buf)) return false;
496  AppendPDFObject(buf);
497 
498  // CIDFONTTYPE2
499  n = snprintf(buf, sizeof(buf),
500  "4 0 obj\n"
501  "<<\n"
502  " /BaseFont /GlyphLessFont\n"
503  " /CIDToGIDMap %ld 0 R\n"
504  " /CIDSystemInfo\n"
505  " <<\n"
506  " /Ordering (Identity)\n"
507  " /Registry (Adobe)\n"
508  " /Supplement 0\n"
509  " >>\n"
510  " /FontDescriptor %ld 0 R\n"
511  " /Subtype /CIDFontType2\n"
512  " /Type /Font\n"
513  " /DW %d\n"
514  ">>\n"
515  "endobj\n",
516  5L, // CIDToGIDMap
517  7L, // Font descriptor
518  1000 / kCharWidth);
519  if (n >= sizeof(buf)) return false;
520  AppendPDFObject(buf);
521 
522  // CIDTOGIDMAP
523  const int kCIDToGIDMapSize = 2 * (1 << 16);
524  unsigned char *cidtogidmap = new unsigned char[kCIDToGIDMapSize];
525  for (int i = 0; i < kCIDToGIDMapSize; i++) {
526  cidtogidmap[i] = (i % 2) ? 1 : 0;
527  }
528  size_t len;
529  unsigned char *comp =
530  zlibCompress(cidtogidmap, kCIDToGIDMapSize, &len);
531  delete[] cidtogidmap;
532  n = snprintf(buf, sizeof(buf),
533  "5 0 obj\n"
534  "<<\n"
535  " /Length %ld /Filter /FlateDecode\n"
536  ">>\n"
537  "stream\n", len);
538  if (n >= sizeof(buf)) {
539  lept_free(comp);
540  return false;
541  }
542  AppendString(buf);
543  long objsize = strlen(buf);
544  AppendData(reinterpret_cast<char *>(comp), len);
545  objsize += len;
546  lept_free(comp);
547  const char *endstream_endobj =
548  "endstream\n"
549  "endobj\n";
550  AppendString(endstream_endobj);
551  objsize += strlen(endstream_endobj);
552  AppendPDFObjectDIY(objsize);
553 
554  const char *stream =
555  "/CIDInit /ProcSet findresource begin\n"
556  "12 dict begin\n"
557  "begincmap\n"
558  "/CIDSystemInfo\n"
559  "<<\n"
560  " /Registry (Adobe)\n"
561  " /Ordering (UCS)\n"
562  " /Supplement 0\n"
563  ">> def\n"
564  "/CMapName /Adobe-Identify-UCS def\n"
565  "/CMapType 2 def\n"
566  "1 begincodespacerange\n"
567  "<0000> <FFFF>\n"
568  "endcodespacerange\n"
569  "1 beginbfrange\n"
570  "<0000> <FFFF> <0000>\n"
571  "endbfrange\n"
572  "endcmap\n"
573  "CMapName currentdict /CMap defineresource pop\n"
574  "end\n"
575  "end\n";
576 
577  // TOUNICODE
578  n = snprintf(buf, sizeof(buf),
579  "6 0 obj\n"
580  "<< /Length %lu >>\n"
581  "stream\n"
582  "%s"
583  "endstream\n"
584  "endobj\n", (unsigned long) strlen(stream), stream);
585  if (n >= sizeof(buf)) return false;
586  AppendPDFObject(buf);
587 
588  // FONT DESCRIPTOR
589  const int kCharHeight = 2; // Effect: highlights are half height
590  n = snprintf(buf, sizeof(buf),
591  "7 0 obj\n"
592  "<<\n"
593  " /Ascent %d\n"
594  " /CapHeight %d\n"
595  " /Descent -1\n" // Spec says must be negative
596  " /Flags 5\n" // FixedPitch + Symbolic
597  " /FontBBox [ 0 0 %d %d ]\n"
598  " /FontFile2 %ld 0 R\n"
599  " /FontName /GlyphLessFont\n"
600  " /ItalicAngle 0\n"
601  " /StemV 80\n"
602  " /Type /FontDescriptor\n"
603  ">>\n"
604  "endobj\n",
605  1000 / kCharHeight,
606  1000 / kCharHeight,
607  1000 / kCharWidth,
608  1000 / kCharHeight,
609  8L // Font data
610  );
611  if (n >= sizeof(buf)) return false;
612  AppendPDFObject(buf);
613 
614  n = snprintf(buf, sizeof(buf), "%s/pdf.ttf", datadir_);
615  if (n >= sizeof(buf)) return false;
616  FILE *fp = fopen(buf, "rb");
617  if (!fp) {
618  tprintf("Can not open file \"%s\"!\n", buf);
619  return false;
620  }
621  fseek(fp, 0, SEEK_END);
622  long int size = ftell(fp);
623  fseek(fp, 0, SEEK_SET);
624  char *buffer = new char[size];
625  if (fread(buffer, 1, size, fp) != size) {
626  fclose(fp);
627  delete[] buffer;
628  return false;
629  }
630  fclose(fp);
631  // FONTFILE2
632  n = snprintf(buf, sizeof(buf),
633  "8 0 obj\n"
634  "<<\n"
635  " /Length %ld\n"
636  " /Length1 %ld\n"
637  ">>\n"
638  "stream\n", size, size);
639  if (n >= sizeof(buf)) {
640  delete[] buffer;
641  return false;
642  }
643  AppendString(buf);
644  objsize = strlen(buf);
645  AppendData(buffer, size);
646  delete[] buffer;
647  objsize += size;
648  AppendString(endstream_endobj);
649  objsize += strlen(endstream_endobj);
650  AppendPDFObjectDIY(objsize);
651  return true;
652 }
653 
654 bool TessPDFRenderer::imageToPDFObj(Pix *pix,
655  char *filename,
656  long int objnum,
657  char **pdf_object,
658  long int *pdf_object_size) {
659  size_t n;
660  char b0[kBasicBufSize];
661  char b1[kBasicBufSize];
662  char b2[kBasicBufSize];
663  if (!pdf_object_size || !pdf_object)
664  return false;
665  *pdf_object = NULL;
666  *pdf_object_size = 0;
667  if (!filename)
668  return false;
669 
670  L_COMP_DATA *cid = NULL;
671  const int kJpegQuality = 85;
672 
673  // TODO(jbreiden) Leptonica 1.71 doesn't correctly handle certain
674  // types of PNG files, especially if there are 2 samples per pixel.
675  // We can get rid of this logic after Leptonica 1.72 is released and
676  // has propagated everywhere. Bug discussion as follows.
677  // https://code.google.com/p/tesseract-ocr/issues/detail?id=1300
678  int format, sad;
679  findFileFormat(filename, &format);
680  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
681  pixSetSpp(pix, 3);
682  sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid);
683  } else {
684  sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
685  }
686 
687  if (sad || !cid) {
688  l_CIDataDestroy(&cid);
689  return false;
690  }
691 
692  const char *group4 = "";
693  const char *filter;
694  switch(cid->type) {
695  case L_FLATE_ENCODE:
696  filter = "/FlateDecode";
697  break;
698  case L_JPEG_ENCODE:
699  filter = "/DCTDecode";
700  break;
701  case L_G4_ENCODE:
702  filter = "/CCITTFaxDecode";
703  group4 = " /K -1\n";
704  break;
705  case L_JP2K_ENCODE:
706  filter = "/JPXDecode";
707  break;
708  default:
709  l_CIDataDestroy(&cid);
710  return false;
711  }
712 
713  // Maybe someday we will accept RGBA but today is not that day.
714  // It requires creating an /SMask for the alpha channel.
715  // http://stackoverflow.com/questions/14220221
716  const char *colorspace;
717  if (cid->ncolors > 0) {
718  n = snprintf(b0, sizeof(b0),
719  " /ColorSpace [ /Indexed /DeviceRGB %d %s ]\n",
720  cid->ncolors - 1, cid->cmapdatahex);
721  if (n >= sizeof(b0)) {
722  l_CIDataDestroy(&cid);
723  return false;
724  }
725  colorspace = b0;
726  } else {
727  switch (cid->spp) {
728  case 1:
729  colorspace = " /ColorSpace /DeviceGray\n";
730  break;
731  case 3:
732  colorspace = " /ColorSpace /DeviceRGB\n";
733  break;
734  default:
735  l_CIDataDestroy(&cid);
736  return false;
737  }
738  }
739 
740  int predictor = (cid->predictor) ? 14 : 1;
741 
742  // IMAGE
743  n = snprintf(b1, sizeof(b1),
744  "%ld 0 obj\n"
745  "<<\n"
746  " /Length %ld\n"
747  " /Subtype /Image\n",
748  objnum, (unsigned long) cid->nbytescomp);
749  if (n >= sizeof(b1)) {
750  l_CIDataDestroy(&cid);
751  return false;
752  }
753 
754  n = snprintf(b2, sizeof(b2),
755  " /Width %d\n"
756  " /Height %d\n"
757  " /BitsPerComponent %d\n"
758  " /Filter %s\n"
759  " /DecodeParms\n"
760  " <<\n"
761  " /Predictor %d\n"
762  " /Colors %d\n"
763  "%s"
764  " /Columns %d\n"
765  " /BitsPerComponent %d\n"
766  " >>\n"
767  ">>\n"
768  "stream\n",
769  cid->w, cid->h, cid->bps, filter, predictor, cid->spp,
770  group4, cid->w, cid->bps);
771  if (n >= sizeof(b2)) {
772  l_CIDataDestroy(&cid);
773  return false;
774  }
775 
776  const char *b3 =
777  "endstream\n"
778  "endobj\n";
779 
780  size_t b1_len = strlen(b1);
781  size_t b2_len = strlen(b2);
782  size_t b3_len = strlen(b3);
783  size_t colorspace_len = strlen(colorspace);
784 
785  *pdf_object_size =
786  b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
787  *pdf_object = new char[*pdf_object_size];
788  if (!pdf_object) {
789  l_CIDataDestroy(&cid);
790  return false;
791  }
792 
793  char *p = *pdf_object;
794  memcpy(p, b1, b1_len);
795  p += b1_len;
796  memcpy(p, colorspace, colorspace_len);
797  p += colorspace_len;
798  memcpy(p, b2, b2_len);
799  p += b2_len;
800  memcpy(p, cid->datacomp, cid->nbytescomp);
801  p += cid->nbytescomp;
802  memcpy(p, b3, b3_len);
803  l_CIDataDestroy(&cid);
804  return true;
805 }
806 
808  size_t n;
809  char buf[kBasicBufSize];
810  Pix *pix = api->GetInputImage();
811  char *filename = (char *)api->GetInputName();
812  int ppi = api->GetSourceYResolution();
813  if (!pix || ppi <= 0)
814  return false;
815  double width = pixGetWidth(pix) * 72.0 / ppi;
816  double height = pixGetHeight(pix) * 72.0 / ppi;
817 
818  // PAGE
819  n = snprintf(buf, sizeof(buf),
820  "%ld 0 obj\n"
821  "<<\n"
822  " /Type /Page\n"
823  " /Parent %ld 0 R\n"
824  " /MediaBox [0 0 %.2f %.2f]\n"
825  " /Contents %ld 0 R\n"
826  " /Resources\n"
827  " <<\n"
828  " /XObject << /Im1 %ld 0 R >>\n"
829  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
830  " /Font << /f-0-0 %ld 0 R >>\n"
831  " >>\n"
832  ">>\n"
833  "endobj\n",
834  obj_,
835  2L, // Pages object
836  width,
837  height,
838  obj_ + 1, // Contents object
839  obj_ + 2, // Image object
840  3L); // Type0 Font
841  if (n >= sizeof(buf)) return false;
842  pages_.push_back(obj_);
843  AppendPDFObject(buf);
844 
845  // CONTENTS
846  char* pdftext = GetPDFTextObjects(api, width, height);
847  long pdftext_len = strlen(pdftext);
848  unsigned char *pdftext_casted = reinterpret_cast<unsigned char *>(pdftext);
849  size_t len;
850  unsigned char *comp_pdftext =
851  zlibCompress(pdftext_casted, pdftext_len, &len);
852  long comp_pdftext_len = len;
853  n = snprintf(buf, sizeof(buf),
854  "%ld 0 obj\n"
855  "<<\n"
856  " /Length %ld /Filter /FlateDecode\n"
857  ">>\n"
858  "stream\n", obj_, comp_pdftext_len);
859  if (n >= sizeof(buf)) {
860  delete[] pdftext;
861  lept_free(comp_pdftext);
862  return false;
863  }
864  AppendString(buf);
865  long objsize = strlen(buf);
866  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
867  objsize += comp_pdftext_len;
868  lept_free(comp_pdftext);
869  delete[] pdftext;
870  const char *b2 =
871  "endstream\n"
872  "endobj\n";
873  AppendString(b2);
874  objsize += strlen(b2);
875  AppendPDFObjectDIY(objsize);
876 
877  char *pdf_object;
878  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
879  return false;
880  }
881  AppendData(pdf_object, objsize);
882  AppendPDFObjectDIY(objsize);
883  delete[] pdf_object;
884  return true;
885 }
886 
887 
889  size_t n;
890  char buf[kBasicBufSize];
891 
892  // We reserved the /Pages object number early, so that the /Page
893  // objects could refer to their parent. We finally have enough
894  // information to go fill it in. Using lower level calls to manipulate
895  // the offset record in two spots, because we are placing objects
896  // out of order in the file.
897 
898  // PAGES
899  const long int kPagesObjectNumber = 2;
900  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
901  n = snprintf(buf, sizeof(buf),
902  "%ld 0 obj\n"
903  "<<\n"
904  " /Type /Pages\n"
905  " /Kids [ ", kPagesObjectNumber);
906  if (n >= sizeof(buf)) return false;
907  AppendString(buf);
908  size_t pages_objsize = strlen(buf);
909  for (size_t i = 0; i < pages_.size(); i++) {
910  n = snprintf(buf, sizeof(buf),
911  "%ld 0 R ", pages_[i]);
912  if (n >= sizeof(buf)) return false;
913  AppendString(buf);
914  pages_objsize += strlen(buf);
915  }
916  n = snprintf(buf, sizeof(buf),
917  "]\n"
918  " /Count %d\n"
919  ">>\n"
920  "endobj\n", pages_.size());
921  if (n >= sizeof(buf)) return false;
922  AppendString(buf);
923  pages_objsize += strlen(buf);
924  offsets_.back() += pages_objsize; // manipulation #2
925 
926  // INFO
927  char* datestr = l_getFormattedDate();
928  n = snprintf(buf, sizeof(buf),
929  "%ld 0 obj\n"
930  "<<\n"
931  " /Producer (Tesseract %s)\n"
932  " /CreationDate (D:%s)\n"
933  " /Title (%s)"
934  ">>\n"
935  "endobj\n", obj_, TESSERACT_VERSION_STR, datestr, title());
936  lept_free(datestr);
937  if (n >= sizeof(buf)) return false;
938  AppendPDFObject(buf);
939  n = snprintf(buf, sizeof(buf),
940  "xref\n"
941  "0 %ld\n"
942  "0000000000 65535 f \n", obj_);
943  if (n >= sizeof(buf)) return false;
944  AppendString(buf);
945  for (int i = 1; i < obj_; i++) {
946  n = snprintf(buf, sizeof(buf), "%010ld 00000 n \n", offsets_[i]);
947  if (n >= sizeof(buf)) return false;
948  AppendString(buf);
949  }
950  n = snprintf(buf, sizeof(buf),
951  "trailer\n"
952  "<<\n"
953  " /Size %ld\n"
954  " /Root %ld 0 R\n"
955  " /Info %ld 0 R\n"
956  ">>\n"
957  "startxref\n"
958  "%ld\n"
959  "%%%%EOF\n",
960  obj_,
961  1L, // catalog
962  obj_ - 1, // info
963  offsets_.back());
964  if (n >= sizeof(buf)) return false;
965  AppendString(buf);
966  return true;
967 }
968 } // namespace tesseract
const int kCharWidth
int size() const
Definition: genericvector.h:72
long dist2(int x1, int y1, int x2, int y2)
virtual bool BeginDocumentHandler()
int push_back(T object)
#define tprintf(...)
Definition: tprintf.h:31
virtual bool EndDocumentHandler()
basic_string< char_32 > string_32
Definition: string_32.h:41
T & back() const
void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1, int word_x2, int word_y2, int line_x1, int line_y1, int line_x2, int line_y2, double *x0, double *y0, double *length)
#define round(x)
Definition: mathfix.h:34
const char * GetInputName()
Definition: baseapi.cpp:948
void AppendString(const char *s)
Definition: renderer.cpp:83
struct TessBaseAPI TessBaseAPI
Definition: capi.h:67
const char * title() const
Definition: renderer.h:80
double prec(double x)
static void UTF8ToUTF32(const char *utf8_str, string_32 *str32)
Definition: cube_utils.cpp:266
void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2, double *a, double *b, double *c, double *d)
#define TESSERACT_VERSION_STR
Definition: baseapi.h:23
const int kBasicBufSize
Definition: strngs.h:44
#define NULL
Definition: host.h:144
void Swap(T *p1, T *p2)
Definition: helpers.h:90
virtual bool AddImageHandler(TessBaseAPI *api)
void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1, int *line_x2, int *line_y2)
TessPDFRenderer(const char *outputbase, const char *datadir)
void AppendData(const char *s, int len)
Definition: renderer.cpp:87