Document indexing allows PDF Highlighter to analyze PDF documents ahead of time, before their highlighting is requested. PDF Highlighter server will extract, index and cache text, text position and other metadata. Cached data allows the server to handle highlighting requests of large PDF documents much faster. When document cache exists, PDF Highlighter will only check timestamp of the PDF document and, if the document didn't change, use the cache without reading the actual document.
Indexing is not required. If highlighting was requested for a document that was not analyzed before, PDF Highlighter will do it on the fly. When highlight-for-xml endpoint is used, hit pages are provided as part of the input so PDF Highlighter analyzes only pages known to contain hits.
There are two ways to index documents:
To index remote documents, use PDF Highlighter's /index web service endpoint.
If a copy of documents being highlighted is accessible to the server on a local disk or remote disk mount, PDF Highlighter can crawl and index all PDF documents found. In this case you need to setup documentDirs in the application.conf and will probably need to configure uriMappings so that Highlighter can map your document URLs to local files.
Setup document indexing
To enable indexing:
Configure Highlighter to map PDF URL to local file system. See Serving Local Files.
Add documentation root folders to the documentDirs list in the indexing section of the configuration file.
documentDirs = [
In the config file, you can also setup how often Highlighter should scan your document folders. Only updated files will be processed.
Reading documents from file system vs HTTP
For better performance, it's recommended to make PDF documents accessible to Highlighter via file system — ideally having both on the same server. That will allow Highlighter to access PDF document without HTTP round-trip.
See serving local files for details.
Highlighting Mode: In Viewer
In Highlighting In Viewer mode, the server provides PDF Viewer with data necessary to render highlighting areas on a top of the shown PDF document. The PDF document is served by your web server — from its original location — so users can benefit from caching in a web browser as well. In this mode, Highlighter handles requests faster than for Burning PDF because it doesn't need to serialize and optimize PDF documents for Web delivery — which, in case of huge files, can take much longer than the highlighting process.
It's highly recommended to setup document indexing as it greatly improves highlighting performance.
Highlighting Mode: Burning PDF
When burning highlights into PDF, Highlighter creates a new PDF on the fly, adding highlighting annotations to the original document. The produced PDF file doesn't a require special viewer — highlights are part of the document and will be shown by every PDF viewer supporting annotations.
Web optimized PDF delivery
PDF Highlighter outputs PDF documents as web optimized ("linearized"). This PDF feature, often referenced as "fast web view", allows the PDF viewer to load and show the first document page (and subsequent pages) before the PDF file is fully downloaded from the web server. PDF viewer achieves this requesting from the web server smaller chunks of the PDF document, showing chunk pages as soon as the chunk is received.
PDF Highlighter returns the URL to the highlighted document in a format that tells PDF viewer which page to open first (i.e. the first hit page). Now, it's up to PDF viewer to use this information and different viewers have different document loading strategies. For example, our test from 2016 shows that:
Adobe Reader (which is usually the default PDF viewer for Internet Explorer users) uses the first page information to load it and render as soon as possible.
Recent releases of Mozilla Firefox come with an internal PDF viewer that works similarly to Adobe Reader — loading and showing the referenced page quickly.
Google Chrome has an internal PDF viewer which supports web optimized loading as well but is less efficient — it will load the first document page (not the one referenced in the URL) and continue loading the document. It will automatically jump to the first hit page only after the document is fully downloaded from the web server.
Note that, although it generally provides better user experience, web optimized PDF delivery comes with a significant networking overhead. As a result, it takes more time to fully load a web optimized PDF document in viewer than to download a non-optimized version of the same document.
Serving highlighted files using a front-end web server
Web servers like Apache, Nginx, or IIS will probably give you better static file serving performance than Highlighter can. In addition, you may prefer to serve resulting PDF files through a front-end web server to enforce security restrictions. Using "docsCacheDir" caching option you can instruct Highlighter to save the resulting document to a directory accessible to and served by the front-end web server. To instruct Highlighter to send users to a different path (handled by your front-end web server), change the document serving options.
For Highlighter to function in a load balanced environment, all nodes should be using the same shared location for the cache directory. Use the docsCacheDir option to setup this.
comments powered by Disqus