Jump to content.

IFX Group

A DIFFERENT PERSPECTIVE CAN CHANGE EVERYTHING.

Using LibreOffice for Automated Web to PDF Creation

Several of our web projects over the years have required some complex automation tasks. Most of those tasks can be addressed with a specialized combination of programs or scripts, but one task stands out from the rest and that is an automated way to turn a web page into a PDF. This task requires two different parts that rarely appear in the same place.

Turning Web Into PDF

The first part is a rendering engine able to turn HTML (hypertext markup language - how every web page is defined) into a graphical display with a layout structure using position, graphics, fonts and colors like what you see on this web page. Every web browser is designed to do the first part of this task well, but virtually all web browsers are very poor at formatting web pages on paper. Try it for yourself and notice how the web page content is sometimes split in arbitrary places at the paper edges. This is not acceptable.

The second part is generating an industry standard file format that is compatible with the widest range of operating systems, display methods and printing methods. Currently the most popular file format for this kind of compatibility is PDF (portable document format) that works equally well to convey formatted graphic documents on Windows™, Mac™ and Linux™ and virtually all modern printers.

What are the choices?

As we have already covered, adding a PDF printer driver to a web browser is not a good solution because web browsers are not built for nice looking print output. So the better choice is to find something that is designed primarily for printing that also understands HTML. Most commercial word processors have some ability to read HTML (a web page), but up until recently all lacked the ability to create PDF files without an external add-on printer driver. This was not a good choice because it required selecting a specific printer driver for the word processor either before running the automation script, or selecting the printer inside the automation script. Both choices have their own issues that make this prone to problems for the user, some of the PDF printer drivers even prompt the user for input (like the PDF file name or optional PDF settings) which is very bad for automation.

So our final choice was to use a word processor with native built-in PDF creation ability and it also helps that this was the least expensive of all other choices - the LibreOffice.org office suite.

It is no secret we like this word processor for serious text editing and book publishing. All of the IFX Group books are written in the LibreOffice.org word processor and all of the professional quality master PDF documents sent to the printing house are directly generated by this program. What is not as well known is the variety of scripting languages available for customizing LibreOffice and extending how it works. At the time of this writing these languages include LibreOffice (a.k.a. OpenOffice) Basic (similar to VisualBasic for Applications - VBA), Python, BeanShell (similar to Java) and Javascript (the most common web scripting language).

To accomplish our automation needs we will need two parts; the first is a very simple macro inside LibreOffice and the second is a specially formatted command line that can be called from any batch file or programming language. While this document primarily describes using Microsoft Windows to perform the automation, with very few changes to the command line this also works equally well on other operating systems like Mac and Linux.

The Macro

To keep things simple and easy to understand we will use LibreOffice Basic for this macro.

Start by opening the LibreOffice word processor and going to the Tools, Macros, Organize Macros, LibreOffice Basic menu section. This brings up a window showing a list of choices. Under the My Macros section is a Standard selection, open this and select the New button then paste the following block of code into that window.


 Sub HTML2PDF( cFile )
   ' Change input to URL format path (just to make sure)
   cURL = ConvertToURL( cFile )
   ' Open the file. Assume it is a format OOo will open without specifying an import filter.
   oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, Array(MakePropertyValue( "Hidden", False ),))
   ' Give LibreOffice time to paginate the document
   wait 10000
   ' Make new file name (assume file extension is 4 characters)
   cFile = Left( cFile, Len( cFile ) - 4 ) + ".pdf"
   ' Change to URL format path
   cURL = ConvertToURL( cFile )
   ' Save the document using the web to pdf export filter
   oDoc.storeToURL( cURL, Array( MakePropertyValue( "FilterName", "writer_web_pdf_Export" ),)
   ' Close the file, we are done
   oDoc.close( True )
 End Sub
 Function MakePropertyValue( Optional cName As String, _
   Optional uValue ) As com.sun.star.beans.PropertyValue
   Dim oPropertyValue As New com.sun.star.beans.PropertyValue
   If Not IsMissing( cName ) Then
     oPropertyValue.Name = cName
   EndIf
   If Not IsMissing( uValue ) Then
     oPropertyValue.Value = uValue
   EndIf
   MakePropertyValue() = oPropertyValue
 End Function
 

In very simple terms the first function HTML2PDF opens a file (the name is passed on the command line shown below), changes the file extension to PDF and exports the file in PDF format under the new name. The second function MakePropertyValue is called by the first to put text into a memory structure used by LibreOffice.org for options. While these two functions could be combined, it is much easier to maintain if they are separate.

Use the File, Save menu or press Ctrl-S to save your changes to disk. You don't need to close the editor window which is very helpful if you run into issues later. For example, notice the wait line in the HTML2PDF function above. This pauses the script for a few seconds (the number is in milliseconds where a thousand milliseconds is one second) so that LibreOffice.org has time to load the document and locate all of the page breaks. Without this slight pause the script could try to save the document before it is completely loaded which can understandably be a source of trouble. Technically this is called a race condition. If you have a very large web page, a slower computer or one that is typically too busy to calculate the page breaks quickly, adding a few more seconds to this wait time will help avoid any problems.

The Command Line

The second part of this is a special command line.


 "\Program Files\LibreOffice\program\soffice.exe" -norestore "macro:///Standard.Module1.HTML2PDF(file:///C:/web/test.htm)"
 

Copy and paste this into a text file called TEST.BAT and keep it open for additional editing.

The format of this special line is simple, but is also strict. The first part is the path to the LibreOffice.org word processor. Currently the Windows version of this program is named soffice.exe which reflects the StarOffice commercial word processor roots behind the LibreOffice.org project.

Next we have an optional command line switch. To make it easier for automation we temporarily skip the restore (crash recovery) feature. Otherwise if LibreOffice.org had detected an abrupt shutdown (loss of power, crash, etc.) it would present the user with a choice to recover the previously open document. This is obviously not a good thing for automation which is why we want to skip it. There are other command line options you may want to investigate for use here, but for now this is enough to get us going.

The last part is how we call our macro from a command line. The syntax starts with the word macro, a colon and three forward slash marks followed by the logical LibreOffice.org path (using periods to mark the levels) to the macro we want. We always store our macros in the Standard.Module1 which is why that appears before our macro name. Then we need to pass a file name to our macro so that name is included in parentheses. Note that the file name must be in URL format. This is the same format used by most non-Microsoft web browsers when viewing a local HTML file. It always starts with the word file, a colon and three slash marks followed by the full path and filename to the file we want to open except with the backslash (\) marks turned into forward slash (/) marks. This whole macro call is enclosed in double quotes to make sure it stays together.

Testing The Macro

Now we have all the parts except for a local web page to test. The fastest way to get a test web page is to use the Ctrl-S (press the control key and S key together) and save this page to your computer. When asked use the file name TEST.HTM and to make it even easier, save it into a directory called web located in the root directory of your hard drive (like C:\WEB). You may need to create the WEB directory using the Create New Directory button and then navigate into the WEB directory by double clicking it before you click the Save button.

Next edit the TEST.BAT file to make sure it points to C:/WEB/TEST.HTM (or whatever path and filename you used in the step above. Save the TEST.BAT file and then double click it from the Windows Explorer to make it run. This should open the LibreOffice.org word processor, save the PDF file and then close the word processor all without any additional action by you.

If everything went well you now have a working HTML to PDF conversion tool that you can use to turn any web page saved to the WEB directory on your hard drive into a PDF file. But that is only the beginning. Now it is easy for almost any program to generate nice looking and highly portable output for anything ranging from reports and receipts to books all using standard HTML formatting followed by a simple macro call. You can see a live example of how this looks on the www.GoodRVfood.com web site. This is exactly how their cookbook is generated each time the web site is updated.

First published 2010-10-21. The last major review or update of this information was on 2012-01-10. Your feedback using the form below helps us correct errors and omissions on this page.