Peter Heslin's Homepage > Software > Latex

latex2doc

If you are an academic using Latex for your own work, everything is wonderful as long as you are generating PDF as your output. The problem comes when you need to produce MS Word for submission to an editor of a journal or proceedings. In many fields, such as in the humanities, this is the only format that will be accepted. Unfortunately, simple converters are not adequate when you have complex bibliographical material.

With the existence of Biblatex, which makes it easy to produce bibliographies and footnotes in the range of styles in use in the humanities, this is particularly unfortunate. It should be possible to format all referencing in an article in a given house style by including a style file. If that journal says no and you need to submit it elsewhere, just substitute that file and you can automatically generate an entirely different style of referencing. The only problem is that you will eventually need to supply an MS Word file to someone.

The Background

For simple files, I have always used latex2rtf, but it will only translate your own markup, not the automatically generated references. As the saying goes, the only thing that will parse TeX is tex. So I used to use tex4ht to convert articles, as it processed the dvi output and was in theory capable of the task. It worked fairly well with mature bibliographical packages like jurabib, but with the rapidly evolving biblatex, it was very difficult to get a version of tex4ht that would work with a given version of biblatex. It was a very brittle process at the best of times, hard to understand and install. When I heard that the maintainer of tex4ht had passed away suddenly, I decided to look for another method.

The Theory

After messing around with a number of dvi post-processing strategies, I came up with a surprisingly simple method. I cannot decide if it is elegant solution or a disgraceful hack.

I use pdftotext to extract the text output from the pdf, but redefine various Latex commands so that in addition to doing whatever they are normally supposed to do, they also inject into the output a backslash, the name of the command and curly braces around the argument. Thus the text extracted from the pdf has Latex commands embedded in it which means that the text can be processed again, this time running it through a Latex to Word converter.

Naturally, this is not a perfect solution and I can imagine that there are tricks with macro expansion against which this would not be robust. If you use Latex as it was meant to be used, as semantic mark-up, and put all of your fancy formatting in a style file which can be omitted for the purposes of conversion, it should mostly work. If the file latex2doc.sty has not redefined a Latex command you need to have preserved and translated, it should be fairly obvious how to do extend the file to do that yourself. Remember that the point is just to translate the content, not to preserve fancy formatting effects. This method will not work for complex math, but it should preserve all Unicode text and symbols.

The trickiest things are the footnotes. These are redefined as endnotes, and then a perl script re-inserts them back into the text as footnotes.

Prerequisites

The following instructions refer to Linux; it should be possible to do the same things on other platforms, but you are on your own there. You need to have the following:

Instructions

Open your .tex file in your favorite editor and put the following text in the preamble:

 \newif\ifconvert
 \converttrue
 \ifconvert

 \usepackage{latex2doc}
 % Put simplified font-loading here

 \else

 % Put complicated formatting here

 \fi
            

Now move any fancy formatting from your normal preamble into the place indicated for complex formatting. This includes headers, footers, etc. It doesn't matter if the output is a bit uglier -- after all, the output is going to MS Word. Also put any fontspec font definitions for Xetex in here.

Next copy those very same fontspec definitions into the place indicated for simplified font-loading. Remove from those definitions any fancy features, such as old-style figures, small caps and ligatures. If it is a font that uses ligatures by default (as most do), add the specification Ligatures=NoCommon.

Now compile the file to make sure it still works. Run bibtex and run latex again if necessary. When you look at the resulting pdf, you will see a lot of latex markup, which is supposed to be there. If any markup is missing which you think the converter will be able to render, look at the latex2doc.sty file and follow those examples to redefine the commands to put the markup in the pdf output. You might also want to run pdftotext by hand and look at the output in a Unicode aware pager or editor, making sure that there are no unknown glyphs there in places where you use exotic languages or symbols.

Finally, you should run latex2doc.pl. If all goes well, you should have a file called myfilename-convert.rtf, which OpenOffice can read and then save as a Word file. If there are any errors, you will have to hunt down the causes, for something surely went wrong with the conversion process. Check the output carefully, making sure that exotic glyphs have been preserved. The default output font is Palatino Linotype, but you can change this by editing the file fonts.cfg. Make sure that footnotes have ended up back where they belong.

Search for any stray curly brackets and beware that latex2rtf seems to insert a stray space at the start of environments like quote.

You can revert to the fancy, nicely formatted PDF version later by commenting out the \converttrue line.

Good luck with converting your files. I'm afraid I will not be able to give any detailed support for this, which is more of a methodology than a turn-key solution.