Counting the Words in a LaTeX Document

The concept of a word count for a mathematical document is usually not appropriate. A more appropriate assessment is to provide some guidance on the page size, line spacing and font size to be used and then define a limit in terms of pages after excluding certain material, e.g. one might exclude figures, tables, appendices, front and back matter from the count.

Nevertheless sometimes a word count or at least some estimate of a word count is required or of some interest.

The standard command wc counts the letters, words and lines in a file. However this will give a gross over estimate on many latex documents due to the large number of words which are actually latex commands and maths. To get a more accurate estimate their is a need to try to count just the actual words in the document.

Using Kile

Kile is a latex editor. If you open your document in Kile then select Statistics from the File menu you will find a word count etc.

Using untex

Use untex first to remove the tex codes and then count the words, e.g.

untex file.tex | wc -w

The accuracy of the estimate will depend to a degree on how many latex macros of your own you have which it fails to handle well.

Using TeXcount

TeXcount is another system that aims to parse the latex document and count the words, e.g. one can run

texcount.pl -inc -html -v -sum file.tex > results.html

which produces an HTML file that you can view in a web browser to see the overall counts it has done and what parts of the document it has included or excluded for the count.

Note texcount also provide a web based word count service.

As with untex the accuracy of the estimate will depend to a degree on how many latex macros of your own you have which it fails to handle well.

Using Postscript or PDF document

An alternative approach is to try to count the words in the postscript of PDF file by converting it to plain text first, e.g.

dvips -o - file.dvi | ps2ascii | wc -w

pdftotext file.pdf - | egrep -e '\w\w\w+' | iconv -f ISO-8859-15 -t UTF-8 | wc -w