Tuesday, July 3, 2012

Easily create table of contents for your EPUB e-book

Recently I published a series of books on Emanuel Swedenborg on Amazon, and I thought I would share some tips for anyone also planning on publishing an e-book. For the books I chose the EPUB format, as that seemed to be the most universal format. Amazon has their own format, a variant of MOBI, but they will accept an EPUB e-book and automatically convert it to their own format. In the process I hit a few road blocks, but I will discuss two of them here which concern the table of contents.

For those of you not familiar with the EPUB format, it is for the most part a ZIP file containing the book text in XHTML format. One unusual part of the specification is the XML format chosen for the table of contents, a file typically named "toc.ncx".  Here is an example of what a portion of the section looks like, taking a portion from one of the books I published, The Divine Revelation of the New Jerusalem:

<navPoint id="DR-frontcover" playOrder="1">
<text>Front Cover</text>
<content src="Text/DR-frontcover.xml"/>
<navPoint id="DR-edition" playOrder="2">
<text>About this Edition</text>
<content src="Text/DR-edition.xml"/>
<navPoint id="DR-preface" playOrder="4">
<text>Editor's Preface</text>
<content src="Text/DR-preface.xml"/>

In the above section, there are three entries: the front cover, a section about the edition, and an editor's preface. For the book The Divine Revelation of the New Jerusalem, the table of contents was huge. Where I hit a big problem was the "playOrder" property.  According to the specification, each entry must have a playOrder property, beginning at 1, and incremented by 1 for each entry. That means if you create a table of contents, then decide later to add an entry in the middle, you then have to renumber all those playOrder numbers. Why? I am not sure....they should just follow the order of each entry in the XML document, but didn't.  For the book I was working on, the table of contents has 1,076 entries. And notice that in the above example, the playOrder values are not completely in sequence. What a pain!

If you ask me, did I spend the time to number all the playOrder values?  No.  I was constantly changing the table of contents, so I initially set each playOrder value to 1.  But in the end, the EPUB failed a validation check on the playOrder sequence, and then  I knew I just could not ignore this part of the EPUB specification. As the the toc.ncx file is an XML file, I decided to use an XSLT stylesheet to fix the problem. If the following XSLT stylesheet is applied to the toc.ncx file, it will automatically fix all the playOrder values:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ncx="http://www.daisy.org/z3986/2005/ncx/">

<xsl:variable name="paths" select="distinct-values(//ncx:content/normalize-space(@src))"/>
<!-- Recursive copy template -->
<xsl:template match="node() | @*">
<xsl:apply-templates select="node() | @*"/>
<!-- Always compute a sequential value for playOrder -->
<xsl:template match="@playOrder">
<xsl:attribute name="playOrder">
<xsl:value-of select="index-of($paths, parent::*/ncx:content/@src)"/>

The above stylesheet does not only renumber all the playOrder values, it also ensures that if there is a duplicate entry in the table of contents they will each get the same playOrder value as well, according to the specification. Although duplicate entries are allowable in the EPUB specification, Amazon didn't like them so I eventually removed them.  For using XSLT stylesheets you will need an XML editor - I used Altova XMLSpy.  Another good editor that has EPUB specific features is Oxygen XML Editor.

So with the above solution, I thought I was done, but when I converted it to Amazon's format, Amazon complained that I did not have a table of contents entry: they want to see a table of contents that appears in the book, the toc.ncx file was not enough.  Again, my table of contents has 1,076 entries, and there was no way I was going to do this by hand.  So instead, I wrote another XSLT stylesheet, which when applied to a toc.ncx file will convert it into a valid XHTML file that can be used as the table of contents:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" exclude-result-prefixes="ncx xsl" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ncx="http://www.daisy.org/z3986/2005/ncx/" xmlns="http://www.w3.org/1999/xhtml">

<xsl:output method="xml" indent="no" encoding="UTF-8" doctype-system="http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" doctype-public="-//W3C//DTD XHTML 1.1//EN"/>

<xsl:preserve-space elements="*"/>

<xsl:template match="ncx:ncx">
<link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css" />
<p class="hdrbig">CONTENTS</p>
<xsl:apply-templates select="ncx:navMap|ncx:pageList"/>

<xsl:template match="ncx:navPoint">
<a href="{ncx:content/@src}"><xsl:value-of select="ncx:navLabel/ncx:text/text()"/></a><br/>
<!-- Only some navPoints have more navPoints -->
<xsl:if test="ncx:navPoint">
<xsl:apply-templates select="ncx:navPoint"/>

The advantage of the above stylesheet is that it takes into account nested entries in the toc.ncx table of contents, and will indent them in the html version of the table of contents.  In the above example it will generate an entry for a global stylesheet that I was using, so you will need to adjust it slightly to fit your own purposes.

No comments:

Post a Comment

Comments, questions, corrections and opinions welcome...