Two methods for exporting EPUB annotations (.annot)

This post is part of the { Programming } series.

See here for a follow-up.

My personal goal for this summer break was reading more, as I really enjoy it but do not schedule enough time for it during the many hectic days throughout year. I always enjoy reading a book, but somehow the threshold for doing some project behind my pc is lower than simply sitting down in a chair with a good book. A complication for reaching my goal was however that I would go backpacking for three weeks throughout Europe. I needed to pack very lightly, and even bringing a single book would be a major compromise to that. This is where, despite being a bit of a chauvinistic philosopher that prefers the touch of “real” books, the e-reader comes into play. I purchased a Kobo Clara HD, and I have to say that the experience has been great. During my travels I finished “Crime and Punishment” from Dostojevski, read “Slaughterhouse Five” from Kurt Vonnegut, and read half of the uncomfortably thick “The Brothers Karamazov”, also from Dostojevski. And even now that I am home I notice how much easier it is to pick up the e-reader, compared to a book.

During reading, I made many annotations and notes on my Kobo. Now that I am home, I was wondering how to export these notes to my pc, because that would save the trouble of manually finding back citations on the Kobo itself, which is slow, and perhaps typing them over by hand, which is even slower. To my surprise, there was no default exporting option for annotations.

Method 1: adjusting the Kobo configuration file

A reddit user however found a solution . This solution was suggested for another Kobo version, but also works for my Clara HD. I summarize the solution here for completeness:

  1. Connect your Kobo to your computer.
  2. Find and open “Kobo eReader.config” in the Kobo drive. Mine is at /.kobo/Kobo/, relative to the root of your Kobo e-reader.
  3. Add the following code, including the newline. This section is brand new, so it’s probably easiest to just add it at the bottom of the file:
[FeatureSettings]
ExportHighlights=true
  1. Eject Kobo and boot it up.
  2. This adds another option in the menu that is available when reading books, namely to “Export highlights” under the “Notes” tab. After entering a filename the annotations will be saved to the root directory of the Kobo.

The export function produces a plain text file, starting with the title of the book, followed by a separate paragraph for each annotation. Notes are displayed in a similar manner, as such:

The original citation goes here
Note: this is my smart comment 

And voila! With this method you have fast access to all your annotations in an open text format, so you can directly use it in an editor of your choice.

Method 2: customize the exporting to your own needs by parsing the annotation files

However, if for some reason you want to export your annotations in a different manner, then you can always find the full xhtml markup with all annotations at “/Digital Editions/Annotations/books/". If we inspect it, we see that the xhtml does not really contain much more relevant information than we already exported. Per annotation, we also have the date at which we made the annotiation, as well as some non-human-readable identifiers. Having the date of an annotation is not essential, but if you intend to archive your notes, dates would give insight in your lecture of for example a few years back, and add some flexibility. One could for example later sort the notes on date to distinguish notes from a first and a second reading.

What I would have liked to include in my export was some more structure, for example grouping notes by chapter. What I also think is weird with the default export, is that the author of the book is never listed, and neither is the publisher of the book, which is handy for later reference. Another argument for writing our own “export function” is the possibility of immediately using a specific output format of choice. For example, I currently store my notes in Markdown on Github, so we could export the notes immediately using Markdown syntax. Another idea is to at least number the annotations, given the absence of an ordering in chapters and the unavailability of a meaningful page numbering with the epub format.

If someone knows how to parse chapters and pagination from .annot files, please hit me up!

Solution with a Python script

The annotation files with the .annot extension are written in xhtml. For parsing xhtml we can use the lxml xml parser. Consider this remark on their site :

Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results.

I like using Python, and luckily Python has a nice package called BeautifulSoup that offers a simple interface for using the lxml parser.

The Python script I wrote extracts the title, author, publisher and writes them to a file in the YAML format, which can be used within Markdown files and is supported both by Github Markdown and Pandoc Markdown (the two dialects I use). Pandoc’s default LaTeX engine for producing pdf files actually knows how to read the YAML entries and display them as a default LaTeX titlepage, which allows you to directly create a smooth pdf without writing any LaTeX.

The script also distinguishes between annotations and notes, and displays them differently. All annotations are displayed in a numbered list. Notes are indented as block quotes, directly below the annotation to which they belong. Because the list itself is also already indented, I double the indentation as such “> > “. In Pandoc Markdown this adds extra indentation, in Github Markdown the extra “>” does not do anything, but is also not necessary since blockquotes receive a different color on Github.

This is the script:

import os
import sys
from bs4 import BeautifulSoup

args = sys.argv[1:]

if not args:
    print('usage: kobo_export.py filename')
    sys.exit(1)

filename = args[0]

try:
    with open(filename, "r", encoding="utf-8") as f:
        soup = BeautifulSoup(f, "lxml-xml")
except FileNotFoundError:
    print("The annotation file was not found")

title = soup.find('title').get_text()
author = soup.find('creator').get_text() 
publisher = soup.find('publisher').get_text()
annotations = soup.find_all('annotation')

# YAML metadata
metadata ="""---
title: {}
author: {}
publisher: {}
---

""".format(title, author, publisher)

export = []
export.append(metadata)

for i, annotation in enumerate(annotations):
    date = annotation.date.get_text()
    citation = annotation.target.find('text').get_text()
    export.append('{}. "{}" ({})\n\n'.format(i,citation, date))
    note = annotation.content.find('text')
    if note:
        export.append('> > ' + note.get_text() + "\n\n")

with open(filename + ".md", "w", encoding="utf-8") as output:
    output.writelines(export)

The result looks good in plain text, on Github as well as a pdf when produced from the Markdown with pandoc. Consider these extracted annotations from Emil Cioran’s very gloomy youth work:

Plain text:

---
title: On the Heights of Despair
author: E. M. Cioran
publisher: 
---

0. "In illness, death is always already in life. Genuine ailment links us to
metaphysical realities which the healthy, average man cannot understand. Young
people talk of death as external to life. But when an illness hits them with
full power, all the illusions and seductions of youth disappear. In this world,
the only genuine agonies are those sprung from illness. " (2019-08-26T11:46:10Z)

...

6. "The vulgar interpretation of universality calls it a phenomenon of quantitative
expansion rather than a qualitatively rich containment." (2019-08-23T10:19:09Z)

7. "Each subjective existence is absolute to itself. For this reason each man lives 
as if he were the center of the universe or the center of history. Then how could
his suffering fail to be absolute? I cannot understand another's suffering in
order to diminish my own. " (2019-08-24T08:21:54Z)

8. "One of the greatest delusions
of the average man is to forget that life is death's prisoner." (2019-08-26T11:38:32Z)

... 

36. "The melancholy look is expressionless, without
perspective. " (2019-08-31T07:28:00Z)

> > De afwezige blik in het oneindige externaliseert de ruimtelijkheid 
die volgens Cioran intern bij de melancholie hoort

37. "The sharper our consciousness of the world's infinity,
the more acute our awareness of our own finitude" (2019-08-31T07:29:48Z)

Github

See this gist .

Pdf through LaTeX



Self portraits using stable diffusion <-- Latest

Friendship, death, and writing in Michel de Montaigne's Essays <-- Next

Dynamic BibTeX bibliography paths with spaces <-- Previous

Website Update: Dark Theme and Etc. section <-- Random

Webmentions


Do you want to link a webmention to this page?
Provide the URL of your response for it to show up here.

Comments

Ashley on Thursday, Feb 27, 2020:

Thanks a million for this. Works perfectly with the Kobo Forma.

Edwin on Thursday, Feb 27, 2020
In reply to Ashley

Great to hear! Nice to verify that this also works for other models (even though it would be surprising if it didn’t).

Mitchell on Tuesday, Jun 30, 2020:

Thanks for this walkthrough! Had some issues with permissions in the root folder area on my Kobo Libra H20 but by making a copy of Kobo eReader.conf on my desktop, making the edits and then dragging it into the Kobo and replacing the file this seems to have worked! No idea why this isn’t a default feature… Might look at your python options too because I also love Python.

Edwin on Tuesday, Jun 30, 2020
In reply to Mitchell

Glad to hear! I don’t know why exactly, but my own script has stopped working for me a while back. Since then I added some more logic to handle failures and missing fields (I noticed the quality of epubs can differ quite a bit w.r.t. specifying meta data etc.). Also added some logic for sorting the entries. If you find an issue, feel free to hit me up! I can always share the updated script via email, but if there’s interest I may write a post with an update :-)

fan on Wednesday, Sep 2, 2020:

Hi Edwin,

thanks for your nice article! Im going to have a kobo glo hd soon and I would like to try the python-solution, since I really want to work with python anyway… Did there appear any problems with our initial script? Would you like to share your experiences?

Edwin on Wednesday, Sep 2, 2020
In reply to fan

I’m still very happy with the Kobo, so I can wholeheartedly recommend it. The script above is really just a first sketch. One major problem is that it doesn’t handle errors at all. So if a particular field is not found, it just crashes. I redid this a little bit more properly and also added some more formatting for the markdown output. I didn’t spend much time on it though, so it’s not necessarily code I would publish. What I would also add is the option to process all notes in batch (also didn’t do that yet). I could make a repository with what I have so far, you could take that as a point of departure. TLDR; the code above is a starting point that you could use for your own project, if your goal is also to learn a bit of Python. Otherwise I could share my code with you. I’ll perhaps later make a follow-up post when I have something that’s worth publishing, but I’m currently a bit short on time :-)

Rob on Tuesday, Dec 8, 2020:

Amazing article and script. Any chance you could please share your updated version with the error handling and better formatting you mentioned?

Edwin on Tuesday, Dec 8, 2020
In reply to Rob

Hi Rob, thanks a lot! Since you are the second to request this, I quickly made a repo where you can download the code. You can find it here. Have a look at the README for disclaimers ;-)

Edwin on Friday, Jan 8, 2021
In reply to Rob

Hi Rob, check the new blog post :-) That should help. Let me know if you have questions

Jordy on Friday, Jan 8, 2021:

Hi Edwin, I’m not literate in Python so don’t really get what I have to do once I’ve downloaded the file. Is it something that changes all .txt files exported after they are exported or is it something we have to run for each .txt files once it’s in our computer. Could you have a dummy-level tutorial that would make people like me (and my dad, who loves to annotate but always find this hard to read the raw txt) understand and proceed ?

Thanks a lot for working on this ! I don’t understand how it’s not on the Kobo by default (what good are notes if you can’t use them ? People doing this are generally writing theses or articles and want useful info like pagenumber and such :/)

All the best, Jordy

Edwin on Friday, Jan 8, 2021
In reply to Jordy

Thanks for dropping by! You are not the first to ask, so it was about time I got busy again. Here you go: https://www.edwinwenink.xyz/posts/53-update_kobo_annotation/ :-) Take care!

Jordy on Saturday, Jan 9, 2021:

Thanks a lot that is awesome ! I’ll have a look into it !