{ "metadata": { "name": "", "signature": "sha256:79e7f4505df4df7b3f16885d4d975832795dea0dd4b4d0790179dc25b15f8eee" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Let me see some examples of OPDS in the wild to see how it works:\n", "\n", "available feeds: https://code.google.com/p/openpub/wiki/AvailableFeeds\n", "\n", "let's look at archive.org, which presumably should have a good feed\n", "\n", "* archive.org: http://bookserver.archive.org/catalog/\n", "* feedbooks.com: http://www.feedbooks.com/catalog.atom\n", "* oreilly.com: http://opds.oreilly.com/opds/\n" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Some concepts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "http://www.slideshare.net/fullscreen/HadrienGardeur/understanding-opds/7\n", "\n", "OPDS is based on\n", "\n", "* resources\n", "* collections \n", "\n", "A collection aggregates resources.\n", "\n", "Two kinds of resources:\n", "\n", "* Navigation link \n", "* Catalog entry \n", "\n", "for two kinds of collections:\n", "\n", "* Navigation \n", "* Acquisition" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Acquisition scenarios" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiple acquisition scenarios:\n", " \n", "* Open Access\n", "* Sale\n", "* Lending\n", "* Subscription\n", "* Extract\n", "* Undefined" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import requests\n", "from lxml.etree import fromstring\n", "\n", "ATOM_NS = \"http://www.w3.org/2005/Atom\"\n", "\n", "def nsq(url, tag):\n", " return \"{\" + url +\"}\" + tag\n", "\n", "url = \"http://bookserver.archive.org/catalog/\"\n", " \n", "r = requests.get(url)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "doc=fromstring(r.text)\n", "doc" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# get links\n", "# what types specified in spec?\n", "\n", "[link.attrib for link in doc.findall(nsq(ATOM_NS,'link'))]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "it might be useful to use specialized libraries to handle Atom or AtomPub." ] }, { "cell_type": "code", "collapsed": false, "input": [ "doc.findall(nsq(ATOM_NS, \"entry\"))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Atom feed generation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://github.com/sramana/pyatom\n", "\n", " pip install pyatom" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# let's try the basics of pyatom\n", "# puzzled wwhere come from.\n", "\n", "from pyatom import AtomFeed\n", "import datetime\n", "\n", "feed = AtomFeed(title=\"Unglue.it\",\n", " subtitle=\"Unglue.it OPDS Navigation\",\n", " feed_url=\"https://unglue.it/opds\",\n", " url=\"https://unglue.it/\",\n", " author=\"unglue.it\")\n", "\n", "# Do this for each feed entry\n", "feed.add(title=\"My Post\",\n", " content=\"Body of my post\",\n", " content_type=\"html\",\n", " author=\"Me\",\n", " url=\"http://example.org/entry1\",\n", " updated=datetime.datetime.utcnow())\n", "\n", "print feed.to_string()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Creating navigation feed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "template: https://gist.github.com/rdhyee/94d82f6639809fb7796f#file-unglueit_nav_opds-xml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "````xml\n", "\n", " Unglue.it Catalog\n", " https://unglue.it/opds\n", " 2014-06-13T21:48:34Z\n", " \n", " unglue.it\n", " https://unglue.it\n", " \n", " \n", " \n", " \n", " \n", " Creative Commons\n", " https://unglue.it/creativecommons/\n", " 2014-06-13T00:00:00Z\n", " \n", " These Creative Commons licensed ebooks are ready to read - the people who created them want you to read and share them..\n", " \n", " \n", " Active Campaigns\n", " https://unglue.it/campaigns/ending#2\n", " 2014-06-13T00:00:00Z\n", " \n", " With your help we're raising money to make these books free to the world.\n", " \n", "````" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from lxml import etree\n", "import datetime\n", "import pytz\n", "\n", "def text_node(tag, text):\n", " node = etree.Element(tag)\n", " node.text = text\n", " return node\n", "\n", "def entry_node(title, id_, updated, link_href, link_type, content):\n", " node = etree.Element(\"entry\")\n", " node.append(text_node(\"title\", title))\n", " node.append(text_node(\"id\", id_))\n", " node.append(text_node(\"updated\", updated))\n", " \n", " link_node = etree.Element(\"link\")\n", " link_node.attrib.update({'href':link_href, 'type':link_type})\n", " node.append(link_node)\n", " \n", " node.append(text_node(\"content\", content))\n", " return node\n", "\n", "feed_xml = \"\"\"\"\"\"\n", "\n", "feed = etree.fromstring(feed_xml)\n", "\n", "# add title\n", "\n", "feed.append(text_node('title', \"Unglue.it Catalog\"))\n", "\n", "# id \n", "\n", "feed.append(text_node('id', \"https://unglue.it/opds\"))\n", "\n", "# updated\n", "\n", "feed.append(text_node('updated',\n", " pytz.utc.localize(datetime.datetime.utcnow()).isoformat()))\n", "\n", "# author\n", "\n", "author_node = etree.Element(\"author\")\n", "author_node.append(text_node('name', 'unglue.it'))\n", "author_node.append(text_node('uri', 'https://unglue.it'))\n", "feed.append(author_node)\n", "\n", "# start link\n", "\n", "start_link = etree.Element(\"link\")\n", "start_link.attrib.update({\"rel\":\"start\",\n", " \"href\":\"https://unglue.it/opds\",\n", " \"type\":\"application/atom+xml;profile=opds-catalog;kind=navigation\",\n", "})\n", "feed.append(start_link)\n", "\n", "# crawlable link\n", "\n", "crawlable_link = etree.Element(\"link\")\n", "crawlable_link.attrib.update({\"rel\":\"http://opds-spec.org/crawlable\", \n", " \"href\":\"https://unglue.it/opds/crawlable\",\n", " \"type\":\"application/atom+xml;profile=opds-catalog;kind=acquisition\",\n", " \"title\":\"Crawlable feed\"})\n", "feed.append(crawlable_link)\n", "\n", "# CC entry_node\n", "\n", "cc_entry = entry_node(title=\"Creative Commons\",\n", " id_=\"https://unglue.it/creativecommons/\",\n", " updated=\"2014-06-13T00:00:00Z\",\n", " link_href=\"creativecommons.xml\",\n", " link_type=\"application/atom+xml;profile=opds-catalog;kind=acquisition\",\n", " content=\"These Creative Commons licensed ebooks are ready to read - the people who created them want you to read and share them..\")\n", "feed.append(cc_entry)\n", "\n", "print etree.tostring(feed, pretty_print=True)\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Writing Crawlable Feed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "````xml\n", " \n", " Unglue.it Catalog -- 1 to 1 of 2000 -- crawlable feed\n", " https://unglue.it/opds/crawlable\n", " 2014-06-16T00:00:00Z\n", " \n", " \n", " \n", " unglue.it\n", " https://unglue.it\n", " \n", " \n", " \n", " Oral Literature In Africa\n", " https://unglue.it/work/81834/\n", " 2013-07-17T23:27:37Z\n", " \n", " \n", " \n", " \n", " 2012\n", " \n", " Ruth Finnegan\n", " \n", " \n", " \n", " \n", " Open Book Publishers\n", " en\n", " \n", " \n", "\n", "````" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# crawlable feed\n", "\n", "from itertools import islice\n", "\n", "from lxml import etree\n", "import datetime\n", "import urlparse\n", "\n", "import pytz\n", "\n", "from regluit.core import models\n", "import regluit.core.cc as cc\n", "\n", "licenses = cc.LICENSE_LIST\n", "\n", "FORMAT_TO_MIMETYPE = {'pdf':\"application/pdf\",\n", " 'epub':\"application/epub+zip\",\n", " 'mobi':\"application/x-mobipocket-ebook\",\n", " 'html':\"text/html\",\n", " 'text':\"text/html\"}\n", "\n", "def text_node(tag, text):\n", " node = etree.Element(tag)\n", " node.text = text\n", " return node\n", "\n", "def map_to_unglueit(url):\n", " m = list(urlparse.urlparse(url))\n", " (m[0], m[1]) = ('https','unglue.it')\n", " return urlparse.urlunparse(m)\n", "\n", "def work_node(work):\n", " node = etree.Element(\"entry\")\n", " # title\n", " node.append(text_node(\"title\", work.title))\n", " \n", " # id\n", " node.append(text_node('id', \"https://unglue.it{0}\".format(work.get_absolute_url())))\n", " \n", " # updated -- using creation date\n", " node.append(text_node('updated', work.created.isoformat()))\n", " \n", " # links for all ebooks\n", " \n", " for ebook in work.ebooks():\n", " link_node = etree.Element(\"link\")\n", " link_node.attrib.update({\"href\":map_to_unglueit(ebook.download_url),\n", " \"type\":FORMAT_TO_MIMETYPE.get(ebook.format, \"\"),\n", " \"rel\":\"http://opds-spec.org/acquisition\"})\n", " node.append(link_node)\n", " \n", " # get the cover -- assume jpg?\n", " \n", " cover_node = etree.Element(\"link\")\n", " cover_node.attrib.update({\"href\":work.cover_image_small(),\n", " \"type\":\"image/jpeg\",\n", " \"rel\":\"http://opds-spec.org/image/thumbnail\"})\n", " node.append(cover_node)\n", " \n", " # 2012\n", " node.append(text_node(\"{http://purl.org/dc/terms/}issued\", work.publication_date))\n", " \n", " # author\n", " # TO DO: include all authors?\n", " author_node = etree.Element(\"author\")\n", " author_node.append(text_node(\"name\", work.author()))\n", " node.append(author_node)\n", " \n", " # publisher\n", " #Open Book Publishers\n", " if len(work.publishers()):\n", " for publisher in work.publishers():\n", " node.append(text_node(\"{http://purl.org/dc/terms/}issued\", publisher.name.name))\n", " \n", " # language\n", " #en\n", " node.append(text_node(\"{http://purl.org/dc/terms/}language\", work.language))\n", "\n", " # subject tags\n", " # [[subject.name for subject in work.subjects.all()] for work in ccworks if work.subjects.all()]\n", " if work.subjects.all():\n", " for subject in work.subjects.all():\n", " category_node = etree.Element(\"category\")\n", " category_node.attrib[\"term\"] = subject.name \n", " node.append(category_node)\n", " \n", " return node\n", "\n", "feed_xml = \"\"\"\"\"\"\n", "\n", "feed = etree.fromstring(feed_xml)\n", "\n", "# add title\n", "# TO DO: will need to calculate the number items and where in the feed we are\n", "\n", "feed.append(text_node('title', \"Unglue.it Catalog: crawlable feed\"))\n", "\n", "# id \n", "\n", "feed.append(text_node('id', \"https://unglue.it/opds/crawlable\"))\n", "\n", "# updated\n", "# TO DO: fix time zone?\n", "\n", "feed.append(text_node('updated',\n", " pytz.utc.localize(datetime.datetime.utcnow()).isoformat()))\n", "\n", "# author\n", "\n", "author_node = etree.Element(\"author\")\n", "author_node.append(text_node('name', 'unglue.it'))\n", "author_node.append(text_node('uri', 'https://unglue.it'))\n", "feed.append(author_node)\n", "\n", "# links: start, self, next/prev (depending what's necessary -- to start with put all CC books)\n", "\n", "# start link\n", "\n", "start_link = etree.Element(\"link\")\n", "start_link.attrib.update({\"rel\":\"start\",\n", " \"href\":\"https://unglue.it/opds\",\n", " \"type\":\"application/atom+xml;profile=opds-catalog;kind=navigation\",\n", "})\n", "feed.append(start_link)\n", "\n", "# self link\n", "\n", "self_link = etree.Element(\"link\")\n", "self_link.attrib.update({\"rel\":\"self\",\n", " \"href\":\"https://unglue.it/opds/crawlable\",\n", " \"type\":\"application/atom+xml;profile=opds-catalog;kind=acquisition\",\n", "})\n", "feed.append(self_link)\n", "\n", "licenses = cc.LICENSE_LIST\n", "\n", "ccworks = models.Work.objects.filter(editions__ebooks__isnull=False, \n", " editions__ebooks__rights__in=licenses).distinct().order_by('-created')\n", "\n", "for work in islice(ccworks,None):\n", " node = work_node(work)\n", " feed.append(node)\n", "\n", "print etree.tostring(feed, pretty_print=True)\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# how to get CC books?\n", "# make use of CCListView: https://github.com/Gluejar/regluit/blob/b675052736f79dcb8d84ddc6349c99fa392fa9bc/frontend/views.py#L878\n", "# template: https://github.com/Gluejar/regluit/blob/b675052736f79dcb8d84ddc6349c99fa392fa9bc/frontend/templates/cc_list.html\n", "\n", "from regluit.core import models\n", "import regluit.core.cc as cc\n", "\n", "licenses = cc.LICENSE_LIST\n", "\n", "ccworks = models.Work.objects.filter(editions__ebooks__isnull=False, \n", " editions__ebooks__rights__in=licenses).distinct().order_by('-created')\n", "ccworks" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "dir(ccworks[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "work = ccworks[0]\n", "ebook = work.ebooks()[0]\n", "dir(ebook)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from collections import Counter\n", "\n", "c = Counter()\n", "\n", "for work in islice(ccworks,None):\n", " c.update([ebook.format for ebook in work.ebooks()])\n", " \n", "print c\n", "\n", "#[[ebook.format for ebook in work.ebooks()] for work in islice(ccworks,1)]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Calling regluit.core.opds code" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from regluit.core import opds\n", "opds.creativecommons()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Dealing URLs of downloaded books" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from regluit.core.models import Work\n", "\n", "work = Work.objects.get(id=137688)\n", "[ebook.download_url for ebook in work.ebooks()]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Tacking on a query component to a URL" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "\n", "def add_query_component(url, qc):\n", " m = list(urlparse.urlparse(url))\n", " if len(m[4]):\n", " m[4] = \"&\".join([m[4],qc])\n", " else:\n", " m[4] = qc\n", " return urlparse.urlunparse(m)\n", "\n", "add_query_component(\"https://unglue.it/download_ebook/906/\", \"feed=opds\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Getting works of active campaigns " ] }, { "cell_type": "code", "collapsed": false, "input": [ "campaigns = models.Campaign.objects.filter(status='ACTIVE').order_by('deadline')\n", "campaigns" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "models.Work.objects.filter(campaigns__status='ACTIVE')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "must exclude campaigns without ebooks" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from regluit.core import models\n", "from django.db.models import Q\n", "\n", "len(models.Work.objects.filter(campaigns__status='ACTIVE').order_by('-created'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "works = models.Work.objects.filter(campaigns__status='ACTIVE',\n", " editions__ebooks__isnull=False).distinct().order_by('-created')\n", "[w.ebooks() for w in works]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "works = Work.objects.all()\n", "work = works[0]\n", "work.ebooks()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "models.Work.objects.filter(work)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Appendix: dealing with namespaces in ElementTree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe come back to http://effbot.org/zone/element-namespaces.htm for more sophisticated ways to register namespaces." ] } ], "metadata": {} } ] }