regluit/notebooks/doab_loading.ipynb

548 lines
14 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from __future__ import print_function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# optionally copy doab.json from DOAB repo\n",
"\n",
"import shutil\n",
"\n",
"# toggle the boolean to set whether to copy\n",
"if (False):\n",
" shutil.copyfile(\"/Users/raymondyee/D/Document/Gluejar/Gluejar.github/DOAB/doab.json\",\n",
" \"../bookdata/doab.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# load up django settings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import os\n",
"import django\n",
"\n",
"# http://stackoverflow.com/questions/24793351/django-appregistrynotready\n",
"\n",
"os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'regluit.settings.me')\n",
"django.setup()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading the list of books"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import json\n",
"import codecs\n",
"s = codecs.open(\"../bookdata/doab.json\", encoding='UTF-8').read()\n",
"records = json.loads(s)\n",
"records[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I need to remind myself of how to check that there are no outstanding celery jobs after I do this loading. \n",
"\n",
"I have a technique for using `django-celery` monitoring that works on redis (what we use on just and production) -- but not laptop (http://stackoverflow.com/a/5451479/7782). I think a workable way is to look at the celery_taskmeta table."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"limit = None"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from regluit.core.loaders import doab\n",
"\n",
"file_path = '../bookdata/doab.json'\n",
"\n",
"doab.load_doab_records(file_path, limit=int(limit))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import djcelery\n",
"[t.status for t in djcelery.models.TaskMeta.objects.all()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tests for the loading\n",
"\n",
" * can we find all the the URLs?\n",
" * is it associated with the the right doab_id?\n",
" * all the ISBNs loaded?\n",
" * which books are not matched with Google Books IDs -- and therefore might require URLs for covers?\n",
" * did I make sure the edition I'm attaching the ebooks to is the \"selected edition\"?\n",
" * for editions that I create [and maybe all editions?], attach a cover_image from DOAB.\n",
" * all clustered around the same work? (or do I need to do further merging?)\n",
" * are we creating extraneous works?\n",
" * subject metadata\n",
" * are we loading all the useful metadata? \n",
" * is the loading script idempotent?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## important limit to testing\n",
"\n",
"I have written code to handle the loading of all associated ISBNs with DOAB records -- but we upload only records with non-null licenses, we will have only one ISBN per DOAB record for records with known licenses. So the loading of works for which we know the license won't exercise the code in question:\n",
"https://github.com/Gluejar/regluit/blob/5b3a8d7b1302bc1b1985c675add06c345567a7a1/core/doab.py#L91\n",
"I also checked that there is no intersection of DOAB ids betwen records with known licenses and those that don't."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from regluit.core.models import Work, Edition, Ebook, Identifier\n",
"from regluit.core.isbn import ISBN\n",
"from itertools import islice\n",
"\n",
"import traceback\n",
"import sys\n",
"\n",
"\n",
"tests_exceptions = []\n",
"no_google_book_id = []\n",
"all_problems = []\n",
"cover_problems = []\n",
"\n",
"os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'regluit.settings.me')\n",
"django.setup()\n",
"\n",
"records_to_load = list(islice(records,limit))\n",
"\n",
"for record in islice(records_to_load, None):\n",
" d = dict(record)\n",
" \n",
" problems = []\n",
" \n",
" try:\n",
" \n",
" # has a work been associated with the doab_id?\n",
"\n",
" work = Identifier.objects.get(type='doab', \n",
" value=d.get('doab_id')).work\n",
"\n",
" edition = work.selected_edition\n",
" \n",
" # check only one ebook with this URL.\n",
" \n",
" # check for url if format\n",
" if d.get('format') in ('pdf', 'epub', 'mobi'):\n",
" ebooks = Ebook.objects.filter(url=d.get('url'))\n",
" if len(ebooks) != 1:\n",
" problems.append(\"len(ebooks): {}\".format(len(ebooks)))\n",
" \n",
" # all the ISBNs loaded?\n",
" # this code might be a bit inefficient given there might only be one isbn per record\n",
" \n",
" isbns = [ISBN(i).to_string() for i in d.get('isbns')]\n",
" if not(set(isbns) == set([id_.value for id_ in Identifier.objects.filter(type=\"isbn\", \n",
" value__in=isbns)])):\n",
" problems.append(\"isbns not matching\")\n",
" \n",
" if problems:\n",
" all_problems.append((d, problems))\n",
" \n",
" # check on presence of Google books id\n",
" if len(edition.identifiers.filter(type=\"goog\")) < 1:\n",
" no_google_book_id.append(d)\n",
"\n",
" # check on the cover URLs\n",
" #print (edition.work.cover_image_small())\n",
" if edition.work.cover_image_small().find(\"amazonaws\") < 0:\n",
" cover_problems.append((d))\n",
" \n",
" except Exception as e:\n",
" (exc_type, exc_value, exc_tb) = sys.exc_info()\n",
" stack_trace = \" \".join(traceback.format_exception(exc_type, exc_value, exc_tb))\n",
"\n",
" tests_exceptions.append((d, (e, stack_trace)))\n",
" \n",
"print (\"number of records loaded\", len(records_to_load))\n",
"print ()\n",
"print (\"all_problems\", all_problems)\n",
"print ()\n",
"print (\"tests_exceptions\", tests_exceptions)\n",
"print ()\n",
"print (\"no_google_book_id\", no_google_book_id)\n",
"print ()\n",
"print (\"cover problems\", cover_problems)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"all_problems[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# work through test exceptions\n",
"\n",
"for (d, (e, trace)) in tests_exceptions:\n",
" print(d, trace)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# work through specific example\n",
"\n",
"d = dict(records_to_load[2])\n",
"\n",
"url = d.get('url')\n",
"print(url)\n",
"\n",
"# has a work been associated with the doab_id?\n",
"\n",
"work = Identifier.objects.get(type='doab', \n",
" value=d.get('doab_id')).work\n",
"\n",
"edition = work.selected_edition\n",
"\n",
"# check for url if format\n",
"if d.get('format') in ('pdf', 'epub', 'mobi'):\n",
" ebooks = Ebook.objects.filter(url=url)\n",
" print (len(ebooks))\n",
"\n",
"# google id\n",
"print ( len(edition.identifiers.filter(type=\"goog\")), len(edition.work.identifiers.filter(type=\"goog\")))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"d"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ebook = doab.load_doab_edition(**d)\n",
"ebook"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ebook is None"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"d"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from regluit.core import bookloader \n",
"isbn = '9788575414088'\n",
"\n",
"ed1 = bookloader.add_by_isbn(isbn)\n",
"ed1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"all_problems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Stop"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"raise Exception(\"Stop here\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# invalid ISBNs?\n",
"\n",
"for (d, p) in all_problems:\n",
" print (d['isbns'][0], ISBN(d['isbns'][0]).valid)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"[(d['doab_id'], d['isbns'][0]) for d in no_google_book_id]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# it is possible to do a query for a whole set of values, a technique I might make use of.\n",
"# http://stackoverflow.com/a/9304968\n",
"# e.g., Blog.objects.filter(pk__in=[1,4,7])\n",
"\n",
"urls = [dict(record).get('url') for record in records_to_load]\n",
"set([ebook.url for ebook in Ebook.objects.filter(url__in=urls)]) == set(urls)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Code I was working out to use Django querysets to pull out relationships among ebooks, editions, works"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from regluit.core.models import (Ebook, Edition, Work)\n",
"from django.db.models import (Q, F)\n",
"\n",
"# models.Identifier.objects.filter(edition__isnull=False).filter(~Q(edition__work__id = F('work__id'))).count()\n",
"\n",
"editions_with_ebooks = Edition.objects.filter(ebooks__isnull=False)\n",
"editions_with_ebooks\n",
"\n",
"edition = editions_with_ebooks[0]\n",
"print (edition.work_id)\n",
"work = edition.work\n",
"print (work.editions.all())\n",
"# didn't know you should use distinct()\n",
"Edition.objects.filter(Q(work__id=edition.work_id) & Q(ebooks__isnull=False)).distinct()\n",
"#Edition.objects.filter(Q(work__id=edition.work_id))\n",
"#work.objects.filter(editions__ebooks__isnull=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# let me grab ebooks and look at their parent works\n",
"\n",
"from regluit.core.models import Ebook\n",
"\n",
"[eb.edition for eb in Ebook.objects.all()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extra"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"raise Exception(\"Stop here\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Checking Celery Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Checking the results of a local celery task \n",
"from regluit.core import tasks\n",
"\n",
"task_id = \"28982485-efc3-44d7-9cf6-439645180d5d\"\n",
"result = tasks.fac.AsyncResult(task_id)\n",
"result.get()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}