SFR-453 Initial Commit

Includes an initial version of the utility script used to generate
the copyright entry/renewal database along with instructions on how
to run the script and create a version of the database locally
api-changes
Mike Benowitz 2019-05-29 12:47:03 -04:00
commit edd14d9cdb
22 changed files with 1608 additions and 0 deletions

5
.gitignore vendored Normal file
View File

@ -0,0 +1,5 @@
/__pycache__/*
config.yaml
*.pyc
*.xml
*.code-workspace

64
README.md Normal file
View File

@ -0,0 +1,64 @@
# Copyright Registrations and Renewals DB Maintenance Script
## Introduction
This script creates and updates a database of copyright records, both original registrations and renewals, drawn from two projects supported by the NYPL to digitize and gather this source data. The goal of this project is to create a single source for all copyright information for works registered after 1922 and renewed before 1978 (see [Duration of Copyright](https://www.copyright.gov/circs/circ15a.pdf) for more information). This should provide the connections between registrations and renewals and allow users to see the connections between related registrations and renewals.
The two projects that this data is drawn from are hosted on GitHub, more information about the source of data for these projects can be found in each repository:
- (Catalog of Copyright Entries)[https://github.com/NYPL/catalog_of_copyright_entries_project]
- (Catalog of Copyright Entries Renewals)[https://github.com/NYPL/cce-renewals]
## Database
This script generates and updates a Postgresql database and ElasticSearch search layer. It scans the above repository for updated source files and inserts/updates the appropriate records, ensuring that the search layer is kept up to date following these changes.
### Structure
The database is relatively simple but contains some complexities to describe the range of conditions that exist in the copyright data. The tables currently are:
- `cce` The core table that contains one copyright entry per row, identified by a UUIDv4. This includes a title and other descriptive information
- `renewal` The core table that contains one copyright renewal per row, identified by a UUIDv4 and a renewal number.
- `registration` This table contains copyright registration numbers (e.g. `A999999`) and registration dates. There can be multiple registration numbers per entry and renewal.
- `volume` Contains information describing the CCE volume that specific records were drawn from, including a URL where a digitized version of the volume can be found. This can be used to locate the original text of a copyright entry.
- `author` The authors associated with a copyright entry. For each entry one author is designated as the `primary` author, generally drawn from the entry heading and used for lookup and sorting purposes
- `publisher` The publishers associated with a copyright entry. Each entry is tagged with a boolean `claimant` field that designates if the publisher was the copyright claimant. These entries can overlap with the `author` table due to discrepancies in how data was entered.
- `lccn` Normalized LCCN numbers associated with a copyright entry
- `renewal_claimant` Claimants who have created a renewal record
- `xml` The source XML string for copyright entries. This is stored in an XML field and can be queried with `xpath`. This table also stores successive versions of the source XML for each entry, allowing changes to be found./visualized.
- `error_cce` Contains copyright entries that could not be properly parsed. These are stored to see if the issue lies in the parser or the source data.
### Relationships
The primary relationships are between the `registration` table, entries and renewals. Through registration numbers and dates, connections can be found between entries and renewals. It should be noted that an individual entry can have multiple renewals and vice-versa, related through multiple registration numbers
## ElasticSearch
The ElasticSearch instance is designed to provide a simple, lightweight search layer. The documents in the ElasticSearch do not contain a full set of metadata, simply enough to return results that can be fully realized with an additional query to the database.
The ES instance is comprised of two indexes. `cce` for entries and `ccr` for renewals. These indexes can be queried separately or together
## Configuration
This script can be configured to run locally using the NYPL data sources or your own forks of these sources. A `config.yaml-dist` file is included here that contains the structure for the necessary environment variables. The following will need to be supplied:
- Connection details for a PostgreSQL database
- Connection details for an ElasticSearch instance
- A GitHub Personal Access Token for interacting with the GitHub API
- Repository names for the entry and renewal repositories
## Setup
This script requires Python 3.6+
To create a local instance it is recommended to create a virtualenv and install dependencies with `pip install -r requirements.txt`
## Running
The utility script can be invoked simply with `python main.py` to initialize the database and pull down the full set of source data. Please note that this will take 2-4 hours to fully download, parse and index the source data.
Several command-line arguments control the options for the execution of this script
- `--REINITIALIZE` Will drop the existing database and rebuild it from scratch. WARNING THIS WILL DELETE ALL CURRENT DATA
- `-t` or `--time` The time ago in seconds to check for updated records in the GitHub repositories. This allows for only updating changed records
- `-y` or `--year` A specific year to load from either the entries or renewals
- `-x` or `--exclude` Set to exclude either the entries (with `cce`) or the renewals (with `ccr`) from the current execution. Useful when used in conjunction with the `year` parameter to control what records are updated.

416
builder.py Normal file
View File

@ -0,0 +1,416 @@
import base64
from datetime import datetime
from github import Github
import lccnorm
from lxml import etree
import os
import re
import traceback
from model.cce import CCE
from model.errorCCE import ErrorCCE
from model.volume import Volume
from helpers.errors import DataError
class CCEReader():
def __init__(self, manager=None):
self.git = Github(os.environ['ACCESS_TOKEN'])
self.repo = self.git.get_repo(os.environ['CCE_REPO'])
self.dbManager = manager
self.cceYears = {}
def loadYears(self, selectedYear):
for year in self.repo.get_contents('/xml'):
if not re.match(r'^19[0-9]{2}$', year.name): continue
if selectedYear is not None and year.name != selectedYear: continue
yearInfo = {'path': year.path, 'yearFiles': []}
self.cceYears[year.name] = yearInfo
def loadYearFiles(self, year, loadFromTime):
yearInfo = self.cceYears[year]
for yearFile in self.repo.get_contents(yearInfo['path']):
if 'alto' in yearFile.name or 'TOC' in yearFile.name: continue
fileCommit = self.repo.get_commits(path=yearFile.path)[0]
commitDate = fileCommit.commit.committer.date
if loadFromTime is not None and commitDate < loadFromTime: continue
self.cceYears[year]['yearFiles'].append({
'filename': yearFile.name,
'path': yearFile.path,
'sha': yearFile.sha
})
def getYearFiles(self, loadFromTime):
for year in self.cceYears.keys():
print('Loading files for {}'.format(year))
self.loadYearFiles(year, loadFromTime)
def importYearData(self):
for year in self.cceYears.keys(): self.importYear(year)
def importYear(self, year):
yearFiles = self.cceYears[year]['yearFiles']
for yearFile in yearFiles: self.importFile(yearFile)
def importFile(self, yearFile):
print('Importing data from {}'.format(yearFile['filename']))
cceFile = CCEFile(self.repo, yearFile, self.dbManager.session)
cceFile.loadFileXML()
cceFile.readXML()
self.dbManager.commitChanges()
class CCEFile():
tagOptions = {
'header': 'skipElement',
'page': 'parsePage',
'copyrightEntry': 'parseEntry',
'entryGroup': 'parseGroup',
'crossRef': 'skipElement'
}
dateTypes = ['regDate', 'copyDate', 'pubDate', 'affDate']
def __init__(self, repo, cceFile, session):
self.repo = repo
self.cceFile = cceFile
self.session = session
self.xmlString = None
self.root = None
self.fileHeader = None
self.currentPage = 0
self.pagePos = 0
def loadFileXML(self):
yearBlob = self.repo.get_git_blob(self.cceFile['sha'])
self.xmlString = base64.b64decode(yearBlob.content)
self.root = etree.fromstring(self.xmlString)
def readXML(self):
self.loadHeader()
for child in self.root:
childOp = getattr(self, CCEFile.tagOptions[child.tag])
self.pagePos += 1
try:
childOp(child)
except DataError as err:
print('Caught error')
self.createErrorEntry(
getattr(err, 'uuid', child.get('id')),
getattr(err, 'regnum', None),
getattr(err, 'entry', child),
getattr(err, 'message', None)
)
except Exception as err:
print('ERROR', err)
print(traceback.print_exc())
raise err
def skipElement(self, el):
print('Skipping element {}'.format(el.tag))
def parsePage(self, page):
self.currentPage = page.get('pgnum')
self.pagePos = 0
def parseEntry(self, entry, shared=[]):
uuid = entry.get('id')
if 'regnum' not in entry.attrib:
print('Entry Missing REGNUM')
raise DataError(
'missing_regnum',
uuid=uuid,
entry=entry
)
regnums = self.loadRegnums(entry)
entryDates = self.loadDates(entry)
regDates = entryDates['regDate']
if(len(regnums) != len(regDates)):
if len(regDates) == 1 and len(regnums) > 0:
regDates = [regDates[0]] * len(regnums)
else:
raise DataError(
'regnum_date_mismatch',
uuid=uuid,
regnum='; '.join(regnums),
entry=entry
)
regs = self.createRegistrations(regnums, regDates)
existingRec = self.matchUUID(uuid)
if existingRec:
self.updateEntry(existingRec, entryDates, entry, shared, regs)
else:
self.createEntry(uuid, entryDates, entry, shared, regs)
def matchUUID(self, uuid):
return self.session.query(CCE).filter(CCE.uuid == uuid).one_or_none()
def createEntry(self, uuid, dates, entry, shared, registrations):
titles = self.createTitleList(entry, shared)
authors = self.createAuthorList(entry, shared)
copies = CCEFile.fetchText(entry, 'copies')
description = CCEFile.fetchText(entry, 'desc')
newMatter = len(entry.findall('newMatterClaimed')) > 0
publishers = [
(c.text, c.get('claimant', None))
for c in entry.findall('.//pubName')
]
lccn = [ lccnorm.normalize(l.text) for l in entry.findall('lccn') ]
cceRec = CCE(
uuid=uuid,
page=self.currentPage,
page_position=self.pagePos,
title=titles,
copies=copies,
description=description,
new_matter=newMatter,
pub_date=CCEFile.fetchDateValue(dates['pubDate'], text=False),
pub_date_text=CCEFile.fetchDateValue(dates['pubDate'], text=True),
aff_date=CCEFile.fetchDateValue(dates['affDate'], text=False),
aff_date_text=CCEFile.fetchDateValue(dates['affDate'], text=True),
copy_date=CCEFile.fetchDateValue(dates['copyDate'], text=False),
copy_date_text=CCEFile.fetchDateValue(dates['copyDate'], text=True)
)
cceRec.addRelationships(
self.fileHeader,
entry,
authors=authors,
publishers=publishers,
lccn=lccn,
registrations=registrations
)
self.session.add(cceRec)
print('INSERT', cceRec)
def updateEntry(self, rec, dates, entry, shared, registrations):
rec.title = self.createTitleList(entry, shared)
rec.copies = CCEFile.fetchText(entry, 'copies')
rec.description = CCEFile.fetchText(entry, 'desc')
rec.new_matter = len(entry.findall('newMatterClaimed')) > 0
rec.page = self.currentPage
rec.page_position = self.pagePos
rec.pub_date = CCEFile.fetchDateValue(dates['pubDate'], text=False)
rec.pub_date_text = CCEFile.fetchDateValue(dates['pubDate'], text=True)
rec.aff_date = CCEFile.fetchDateValue(dates['affDate'], text=False)
rec.aff_date_text = CCEFile.fetchDateValue(dates['affDate'], text=True)
rec.copy_date = CCEFile.fetchDateValue(dates['copyDate'], text=False)
rec.copy_date_text = CCEFile.fetchDateValue(dates['copyDate'], text=True)
authors = self.createAuthorList(entry, shared)
publishers = [
(c.text, c.get('claimant', None))
for c in entry.findall('.//pubName')
]
lccn = [ lccnorm.normalize(l.text) for l in entry.findall('lccn') ]
rec.updateRelationships(
entry,
authors=authors,
publishers=publishers,
lccn=lccn,
registrations=registrations
)
print('UPDATE', rec)
def createRegistrations(self, regnums, regdates):
registrations = []
for x in range(len(regnums)):
try:
regCat = re.match(r'([A-Z]+)', regnums[x][0]).group(1)
except AttributeError:
regCat ='Unknown'
registrations.append({
'regnum': regnums[x],
'category': regCat,
'regDate': regdates[x][0],
'regDateText': regdates[x][1]
})
return registrations
def loadRegnums(self, entry):
rawRegnums = entry.get('regnum').strip()
regnums = rawRegnums.split(' ')
regnums.extend(self.loadAddtlEntries(entry))
outNums = []
for num in regnums:
parsedNum = self.parseRegNum(num)
if type(parsedNum) is str: outNums.append(parsedNum)
else: outNums.extend(parsedNum)
return outNums
def loadAddtlEntries(self, entry):
moreRegnums = []
for addtl in entry.findall('.//additionalEntry'):
try:
addtlRegnum = addtl.get('regnum').strip()
moreRegnums.extend(addtlRegnum.split(' '))
except AttributeError:
continue
return moreRegnums
def parseRegNum(self, num):
try:
if (
re.search(r'[0-9]+\-[A-Z0-9]+', num)
and num[2] != '0'
and num[:2] != 'B5'
):
regnumRange = num.split('-')
regRangePrefix = re.match(r'([A-Z]+)', regnumRange[0]).group(1)
regRangeStart = int(regnumRange[0].replace(regRangePrefix, ''))
regRangeEnd = int(regnumRange[1].replace(regRangePrefix, ''))
if regRangeEnd - regRangeStart < 1000:
return [
'{}{}'.format(regRangePrefix, str(i))
for i in range(regRangeStart, regRangeEnd)
]
except (ValueError, AttributeError) as err:
print('RANGE ERROR', self.currentPage, self.pagePos)
raise DataError('regnum_range_parsing_error')
return num
def loadDates(self, entry):
dates = {}
for dType in CCEFile.dateTypes:
dates[dType] = [
(CCEFile.parseDate(d.get('date', None), '%Y-%m-%d'), d.text)
for d in entry.findall('.//{}'.format(dType))
if not ('ignore' in d.attrib and d.get('ignore') == 'yes')
]
if len(dates['regDate']) < 1:
if len(dates['copyDate']) > 0:
dates['regDate'].append(dates['copyDate'][0])
elif len(dates['pubDate']) > 0:
dates['regDate'].append(dates['pubDate'][0])
return dates
@staticmethod
def parseDate(date, dateFormat):
if date is None or dateFormat == '': return None
try:
outDate = datetime.strptime(date, dateFormat)
return outDate.strftime('%Y-%m-%d')
except ValueError:
return CCEFile.parseDate(date, dateFormat[:-3])
def createTitleList(self, entry, shared):
return '; '.join([
t.text
for t in entry.findall('.//title') + [
t for t in shared if t.tag == 'title'
]
if t.text is not None
])
def createErrorEntry(self, uuid, regnum, entry, reason):
errorCCE = ErrorCCE(
uuid=uuid,
regnum=regnum,
page=self.currentPage,
page_position=self.pagePos,
reason=reason
)
errorCCE.volume = self.fileHeader
errorCCE.addXML(entry)
self.session.add(errorCCE)
def parseGroup(self, group):
entries = []
shared = []
for field in group:
if field.tag == 'page': self.parsePage(field)
elif field.tag == 'copyrightEntry': entries.append(field)
else: shared.append(field)
for entry in entries:
try:
self.parseEntry(entry, shared)
except DataError as err:
err.uuid = entry.get('id')
raise err
def createAuthorList(self, rec, shared=[]):
authors = [
(a.text, False) for a in rec.findall('.//authorName')
if len(list(a.itersiblings(tag='role', preceding=True))) > 0
]
if len(authors) < 1:
try:
authors.append((rec.findall('.//authorName')[0].text, False))
except IndexError:
pass
authors.extend([
(a.text, False) for a in shared if a.tag == 'authorName'
])
try:
authors[0] = (authors[0][0], True)
except IndexError:
pass
return authors
def loadHeader(self):
header = self.root.find('.//header')
startNum = endNum = None
if header.find('cite/division/numbers') is not None:
startNum = header.find('cite/division/numbers').get('start', None)
endNum = header.find('cite/division/numbers').get('end', None)
elif header.find('cite/division/number') is not None:
volNum = CCEFile.fetchText(header, 'cite/division/number')
startNum = volNum
endNum= volNum
headerDict = {
'source': header.find('source').get('url', None),
'status': CCEFile.fetchText(header, 'status'),
'series': header.find('cite/series').get('label', None),
'volume': CCEFile.fetchText(header, 'cite/volume'),
'year': CCEFile.fetchText(header, 'cite/year'),
'part': CCEFile.fetchText(header, 'cite/division/part'),
'group': CCEFile.fetchText(header, 'cite/division/group'),
'material': CCEFile.fetchText(header, 'cite/division/material'),
'numbers': {
'start': startNum,
'end': endNum
}
}
self.fileHeader = Volume(
source=headerDict['source'],
status=headerDict['status'],
series=headerDict['series'],
volume=headerDict['volume'],
year=headerDict['year'],
part=headerDict['part'],
group=headerDict['group'],
material=headerDict['material'],
start_number=headerDict['numbers']['start'],
end_number=headerDict['numbers']['end']
)
self.session.add(self.fileHeader)
@staticmethod
def fetchText(parent, tag):
element = parent.find(tag)
return element.text if element is not None else None
@staticmethod
def fetchDateValue(date, text=False):
x = 1 if text else 0
return date[0][x] if len(date) > 0 else None

18
config.yaml-dist Normal file
View File

@ -0,0 +1,18 @@
DATABASE:
DB_USER:
DB_PSWD:
DB_HOST:
DB_PORT:
DB_NAME:
GITHUB:
ACCESS_TOKEN:
CCE_REPO:
CCR_REPO:
ELASTICSEARCH:
ES_CCE_INDEX:
ES_CCR_INDEX:
ES_HOST:
ES_PORT:
ES_TIMEOUT:

151
esIndexer.py Normal file
View File

@ -0,0 +1,151 @@
import os
from datetime import datetime
from elasticsearch.helpers import bulk, BulkIndexError, streaming_bulk
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import (
ConnectionError,
TransportError,
ConflictError
)
from elasticsearch_dsl import connections
from elasticsearch_dsl.wrappers import Range
from sqlalchemy import or_
from sqlalchemy.orm import configure_mappers, raiseload
from sqlalchemy.dialects import postgresql
from model.cce import CCE as dbCCE
from model.renewal import Renewal as dbRenewal
from model.registration import Registration as dbRegistration
from model.elastic import (
CCE,
Registration,
Renewal
)
class ESIndexer():
def __init__(self, manager, loadFromTime):
self.cce_index = os.environ['ES_CCE_INDEX']
self.ccr_index = os.environ['ES_CCR_INDEX']
self.client = None
self.session = manager.session
self.loadFromTime = loadFromTime if loadFromTime else datetime.strptime('1970-01-01', '%Y-%m-%d')
self.createElasticConnection()
self.createIndex()
configure_mappers()
def createElasticConnection(self):
host = os.environ['ES_HOST']
port = os.environ['ES_PORT']
timeout = int(os.environ['ES_TIMEOUT'])
try:
self.client = Elasticsearch(
hosts=[{'host': host, 'port': port}],
timeout=timeout
)
except ConnectionError as err:
print('Failed to connect to ElasticSearch instance')
raise err
connections.connections._conns['default'] = self.client
def createIndex(self):
if self.client.indices.exists(index=self.cce_index) is False:
CCE.init()
if self.client.indices.exists(index=self.ccr_index) is False:
Renewal.init()
def indexRecords(self, recType='cce'):
"""Process the current batch of updating records. This utilizes the
elasticsearch-py bulk helper to import records in chunks of the
provided size. If a record in the batch errors that is reported and
logged but it does not prevent the other records in the batch from
being imported.
"""
success, failure = 0, 0
errors = []
try:
for status, work in streaming_bulk(self.client, self.process(recType)):
print(status, work)
if not status:
errors.append(work)
failure += 1
else:
success += 1
print('Success {} | Failure: {}'.format(success, failure))
except BulkIndexError as err:
print('One or more records in the chunk failed to import')
raise err
def process(self, recType):
if recType == 'cce':
for cce in self.retrieveEntries():
esEntry = ESDoc(cce)
esEntry.indexEntry()
yield esEntry.entry.to_dict(True)
elif recType == 'ccr':
for ccr in self.retrieveRenewals():
esRen = ESRen(ccr)
esRen.indexRen()
if esRen.renewal.rennum == '': continue
yield esRen.renewal.to_dict(True)
def retrieveEntries(self):
retQuery = self.session.query(dbCCE)\
.filter(dbCCE.date_modified > self.loadFromTime)
for cce in retQuery.all():
yield cce
def retrieveRenewals(self):
renQuery = self.session.query(dbRenewal)\
.filter(dbRenewal.date_modified > self.loadFromTime)
for ccr in renQuery.all():
yield ccr
class ESDoc():
def __init__(self, cce):
self.dbRec = cce
self.entry = None
self.initEntry()
def initEntry(self):
print('Creating ES record for {}'.format(self.dbRec))
self.entry = CCE(meta={'id': self.dbRec.uuid})
def indexEntry(self):
self.entry.uuid = self.dbRec.uuid
self.entry.title = self.dbRec.title
self.entry.authors = [ a.name for a in self.dbRec.authors ]
self.entry.publishers = [ p.name for p in self.dbRec.publishers ]
self.entry.lccns = [ l.lccn for l in self.dbRec.lccns ]
self.entry.registrations = [
Registration(regnum=r.regnum, regdate=r.reg_date)
for r in self.dbRec.registrations
]
class ESRen():
def __init__(self, ccr):
self.dbRen = ccr
self.renewal = None
self.initRenewal()
def initRenewal(self):
print('Creating ES record for {}'.format(self.dbRen))
self.renewal = Renewal(meta={'id': self.dbRen.renewal_num})
def indexRen(self):
self.renewal.rennum = self.dbRen.renewal_num
self.renewal.rendate = self.dbRen.renewal_date
self.renewal.title = self.dbRen.title
self.renewal.claimants = [
c.name for c in self.dbRen.claimants
]

7
helpers/errors.py Normal file
View File

@ -0,0 +1,7 @@
class DataError(Exception):
def __init__(self, message, **kwargs):
self.message = message
for key, value in kwargs.items():
setattr(self, key, value)

86
main.py Normal file
View File

@ -0,0 +1,86 @@
import argparse
from datetime import datetime, timedelta
import os
import yaml
from sessionManager import SessionManager
from builder import CCEReader, CCEFile
from renBuilder import CCRReader, CCRFile
from esIndexer import ESIndexer
def main(secondsAgo=None, year=None, exclude=None, reinit=False):
manager = SessionManager()
manager.generateEngine()
manager.initializeDatabase(reinit)
manager.createSession()
loadFromTime = None
startTime = datetime.now()
if secondsAgo is not None:
loadFromTime = startTime - timedelta(seconds=secondsAgo)
#if exclude != 'cce':
# loadCCE(manager, loadFromTime, year)
#if exclude != 'ccr':
# loadCCR(manager, loadFromTime, year)
indexUpdates(manager, None)
manager.closeConnection()
def loadCCE(manager, loadFromTime, selectedYear):
cceReader = CCEReader(manager)
cceReader.loadYears(selectedYear)
cceReader.getYearFiles(loadFromTime)
cceReader.importYearData()
def loadCCR(manager, loadFromTime, selectedYear):
ccrReader = CCRReader(manager)
ccrReader.loadYears(selectedYear, loadFromTime)
ccrReader.importYears()
def indexUpdates(manager, loadFromTime):
esIndexer = ESIndexer(manager, None)
# esIndexer.indexRecords(recType='cce')
esIndexer.indexRecords(recType='ccr')
def parseArgs():
parser = argparse.ArgumentParser(
description='Load CCE XML and CCR TSV into PostgresQL'
)
parser.add_argument('-t', '--time', type=int, required=False,
help='Time ago in seconds to check for file updates'
)
parser.add_argument('-y', '--year', type=str, required=False,
help='Specific year to load CCE entries and/or renewals from'
)
parser.add_argument('-x', '--exclude', type=str, required=False,
choices=['cce', 'ccr'],
help='Specify to exclude either entries or renewals from this run'
)
parser.add_argument('--REINITIALIZE', action='store_true')
return parser.parse_args()
def loadConfig():
with open('config.yaml', 'r') as yamlFile:
config = yaml.safe_load(yamlFile)
for section in config:
sectionDict = config[section]
for key, value in sectionDict.items():
os.environ[key] = value
if __name__ == '__main__':
args = parseArgs()
loadConfig()
main(
secondsAgo=args.time,
year=args.year,
exclude=args.exclude,
reinit=args.REINITIALIZE
)

28
model/author.py Normal file
View File

@ -0,0 +1,28 @@
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
Boolean
)
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
class Author(Core, Base):
__tablename__ = 'author'
id = Column(Integer, primary_key=True)
name = Column(Unicode, nullable=False, index=True)
primary = Column(Boolean, index=True)
cce_id = Column(Integer, ForeignKey('cce.id'), index=True)
def __repr__(self):
return '<Author(name={}, primary={})>'.format(self.name, self.primary)

163
model/cce.py Normal file
View File

@ -0,0 +1,163 @@
from lxml import etree
import uuid
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
Boolean
)
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
from model.xml import XML
from model.author import Author
from model.publisher import Publisher
from model.lccn import LCCN
from model.registration import Registration
class CCE(Core, Base):
__tablename__ = 'cce'
id = Column(Integer, primary_key=True)
uuid = Column(UUID(as_uuid=True), unique=False, nullable=False, index=True)
page = Column(Integer)
page_position = Column(Integer)
title = Column(Unicode, index=True)
copies = Column(Unicode)
description = Column(Unicode)
new_matter = Column(Boolean)
pub_date = Column(Date)
pub_date_text = Column(Unicode)
copy_date = Column(Date)
copy_date_text = Column(Unicode)
aff_date = Column(Date)
aff_date_text = Column(Unicode)
volume_id = Column(Integer, ForeignKey('volume.id'))
registrations = relationship('Registration', backref='cce')
lccns = relationship('LCCN', backref='cce', cascade='all, delete-orphan')
authors = relationship('Author', backref='cce', cascade='all, delete-orphan')
publishers = relationship('Publisher', backref='cce', cascade='all, delete-orphan')
def __repr__(self):
return '<CCE(regnums={}, uuid={}, title={})>'.format(self.registrations, self.uuid, self.title)
def addRelationships(self, volume, xml, lccn=[], authors=[], publishers=[], registrations=[]):
self.volume = volume
self.addLCCN(lccn)
self.addAuthor(authors)
self.addPublisher(publishers)
self.addRegistration(registrations)
self.addXML(xml)
def addLCCN(self, lccns):
self.lccns = [ LCCN(lccn=lccn) for lccn in lccns ]
def addXML(self, xml):
xmlString = etree.tostring(xml, encoding='utf-8').decode()
self.xml_sources.append(XML(xml_source=xmlString))
def addAuthor(self, authors):
for auth in authors:
if auth[0] is None:
print('No author name! for {}'.format(self.uuid))
continue
self.authors.append(Author(name=auth[0], primary=auth[1]))
def addPublisher(self, publishers):
for pub in publishers:
if pub[0] is None:
print('No publisher name! for {}'.format(self.uuid))
continue
claimant = True if pub[1] == 'yes' else False
self.publishers.append(Publisher(name=pub[0], claimant=claimant))
def addRegistration(self, registrations):
self.registrations = [
Registration(
regnum=reg['regnum'],
category=reg['category'],
reg_date=reg['regDate'],
reg_date_text=reg['regDateText']
)
for reg in registrations
]
def updateRelationships(self, xml, lccn=[], authors=[], publishers=[], registrations=[]):
self.addXML(xml)
self.updateLCCN(lccn)
self.updateAuthors(authors)
self.updatePublishers(publishers)
self.updateRegistrations(registrations)
def updateLCCN(self, lccns):
currentLCCNs = [ l.lccn for l in self.lccns ]
if lccns != currentLCCNs:
self.lccns = [
l for l in self.lccns
if l.lccn in list(set(currentLCCNs) & set(lccns))
]
for new in list(set(lccns) - set(currentLCCNs)):
self.lccns.append(LCCN(lccn=new))
def updateAuthors(self, authors):
currentAuthors = [ (a.name, a.primary) for a in self.authors ]
newAuthors = filter(lambda x: x[0] is None, authors)
if newAuthors != currentAuthors:
self.authors = [
a for a in self.authors
if a.name in list(set(currentAuthors) & set(newAuthors))
]
for new in list(set(newAuthors) - set(currentAuthors)):
self.authors.append(Author(name=new[0], primary=new[1]))
def updatePublishers(self, publishers):
currentPublishers = [ (a.name, a.claimant) for a in self.publishers ]
newPublishers = [
(p[0], True if p[1] == 'yes' else False)
for p in filter(lambda x: x[0] is None, publishers)
]
if newPublishers != currentPublishers:
self.authors = [
a for a in self.authors
if a.name in list(set(currentPublishers) & set(newPublishers))
]
for new in list(set(newPublishers) - set(currentPublishers)):
self.publishers.append(Publisher(name=new[0], claimant=new[1]))
def updateRegistrations(self, registrations):
existingRegs = [
self.updateReg(r, registrations) for r in self.registrations
if r.regnum in [ r['regnum'] for r in registrations ]
]
newRegs = [
r for r in registrations
if r['regnum'] not in [ r.regnum for r in existingRegs ]
]
self.registrations = existingRegs + [
Registration(
regnum=reg['regnum'],
category=reg['category'],
reg_date=reg['regDate'],
reg_date_text=reg['regDateText']
)
for reg in newRegs
]
def updateReg(self, reg, registrations):
newReg = CCE.getReg(reg.regnum, registrations)
reg.update(newReg)
@staticmethod
def getReg(regnum, newRegs):
for new in newRegs:
if regnum == new['regnum']: return new

21
model/core.py Normal file
View File

@ -0,0 +1,21 @@
from sqlalchemy import Column, DateTime
from sqlalchemy.ext.declarative import declarative_base
from datetime import datetime
Base = declarative_base()
class Core(object):
"""A mixin for other SQLAlchemy ORM classes. Includes a date_craeted and
date_updated field for all database tables."""
date_created = Column(
DateTime,
default=datetime.now()
)
date_modified = Column(
DateTime,
default=datetime.now(),
onupdate=datetime.now()
)

58
model/elastic.py Normal file
View File

@ -0,0 +1,58 @@
import os
import yaml
from elasticsearch_dsl import (
Index,
Document,
Keyword,
Text,
Date,
InnerDoc,
Nested
)
class BaseDoc(Document):
date_created = Date()
date_modified = Date()
def save(self, **kwargs):
return super(BaseDoc, self).save(**kwargs)
class BaseInner(InnerDoc):
date_created = Date()
date_modified = Date()
def save(self, **kwargs):
return super(BaseInner, self).save(**kwargs)
class Registration(BaseInner):
regnum = Keyword()
regdate = Date()
class Renewal(BaseDoc):
rennum = Keyword()
rendate = Date()
title = Text(fields={'keyword': Keyword()})
claimants = Text(multi=True)
class Index:
with open('config.yaml', 'r') as yamlFile:
config = yaml.safe_load(yamlFile)
name = config['ELASTICSEARCH']['ES_CCR_INDEX']
class CCE(BaseDoc):
uuid = Keyword(store=True)
title = Text(fields={'keyword': Keyword()})
authors = Text(multi=True)
publishers = Text(multi=True)
lccns = Keyword(multi=True)
registrations = Nested(Registration)
class Index:
with open('config.yaml', 'r') as yamlFile:
config = yaml.safe_load(yamlFile)
name = config['ELASTICSEARCH']['ES_CCE_INDEX']

42
model/errorCCE.py Normal file
View File

@ -0,0 +1,42 @@
from lxml import etree
import uuid
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
Boolean
)
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
from model.xml import XML
from model.author import Author
from model.publisher import Publisher
from model.lccn import LCCN
class ErrorCCE(Core, Base):
__tablename__ = 'error_cce'
id = Column(Integer, primary_key=True)
uuid = Column(UUID(as_uuid=True), unique=False, nullable=False, index=True)
regnum = Column(Unicode, unique=False, nullable=True)
page = Column(Integer)
page_position = Column(Integer)
reason = Column(Unicode)
volume_id = Column(Integer, ForeignKey('volume.id'))
def __repr__(self):
return '<ErrorCCE(regnum={}, uuid={})>'.format(self.regnum, self.uuid)
def addXML(self, xml):
xmlString = etree.tostring(xml, encoding='utf-8').decode()
self.xml_sources.append(XML(xml_source=xmlString))

25
model/lccn.py Normal file
View File

@ -0,0 +1,25 @@
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
Boolean
)
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
class LCCN(Core, Base):
__tablename__ = 'lccn'
id = Column(Integer, primary_key=True)
lccn = Column(Unicode, nullable=False, index=True)
cce_id = Column(Integer, ForeignKey('cce.id'), index=True)

28
model/publisher.py Normal file
View File

@ -0,0 +1,28 @@
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
Boolean
)
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
class Publisher(Core, Base):
__tablename__ = 'publisher'
id = Column(Integer, primary_key=True)
name = Column(Unicode, nullable=False, index=True)
claimant = Column(Boolean, index=True)
cce_id = Column(Integer, ForeignKey('cce.id'), index=True)
def __repr__(self):
return '<Publisher(name={}, claimant={})>'.format(self.name, self.claimant)

34
model/registration.py Normal file
View File

@ -0,0 +1,34 @@
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
Boolean
)
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
class Registration(Core, Base):
__tablename__ = 'registration'
id = Column(Integer, primary_key=True)
regnum = Column(Unicode, unique=False, nullable=False, index=True)
category = Column(Unicode)
reg_date = Column(Date)
reg_date_text = Column(Unicode)
cce_id = Column(Integer, ForeignKey('cce.id'), index=True)
def __repr__(self):
return '<Registration(regnum={}, date={})>'.format(self.regnum, self.reg_date_text)
def update(self, newReg):
for key, value in newReg.items():
setattr(self, key, value)

30
model/renClaimant.py Normal file
View File

@ -0,0 +1,30 @@
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
Boolean
)
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
class RenClaimant(Core, Base):
__tablename__ = 'renewal_claimant'
id = Column(Integer, primary_key=True)
name = Column(Unicode, nullable=False, index=True)
claimant_type = Column(Unicode, index=True)
renewal_id = Column(Integer, ForeignKey('renewal.id'))
renewal = relationship('Renewal', backref='claimants')
def __repr__(self):
return '<Claimant(name={}, type={})>'.format(self.name, self.claimant_type)

88
model/renewal.py Normal file
View File

@ -0,0 +1,88 @@
import uuid
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
Boolean,
Table
)
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
from model.renClaimant import RenClaimant
RENEWAL_REG = Table(
'renewal_registration',
Base.metadata,
Column('renewal_id', Integer, ForeignKey('renewal.id'), index=True),
Column('registration_id', Integer, ForeignKey('registration.id'), index=True)
)
class Renewal(Core, Base):
__tablename__ = 'renewal'
id = Column(Integer, primary_key=True)
uuid = Column(UUID(as_uuid=True), unique=False, nullable=False, index=True)
volume = Column(Integer)
part = Column(Unicode)
number = Column(Integer)
page = Column(Integer)
author = Column(Unicode)
title = Column(Unicode, index=True)
reg_data = Column(Unicode)
renewal_num = Column(Unicode)
renewal_date = Column(Date)
renewal_date_text = Column(Unicode)
new_matter = Column(Unicode)
see_also_regs = Column(Unicode)
see_also_rens = Column(Unicode)
notes = Column(Unicode)
source = Column(Unicode)
orphan = Column(Boolean, default=False)
registrations = relationship(
'Registration',
secondary=RENEWAL_REG,
backref='renewals'
)
def __repr__(self):
return '<CCR(regs={}, uuid={}, title={})>'.format(self.registrations, self.uuid, self.title)
def addClaimants(self, claimants):
if claimants:
for claim in claimants.split('||'):
cParts = claim.split('|')
self.claimants.append(
RenClaimant(name=cParts[0], claimant_type=cParts[1])
)
def updateClaimants(self, claimants):
addClaims = [
claim.split('|') for claim in claimants.split('||')
]
existingClaims = [
c for c in self.claimants
if c.name in [ a[0] for a in addClaims ]
]
newClaims = [
c for c in addClaims
if c[0] not in [ e.name for e in existingClaims ]
and c[0] != ''
]
self.claimants = existingClaims + [
RenClaimant(
name=claim[0],
claimant_type=claim[1]
)
for claim in newClaims
]

37
model/volume.py Normal file
View File

@ -0,0 +1,37 @@
import uuid
import json
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
PrimaryKeyConstraint
)
from sqlalchemy.orm import relationship, backref
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.orm.exc import NoResultFound
from model.core import Base, Core
class Volume(Core, Base):
__tablename__ = 'volume'
id = Column(Integer, primary_key=True)
source = Column(Unicode)
status = Column(Unicode)
series = Column(Unicode)
volume = Column(Integer)
year = Column(Integer)
part = Column(Unicode)
group = Column(Integer)
material = Column(Unicode)
start_number = Column(Integer)
end_number = Column(Integer)
entries = relationship('CCE', backref='volume')
error_entries = relationship('ErrorCCE', backref='volume')
def __repr__(self):
return '<Volume(series={}, volume={}, year={})>'.format(self.series, self.volume, self.year)

50
model/xml.py Normal file
View File

@ -0,0 +1,50 @@
from sqlalchemy.types import PickleType
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.orm import relationship, backref
from sqlalchemy import (
Column,
Date,
ForeignKey,
Integer,
String,
Unicode,
PrimaryKeyConstraint,
Table
)
from model.core import Base, Core
@compiles(String, 'postgresql')
def compile_xml(type_, compiler, **kw):
return "XML"
ENTRY_XML = Table(
'entry_xml',
Base.metadata,
Column('cce_id', Integer, ForeignKey('cce.id'), index=True),
Column('xml_id', Integer, ForeignKey('xml.id'), index=True)
)
ERROR_XML = Table(
'error_entry_xml',
Base.metadata,
Column('error_cce_id', Integer, ForeignKey('error_cce.id'), index=True),
Column('xml_id', Integer, ForeignKey('xml.id'), index=True)
)
class XML(Core, Base):
__tablename__ = 'xml'
id = Column(Integer, primary_key=True)
xml_source = Column(String)
entry = relationship(
'CCE',
secondary=ENTRY_XML,
backref='xml_sources'
)
error_entry = relationship(
'ErrorCCE',
secondary=ERROR_XML,
backref='xml_sources'
)

193
renBuilder.py Normal file
View File

@ -0,0 +1,193 @@
import base64
import csv
from datetime import datetime
from github import Github
from io import StringIO
import os
import re
from sqlalchemy.orm.exc import MultipleResultsFound
from model.renewal import Renewal
from model.registration import Registration
class CCRReader():
def __init__(self, manager):
self.git = Github(os.environ['ACCESS_TOKEN'])
self.repo = self.git.get_repo(os.environ['CCR_REPO'])
self.ccrYears = {}
self.dbManager = manager
def loadYears(self, selectedYear, loadFromTime):
for year in self.repo.get_contents('/data'):
yearMatch = re.match(r'^([0-9]{4}).*\.tsv$', year.name)
if not yearMatch: continue
fileYear = yearMatch.group(1)
if selectedYear is not None and selectedYear != fileYear: continue
fileCommit = self.repo.get_commits(path=year.path)[0]
commitDate = fileCommit.commit.committer.date
if loadFromTime is not None and commitDate < loadFromTime: continue
yearInfo = {
'path': year.path,
'filename': year.name,
'sha': year.sha
}
self.ccrYears[fileYear] = yearInfo
def importYears(self):
for year in self.ccrYears.keys(): self.importYear(year)
def importYear(self, year):
yearInfo = self.ccrYears[year]
print('Loading Year {}'.format(year))
cceFile = CCRFile(self.repo, yearInfo, self.dbManager.session)
cceFile.loadFileTSV()
cceFile.readRows()
self.dbManager.commitChanges()
class CCRFile():
def __init__(self, repo, ccrFile, session):
self.repo = repo
self.ccrFile = ccrFile
self.session = session
self.rows = []
def loadFileTSV(self):
yearBlob = self.repo.get_git_blob(self.ccrFile['sha'])
tsvString = base64.b64decode(yearBlob.content).decode('utf-8')
tsvFile = StringIO(tsvString)
self.rows = csv.DictReader(tsvFile, delimiter='\t', quotechar='"')
def readRows(self):
for row in self.rows: self.parseRow(row)
def parseRow(self, row):
rec = self.matchRenewal(row['entry_id'])
if rec: self.updateRenewal(rec, row)
else: self.createRenewal(row)
def createRenewal(self, row):
title = CCRFile.cascadeFieldNameLoad('title', 'titl', row=row)
renewalDateText = CCRFile.cascadeFieldNameLoad('rdat', 'dreg', row=row)
source = CCRFile.cascadeFieldNameLoad('source', 'full_text', row=row)
author = CCRFile.cascadeFieldNameLoad('author', 'auth', row=row)
notes = CCRFile.cascadeFieldNameLoad('notes', 'note', row=row)
try:
renDate = datetime.strptime(renewalDateText, '%Y-%m-%d')
except ValueError:
renDate = None
renRec = Renewal(
uuid=row['entry_id'],
author=author,
title=title,
reg_data='{}|{}'.format(row['oreg'], row['odat']),
renewal_num=row['id'],
renewal_date=renDate,
renewal_date_text=renewalDateText,
new_matter=row['new_matter'],
see_also_regs=row['see_also_reg'],
see_also_rens=row['see_also_ren'],
notes=notes,
source=source
)
for numField in ['volume', 'part', 'number', 'page']:
setattr(
renRec,
numField,
row[numField] if row[numField] != '' else None
)
self.matchRegistrations(renRec, row['oreg'], row['odat'])
renRec.addClaimants(row['claimants'])
self.session.add(renRec)
print('INSERT', renRec)
def updateRenewal(self, rec, row):
rec.uuid = row['entry_id']
rec.title = CCRFile.cascadeFieldNameLoad('title', 'titl', row=row)
rec.source = CCRFile.cascadeFieldNameLoad('source', 'full_text', row=row)
rec.author = CCRFile.cascadeFieldNameLoad('author', 'auth', row=row)
rec.notes = CCRFile.cascadeFieldNameLoad('notes', 'note', row=row)
rec.reg_data = '{}|{}'.format(row['oreg'], row['odat'])
rec.renewal_num = row['id']
rec.new_matter = row['new_matter']
if row['see_also_reg']:
rec.see_also_regs = row['see_also_reg']
if row['see_also_ren']:
rec.see_also_rens = row['see_also_ren']
rec.renewal_date_text = CCRFile.cascadeFieldNameLoad('rdat', 'dreg', row=row)
try:
rec.renewal_date = datetime.strptime(rec.renewal_date_text, '%Y-%m-%d')
except ValueError:
rec.renewal_date = None
for numField in ['volume', 'part', 'number', 'page']:
setattr(
rec,
numField,
row[numField] if row[numField] != '' else None
)
self.matchRegistrations(rec, row['oreg'], row['odat'])
rec.updateClaimants(row['claimants'])
print('UPDATE', rec)
def matchRenewal(self, uuid):
return self.session.query(Renewal).filter(Renewal.uuid == uuid).one_or_none()
def matchRegistrations(self, renRec, regnum, origDate):
if regnum is None or regnum.strip() == '': return
try:
checkDate = datetime.strptime(origDate, '%Y-%m-%d')
except ValueError:
checkDate = None
regnumQuery = self.session.query(Registration)\
.filter(Registration.regnum == regnum)\
.filter(Registration.reg_date == checkDate)
try:
origReg = regnumQuery.one_or_none()
except MultipleResultsFound:
origRegs = regnumQuery.all()
if len(origRegs) < 1:
origReg = None
seeAlsoRegs = regnumQuery.all()
renRec.see_also_regs = '{}|{}'.format(
renRec.see_also_regs,
'|'.join([ r.regnum for r in seeAlsoRegs ])
)
else:
origReg = origRegs[0]
if len(origRegs) > 1:
renRec.see_also_regs = '{}|{}'.format(
renRec.see_also_regs,
'|'.join([ r.regnum for r in origRegs[1:] ])
)
if origReg:
renRec.registrations.append(origReg)
renRec.orphan = False
else:
print('Matching Registration not found!')
if len(renRec.registrations) < 1:
renRec.orphan = True
@staticmethod
def cascadeFieldNameLoad(*fields, row=None):
for field in fields:
try:
return row[field]
except KeyError:
pass
print('No matching field found!')
raise KeyError

7
requirements.txt Normal file
View File

@ -0,0 +1,7 @@
elasticsearch-dsl>=6.0.0,<7.0.0
lccnorm
lxml
psycopg2-binary
pygithub
pyyaml
sqlalchemy

57
sessionManager.py Normal file
View File

@ -0,0 +1,57 @@
import os
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from model.core import Base
class SessionManager():
def __init__(self, user=None, pswd=None, host=None, port=None, db=None):
self.user = user if user else os.environ.get('DB_USER', None)
self.pswd = pswd if pswd else os.environ.get('DB_PSWD', None)
self.host = host if host else os.environ.get('DB_HOST', None)
self.port = port if port else os.environ.get('DB_PORT', None)
self.db = db if db else os.environ.get('DB_NAME', None)
self.engine = None
self.session = None
def generateEngine(self):
try:
self.engine = create_engine(
'postgresql://{}:{}@{}:{}/{}'.format(
self.user,
self.pswd,
self.host,
self.port,
self.db
)
)
except Exception as e:
raise e
def initializeDatabase(self, reinit=False):
if reinit: Base.metadata.drop_all(self.engine, checkfirst=True)
if not self.engine.dialect.has_table(self.engine, 'cce'):
Base.metadata.create_all(self.engine)
def createSession(self, autoflush=True):
if not self.engine: self.generateEngine()
self.session = sessionmaker(
bind=self.engine,
autoflush=autoflush
)()
return self.session
def startSession(self):
self.session.begin_nested()
def commitChanges(self):
self.session.commit()
def rollbackChanges(self):
self.session.rollback()
def closeConnection(self):
self.commitChanges()
self.session.close()
self.engine.dispose()