SFR-453 Initial Commit
Includes an initial version of the utility script used to generate the copyright entry/renewal database along with instructions on how to run the script and create a version of the database locallyapi-changes
commit
edd14d9cdb
|
@ -0,0 +1,5 @@
|
|||
/__pycache__/*
|
||||
config.yaml
|
||||
*.pyc
|
||||
*.xml
|
||||
*.code-workspace
|
|
@ -0,0 +1,64 @@
|
|||
# Copyright Registrations and Renewals DB Maintenance Script
|
||||
|
||||
## Introduction
|
||||
|
||||
This script creates and updates a database of copyright records, both original registrations and renewals, drawn from two projects supported by the NYPL to digitize and gather this source data. The goal of this project is to create a single source for all copyright information for works registered after 1922 and renewed before 1978 (see [Duration of Copyright](https://www.copyright.gov/circs/circ15a.pdf) for more information). This should provide the connections between registrations and renewals and allow users to see the connections between related registrations and renewals.
|
||||
|
||||
The two projects that this data is drawn from are hosted on GitHub, more information about the source of data for these projects can be found in each repository:
|
||||
|
||||
- (Catalog of Copyright Entries)[https://github.com/NYPL/catalog_of_copyright_entries_project]
|
||||
- (Catalog of Copyright Entries Renewals)[https://github.com/NYPL/cce-renewals]
|
||||
|
||||
## Database
|
||||
|
||||
This script generates and updates a Postgresql database and ElasticSearch search layer. It scans the above repository for updated source files and inserts/updates the appropriate records, ensuring that the search layer is kept up to date following these changes.
|
||||
|
||||
### Structure
|
||||
|
||||
The database is relatively simple but contains some complexities to describe the range of conditions that exist in the copyright data. The tables currently are:
|
||||
|
||||
- `cce` The core table that contains one copyright entry per row, identified by a UUIDv4. This includes a title and other descriptive information
|
||||
- `renewal` The core table that contains one copyright renewal per row, identified by a UUIDv4 and a renewal number.
|
||||
- `registration` This table contains copyright registration numbers (e.g. `A999999`) and registration dates. There can be multiple registration numbers per entry and renewal.
|
||||
- `volume` Contains information describing the CCE volume that specific records were drawn from, including a URL where a digitized version of the volume can be found. This can be used to locate the original text of a copyright entry.
|
||||
- `author` The authors associated with a copyright entry. For each entry one author is designated as the `primary` author, generally drawn from the entry heading and used for lookup and sorting purposes
|
||||
- `publisher` The publishers associated with a copyright entry. Each entry is tagged with a boolean `claimant` field that designates if the publisher was the copyright claimant. These entries can overlap with the `author` table due to discrepancies in how data was entered.
|
||||
- `lccn` Normalized LCCN numbers associated with a copyright entry
|
||||
- `renewal_claimant` Claimants who have created a renewal record
|
||||
- `xml` The source XML string for copyright entries. This is stored in an XML field and can be queried with `xpath`. This table also stores successive versions of the source XML for each entry, allowing changes to be found./visualized.
|
||||
- `error_cce` Contains copyright entries that could not be properly parsed. These are stored to see if the issue lies in the parser or the source data.
|
||||
|
||||
### Relationships
|
||||
|
||||
The primary relationships are between the `registration` table, entries and renewals. Through registration numbers and dates, connections can be found between entries and renewals. It should be noted that an individual entry can have multiple renewals and vice-versa, related through multiple registration numbers
|
||||
|
||||
## ElasticSearch
|
||||
|
||||
The ElasticSearch instance is designed to provide a simple, lightweight search layer. The documents in the ElasticSearch do not contain a full set of metadata, simply enough to return results that can be fully realized with an additional query to the database.
|
||||
|
||||
The ES instance is comprised of two indexes. `cce` for entries and `ccr` for renewals. These indexes can be queried separately or together
|
||||
|
||||
## Configuration
|
||||
|
||||
This script can be configured to run locally using the NYPL data sources or your own forks of these sources. A `config.yaml-dist` file is included here that contains the structure for the necessary environment variables. The following will need to be supplied:
|
||||
- Connection details for a PostgreSQL database
|
||||
- Connection details for an ElasticSearch instance
|
||||
- A GitHub Personal Access Token for interacting with the GitHub API
|
||||
- Repository names for the entry and renewal repositories
|
||||
|
||||
## Setup
|
||||
|
||||
This script requires Python 3.6+
|
||||
|
||||
To create a local instance it is recommended to create a virtualenv and install dependencies with `pip install -r requirements.txt`
|
||||
|
||||
## Running
|
||||
|
||||
The utility script can be invoked simply with `python main.py` to initialize the database and pull down the full set of source data. Please note that this will take 2-4 hours to fully download, parse and index the source data.
|
||||
|
||||
Several command-line arguments control the options for the execution of this script
|
||||
|
||||
- `--REINITIALIZE` Will drop the existing database and rebuild it from scratch. WARNING THIS WILL DELETE ALL CURRENT DATA
|
||||
- `-t` or `--time` The time ago in seconds to check for updated records in the GitHub repositories. This allows for only updating changed records
|
||||
- `-y` or `--year` A specific year to load from either the entries or renewals
|
||||
- `-x` or `--exclude` Set to exclude either the entries (with `cce`) or the renewals (with `ccr`) from the current execution. Useful when used in conjunction with the `year` parameter to control what records are updated.
|
|
@ -0,0 +1,416 @@
|
|||
import base64
|
||||
from datetime import datetime
|
||||
from github import Github
|
||||
import lccnorm
|
||||
from lxml import etree
|
||||
import os
|
||||
import re
|
||||
import traceback
|
||||
|
||||
from model.cce import CCE
|
||||
from model.errorCCE import ErrorCCE
|
||||
from model.volume import Volume
|
||||
|
||||
from helpers.errors import DataError
|
||||
|
||||
class CCEReader():
|
||||
def __init__(self, manager=None):
|
||||
self.git = Github(os.environ['ACCESS_TOKEN'])
|
||||
self.repo = self.git.get_repo(os.environ['CCE_REPO'])
|
||||
self.dbManager = manager
|
||||
self.cceYears = {}
|
||||
|
||||
def loadYears(self, selectedYear):
|
||||
for year in self.repo.get_contents('/xml'):
|
||||
if not re.match(r'^19[0-9]{2}$', year.name): continue
|
||||
if selectedYear is not None and year.name != selectedYear: continue
|
||||
yearInfo = {'path': year.path, 'yearFiles': []}
|
||||
self.cceYears[year.name] = yearInfo
|
||||
|
||||
def loadYearFiles(self, year, loadFromTime):
|
||||
yearInfo = self.cceYears[year]
|
||||
for yearFile in self.repo.get_contents(yearInfo['path']):
|
||||
if 'alto' in yearFile.name or 'TOC' in yearFile.name: continue
|
||||
fileCommit = self.repo.get_commits(path=yearFile.path)[0]
|
||||
commitDate = fileCommit.commit.committer.date
|
||||
if loadFromTime is not None and commitDate < loadFromTime: continue
|
||||
self.cceYears[year]['yearFiles'].append({
|
||||
'filename': yearFile.name,
|
||||
'path': yearFile.path,
|
||||
'sha': yearFile.sha
|
||||
})
|
||||
|
||||
def getYearFiles(self, loadFromTime):
|
||||
for year in self.cceYears.keys():
|
||||
print('Loading files for {}'.format(year))
|
||||
self.loadYearFiles(year, loadFromTime)
|
||||
|
||||
def importYearData(self):
|
||||
for year in self.cceYears.keys(): self.importYear(year)
|
||||
|
||||
def importYear(self, year):
|
||||
yearFiles = self.cceYears[year]['yearFiles']
|
||||
for yearFile in yearFiles: self.importFile(yearFile)
|
||||
|
||||
def importFile(self, yearFile):
|
||||
print('Importing data from {}'.format(yearFile['filename']))
|
||||
cceFile = CCEFile(self.repo, yearFile, self.dbManager.session)
|
||||
cceFile.loadFileXML()
|
||||
cceFile.readXML()
|
||||
self.dbManager.commitChanges()
|
||||
|
||||
|
||||
class CCEFile():
|
||||
tagOptions = {
|
||||
'header': 'skipElement',
|
||||
'page': 'parsePage',
|
||||
'copyrightEntry': 'parseEntry',
|
||||
'entryGroup': 'parseGroup',
|
||||
'crossRef': 'skipElement'
|
||||
}
|
||||
|
||||
dateTypes = ['regDate', 'copyDate', 'pubDate', 'affDate']
|
||||
|
||||
def __init__(self, repo, cceFile, session):
|
||||
self.repo = repo
|
||||
self.cceFile = cceFile
|
||||
self.session = session
|
||||
|
||||
self.xmlString = None
|
||||
self.root = None
|
||||
self.fileHeader = None
|
||||
self.currentPage = 0
|
||||
self.pagePos = 0
|
||||
|
||||
def loadFileXML(self):
|
||||
yearBlob = self.repo.get_git_blob(self.cceFile['sha'])
|
||||
self.xmlString = base64.b64decode(yearBlob.content)
|
||||
self.root = etree.fromstring(self.xmlString)
|
||||
|
||||
def readXML(self):
|
||||
self.loadHeader()
|
||||
|
||||
for child in self.root:
|
||||
childOp = getattr(self, CCEFile.tagOptions[child.tag])
|
||||
self.pagePos += 1
|
||||
try:
|
||||
childOp(child)
|
||||
except DataError as err:
|
||||
print('Caught error')
|
||||
self.createErrorEntry(
|
||||
getattr(err, 'uuid', child.get('id')),
|
||||
getattr(err, 'regnum', None),
|
||||
getattr(err, 'entry', child),
|
||||
getattr(err, 'message', None)
|
||||
)
|
||||
except Exception as err:
|
||||
print('ERROR', err)
|
||||
print(traceback.print_exc())
|
||||
raise err
|
||||
|
||||
def skipElement(self, el):
|
||||
print('Skipping element {}'.format(el.tag))
|
||||
|
||||
def parsePage(self, page):
|
||||
self.currentPage = page.get('pgnum')
|
||||
self.pagePos = 0
|
||||
|
||||
def parseEntry(self, entry, shared=[]):
|
||||
uuid = entry.get('id')
|
||||
|
||||
if 'regnum' not in entry.attrib:
|
||||
print('Entry Missing REGNUM')
|
||||
raise DataError(
|
||||
'missing_regnum',
|
||||
uuid=uuid,
|
||||
entry=entry
|
||||
)
|
||||
|
||||
regnums = self.loadRegnums(entry)
|
||||
|
||||
entryDates = self.loadDates(entry)
|
||||
regDates = entryDates['regDate']
|
||||
if(len(regnums) != len(regDates)):
|
||||
if len(regDates) == 1 and len(regnums) > 0:
|
||||
regDates = [regDates[0]] * len(regnums)
|
||||
else:
|
||||
raise DataError(
|
||||
'regnum_date_mismatch',
|
||||
uuid=uuid,
|
||||
regnum='; '.join(regnums),
|
||||
entry=entry
|
||||
)
|
||||
|
||||
regs = self.createRegistrations(regnums, regDates)
|
||||
existingRec = self.matchUUID(uuid)
|
||||
if existingRec:
|
||||
self.updateEntry(existingRec, entryDates, entry, shared, regs)
|
||||
else:
|
||||
self.createEntry(uuid, entryDates, entry, shared, regs)
|
||||
|
||||
def matchUUID(self, uuid):
|
||||
return self.session.query(CCE).filter(CCE.uuid == uuid).one_or_none()
|
||||
|
||||
def createEntry(self, uuid, dates, entry, shared, registrations):
|
||||
titles = self.createTitleList(entry, shared)
|
||||
authors = self.createAuthorList(entry, shared)
|
||||
copies = CCEFile.fetchText(entry, 'copies')
|
||||
description = CCEFile.fetchText(entry, 'desc')
|
||||
newMatter = len(entry.findall('newMatterClaimed')) > 0
|
||||
|
||||
publishers = [
|
||||
(c.text, c.get('claimant', None))
|
||||
for c in entry.findall('.//pubName')
|
||||
]
|
||||
|
||||
lccn = [ lccnorm.normalize(l.text) for l in entry.findall('lccn') ]
|
||||
|
||||
cceRec = CCE(
|
||||
uuid=uuid,
|
||||
page=self.currentPage,
|
||||
page_position=self.pagePos,
|
||||
title=titles,
|
||||
copies=copies,
|
||||
description=description,
|
||||
new_matter=newMatter,
|
||||
pub_date=CCEFile.fetchDateValue(dates['pubDate'], text=False),
|
||||
pub_date_text=CCEFile.fetchDateValue(dates['pubDate'], text=True),
|
||||
aff_date=CCEFile.fetchDateValue(dates['affDate'], text=False),
|
||||
aff_date_text=CCEFile.fetchDateValue(dates['affDate'], text=True),
|
||||
copy_date=CCEFile.fetchDateValue(dates['copyDate'], text=False),
|
||||
copy_date_text=CCEFile.fetchDateValue(dates['copyDate'], text=True)
|
||||
)
|
||||
cceRec.addRelationships(
|
||||
self.fileHeader,
|
||||
entry,
|
||||
authors=authors,
|
||||
publishers=publishers,
|
||||
lccn=lccn,
|
||||
registrations=registrations
|
||||
)
|
||||
self.session.add(cceRec)
|
||||
print('INSERT', cceRec)
|
||||
|
||||
def updateEntry(self, rec, dates, entry, shared, registrations):
|
||||
rec.title = self.createTitleList(entry, shared)
|
||||
rec.copies = CCEFile.fetchText(entry, 'copies')
|
||||
rec.description = CCEFile.fetchText(entry, 'desc')
|
||||
rec.new_matter = len(entry.findall('newMatterClaimed')) > 0
|
||||
rec.page = self.currentPage
|
||||
rec.page_position = self.pagePos
|
||||
|
||||
rec.pub_date = CCEFile.fetchDateValue(dates['pubDate'], text=False)
|
||||
rec.pub_date_text = CCEFile.fetchDateValue(dates['pubDate'], text=True)
|
||||
rec.aff_date = CCEFile.fetchDateValue(dates['affDate'], text=False)
|
||||
rec.aff_date_text = CCEFile.fetchDateValue(dates['affDate'], text=True)
|
||||
rec.copy_date = CCEFile.fetchDateValue(dates['copyDate'], text=False)
|
||||
rec.copy_date_text = CCEFile.fetchDateValue(dates['copyDate'], text=True)
|
||||
|
||||
authors = self.createAuthorList(entry, shared)
|
||||
publishers = [
|
||||
(c.text, c.get('claimant', None))
|
||||
for c in entry.findall('.//pubName')
|
||||
]
|
||||
lccn = [ lccnorm.normalize(l.text) for l in entry.findall('lccn') ]
|
||||
|
||||
rec.updateRelationships(
|
||||
entry,
|
||||
authors=authors,
|
||||
publishers=publishers,
|
||||
lccn=lccn,
|
||||
registrations=registrations
|
||||
)
|
||||
print('UPDATE', rec)
|
||||
|
||||
def createRegistrations(self, regnums, regdates):
|
||||
registrations = []
|
||||
for x in range(len(regnums)):
|
||||
try:
|
||||
regCat = re.match(r'([A-Z]+)', regnums[x][0]).group(1)
|
||||
except AttributeError:
|
||||
regCat ='Unknown'
|
||||
registrations.append({
|
||||
'regnum': regnums[x],
|
||||
'category': regCat,
|
||||
'regDate': regdates[x][0],
|
||||
'regDateText': regdates[x][1]
|
||||
})
|
||||
return registrations
|
||||
|
||||
def loadRegnums(self, entry):
|
||||
rawRegnums = entry.get('regnum').strip()
|
||||
regnums = rawRegnums.split(' ')
|
||||
regnums.extend(self.loadAddtlEntries(entry))
|
||||
outNums = []
|
||||
for num in regnums:
|
||||
parsedNum = self.parseRegNum(num)
|
||||
if type(parsedNum) is str: outNums.append(parsedNum)
|
||||
else: outNums.extend(parsedNum)
|
||||
return outNums
|
||||
|
||||
def loadAddtlEntries(self, entry):
|
||||
moreRegnums = []
|
||||
for addtl in entry.findall('.//additionalEntry'):
|
||||
try:
|
||||
addtlRegnum = addtl.get('regnum').strip()
|
||||
moreRegnums.extend(addtlRegnum.split(' '))
|
||||
except AttributeError:
|
||||
continue
|
||||
return moreRegnums
|
||||
|
||||
def parseRegNum(self, num):
|
||||
try:
|
||||
if (
|
||||
re.search(r'[0-9]+\-[A-Z0-9]+', num)
|
||||
and num[2] != '0'
|
||||
and num[:2] != 'B5'
|
||||
):
|
||||
regnumRange = num.split('-')
|
||||
regRangePrefix = re.match(r'([A-Z]+)', regnumRange[0]).group(1)
|
||||
regRangeStart = int(regnumRange[0].replace(regRangePrefix, ''))
|
||||
regRangeEnd = int(regnumRange[1].replace(regRangePrefix, ''))
|
||||
if regRangeEnd - regRangeStart < 1000:
|
||||
return [
|
||||
'{}{}'.format(regRangePrefix, str(i))
|
||||
for i in range(regRangeStart, regRangeEnd)
|
||||
]
|
||||
except (ValueError, AttributeError) as err:
|
||||
print('RANGE ERROR', self.currentPage, self.pagePos)
|
||||
raise DataError('regnum_range_parsing_error')
|
||||
|
||||
return num
|
||||
|
||||
def loadDates(self, entry):
|
||||
dates = {}
|
||||
for dType in CCEFile.dateTypes:
|
||||
dates[dType] = [
|
||||
(CCEFile.parseDate(d.get('date', None), '%Y-%m-%d'), d.text)
|
||||
for d in entry.findall('.//{}'.format(dType))
|
||||
if not ('ignore' in d.attrib and d.get('ignore') == 'yes')
|
||||
]
|
||||
if len(dates['regDate']) < 1:
|
||||
if len(dates['copyDate']) > 0:
|
||||
dates['regDate'].append(dates['copyDate'][0])
|
||||
elif len(dates['pubDate']) > 0:
|
||||
dates['regDate'].append(dates['pubDate'][0])
|
||||
|
||||
return dates
|
||||
|
||||
@staticmethod
|
||||
def parseDate(date, dateFormat):
|
||||
if date is None or dateFormat == '': return None
|
||||
|
||||
try:
|
||||
outDate = datetime.strptime(date, dateFormat)
|
||||
return outDate.strftime('%Y-%m-%d')
|
||||
except ValueError:
|
||||
return CCEFile.parseDate(date, dateFormat[:-3])
|
||||
|
||||
def createTitleList(self, entry, shared):
|
||||
return '; '.join([
|
||||
t.text
|
||||
for t in entry.findall('.//title') + [
|
||||
t for t in shared if t.tag == 'title'
|
||||
]
|
||||
if t.text is not None
|
||||
])
|
||||
|
||||
def createErrorEntry(self, uuid, regnum, entry, reason):
|
||||
errorCCE = ErrorCCE(
|
||||
uuid=uuid,
|
||||
regnum=regnum,
|
||||
page=self.currentPage,
|
||||
page_position=self.pagePos,
|
||||
reason=reason
|
||||
)
|
||||
errorCCE.volume = self.fileHeader
|
||||
errorCCE.addXML(entry)
|
||||
self.session.add(errorCCE)
|
||||
|
||||
def parseGroup(self, group):
|
||||
entries = []
|
||||
shared = []
|
||||
for field in group:
|
||||
if field.tag == 'page': self.parsePage(field)
|
||||
elif field.tag == 'copyrightEntry': entries.append(field)
|
||||
else: shared.append(field)
|
||||
|
||||
for entry in entries:
|
||||
try:
|
||||
self.parseEntry(entry, shared)
|
||||
except DataError as err:
|
||||
err.uuid = entry.get('id')
|
||||
raise err
|
||||
|
||||
def createAuthorList(self, rec, shared=[]):
|
||||
authors = [
|
||||
(a.text, False) for a in rec.findall('.//authorName')
|
||||
if len(list(a.itersiblings(tag='role', preceding=True))) > 0
|
||||
]
|
||||
if len(authors) < 1:
|
||||
try:
|
||||
authors.append((rec.findall('.//authorName')[0].text, False))
|
||||
except IndexError:
|
||||
pass
|
||||
|
||||
authors.extend([
|
||||
(a.text, False) for a in shared if a.tag == 'authorName'
|
||||
])
|
||||
|
||||
try:
|
||||
authors[0] = (authors[0][0], True)
|
||||
except IndexError:
|
||||
pass
|
||||
|
||||
return authors
|
||||
|
||||
def loadHeader(self):
|
||||
header = self.root.find('.//header')
|
||||
|
||||
startNum = endNum = None
|
||||
if header.find('cite/division/numbers') is not None:
|
||||
startNum = header.find('cite/division/numbers').get('start', None)
|
||||
endNum = header.find('cite/division/numbers').get('end', None)
|
||||
elif header.find('cite/division/number') is not None:
|
||||
volNum = CCEFile.fetchText(header, 'cite/division/number')
|
||||
startNum = volNum
|
||||
endNum= volNum
|
||||
|
||||
headerDict = {
|
||||
'source': header.find('source').get('url', None),
|
||||
'status': CCEFile.fetchText(header, 'status'),
|
||||
'series': header.find('cite/series').get('label', None),
|
||||
'volume': CCEFile.fetchText(header, 'cite/volume'),
|
||||
'year': CCEFile.fetchText(header, 'cite/year'),
|
||||
'part': CCEFile.fetchText(header, 'cite/division/part'),
|
||||
'group': CCEFile.fetchText(header, 'cite/division/group'),
|
||||
'material': CCEFile.fetchText(header, 'cite/division/material'),
|
||||
'numbers': {
|
||||
'start': startNum,
|
||||
'end': endNum
|
||||
}
|
||||
}
|
||||
|
||||
self.fileHeader = Volume(
|
||||
source=headerDict['source'],
|
||||
status=headerDict['status'],
|
||||
series=headerDict['series'],
|
||||
volume=headerDict['volume'],
|
||||
year=headerDict['year'],
|
||||
part=headerDict['part'],
|
||||
group=headerDict['group'],
|
||||
material=headerDict['material'],
|
||||
start_number=headerDict['numbers']['start'],
|
||||
end_number=headerDict['numbers']['end']
|
||||
)
|
||||
self.session.add(self.fileHeader)
|
||||
|
||||
@staticmethod
|
||||
def fetchText(parent, tag):
|
||||
element = parent.find(tag)
|
||||
return element.text if element is not None else None
|
||||
|
||||
@staticmethod
|
||||
def fetchDateValue(date, text=False):
|
||||
x = 1 if text else 0
|
||||
return date[0][x] if len(date) > 0 else None
|
|
@ -0,0 +1,18 @@
|
|||
DATABASE:
|
||||
DB_USER:
|
||||
DB_PSWD:
|
||||
DB_HOST:
|
||||
DB_PORT:
|
||||
DB_NAME:
|
||||
|
||||
GITHUB:
|
||||
ACCESS_TOKEN:
|
||||
CCE_REPO:
|
||||
CCR_REPO:
|
||||
|
||||
ELASTICSEARCH:
|
||||
ES_CCE_INDEX:
|
||||
ES_CCR_INDEX:
|
||||
ES_HOST:
|
||||
ES_PORT:
|
||||
ES_TIMEOUT:
|
|
@ -0,0 +1,151 @@
|
|||
import os
|
||||
from datetime import datetime
|
||||
from elasticsearch.helpers import bulk, BulkIndexError, streaming_bulk
|
||||
from elasticsearch import Elasticsearch
|
||||
from elasticsearch.exceptions import (
|
||||
ConnectionError,
|
||||
TransportError,
|
||||
ConflictError
|
||||
)
|
||||
from elasticsearch_dsl import connections
|
||||
from elasticsearch_dsl.wrappers import Range
|
||||
|
||||
from sqlalchemy import or_
|
||||
from sqlalchemy.orm import configure_mappers, raiseload
|
||||
from sqlalchemy.dialects import postgresql
|
||||
|
||||
from model.cce import CCE as dbCCE
|
||||
from model.renewal import Renewal as dbRenewal
|
||||
from model.registration import Registration as dbRegistration
|
||||
from model.elastic import (
|
||||
CCE,
|
||||
Registration,
|
||||
Renewal
|
||||
)
|
||||
|
||||
|
||||
class ESIndexer():
|
||||
def __init__(self, manager, loadFromTime):
|
||||
self.cce_index = os.environ['ES_CCE_INDEX']
|
||||
self.ccr_index = os.environ['ES_CCR_INDEX']
|
||||
self.client = None
|
||||
self.session = manager.session
|
||||
self.loadFromTime = loadFromTime if loadFromTime else datetime.strptime('1970-01-01', '%Y-%m-%d')
|
||||
|
||||
self.createElasticConnection()
|
||||
self.createIndex()
|
||||
|
||||
configure_mappers()
|
||||
|
||||
def createElasticConnection(self):
|
||||
host = os.environ['ES_HOST']
|
||||
port = os.environ['ES_PORT']
|
||||
timeout = int(os.environ['ES_TIMEOUT'])
|
||||
try:
|
||||
self.client = Elasticsearch(
|
||||
hosts=[{'host': host, 'port': port}],
|
||||
timeout=timeout
|
||||
)
|
||||
except ConnectionError as err:
|
||||
print('Failed to connect to ElasticSearch instance')
|
||||
raise err
|
||||
connections.connections._conns['default'] = self.client
|
||||
|
||||
def createIndex(self):
|
||||
if self.client.indices.exists(index=self.cce_index) is False:
|
||||
CCE.init()
|
||||
if self.client.indices.exists(index=self.ccr_index) is False:
|
||||
Renewal.init()
|
||||
|
||||
def indexRecords(self, recType='cce'):
|
||||
"""Process the current batch of updating records. This utilizes the
|
||||
elasticsearch-py bulk helper to import records in chunks of the
|
||||
provided size. If a record in the batch errors that is reported and
|
||||
logged but it does not prevent the other records in the batch from
|
||||
being imported.
|
||||
"""
|
||||
success, failure = 0, 0
|
||||
errors = []
|
||||
try:
|
||||
for status, work in streaming_bulk(self.client, self.process(recType)):
|
||||
print(status, work)
|
||||
if not status:
|
||||
errors.append(work)
|
||||
failure += 1
|
||||
else:
|
||||
success += 1
|
||||
|
||||
print('Success {} | Failure: {}'.format(success, failure))
|
||||
except BulkIndexError as err:
|
||||
print('One or more records in the chunk failed to import')
|
||||
raise err
|
||||
|
||||
def process(self, recType):
|
||||
if recType == 'cce':
|
||||
for cce in self.retrieveEntries():
|
||||
esEntry = ESDoc(cce)
|
||||
esEntry.indexEntry()
|
||||
yield esEntry.entry.to_dict(True)
|
||||
elif recType == 'ccr':
|
||||
for ccr in self.retrieveRenewals():
|
||||
esRen = ESRen(ccr)
|
||||
esRen.indexRen()
|
||||
if esRen.renewal.rennum == '': continue
|
||||
yield esRen.renewal.to_dict(True)
|
||||
|
||||
def retrieveEntries(self):
|
||||
retQuery = self.session.query(dbCCE)\
|
||||
.filter(dbCCE.date_modified > self.loadFromTime)
|
||||
for cce in retQuery.all():
|
||||
yield cce
|
||||
|
||||
def retrieveRenewals(self):
|
||||
renQuery = self.session.query(dbRenewal)\
|
||||
.filter(dbRenewal.date_modified > self.loadFromTime)
|
||||
for ccr in renQuery.all():
|
||||
yield ccr
|
||||
|
||||
|
||||
class ESDoc():
|
||||
def __init__(self, cce):
|
||||
self.dbRec = cce
|
||||
self.entry = None
|
||||
|
||||
self.initEntry()
|
||||
|
||||
def initEntry(self):
|
||||
print('Creating ES record for {}'.format(self.dbRec))
|
||||
|
||||
self.entry = CCE(meta={'id': self.dbRec.uuid})
|
||||
|
||||
def indexEntry(self):
|
||||
self.entry.uuid = self.dbRec.uuid
|
||||
self.entry.title = self.dbRec.title
|
||||
self.entry.authors = [ a.name for a in self.dbRec.authors ]
|
||||
self.entry.publishers = [ p.name for p in self.dbRec.publishers ]
|
||||
self.entry.lccns = [ l.lccn for l in self.dbRec.lccns ]
|
||||
self.entry.registrations = [
|
||||
Registration(regnum=r.regnum, regdate=r.reg_date)
|
||||
for r in self.dbRec.registrations
|
||||
]
|
||||
|
||||
|
||||
class ESRen():
|
||||
def __init__(self, ccr):
|
||||
self.dbRen = ccr
|
||||
self.renewal = None
|
||||
|
||||
self.initRenewal()
|
||||
|
||||
def initRenewal(self):
|
||||
print('Creating ES record for {}'.format(self.dbRen))
|
||||
|
||||
self.renewal = Renewal(meta={'id': self.dbRen.renewal_num})
|
||||
|
||||
def indexRen(self):
|
||||
self.renewal.rennum = self.dbRen.renewal_num
|
||||
self.renewal.rendate = self.dbRen.renewal_date
|
||||
self.renewal.title = self.dbRen.title
|
||||
self.renewal.claimants = [
|
||||
c.name for c in self.dbRen.claimants
|
||||
]
|
|
@ -0,0 +1,7 @@
|
|||
|
||||
class DataError(Exception):
|
||||
def __init__(self, message, **kwargs):
|
||||
self.message = message
|
||||
|
||||
for key, value in kwargs.items():
|
||||
setattr(self, key, value)
|
|
@ -0,0 +1,86 @@
|
|||
import argparse
|
||||
from datetime import datetime, timedelta
|
||||
import os
|
||||
import yaml
|
||||
|
||||
from sessionManager import SessionManager
|
||||
from builder import CCEReader, CCEFile
|
||||
from renBuilder import CCRReader, CCRFile
|
||||
from esIndexer import ESIndexer
|
||||
|
||||
|
||||
def main(secondsAgo=None, year=None, exclude=None, reinit=False):
|
||||
manager = SessionManager()
|
||||
manager.generateEngine()
|
||||
manager.initializeDatabase(reinit)
|
||||
manager.createSession()
|
||||
|
||||
loadFromTime = None
|
||||
startTime = datetime.now()
|
||||
if secondsAgo is not None:
|
||||
loadFromTime = startTime - timedelta(seconds=secondsAgo)
|
||||
|
||||
#if exclude != 'cce':
|
||||
# loadCCE(manager, loadFromTime, year)
|
||||
#if exclude != 'ccr':
|
||||
# loadCCR(manager, loadFromTime, year)
|
||||
|
||||
indexUpdates(manager, None)
|
||||
|
||||
manager.closeConnection()
|
||||
|
||||
|
||||
def loadCCE(manager, loadFromTime, selectedYear):
|
||||
cceReader = CCEReader(manager)
|
||||
cceReader.loadYears(selectedYear)
|
||||
cceReader.getYearFiles(loadFromTime)
|
||||
cceReader.importYearData()
|
||||
|
||||
|
||||
def loadCCR(manager, loadFromTime, selectedYear):
|
||||
ccrReader = CCRReader(manager)
|
||||
ccrReader.loadYears(selectedYear, loadFromTime)
|
||||
ccrReader.importYears()
|
||||
|
||||
def indexUpdates(manager, loadFromTime):
|
||||
esIndexer = ESIndexer(manager, None)
|
||||
# esIndexer.indexRecords(recType='cce')
|
||||
esIndexer.indexRecords(recType='ccr')
|
||||
|
||||
|
||||
def parseArgs():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Load CCE XML and CCR TSV into PostgresQL'
|
||||
)
|
||||
parser.add_argument('-t', '--time', type=int, required=False,
|
||||
help='Time ago in seconds to check for file updates'
|
||||
)
|
||||
parser.add_argument('-y', '--year', type=str, required=False,
|
||||
help='Specific year to load CCE entries and/or renewals from'
|
||||
)
|
||||
parser.add_argument('-x', '--exclude', type=str, required=False,
|
||||
choices=['cce', 'ccr'],
|
||||
help='Specify to exclude either entries or renewals from this run'
|
||||
)
|
||||
parser.add_argument('--REINITIALIZE', action='store_true')
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def loadConfig():
|
||||
with open('config.yaml', 'r') as yamlFile:
|
||||
config = yaml.safe_load(yamlFile)
|
||||
for section in config:
|
||||
sectionDict = config[section]
|
||||
for key, value in sectionDict.items():
|
||||
os.environ[key] = value
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parseArgs()
|
||||
loadConfig()
|
||||
main(
|
||||
secondsAgo=args.time,
|
||||
year=args.year,
|
||||
exclude=args.exclude,
|
||||
reinit=args.REINITIALIZE
|
||||
)
|
|
@ -0,0 +1,28 @@
|
|||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
Boolean
|
||||
)
|
||||
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
|
||||
|
||||
class Author(Core, Base):
|
||||
__tablename__ = 'author'
|
||||
id = Column(Integer, primary_key=True)
|
||||
name = Column(Unicode, nullable=False, index=True)
|
||||
primary = Column(Boolean, index=True)
|
||||
|
||||
cce_id = Column(Integer, ForeignKey('cce.id'), index=True)
|
||||
|
||||
def __repr__(self):
|
||||
return '<Author(name={}, primary={})>'.format(self.name, self.primary)
|
|
@ -0,0 +1,163 @@
|
|||
from lxml import etree
|
||||
import uuid
|
||||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
Boolean
|
||||
)
|
||||
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
|
||||
from model.xml import XML
|
||||
from model.author import Author
|
||||
from model.publisher import Publisher
|
||||
from model.lccn import LCCN
|
||||
from model.registration import Registration
|
||||
|
||||
|
||||
class CCE(Core, Base):
|
||||
__tablename__ = 'cce'
|
||||
id = Column(Integer, primary_key=True)
|
||||
uuid = Column(UUID(as_uuid=True), unique=False, nullable=False, index=True)
|
||||
page = Column(Integer)
|
||||
page_position = Column(Integer)
|
||||
title = Column(Unicode, index=True)
|
||||
copies = Column(Unicode)
|
||||
description = Column(Unicode)
|
||||
new_matter = Column(Boolean)
|
||||
pub_date = Column(Date)
|
||||
pub_date_text = Column(Unicode)
|
||||
copy_date = Column(Date)
|
||||
copy_date_text = Column(Unicode)
|
||||
aff_date = Column(Date)
|
||||
aff_date_text = Column(Unicode)
|
||||
|
||||
volume_id = Column(Integer, ForeignKey('volume.id'))
|
||||
|
||||
registrations = relationship('Registration', backref='cce')
|
||||
lccns = relationship('LCCN', backref='cce', cascade='all, delete-orphan')
|
||||
authors = relationship('Author', backref='cce', cascade='all, delete-orphan')
|
||||
publishers = relationship('Publisher', backref='cce', cascade='all, delete-orphan')
|
||||
|
||||
def __repr__(self):
|
||||
return '<CCE(regnums={}, uuid={}, title={})>'.format(self.registrations, self.uuid, self.title)
|
||||
|
||||
def addRelationships(self, volume, xml, lccn=[], authors=[], publishers=[], registrations=[]):
|
||||
self.volume = volume
|
||||
self.addLCCN(lccn)
|
||||
self.addAuthor(authors)
|
||||
self.addPublisher(publishers)
|
||||
self.addRegistration(registrations)
|
||||
self.addXML(xml)
|
||||
|
||||
def addLCCN(self, lccns):
|
||||
self.lccns = [ LCCN(lccn=lccn) for lccn in lccns ]
|
||||
|
||||
def addXML(self, xml):
|
||||
xmlString = etree.tostring(xml, encoding='utf-8').decode()
|
||||
self.xml_sources.append(XML(xml_source=xmlString))
|
||||
|
||||
def addAuthor(self, authors):
|
||||
for auth in authors:
|
||||
if auth[0] is None:
|
||||
print('No author name! for {}'.format(self.uuid))
|
||||
continue
|
||||
self.authors.append(Author(name=auth[0], primary=auth[1]))
|
||||
|
||||
def addPublisher(self, publishers):
|
||||
for pub in publishers:
|
||||
if pub[0] is None:
|
||||
print('No publisher name! for {}'.format(self.uuid))
|
||||
continue
|
||||
claimant = True if pub[1] == 'yes' else False
|
||||
self.publishers.append(Publisher(name=pub[0], claimant=claimant))
|
||||
|
||||
def addRegistration(self, registrations):
|
||||
self.registrations = [
|
||||
Registration(
|
||||
regnum=reg['regnum'],
|
||||
category=reg['category'],
|
||||
reg_date=reg['regDate'],
|
||||
reg_date_text=reg['regDateText']
|
||||
)
|
||||
for reg in registrations
|
||||
]
|
||||
|
||||
def updateRelationships(self, xml, lccn=[], authors=[], publishers=[], registrations=[]):
|
||||
self.addXML(xml)
|
||||
self.updateLCCN(lccn)
|
||||
self.updateAuthors(authors)
|
||||
self.updatePublishers(publishers)
|
||||
self.updateRegistrations(registrations)
|
||||
|
||||
def updateLCCN(self, lccns):
|
||||
currentLCCNs = [ l.lccn for l in self.lccns ]
|
||||
if lccns != currentLCCNs:
|
||||
self.lccns = [
|
||||
l for l in self.lccns
|
||||
if l.lccn in list(set(currentLCCNs) & set(lccns))
|
||||
]
|
||||
for new in list(set(lccns) - set(currentLCCNs)):
|
||||
self.lccns.append(LCCN(lccn=new))
|
||||
|
||||
def updateAuthors(self, authors):
|
||||
currentAuthors = [ (a.name, a.primary) for a in self.authors ]
|
||||
newAuthors = filter(lambda x: x[0] is None, authors)
|
||||
if newAuthors != currentAuthors:
|
||||
self.authors = [
|
||||
a for a in self.authors
|
||||
if a.name in list(set(currentAuthors) & set(newAuthors))
|
||||
]
|
||||
for new in list(set(newAuthors) - set(currentAuthors)):
|
||||
self.authors.append(Author(name=new[0], primary=new[1]))
|
||||
|
||||
def updatePublishers(self, publishers):
|
||||
currentPublishers = [ (a.name, a.claimant) for a in self.publishers ]
|
||||
newPublishers = [
|
||||
(p[0], True if p[1] == 'yes' else False)
|
||||
for p in filter(lambda x: x[0] is None, publishers)
|
||||
]
|
||||
if newPublishers != currentPublishers:
|
||||
self.authors = [
|
||||
a for a in self.authors
|
||||
if a.name in list(set(currentPublishers) & set(newPublishers))
|
||||
]
|
||||
for new in list(set(newPublishers) - set(currentPublishers)):
|
||||
self.publishers.append(Publisher(name=new[0], claimant=new[1]))
|
||||
|
||||
def updateRegistrations(self, registrations):
|
||||
existingRegs = [
|
||||
self.updateReg(r, registrations) for r in self.registrations
|
||||
if r.regnum in [ r['regnum'] for r in registrations ]
|
||||
]
|
||||
newRegs = [
|
||||
r for r in registrations
|
||||
if r['regnum'] not in [ r.regnum for r in existingRegs ]
|
||||
]
|
||||
self.registrations = existingRegs + [
|
||||
Registration(
|
||||
regnum=reg['regnum'],
|
||||
category=reg['category'],
|
||||
reg_date=reg['regDate'],
|
||||
reg_date_text=reg['regDateText']
|
||||
)
|
||||
for reg in newRegs
|
||||
]
|
||||
|
||||
def updateReg(self, reg, registrations):
|
||||
newReg = CCE.getReg(reg.regnum, registrations)
|
||||
reg.update(newReg)
|
||||
|
||||
@staticmethod
|
||||
def getReg(regnum, newRegs):
|
||||
for new in newRegs:
|
||||
if regnum == new['regnum']: return new
|
|
@ -0,0 +1,21 @@
|
|||
|
||||
from sqlalchemy import Column, DateTime
|
||||
from sqlalchemy.ext.declarative import declarative_base
|
||||
from datetime import datetime
|
||||
|
||||
Base = declarative_base()
|
||||
|
||||
|
||||
class Core(object):
|
||||
"""A mixin for other SQLAlchemy ORM classes. Includes a date_craeted and
|
||||
date_updated field for all database tables."""
|
||||
date_created = Column(
|
||||
DateTime,
|
||||
default=datetime.now()
|
||||
)
|
||||
|
||||
date_modified = Column(
|
||||
DateTime,
|
||||
default=datetime.now(),
|
||||
onupdate=datetime.now()
|
||||
)
|
|
@ -0,0 +1,58 @@
|
|||
import os
|
||||
import yaml
|
||||
from elasticsearch_dsl import (
|
||||
Index,
|
||||
Document,
|
||||
Keyword,
|
||||
Text,
|
||||
Date,
|
||||
InnerDoc,
|
||||
Nested
|
||||
)
|
||||
|
||||
|
||||
class BaseDoc(Document):
|
||||
date_created = Date()
|
||||
date_modified = Date()
|
||||
|
||||
def save(self, **kwargs):
|
||||
return super(BaseDoc, self).save(**kwargs)
|
||||
|
||||
class BaseInner(InnerDoc):
|
||||
date_created = Date()
|
||||
date_modified = Date()
|
||||
|
||||
def save(self, **kwargs):
|
||||
return super(BaseInner, self).save(**kwargs)
|
||||
|
||||
|
||||
class Registration(BaseInner):
|
||||
regnum = Keyword()
|
||||
regdate = Date()
|
||||
|
||||
|
||||
class Renewal(BaseDoc):
|
||||
rennum = Keyword()
|
||||
rendate = Date()
|
||||
title = Text(fields={'keyword': Keyword()})
|
||||
claimants = Text(multi=True)
|
||||
|
||||
class Index:
|
||||
with open('config.yaml', 'r') as yamlFile:
|
||||
config = yaml.safe_load(yamlFile)
|
||||
name = config['ELASTICSEARCH']['ES_CCR_INDEX']
|
||||
|
||||
|
||||
class CCE(BaseDoc):
|
||||
uuid = Keyword(store=True)
|
||||
title = Text(fields={'keyword': Keyword()})
|
||||
authors = Text(multi=True)
|
||||
publishers = Text(multi=True)
|
||||
lccns = Keyword(multi=True)
|
||||
|
||||
registrations = Nested(Registration)
|
||||
|
||||
class Index:
|
||||
with open('config.yaml', 'r') as yamlFile:
|
||||
config = yaml.safe_load(yamlFile)
|
||||
name = config['ELASTICSEARCH']['ES_CCE_INDEX']
|
|
@ -0,0 +1,42 @@
|
|||
from lxml import etree
|
||||
import uuid
|
||||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
Boolean
|
||||
)
|
||||
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
from model.xml import XML
|
||||
|
||||
from model.author import Author
|
||||
from model.publisher import Publisher
|
||||
from model.lccn import LCCN
|
||||
|
||||
|
||||
class ErrorCCE(Core, Base):
|
||||
__tablename__ = 'error_cce'
|
||||
id = Column(Integer, primary_key=True)
|
||||
uuid = Column(UUID(as_uuid=True), unique=False, nullable=False, index=True)
|
||||
regnum = Column(Unicode, unique=False, nullable=True)
|
||||
page = Column(Integer)
|
||||
page_position = Column(Integer)
|
||||
reason = Column(Unicode)
|
||||
|
||||
volume_id = Column(Integer, ForeignKey('volume.id'))
|
||||
|
||||
def __repr__(self):
|
||||
return '<ErrorCCE(regnum={}, uuid={})>'.format(self.regnum, self.uuid)
|
||||
|
||||
def addXML(self, xml):
|
||||
xmlString = etree.tostring(xml, encoding='utf-8').decode()
|
||||
self.xml_sources.append(XML(xml_source=xmlString))
|
|
@ -0,0 +1,25 @@
|
|||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
Boolean
|
||||
)
|
||||
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
|
||||
|
||||
class LCCN(Core, Base):
|
||||
__tablename__ = 'lccn'
|
||||
id = Column(Integer, primary_key=True)
|
||||
lccn = Column(Unicode, nullable=False, index=True)
|
||||
|
||||
cce_id = Column(Integer, ForeignKey('cce.id'), index=True)
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
Boolean
|
||||
)
|
||||
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
|
||||
|
||||
class Publisher(Core, Base):
|
||||
__tablename__ = 'publisher'
|
||||
id = Column(Integer, primary_key=True)
|
||||
name = Column(Unicode, nullable=False, index=True)
|
||||
claimant = Column(Boolean, index=True)
|
||||
|
||||
cce_id = Column(Integer, ForeignKey('cce.id'), index=True)
|
||||
|
||||
def __repr__(self):
|
||||
return '<Publisher(name={}, claimant={})>'.format(self.name, self.claimant)
|
|
@ -0,0 +1,34 @@
|
|||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
Boolean
|
||||
)
|
||||
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
|
||||
|
||||
class Registration(Core, Base):
|
||||
__tablename__ = 'registration'
|
||||
id = Column(Integer, primary_key=True)
|
||||
regnum = Column(Unicode, unique=False, nullable=False, index=True)
|
||||
category = Column(Unicode)
|
||||
reg_date = Column(Date)
|
||||
reg_date_text = Column(Unicode)
|
||||
|
||||
cce_id = Column(Integer, ForeignKey('cce.id'), index=True)
|
||||
|
||||
def __repr__(self):
|
||||
return '<Registration(regnum={}, date={})>'.format(self.regnum, self.reg_date_text)
|
||||
|
||||
def update(self, newReg):
|
||||
for key, value in newReg.items():
|
||||
setattr(self, key, value)
|
|
@ -0,0 +1,30 @@
|
|||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
Boolean
|
||||
)
|
||||
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
|
||||
|
||||
class RenClaimant(Core, Base):
|
||||
__tablename__ = 'renewal_claimant'
|
||||
id = Column(Integer, primary_key=True)
|
||||
name = Column(Unicode, nullable=False, index=True)
|
||||
claimant_type = Column(Unicode, index=True)
|
||||
|
||||
renewal_id = Column(Integer, ForeignKey('renewal.id'))
|
||||
|
||||
renewal = relationship('Renewal', backref='claimants')
|
||||
|
||||
def __repr__(self):
|
||||
return '<Claimant(name={}, type={})>'.format(self.name, self.claimant_type)
|
|
@ -0,0 +1,88 @@
|
|||
import uuid
|
||||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
Boolean,
|
||||
Table
|
||||
)
|
||||
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
from model.renClaimant import RenClaimant
|
||||
|
||||
|
||||
RENEWAL_REG = Table(
|
||||
'renewal_registration',
|
||||
Base.metadata,
|
||||
Column('renewal_id', Integer, ForeignKey('renewal.id'), index=True),
|
||||
Column('registration_id', Integer, ForeignKey('registration.id'), index=True)
|
||||
)
|
||||
|
||||
|
||||
class Renewal(Core, Base):
|
||||
__tablename__ = 'renewal'
|
||||
id = Column(Integer, primary_key=True)
|
||||
uuid = Column(UUID(as_uuid=True), unique=False, nullable=False, index=True)
|
||||
volume = Column(Integer)
|
||||
part = Column(Unicode)
|
||||
number = Column(Integer)
|
||||
page = Column(Integer)
|
||||
author = Column(Unicode)
|
||||
title = Column(Unicode, index=True)
|
||||
reg_data = Column(Unicode)
|
||||
renewal_num = Column(Unicode)
|
||||
renewal_date = Column(Date)
|
||||
renewal_date_text = Column(Unicode)
|
||||
new_matter = Column(Unicode)
|
||||
see_also_regs = Column(Unicode)
|
||||
see_also_rens = Column(Unicode)
|
||||
notes = Column(Unicode)
|
||||
source = Column(Unicode)
|
||||
orphan = Column(Boolean, default=False)
|
||||
|
||||
registrations = relationship(
|
||||
'Registration',
|
||||
secondary=RENEWAL_REG,
|
||||
backref='renewals'
|
||||
)
|
||||
|
||||
def __repr__(self):
|
||||
return '<CCR(regs={}, uuid={}, title={})>'.format(self.registrations, self.uuid, self.title)
|
||||
|
||||
def addClaimants(self, claimants):
|
||||
if claimants:
|
||||
for claim in claimants.split('||'):
|
||||
cParts = claim.split('|')
|
||||
self.claimants.append(
|
||||
RenClaimant(name=cParts[0], claimant_type=cParts[1])
|
||||
)
|
||||
|
||||
def updateClaimants(self, claimants):
|
||||
addClaims = [
|
||||
claim.split('|') for claim in claimants.split('||')
|
||||
]
|
||||
existingClaims = [
|
||||
c for c in self.claimants
|
||||
if c.name in [ a[0] for a in addClaims ]
|
||||
]
|
||||
newClaims = [
|
||||
c for c in addClaims
|
||||
if c[0] not in [ e.name for e in existingClaims ]
|
||||
and c[0] != ''
|
||||
]
|
||||
|
||||
self.claimants = existingClaims + [
|
||||
RenClaimant(
|
||||
name=claim[0],
|
||||
claimant_type=claim[1]
|
||||
)
|
||||
for claim in newClaims
|
||||
]
|
|
@ -0,0 +1,37 @@
|
|||
import uuid
|
||||
import json
|
||||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
PrimaryKeyConstraint
|
||||
)
|
||||
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy.ext.associationproxy import association_proxy
|
||||
from sqlalchemy.orm.exc import NoResultFound
|
||||
|
||||
from model.core import Base, Core
|
||||
|
||||
class Volume(Core, Base):
|
||||
__tablename__ = 'volume'
|
||||
id = Column(Integer, primary_key=True)
|
||||
source = Column(Unicode)
|
||||
status = Column(Unicode)
|
||||
series = Column(Unicode)
|
||||
volume = Column(Integer)
|
||||
year = Column(Integer)
|
||||
part = Column(Unicode)
|
||||
group = Column(Integer)
|
||||
material = Column(Unicode)
|
||||
start_number = Column(Integer)
|
||||
end_number = Column(Integer)
|
||||
|
||||
entries = relationship('CCE', backref='volume')
|
||||
error_entries = relationship('ErrorCCE', backref='volume')
|
||||
|
||||
def __repr__(self):
|
||||
return '<Volume(series={}, volume={}, year={})>'.format(self.series, self.volume, self.year)
|
|
@ -0,0 +1,50 @@
|
|||
from sqlalchemy.types import PickleType
|
||||
from sqlalchemy.ext.compiler import compiles
|
||||
from sqlalchemy.orm import relationship, backref
|
||||
from sqlalchemy import (
|
||||
Column,
|
||||
Date,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Unicode,
|
||||
PrimaryKeyConstraint,
|
||||
Table
|
||||
)
|
||||
|
||||
from model.core import Base, Core
|
||||
|
||||
@compiles(String, 'postgresql')
|
||||
def compile_xml(type_, compiler, **kw):
|
||||
return "XML"
|
||||
|
||||
ENTRY_XML = Table(
|
||||
'entry_xml',
|
||||
Base.metadata,
|
||||
Column('cce_id', Integer, ForeignKey('cce.id'), index=True),
|
||||
Column('xml_id', Integer, ForeignKey('xml.id'), index=True)
|
||||
)
|
||||
|
||||
ERROR_XML = Table(
|
||||
'error_entry_xml',
|
||||
Base.metadata,
|
||||
Column('error_cce_id', Integer, ForeignKey('error_cce.id'), index=True),
|
||||
Column('xml_id', Integer, ForeignKey('xml.id'), index=True)
|
||||
)
|
||||
|
||||
class XML(Core, Base):
|
||||
__tablename__ = 'xml'
|
||||
id = Column(Integer, primary_key=True)
|
||||
xml_source = Column(String)
|
||||
|
||||
entry = relationship(
|
||||
'CCE',
|
||||
secondary=ENTRY_XML,
|
||||
backref='xml_sources'
|
||||
)
|
||||
|
||||
error_entry = relationship(
|
||||
'ErrorCCE',
|
||||
secondary=ERROR_XML,
|
||||
backref='xml_sources'
|
||||
)
|
|
@ -0,0 +1,193 @@
|
|||
import base64
|
||||
import csv
|
||||
from datetime import datetime
|
||||
from github import Github
|
||||
from io import StringIO
|
||||
import os
|
||||
import re
|
||||
|
||||
from sqlalchemy.orm.exc import MultipleResultsFound
|
||||
|
||||
from model.renewal import Renewal
|
||||
from model.registration import Registration
|
||||
|
||||
class CCRReader():
|
||||
def __init__(self, manager):
|
||||
self.git = Github(os.environ['ACCESS_TOKEN'])
|
||||
self.repo = self.git.get_repo(os.environ['CCR_REPO'])
|
||||
self.ccrYears = {}
|
||||
|
||||
self.dbManager = manager
|
||||
|
||||
def loadYears(self, selectedYear, loadFromTime):
|
||||
for year in self.repo.get_contents('/data'):
|
||||
yearMatch = re.match(r'^([0-9]{4}).*\.tsv$', year.name)
|
||||
if not yearMatch: continue
|
||||
fileYear = yearMatch.group(1)
|
||||
if selectedYear is not None and selectedYear != fileYear: continue
|
||||
fileCommit = self.repo.get_commits(path=year.path)[0]
|
||||
commitDate = fileCommit.commit.committer.date
|
||||
if loadFromTime is not None and commitDate < loadFromTime: continue
|
||||
yearInfo = {
|
||||
'path': year.path,
|
||||
'filename': year.name,
|
||||
'sha': year.sha
|
||||
}
|
||||
self.ccrYears[fileYear] = yearInfo
|
||||
|
||||
def importYears(self):
|
||||
for year in self.ccrYears.keys(): self.importYear(year)
|
||||
|
||||
def importYear(self, year):
|
||||
yearInfo = self.ccrYears[year]
|
||||
print('Loading Year {}'.format(year))
|
||||
cceFile = CCRFile(self.repo, yearInfo, self.dbManager.session)
|
||||
cceFile.loadFileTSV()
|
||||
cceFile.readRows()
|
||||
self.dbManager.commitChanges()
|
||||
|
||||
|
||||
class CCRFile():
|
||||
def __init__(self, repo, ccrFile, session):
|
||||
self.repo = repo
|
||||
self.ccrFile = ccrFile
|
||||
self.session = session
|
||||
|
||||
self.rows = []
|
||||
|
||||
def loadFileTSV(self):
|
||||
yearBlob = self.repo.get_git_blob(self.ccrFile['sha'])
|
||||
tsvString = base64.b64decode(yearBlob.content).decode('utf-8')
|
||||
tsvFile = StringIO(tsvString)
|
||||
self.rows = csv.DictReader(tsvFile, delimiter='\t', quotechar='"')
|
||||
|
||||
def readRows(self):
|
||||
for row in self.rows: self.parseRow(row)
|
||||
|
||||
def parseRow(self, row):
|
||||
|
||||
rec = self.matchRenewal(row['entry_id'])
|
||||
if rec: self.updateRenewal(rec, row)
|
||||
else: self.createRenewal(row)
|
||||
|
||||
def createRenewal(self, row):
|
||||
title = CCRFile.cascadeFieldNameLoad('title', 'titl', row=row)
|
||||
renewalDateText = CCRFile.cascadeFieldNameLoad('rdat', 'dreg', row=row)
|
||||
source = CCRFile.cascadeFieldNameLoad('source', 'full_text', row=row)
|
||||
author = CCRFile.cascadeFieldNameLoad('author', 'auth', row=row)
|
||||
notes = CCRFile.cascadeFieldNameLoad('notes', 'note', row=row)
|
||||
|
||||
try:
|
||||
renDate = datetime.strptime(renewalDateText, '%Y-%m-%d')
|
||||
except ValueError:
|
||||
renDate = None
|
||||
|
||||
renRec = Renewal(
|
||||
uuid=row['entry_id'],
|
||||
author=author,
|
||||
title=title,
|
||||
reg_data='{}|{}'.format(row['oreg'], row['odat']),
|
||||
renewal_num=row['id'],
|
||||
renewal_date=renDate,
|
||||
renewal_date_text=renewalDateText,
|
||||
new_matter=row['new_matter'],
|
||||
see_also_regs=row['see_also_reg'],
|
||||
see_also_rens=row['see_also_ren'],
|
||||
notes=notes,
|
||||
source=source
|
||||
)
|
||||
|
||||
for numField in ['volume', 'part', 'number', 'page']:
|
||||
setattr(
|
||||
renRec,
|
||||
numField,
|
||||
row[numField] if row[numField] != '' else None
|
||||
)
|
||||
|
||||
self.matchRegistrations(renRec, row['oreg'], row['odat'])
|
||||
renRec.addClaimants(row['claimants'])
|
||||
|
||||
self.session.add(renRec)
|
||||
print('INSERT', renRec)
|
||||
|
||||
def updateRenewal(self, rec, row):
|
||||
rec.uuid = row['entry_id']
|
||||
rec.title = CCRFile.cascadeFieldNameLoad('title', 'titl', row=row)
|
||||
rec.source = CCRFile.cascadeFieldNameLoad('source', 'full_text', row=row)
|
||||
rec.author = CCRFile.cascadeFieldNameLoad('author', 'auth', row=row)
|
||||
rec.notes = CCRFile.cascadeFieldNameLoad('notes', 'note', row=row)
|
||||
rec.reg_data = '{}|{}'.format(row['oreg'], row['odat'])
|
||||
rec.renewal_num = row['id']
|
||||
rec.new_matter = row['new_matter']
|
||||
if row['see_also_reg']:
|
||||
rec.see_also_regs = row['see_also_reg']
|
||||
if row['see_also_ren']:
|
||||
rec.see_also_rens = row['see_also_ren']
|
||||
|
||||
rec.renewal_date_text = CCRFile.cascadeFieldNameLoad('rdat', 'dreg', row=row)
|
||||
try:
|
||||
rec.renewal_date = datetime.strptime(rec.renewal_date_text, '%Y-%m-%d')
|
||||
except ValueError:
|
||||
rec.renewal_date = None
|
||||
|
||||
for numField in ['volume', 'part', 'number', 'page']:
|
||||
setattr(
|
||||
rec,
|
||||
numField,
|
||||
row[numField] if row[numField] != '' else None
|
||||
)
|
||||
|
||||
self.matchRegistrations(rec, row['oreg'], row['odat'])
|
||||
rec.updateClaimants(row['claimants'])
|
||||
|
||||
print('UPDATE', rec)
|
||||
|
||||
def matchRenewal(self, uuid):
|
||||
return self.session.query(Renewal).filter(Renewal.uuid == uuid).one_or_none()
|
||||
|
||||
def matchRegistrations(self, renRec, regnum, origDate):
|
||||
if regnum is None or regnum.strip() == '': return
|
||||
try:
|
||||
checkDate = datetime.strptime(origDate, '%Y-%m-%d')
|
||||
except ValueError:
|
||||
checkDate = None
|
||||
regnumQuery = self.session.query(Registration)\
|
||||
.filter(Registration.regnum == regnum)\
|
||||
.filter(Registration.reg_date == checkDate)
|
||||
try:
|
||||
origReg = regnumQuery.one_or_none()
|
||||
except MultipleResultsFound:
|
||||
origRegs = regnumQuery.all()
|
||||
|
||||
if len(origRegs) < 1:
|
||||
origReg = None
|
||||
seeAlsoRegs = regnumQuery.all()
|
||||
renRec.see_also_regs = '{}|{}'.format(
|
||||
renRec.see_also_regs,
|
||||
'|'.join([ r.regnum for r in seeAlsoRegs ])
|
||||
)
|
||||
else:
|
||||
origReg = origRegs[0]
|
||||
if len(origRegs) > 1:
|
||||
renRec.see_also_regs = '{}|{}'.format(
|
||||
renRec.see_also_regs,
|
||||
'|'.join([ r.regnum for r in origRegs[1:] ])
|
||||
)
|
||||
|
||||
if origReg:
|
||||
renRec.registrations.append(origReg)
|
||||
renRec.orphan = False
|
||||
else:
|
||||
print('Matching Registration not found!')
|
||||
if len(renRec.registrations) < 1:
|
||||
renRec.orphan = True
|
||||
|
||||
@staticmethod
|
||||
def cascadeFieldNameLoad(*fields, row=None):
|
||||
for field in fields:
|
||||
try:
|
||||
return row[field]
|
||||
except KeyError:
|
||||
pass
|
||||
print('No matching field found!')
|
||||
raise KeyError
|
|
@ -0,0 +1,7 @@
|
|||
elasticsearch-dsl>=6.0.0,<7.0.0
|
||||
lccnorm
|
||||
lxml
|
||||
psycopg2-binary
|
||||
pygithub
|
||||
pyyaml
|
||||
sqlalchemy
|
|
@ -0,0 +1,57 @@
|
|||
import os
|
||||
from sqlalchemy import create_engine
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
|
||||
from model.core import Base
|
||||
|
||||
class SessionManager():
|
||||
def __init__(self, user=None, pswd=None, host=None, port=None, db=None):
|
||||
self.user = user if user else os.environ.get('DB_USER', None)
|
||||
self.pswd = pswd if pswd else os.environ.get('DB_PSWD', None)
|
||||
self.host = host if host else os.environ.get('DB_HOST', None)
|
||||
self.port = port if port else os.environ.get('DB_PORT', None)
|
||||
self.db = db if db else os.environ.get('DB_NAME', None)
|
||||
|
||||
self.engine = None
|
||||
self.session = None
|
||||
|
||||
def generateEngine(self):
|
||||
try:
|
||||
self.engine = create_engine(
|
||||
'postgresql://{}:{}@{}:{}/{}'.format(
|
||||
self.user,
|
||||
self.pswd,
|
||||
self.host,
|
||||
self.port,
|
||||
self.db
|
||||
)
|
||||
)
|
||||
except Exception as e:
|
||||
raise e
|
||||
|
||||
def initializeDatabase(self, reinit=False):
|
||||
if reinit: Base.metadata.drop_all(self.engine, checkfirst=True)
|
||||
if not self.engine.dialect.has_table(self.engine, 'cce'):
|
||||
Base.metadata.create_all(self.engine)
|
||||
|
||||
def createSession(self, autoflush=True):
|
||||
if not self.engine: self.generateEngine()
|
||||
self.session = sessionmaker(
|
||||
bind=self.engine,
|
||||
autoflush=autoflush
|
||||
)()
|
||||
return self.session
|
||||
|
||||
def startSession(self):
|
||||
self.session.begin_nested()
|
||||
|
||||
def commitChanges(self):
|
||||
self.session.commit()
|
||||
|
||||
def rollbackChanges(self):
|
||||
self.session.rollback()
|
||||
|
||||
def closeConnection(self):
|
||||
self.commitChanges()
|
||||
self.session.close()
|
||||
self.engine.dispose()
|
Loading…
Reference in New Issue