Commit Graph

65 Commits (ca94c128deda4d32ec1850a8cc79b18743c6e382)

Author SHA1 Message Date
eric ca94c128de online to download handling
+ fix bug that made everythong 'online'
+ handle online ebooks with multiple format downloads
+ download ebooks with volatile links
+ move contenttyper to core.loaders.utils
+ add handling for really html ebooks
2018-04-09 16:32:52 -04:00
eric 07fd095b9a fix bugs 2018-04-09 11:54:16 -04:00
eric 0ba2906c62 delint 2018-04-07 18:38:33 -04:00
eric e03fa239b4 revamp doab loading
- doab loading now done primarily by oai, no processing of csv.
- added pyoai and updated lxml
- doab ids or urls in ebook submission now handled by oai scrape
- doab_load_books removed
- doab_utils moved from Gluejar/DOAB
- licenses now recognizes OpenEdition
- new ebook type "online" will implement in UI after mobile launch;
ebooks now creaded for html contenttype
2018-04-07 17:11:36 -04:00
eric 533eb94152 load springer improvements
We've loaded about half the Springer Open books catalog, adding 20
books at a time. I wanted to load page 23 of results without having to
load pages 1-22. Also added some exception handling.
2018-03-22 16:13:55 -04:00
eric ad9523314d fix bug in ubiquity scraper 2018-02-20 13:07:44 -05:00
eric 33f4b75417 stricter RE 2018-01-04 16:53:29 -05:00
eric ba381add02 add smashwords 2018-01-03 15:53:02 -05:00
eric 59388933a9 one scraper per file 2018-01-03 13:58:45 -05:00
eric e837dd6ff2 added date validation 2018-01-03 13:30:36 -05:00
eric c8837c3c74 make check_metas case insensitive for name 2018-01-03 11:54:48 -05:00
eric 3f3428a68b add some opengraph support 2018-01-02 18:20:34 -05:00
eric f1213d590c fix can_scrape 2018-01-01 19:25:00 -05:00
eric cf093c945d add some custom code for ubiquity press sites 2017-12-23 18:29:16 -05:00
eric e6dbae05db update springer 2017-12-23 18:15:59 -05:00
eric f701f1ba36 refactor can_scrape 2017-12-23 18:12:07 -05:00
eric d1cf6e6fb3 fix some scraping bugs 2017-12-15 19:26:50 -05:00
eric ebf68befeb add Springer publisher 2017-12-10 16:38:30 -05:00
eric 3c7c9ade00 add Springer to get_scraper 2017-12-07 17:36:35 -05:00
eric d53b3bcc8d delint 2017-12-07 17:36:08 -05:00
eric 5ccd7a0a47 add get_role to scraper 2017-12-07 17:35:52 -05:00
eric c6885ff84b fix springer descriptions 2017-12-07 16:35:11 -05:00
eric 81c3268f70 fix license url 2017-12-07 16:34:25 -05:00
eric 82784778c4 add springer scraper 2017-12-06 18:13:46 -05:00
eric 28fa60ffba fix cover finding 2017-11-21 11:10:46 -05:00
eric a09f3907b3 add pressbooks sites, improve pubdata scraper 2017-11-20 18:05:07 -05:00
eric 98cbef7104 gather isbns from schema.org
and stop raising unwanted exceptions
2017-11-06 12:42:52 -05:00
eric 6487916adb omit review metadata 2017-11-06 12:38:06 -05:00
eric b5e52effd9 optimize id access
See
https://docs.djangoproject.com/en/1.11/topics/db/optimization/#use-forei
gn-key-values-directly
2017-10-28 18:33:58 -04:00
eric 2a7773fafa add hathitrust scraper 2017-10-27 12:09:03 -04:00
eric f2fb171708 fix bug 2017-09-28 14:17:12 -04:00
eric fa4573a74d authlist cleaner, definition lists 2017-09-28 13:25:56 -04:00
eric 467ab8a425 add scraper selector 2017-09-27 19:20:14 -04:00
eric db03b59fb4 add code for pressbooks scraping 2017-09-27 17:54:44 -04:00
eric 1ce4323bc4 precheck every new subject
fix bug with '/' in subject
interpret ';' as list delimiter
add cleaner script
2017-09-15 15:55:37 -04:00
eric 5bbeb45053 improve merge_works
work_relations were not being updated
2017-09-04 16:10:24 -04:00
eric 6895302338 add OpenGraph type, title, and cover to scraper 2017-08-24 14:43:31 -04:00
eric e7847ae349 remove debug code 2017-08-23 12:24:04 -04:00
eric 0c687fdad4 add command to load from sitemaps 2017-08-23 12:21:56 -04:00
eric 1bd1f943f6 fix bug in edition assignment 2017-08-18 16:39:11 -04:00
eric ca5d9e1053 fix edition note aignment 2017-08-09 21:14:38 -04:00
eric f9d31b0f51 fix glue resolution 2017-08-07 21:46:21 -04:00
eric 489790fa2f add ebook loading code 2017-08-07 16:17:00 -04:00
eric e8bd8725cc handle edition ids better
also, allow contributor to request unglue.it id
2017-08-04 17:12:05 -04:00
eric 08702a7b08 scrapes the metadata
also moves id validation to core
2017-08-03 16:15:06 -04:00
eric 7bc72692c5 add exception handling 2017-07-30 13:55:46 -04:00
eric aaef670798 add scraper for webpages
gets title, description, language

adds beautiful soup to requirements
updates gitenberg.metadata import
2017-07-29 20:46:22 -04:00
eric 2adf3cc7cd handle isbn and goog lookups 2017-07-27 15:13:04 -04:00
eric 7294a5c679 update doi regexp and display
https://www.crossref.org/display-guidelines/
2017-02-22 11:21:24 -05:00
eric 652d9a3456 modify doab load to handle authlists
also fix a few encoding issues and null data problems resulting in
non-loading and ftp redirects
2016-12-02 15:50:07 -05:00