Commit Graph

1593 Commits (4c92a23dc89aa51242d50b22948041eeb7f1565c)

Author SHA1 Message Date
eric 4969994a87 urllib2 didn't handle chunked method 2019-06-13 16:20:05 -04:00
eric e3a5a50f34 catch S3 exception 2019-06-13 16:18:54 -04:00
eric 703db9ed98 add SciELO to good providers 2019-06-12 17:12:54 -04:00
eric 6814380aa4 tweak scielo handling
and add a management command to fix the old ones
2019-06-12 17:02:57 -04:00
eric d5f5656d3c fix missing logger 2019-06-12 17:02:11 -04:00
eric e60e8bfbf8 get dl url from dl link 2019-06-07 15:20:05 -04:00
eric e42d77589b tighten exception handling
got a bunch of integrity errors failure; probably some other exception being throughn here.
2019-06-06 17:23:45 -04:00
eric e5ba5caab4 revert search method
fulltext search returned too many results
2019-06-05 14:21:02 -04:00
eric de3e6c499c try to fix missing scheme 2019-05-05 12:50:52 -04:00
eric 14346ed868 delint 2019-03-27 21:46:25 -04:00
eric c142533898 db cleaning 2019-03-27 21:22:56 -04:00
eric e563da9655 refactor lang validation 2019-03-27 21:22:37 -04:00
eric 6fd33d989c don't create bad works 2019-03-27 21:21:25 -04:00
eric 5fc6a2ee82 harvest more ebooks 2019-03-25 12:47:20 -04:00
eric fe05ff9f88 don't stall on super big pdf files 2019-03-25 12:47:04 -04:00
eric 2396e23ae4 fix missing lang string 2019-03-25 12:46:20 -04:00
eric 174b46abd1 add mobied to ebf admin 2019-03-25 12:45:53 -04:00
eric c190fc0bb1 fix undefined "stapled" 2019-03-08 23:45:54 -05:00
eric 9b12418ada catch more pdf errors 2019-03-05 12:02:42 -05:00
eric cefbc7c56f bugfix 2019-03-05 10:12:51 -05:00
eric d87578c5a0 harden stapler 2019-03-04 17:27:55 -05:00
eric 52b1621633 bugfix 2019-03-02 20:55:42 -05:00
eric 7c33cae82e refinements
- handle dropbox urls with no params
- catch exceptions in stapler
- fix dedupe summary
2019-03-02 19:16:47 -05:00
eric 9bf2d85108 fix degruyter signifier
also propagate user_agent
2019-03-02 16:00:11 -05:00
eric 943031ca22 whoops 2019-03-01 22:38:46 -05:00
eric 02170c9bc2 management commands
1. run an update of providers
2. dedupe the online ebooks
3. should have half the onlines to harvest
2019-03-01 21:26:39 -05:00
eric ac5c241e09 resolve doi in doab provider
- resolve the doi before setting the provider
- strip "www." from netloc
- strip url before setting provider
2019-03-01 21:23:54 -05:00
eric 1fdac9c548 remove dead code 2019-02-28 16:34:14 -05:00
eric 0282ed8136 delint 2019-02-28 16:22:23 -05:00
eric 72a40976bc add degruyter handling
- move harvest to separate module
- add ratelimiter class
- add pdf stapler
- add a googlebot UA
- add base url storage in get_soup
2019-02-28 15:32:41 -05:00
eric e162308191 change to a fulltext query and indices
(this is only a ~20% improvement)
2019-02-27 16:40:21 -05:00
eric 390f403e6c missing import 2019-02-18 15:29:16 -05:00
eric 1a8f22411a change to ku sso 2019-02-18 15:06:40 -05:00
eric 8652ce0b77 add rounds to ku 2019-01-18 12:03:04 -05:00
eric c6771f2eed fix limit on harvest_online 2018-12-10 14:30:54 -05:00
eric 260650ba92 handle application/binary 2018-12-10 14:28:39 -05:00
eric 24ab902e00 added ebook activation 2018-11-05 18:48:35 -05:00
eric ed64dc2b3f bugfix 2018-11-05 18:17:46 -05:00
eric 6535505e4d Revert "Merge branch 'master' into master"
This reverts commit bd52df020d, reversing
changes made to e455d9a766.
2018-11-03 17:23:07 -04:00
eshellman bd52df020d Merge branch 'master' into master 2018-11-03 17:06:09 -04:00
eric f4d7e6f888 working ku code 2018-11-03 14:47:41 -04:00
eric f98de7114e add oapn id 2018-11-03 14:33:23 -04:00
eric add0375ac3 working scraper 2018-11-02 14:03:30 -04:00
eshellman b727aaf9a9
Merge pull request #813 from Gluejar/kuscrape
Kuscrape
2018-11-02 13:58:24 -04:00
eric 57769f65a1 Update core/loaders/multiscrape.py
update to facilitate merg
2018-11-02 13:24:23 -04:00
eric 53995ffb4a allow scrapers to set parser
needed to support xml harvests
2018-10-29 22:42:49 -04:00
eric 3697789274 wip 2018-10-09 09:05:31 -04:00
eric 272616895d fix github3 issue 2018-09-10 12:04:12 -04:00
eric a87cdfc8ef make sure cc url is not garbage 2018-09-09 22:12:42 -04:00
eric 04aed3bf16 add opentextbc to pressbooks list 2018-09-09 21:55:38 -04:00