handle unencoded redirects

Dorien noticed that OAPEN links with "%" in them were resulting in "0", even though they worked fine in browsers.
main
eric 2024-03-13 10:43:51 -04:00
parent 40313c0834
commit 3614da5174
3 changed files with 17 additions and 2 deletions

View File

@ -5,7 +5,7 @@ import logging
import re
import threading
import time
from urllib.parse import urlparse
from urllib.parse import quote, unquote, urlparse, urlsplit, urlunsplit
import requests
@ -29,7 +29,17 @@ class ContentTyper(object):
def content_type(self, url):
try:
r = requests.head(url, allow_redirects=True, headers=HEADERS)
try:
r = requests.get(url, allow_redirects=True, headers=HEADERS)
except UnicodeDecodeError as ude:
# fallback for non-ascii, non-utf8 bytes in redirect location
if 'utf-8' in str(ude):
(scheme, netloc, path, query, fragment) = urlsplit(url)
newpath = quote(unquote(path), encoding='latin1')
url = urlunsplit((scheme, netloc, newpath, query, fragment))
r = requests.get(url, allow_redirects=True, headers=HEADERS)
if r.status_code == 200:
r.status_code = 214 # unofficial status code where url is changed
if r.status_code == 405:
r = requests.get(url, headers=HEADERS)
return r

View File

@ -22,6 +22,10 @@ When a link is checked we record the status code and content type returned by th
</p>
<dl>
<dt id='code214'>"214" indicates a unescaped redirect location.
<dd> When a server redirects a link it is supposed to send a valid URL in the "location" header so that the web client knows where to go. URL's should have only ascii characters - non ascii characters are url-escaped using %XX to represent non-ascii bytes. We have found that some servers send location strings with characters like "ä" (%E4 or %C3%A4 when escaped) in the location header. Most web client software does a best guess of how to interpret the bytes; so the redirection usually succeeds. ("214" is not a standard HTTP code.)
<dt id='code301'>"301" or "302" indicates a bad redirect.
<dd> Redirects are used to keep links working after they've changed addresses. CrossRef links, for example, are usually redirected to the publisher's website. But sometimes the redirecting server get it wrong, and there are problems with the resolutions. Another type of problem involves chains of redirects - there might be a loop, or there might be an insecure link in the middle of an other wise secure chain of links - That used to be OK, but now it's an error.

View File

@ -41,6 +41,7 @@ View <a href="{% url 'publishers' %}">the list of publishers whose links we've c
When a link is checked we record the status code and content type returned by the web server.
</p>
<ul>
<li><a href="{% url 'fixing' %}#code214">"214"</a> indicates an unescaped redirect.
<li><a href="{% url 'fixing' %}#code302">"301" or "302"</a> indicates a bad redirect.
<li><a href="{% url 'fixing' %}#code403">"403"</a> indicates a misconfigured server that is not allowing access to the promised resource.
<li><a href="{% url 'fixing' %}#code404">"404"</a> means the link is broken - the resource is not found.