URL Processing¶
If your applications have reached the Glory of REST by using hypermedia controls throughout, then you aren’t manipulating URLs a lot unless you are responsible for generating them. However, if you are interacting with less mature web applications, you need to manipulate URLs and you are probably doing something like:
>>> url_pattern = 'http://example.com/api/movie/{movie_id}/actors'
>>> response = requests.get(url_pattern.format(movie_id=ident))
or even (the horror):
>>> url = 'http://{0}/{1}?{2}'.format(host, path, query)
>>> response = requests.get(url)
If you are a little more careful, you could be URL encoding the argument to prevent URL injection attacks. This isn’t a horrible pattern for generating URLs from a known pattern and data. But what about other types of manipulation? How do you take a URL and point it at a different host?
>>> # really brute force?
>>> url = url_pattern.format(movie_id=1234)
>>> url = url[:7] + 'host.example.com' + url[18:]
>>> # with str.split + str.join??
>>> parts = url.split('/')
>>> parts[2] = 'host.example.com'
>>> url = '/'.join(parts)
>>> # leverage the standard library???
>>> import urllib.parse
>>> parts = urllib.parse.urlsplit(url)
>>> url = urllib.parse.urlunsplit((parts.scheme, 'host.example.com',
... parts.path, parts.query, parts.fragment))
...
>>>
Let’s face it, manipulating URLs in Python is less than ideal. What about something like the following instead?
>>> from ietfparse import algorithms
>>> url = algorithms.encode_url_template(url_pattern, movie_id=1234)
>>> url = algorithms.rewrite_url(url, host='host.example.com')
And, yes, the encode_url_template()
is doing a bit more than
calling str.format()
. It implements the full gamut of RFC 6570 URL
Templates which happens to handle our case quite well.
rewrite_url()
is closer to the urlsplit()
and
urlunsplit()
case with a nicer interface and a bit of
additional functionality as well. For example, if you are a little more
forward looking, then you probably have heard of International Domain Names
(RFC 5980). The rewrite_url()
function will correctly encode names
using the codecs.idna
. It also implements the same query encoding
tricks that urlencode()
does.
>>> from ietfparse import algorithms
>>> algorithms.rewrite_url('http://example.com', query={'b': 12, 'a': 'c'})
'http://example.com?a=c&b=12'
>>> algorithms.rewrite_url('http://example.com', query=[('b', 12), ('a', 'c')])
'http://example.com?b=12&a=c'
There is a lot going on in those two examples. See the documentation for
rewrite_url()
for all of the details.
Relevant Specifications¶
Known and Accepted Variances¶
Some of the IETF specifications require deep understanding of the underlying URL scheme. These portions are not implemented since they would unnecessarily couple this library to an open-ended set of protocol specifications. This section attempts to cover all such variances.
The host
portion of a URL is not strictly required to be a valid DNS
name for schemes that are restricted to using DNS names. For example,
http://-/
is a questionably valid URL. RFC 1035#section-3.5 prohibits
domain names from beginning with a hyphen and RFC 7230#section-2.7.1
strongly implies (requires?) that the host be an IP literal or valid DNS
name. However, file:///-
is perfectly acceptable, so the requirement
specific to HTTP is left unenforced.
Similarly, the port
portion of a network location is usually a network
port which is limited to 16-bits by both RFC 793 and RFC 768. This
is strictly required to be a TCP port in the case of HTTP (RFC 7230).
This library only limits the port
to a non-negative integer. The other
SHOULD that is not implemented is the suggestion that default port numbers
are omitted - see section 3.2.3 of RFC 3986#section-3.2.3.
Influencing URL Processing¶
URLs are finicky things with a wealth of specifications that sometimes seem to contradict each other. Whenever a grey area was encountered, this library tried to make the result controllable from the outside. For example, section 3.2.2 of RFC 3986#section-3.2.2 contains the following paragraph when describing the host portion of the URL.
The reg-name syntax allows percent-encoded octets in order to represent non-ASCII registered names in a uniform way that is independent of the underlying name resolution technology. Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters. URI producing applications must not use percent-encoding in host unless it is used to represent a UTF-8 character sequence. When a non-ASCII registered name represents an internationalized domain name intended for resolution via the DNS, the name must be transformed to the IDNA encoding [RFC3490] prior to name lookup. URI producers should provide these registered names in the IDNA encoding, rather than a percent-encoding, if they wish to maximize interoperability with legacy URI resolvers.
When rewrite_url()
is called with a host
parameter, it needs to
decide how to encode the string that it is given for inclusion into the URL.
In other words, it needs to decide whether the name represents an
internationalized domain name intended for resolution via the DNS or not.
There are two ways to control decisions like this. The recommended way is
to pass a parameter that explicitly states what you want - the
encode_with_dna
keyword to rewrite_url()
is one such case. A
configuration-based alternative is usually offered as well. The latter
should be used if you have a special case that is application specific.
For example, the ietfparse.algorithms.IDNA_SCHEMES
variable is a
collection that the library uses to know which schemes ALWAYS apply
IDNA rules to host names. You can modify this collection as needed to
meet your application requirements.