ietfparse¶
Wait… Why? What??¶
This is a gut reaction to the wealth of ways to parse URLs, MIME headers,
HTTP messages and other things described by IETF RFCs. They range from
the Python standard library (urllib
) to be buried in the guts of other
kitchen sink libraries (werkzeug
) and most of them are broken in one
way or the other.
So why create another one? Good question… glad that you asked. This is a companion library to the great packages out there that are responsible for communicating with other systems. I’m going to concentrate on providing a crisp and usable set of APIs that concentrate on parsing text. Nothing more. Hopefully by concentrating on the specific task of parsing things, the result will be a beautiful and usable interface to the text strings that power the Internet world.
Here’s a sample of the code that this library lets you write:
from ietfparse import algorithms, headers
def negotiate_versioned_representation(request, handler, data_dict):
requested = headers.parse_accept(request.headers['Accept'])
selected = algorithms.select_content_type(requested, [
headers.parse_content_type('application/example+json; v=1'),
headers.parse_content_type('application/example+json; v=2'),
headers.parse_content_type('application/json'),
])
output_version = selected.parameters.get('v', '2')
if output_version == '1':
handler.set_header('Content-Type', 'application/example+json; v=1')
handler.write(generate_legacy_json(data_dict))
else:
handler.set_header('Content-Type', 'application/example+json; v=2')
handler.write(generate_modern_json(data_dict))
def redirect_to_peer(host, port=80):
flask.redirect(algorithms.rewrite_url(flask.request.url,
host=host, port=port))
Ok… Where?¶
URL Processing¶
If your applications have reached the Glory of REST by using hypermedia controls throughout, then you aren’t manipulating URLs a lot unless you are responsible for generating them. However, if you are interacting with less mature web applications, you need to manipulate URLs and you are probably doing something like:
>>> url_pattern = 'http://example.com/api/movie/{movie_id}/actors'
>>> response = requests.get(url_pattern.format(movie_id=ident))
or even (the horror):
>>> url = 'http://{0}/{1}?{2}'.format(host, path, query)
>>> response = requests.get(url)
If you are a little more careful, you could be URL encoding the argument to prevent URL injection attacks. This isn’t a horrible pattern for generating URLs from a known pattern and data. But what about other types of manipulation? How do you take a URL and point it at a different host?
>>> # really brute force?
>>> url = url_pattern.format(movie_id=1234)
>>> url = url[:7] + 'host.example.com' + url[18:]
>>> # with str.split + str.join??
>>> parts = url.split('/')
>>> parts[2] = 'host.example.com'
>>> url = '/'.join(parts)
>>> # leverage the standard library???
>>> import urllib.parse
>>> parts = urllib.parse.urlsplit(url)
>>> url = urllib.parse.urlunsplit((parts.scheme, 'host.example.com',
... parts.path, parts.query, parts.fragment))
...
>>>
Let’s face it, manipulating URLs in Python is less than ideal. What about something like the following instead?
>>> from ietfparse import algorithms
>>> url = algorithms.encode_url_template(url_pattern, movie_id=1234)
>>> url = algorithms.rewrite_url(url, host='host.example.com')
And, yes, the encode_url_template()
is doing a bit more than
calling str.format()
. It implements the full gamut of RFC 6570 URL
Templates which happens to handle our case quite well.
rewrite_url()
is closer to the urlsplit()
and
urlunsplit()
case with a nicer interface and a bit of
additional functionality as well. For example, if you are a little more
forward looking, then you probably have heard of International Domain Names
(RFC 5980). The rewrite_url()
function will correctly encode names
using the codecs.idna
. It also implements the same query encoding
tricks that urlencode()
does.
>>> from ietfparse import algorithms
>>> algorithms.rewrite_url('http://example.com', query={'b': 12, 'a': 'c'})
'http://example.com?a=c&b=12'
>>> algorithms.rewrite_url('http://example.com', query=[('b', 12), ('a', 'c')])
'http://example.com?b=12&a=c'
There is a lot going on in those two examples. See the documentation for
rewrite_url()
for all of the details.
Relevant Specifications¶
Known and Accepted Variances¶
Some of the IETF specifications require deep understanding of the underlying URL scheme. These portions are not implemented since they would unnecessarily couple this library to an open-ended set of protocol specifications. This section attempts to cover all such variances.
The host
portion of a URL is not strictly required to be a valid DNS
name for schemes that are restricted to using DNS names. For example,
http://-/
is a questionably valid URL. RFC 1035#section-3.5 prohibits
domain names from beginning with a hyphen and RFC 7230#section-2.7.1
strongly implies (requires?) that the host be an IP literal or valid DNS
name. However, file:///-
is perfectly acceptable, so the requirement
specific to HTTP is left unenforced.
Similarly, the port
portion of a network location is usually a network
port which is limited to 16-bits by both RFC 793 and RFC 768. This
is strictly required to be a TCP port in the case of HTTP (RFC 7230).
This library only limits the port
to a non-negative integer. The other
SHOULD that is not implemented is the suggestion that default port numbers
are omitted - see section 3.2.3 of RFC 3986#section-3.2.3.
Influencing URL Processing¶
URLs are finicky things with a wealth of specifications that sometimes seem to contradict each other. Whenever a grey area was encountered, this library tried to make the result controllable from the outside. For example, section 3.2.2 of RFC 3986#section-3.2.2 contains the following paragraph when describing the host portion of the URL.
The reg-name syntax allows percent-encoded octets in order to represent non-ASCII registered names in a uniform way that is independent of the underlying name resolution technology. Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters. URI producing applications must not use percent-encoding in host unless it is used to represent a UTF-8 character sequence. When a non-ASCII registered name represents an internationalized domain name intended for resolution via the DNS, the name must be transformed to the IDNA encoding [RFC3490] prior to name lookup. URI producers should provide these registered names in the IDNA encoding, rather than a percent-encoding, if they wish to maximize interoperability with legacy URI resolvers.
When rewrite_url()
is called with a host
parameter, it needs to
decide how to encode the string that it is given for inclusion into the URL.
In other words, it needs to decide whether the name represents an
internationalized domain name intended for resolution via the DNS or not.
There are two ways to control decisions like this. The recommended way is
to pass a parameter that explicitly states what you want - the
encode_with_dna
keyword to rewrite_url()
is one such case. A
configuration-based alternative is usually offered as well. The latter
should be used if you have a special case that is application specific.
For example, the ietfparse.algorithms.IDNA_SCHEMES
variable is a
collection that the library uses to know which schemes ALWAYS apply
IDNA rules to host names. You can modify this collection as needed to
meet your application requirements.
Header Parsing¶
Parsing IETF headers is a difficult science at best. They come in a wide
variety of syntaxes each with their own peculiarities. The functions in
this module expect that the incoming header data is formatted appropriately.
If it is not, then a data-related exception will be raised. Any of the
following exceptions can be raised from any of the header parsing
functions: AttributeError
, IndexError
, TypeError
, and
ValueError
.
This approach is an intentional design decision on the part of the author. Instead of inventing another list of garbage-in -> garbage-out exception types, I chose to simply let the underlying exception propagate. This means that you should always guard against at least this set of exceptions.
Accept¶
parse_accept()
parses the HTTP Accept header
into a sorted list of ietfparse.datastructures.ContentType
instances.
The list is sorted according to the specified quality values. Elements with
the same quality value are ordered with the most-specific value first. The
following is a good example of this from section 5.3.2 of
RFC 7231#section-5.3.2.
>>> from ietfparse import headers
>>> requested = headers.parse_accept(
... 'text/*, text/plain, text/plain;format=flowed, */*')
>>> [str(h) for h in requested]
['text/plain; format=flowed', 'text/plain', 'text/*', '*/*']
All of the requested types have the same quality - implicitly 1.0 so they
are sorted purely by specificity. Though the result is sorted according
to quality and specificity, selecting a matching content type is not as
easy as traversing the list in order. The full algorithm for selecting the
most appropriate content type is described in RFC 7231 and is fully
implemented by select_content_type()
.
Accept-Charset¶
parse_accept_charset()
parses the HTTP Accept-Charset
header into a sorted sequence of character set identifiers. Character set
identifiers are simple tokens with an optional quality value that is the
strength of the preference from most preferred (1.0) to rejection (0.0).
After the header is parsed and sorted, the quality values are removed and
the token list is returned.
>>> from ietfparse import headers
>>> charsets = headers.parse_accept_charset('latin1;q=0.5, utf-8;q=1.0, '
... 'us-ascii;q=0.1, ebcdic;q=0.0')
['utf-8', 'latin1', 'us-ascii', 'ebcdic']
The wildcard character set if present, will be sorted towards the end of the list. If both a wildcard and rejected values are present, then the wildcard will occur before the rejected values.
>>> from ietfparse import headers
>>> headers.parse_accept_charset('acceptable, rejected;q=0, *')
['acceptable', '*', 'rejected']
Note
The only attribute that is allowed to be specified per the RFC is the quality value. If additional parameters are included, they are not included in the response from this function. More specifically, the returned list contains only the character set strings.
Accept-Encoding¶
parse_accept_encoding()
parses the HTTP Accept-Encoding
header into a sorted sequence of encodings. Encodings are simple tokens
with an optional quality value that is the strength of the preference from
most preferred (1.0) to rejection (0.0). After the header is parsed and sorted,
the quality values are removed and the token list is returned.
>>> from ietfparse import headers
>>> headers.parse_accept_encoding('snappy, compress;q=0.7, gzip;q=0.8')
['snappy', 'gzip', 'compress']
The wildcard character set if present, will be sorted towards the end of the list. If both a wildcard and rejected values are present, then the wildcard will occur before the rejected values.
>>> from ietfparse import headers
>>> headers.parse_accept_encoding('compress, snappy;q=0, *')
['compress', '*', 'snappy']
Note
The only attribute that is allowed to be specified per the RFC is the quality value. If additional parameters are included, they are not included in the response from this function. More specifically, the returned list contains only the character set strings.
Accept-Language¶
parse_accept_language()
parses the HTTP Accept-Language
header into a sorted sequence of languages. Languages are simple tokens
with an optional quality value that is the strength of the preference from
most preferred (1.0) to rejection (0.0). After the header is parsed and sorted,
the quality values are removed and the token list is returned.
>>> from ietfparse import headers
>>> headers.parse_accept_language('de, en;q=0.7, en-gb;q=0.8')
['de', 'en-gb', 'en']
The wildcard character set if present, will be sorted towards the end of the list. If both a wildcard and rejected values are present, then the wildcard will occur before the rejected values.
>>> from ietfparse import headers
>>> headers.parse_accept_language('es-es, en;q=0, *')
['es-es', '*', 'en']
Note
The only attribute that is allowed to be specified per the RFC is the quality value. If additional parameters are included, they are not included in the response from this function. More specifically, the returned list contains only the character set strings.
Cache-Control¶
parse_cache_control()
parses the HTTP Cache-Control header
as described in RFC 7234 into a dictionary of directives.
Directives without a value such as public
or no-cache
will be returned
in the dictionary with a value of True
if set.
>>> from ietfparse import headers
>>> headers.parse_cache_control('public, max-age=2592000')
{'public': True, 'max-age': 2592000}
Content-Type¶
parse_content_type()
parses a MIME or HTTP Content-Type
header into an object that exposes the structured data.
>>> from ietfparse import headers
>>> header = headers.parse_content_type('text/html; charset=ISO-8859-4')
>>> header.content_type, header.content_subtype
('text', 'html')
>>> header.parameters['charset']
'ISO-8859-4'
It handles dequoting and normalizing the value. The content type and all parameter names are translated to lower-case during the parsing process. The relatively unknown option to include comments in the content type is honored and comments are discarded.
>>> header = headers.parse_content_type(
... 'message/http; version=2.0 (someday); MSGTYPE="request"')
>>> header.parameters['version']
'2.0'
>>> header.parameters['msgtype']
'request'
Notice that the (someday)
comment embedded in the version
parameter was discarded and the msgtype
parameter name was
normalized as well.
Link¶
parse_link()
parses an HTTP Link header as
described in RFC 5988 into a sequence of
ietfparse.datastructures.LinkHeader
instances.
>>> from ietfparse import headers
>>> parsed = headers.parse_link(
... '<http://example.com/TheBook/chapter2>; rel="previous"; '
... 'title="previous chapter"')
>>> parsed[0].target
'http://example.com/TheBook/chapter2'
>>> parsed[0].parameters
[('rel', 'previous'), ('title', 'previous chapter')]
>>> str(parsed[0])
'<http://example.com/TheBook/chapter2>; rel="previous"; title="previous chapter"'
Notice that the parameter values are returned as a list of name and value
tuples. This is by design and required by the RFC to support the
hreflang
parameter as specified:
The “hreflang” parameter, when present, is a hint indicating what the language of the result of dereferencing the link should be. Note that this is only a hint; for example, it does not override the Content-Language header of a HTTP response obtained by actually following the link. Multiple “hreflang” parameters on a single link- value indicate that multiple languages are available from the indicated resource.
Also note that you can cast a ietfparse.datastructures.LinkHeader
instance to a string to get a correctly formatted representation of it.
Request Processing¶
Header parsing is only part of what you need to write modern web applications. You need to implement responsive behaviors that factor in the state of the server, the resource in question, and information from the requesting client.
Content Negotiation¶
RFC 7231#section-3.4 describes how Content Negotiation can
be implemented. select_content_type()
implements the type selection
portion of Proactive Negotiation. It takes a list of requested content
types (e.g., from parse_accept()
)
along with a list of content types that the server is capable of producing
and returns the content type that is the best match. The algorithm is
loosely described in Section 5.3 of RFC 7231#section-5.3.
>>> from ietfparse import headers
>>> requested = headers.parse_accept(
... 'text/*;q=0.3, text/html;q=0.7, text/html;level=1, '
... 'text/html;level=2;q=0.4, */*;q=0.5')
>>> headers.select_content_type(
... requested,
... ['text/html', 'text/html;level=4', 'text/html;level=3'])
'text/html
A more interesting case is to select the representation to produce based on what a server knows how to produce and what a client has requested.
>>> from ietfparse import algorithms, headers
>>> requested = headers.parse_accept(
... 'application/vnd.example.com+json;version=2, '
... 'application/vnd.example.com+json;q=0.75, '
... 'application/json;q=0.5, text/javascript;q=0.25'
... )
>>> selected = algorithms.select_content_type(requested, [
... headers.parse_content_type('application/vnd.example.com+json;version=3'),
... headers.parse_content_type('application/vnd.example.com+json;version=2'),
... ])
>>> str(selected)
'application/vnd.example.com+json; version=2'
The select_content_type()
function is an implementation of Proactive
Content Negotiation as described in RFC 7231#section-3.4.1.
API Reference¶
ietfparse.algorithms¶
Implementations of algorithms from various specifications.
rewrite_url()
: modify a portion of a URL.select_content_type()
: select the best match between a HTTPAccept
header and a list of availableContent-Type
s
This module implements some of the more interesting algorithms described in IETF RFCs.
-
ietfparse.algorithms.
IDNA_SCHEMES
¶ A collection of schemes that use IDN encoding for its host.
-
ietfparse.algorithms.
rewrite_url
(input_url, **kwargs)¶ Create a new URL from input_url with modifications applied.
Parameters: - input_url (str) – the URL to modify
- fragment (str) – if specified, this keyword sets the
fragment portion of the URL. A value of
None
will remove the fragment portion of the URL. - host (str) – if specified, this keyword sets the host
portion of the network location. A value of
None
will remove the network location portion of the URL. - password (str) – if specified, this keyword sets the
password portion of the URL. A value of
None
will remove the password from the URL. - path (str) – if specified, this keyword sets the path
portion of the URL. A value of
None
will remove the path from the URL. - port (int) – if specified, this keyword sets the port
portion of the network location. A value of
None
will remove the port from the URL. - query – if specified, this keyword sets the query portion of the URL. See the comments for a description of this parameter.
- scheme (str) – if specified, this keyword sets the scheme
portion of the URL. A value of
None
will remove the scheme. Note that this will make the URL relative and may have unintended consequences. - user (str) – if specified, this keyword sets the user
portion of the URL. A value of
None
will remove the user and password portions. - enable_long_host (bool) – if this keyword is specified
and it is
True
, then the host name length restriction from RFC 3986#section-3.2.2 is relaxed. - encode_with_idna (bool) – if this keyword is specified
and it is
True
, then thehost
parameter will be encoded using IDN. If this value is provided asFalse
, then the percent-encoding scheme is used instead. If this parameter is omitted or included with a different value, then thehost
parameter is processed usingIDNA_SCHEMES
.
Returns: the modified URL
Raises: ValueError – when a keyword parameter is given an invalid value
If the host parameter is specified and not
None
, then it will be processed as an Internationalized Domain Name (IDN) if the scheme appears inIDNA_SCHEMES
. Otherwise, it will be encoded as UTF-8 and percent encoded.The handling of the query parameter requires some additional explanation. You can specify a query value in three different ways - as a mapping, as a sequence of pairs, or as a string. This flexibility makes it possible to meet the wide range of finicky use cases.
If the query parameter is a mapping, then the key + value pairs are sorted by the key before they are encoded. Use this method whenever possible.
If the query parameter is a sequence of pairs, then each pair is encoded in the given order. Use this method if you require that parameter order is controlled.
If the query parameter is a string, then it is used as-is. This form SHOULD BE AVOIDED since it can easily result in broken URLs since no URL escaping is performed. This is the obvious pass through case that is almost always present.
-
ietfparse.algorithms.
select_content_type
(requested, available)¶ Selects the best content type.
Parameters: - requested – a sequence of
ContentType
instances - available – a sequence of
ContentType
instances that the server is capable of producing
Returns: the selected content type (from
available
) and the pattern that it matched (fromrequested
)Return type: tuple
ofContentType
instancesRaises: NoMatch
when a suitable match was not foundThis function implements the Proactive Content Negotiation algorithm as described in sections 3.4.1 and 5.3 of RFC 7231. The input is the Accept header as parsed by
parse_http_accept_header()
and a list of parsedContentType
instances. Theavailable
sequence should be a sequence of content types that the server is capable of producing. The selected value should ultimately be used as the Content-Type header in the generated response.- requested – a sequence of
ietfparse.datastructures¶
Important data structures.
ContentType
: MIMEContent-Type
header.
This module contains data structures that were useful in implementing this library. If a data structure might be useful outside of a particular piece of functionality, it is fully fleshed out and ends up here.
-
class
ietfparse.datastructures.
ContentType
(content_type, content_subtype, parameters=None)¶ A MIME
Content-Type
header.Parameters: Internet content types are described by the Content-Type header from RFC 2045. It was reused across many other protocol specifications, most notably HTTP (RFC 7231). This header’s syntax is described in RFC 2045#section-5.1. In its most basic form, a content type header looks like
text/html
. The primary content type istext
with a subtype ofhtml
. Content type headers can include parameters asname=value
pairs separated by colons.
-
class
ietfparse.datastructures.
LinkHeader
(target, parameters=None)¶ Represents a single link within a
Link
header.-
target
¶ The target URL of the link. This may be a relative URL so the caller may have to make the link absolute by resolving it against a base URL as described in RFC 3986#section-5.
-
parameters
¶ Possibly empty sequence of name and value pairs. Parameters are represented as a sequence since a single parameter may occur more than once.
The Link header is specified by RFC 5988. It is one of the methods used to represent HyperMedia links between HTTP resources.
-
ietfparse.errors¶
Exceptions raised from within ietfparse.
All exceptions are rooted at RootException
so
so you can catch it to implement error handling behavior associated with
this library’s functionality.
-
exception
ietfparse.errors.
MalformedLinkValue
¶ Value specified is not a valid link header.
-
exception
ietfparse.errors.
NoMatch
¶ No match was found when selecting a content type.
-
exception
ietfparse.errors.
RootException
¶ Root of the
ietfparse
exception hierarchy.
ietfparse.headers¶
Functions for parsing headers.
parse_accept_charset()
: parse aAccept-Charset
valueparse_cache_control()
: parse aCache-Control
valueparse_content_type()
: parse aContent-Type
valueparse_accept()
: parse anAccept
valueparse_link()
: parse a RFC 5988Link
valueparse_list()
: parse a comma-separated list that is present in so many headers
This module also defines classes that might be of some use outside of the module. They are not designed for direct usage unless otherwise mentioned.
-
ietfparse.headers.
parse_accept
(header_value)¶ Parse an HTTP accept-like header.
Parameters: header_value (str) – the header value to parse Returns: a list
ofContentType
instances in decreasing quality order. Each instance is augmented with the associated quality as afloat
property namedquality
.Accept
is a class of headers that contain a list of values and an associated preference value. The ever present Accept header is a perfect example. It is a list of content types and an optional parameter namedq
that indicates the relative weight of a particular type. The most basic example is:Accept: audio/*;q=0.2, audio/basic
Which states that I prefer the
audio/basic
content type but will accept otheraudio
sub-types with an 80% mark down.
-
ietfparse.headers.
parse_accept_charset
(header_value)¶ Parse the
Accept-Charset
header into a sorted list.Parameters: header_value (str) – header value to parse Returns: list of character sets sorted from highest to lowest priority The Accept-Charset header is a list of character set names with optional quality values. The quality value indicates the strength of the preference where 1.0 is a strong preference and less than 0.001 is outright rejection by the client.
Note
Character sets that are rejected by setting the quality value to less than 0.001. If a wildcard is included in the header, then it will appear BEFORE values that are rejected.
-
ietfparse.headers.
parse_accept_encoding
(header_value)¶ Parse the
Accept-Encoding
header into a sorted list.Parameters: header_value (str) – header value to parse Returns: list of encodings sorted from highest to lowest priority The Accept-Encoding header is a list of encodings with optional quality values. The quality value indicates the strength of the preference where 1.0 is a strong preference and less than 0.001 is outright rejection by the client.
Note
Encodings that are rejected by setting the quality value to less than 0.001. If a wildcard is included in the header, then it will appear BEFORE values that are rejected.
-
ietfparse.headers.
parse_accept_language
(header_value)¶ Parse the
Accept-Language
header into a sorted list.Parameters: header_value (str) – header value to parse Returns: list of languages sorted from highest to lowest priority The Accept-Language header is a list of languages with optional quality values. The quality value indicates the strength of the preference where 1.0 is a strong preference and less than 0.001 is outright rejection by the client.
Note
Languages that are rejected by setting the quality value to less than 0.001. If a wildcard is included in the header, then it will appear BEFORE values that are rejected.
-
ietfparse.headers.
parse_cache_control
(header_value)¶ Parse a Cache-Control header, returning a dictionary of key-value pairs.
Any of the
Cache-Control
parameters that do not have directives, such aspublic
orno-cache
will be returned with a value ofTrue
if they are set in the header.Parameters: header_value (str) – Cache-Control
header value to parseReturns: the parsed Cache-Control
header valuesReturn type: dict
-
ietfparse.headers.
parse_content_type
(content_type, normalize_parameter_values=True)¶ Parse a content type like header.
Parameters: Returns: a
ContentType
instance
-
ietfparse.headers.
parse_http_accept_header
(header_value)¶ Parse an HTTP accept-like header.
Parameters: header_value (str) – the header value to parse Returns: a list
ofContentType
instances in decreasing quality order. Each instance is augmented with the associated quality as afloat
property namedquality
.Accept
is a class of headers that contain a list of values and an associated preference value. The ever present Accept header is a perfect example. It is a list of content types and an optional parameter namedq
that indicates the relative weight of a particular type. The most basic example is:Accept: audio/*;q=0.2, audio/basic
Which states that I prefer the
audio/basic
content type but will accept otheraudio
sub-types with an 80% mark down.Deprecated since version 1.3.0: Use
parse_accept()
instead.
-
ietfparse.headers.
parse_link
(header_value, strict=True)¶ Parse a HTTP Link header.
Parameters: Returns: a sequence of
LinkHeader
instancesRaises: ietfparse.errors.MalformedLinkValue – if the specified header_value cannot be parsed
-
ietfparse.headers.
parse_link_header
(header_value, strict=True)¶ Parse a HTTP Link header.
Parameters: Returns: a sequence of
LinkHeader
instancesRaises: ietfparse.errors.MalformedLinkValue – if the specified header_value cannot be parsed
Deprecated since version 1.3.0: Use
parse_link()
instead.
-
ietfparse.headers.
parse_list
(value)¶ Parse a comma-separated list header.
Parameters: value (str) – header value to split into elements Returns: list of header elements as strings
-
ietfparse.headers.
parse_list_header
(value)¶ Parse a comma-separated list header.
Parameters: value (str) – header value to split into elements Returns: list of header elements as strings Deprecated since version 1.3.0: Use
parse_list()
instead.
Relevant RFCs¶
RFC-2045¶
ietfparse.datastructures.ContentType
is an abstraction of the Content-Type header described in RFC 2045 and fully specified in section 5.1.
RFC-3986¶
ietfparse.algorithms.rewrite_url()
implements encoding and parsing per RFC 3986.
RFC-5980¶
ietfparse.algorithms.rewrite_url()
encodes hostnames according to RFC 5980 for the schemes identified byIDNA_SCHEMES
. Encoding can also be forced using theencode_with_idna
keyword parameter.
RFC-5988¶
ietfparse.headers.parse_link_header()
parses a Link HTTP header.ietfparse.datastructures.LinkHeader()
represents a Link HTTP header.
RFC-7231¶
ietfparse.algorithms.select_content_type()
implements proactive content negotiation as described in sections 3.4.1 and 5.3 of RFC 7231ietfparse.headers.parse_accept_charset()
parses a Accept-Charset value as described in section 5.3.3.ietfparse.headers.parse_http_accept_header()
parses a Accept value as described in section 5.3.2.ietfparse.headers.parse_list_header()
parses just about any of the comma-separated lists from RFC 7231. It doesn’t provide any logic other than parsing the header though.ietfparse.headers.parse_parameter_list()
parses thekey=value
portions common to many header values.
Contributing to ietfparse¶
Do you want to contribute extensions, fixes, improvements?
Awesome! and thank you very much
This is a nice little open source project that is released under the permissive BSD license so you don’t have to push your changes back if you do not want to. But if you do, they will be more than welcome.
Set up a development environment¶
The first thing that you need to do is set up a development environment so that you can run the test suite. The easiest way to do that is to create a virtual environment for your endeavours:
$ pyvenv env
If you are developing against something earlier than Python 3.4, then I
highly recommend using virtualenv to create the environment. The
earlier versions of pyvenv
were slightly broken. The next step is
to install the development tools that you will need.
dev-requirements.txt is a pip-formatted requirements file that will install everything that you need:
$ env/bin/pip install -qr dev-requirements.txt
$ env/bin/pip freeze
Fluent-Test==3.0.0
Jinja2==2.7.3
MarkupSafe==0.23
Pygments==1.6
Sphinx==1.2.3
coverage==3.7.1
docutils==0.12
flake8==2.2.3
mccabe==0.2.1
mock==1.0.1
nose==1.3.4
pep8==1.5.7
pyflakes==0.8.1
sphinx-rtd-theme==0.1.6
As usual, setup.py is the swiss-army knife in the development tool chest. The following commands are the ones that you will be using most often:
- ./setup.py nosetests
- Run the test suite using nose and generate a coverage report.
- ./setup.py build_sphinx
- Generate the documentation suite into build/sphinx/html
- ./setup.py flake8
- Run the flake8 over the code and report any style violations.
- ./setup.py clean
- Remove generated files. By default, this will remove any top-level egg-related files and the build directory.
Running tests¶
The easiest way to run the test suite is with setup.py nosetests. It will run the test suite with the currently installed python version and report the result of the test run as well as the coverage:
$ env/bin/python setup.py nosetests
running nosetests
running egg_info
writing dependency_links to ietfparse.egg-info/dependency_links.txt
writing top-level names to ietfparse.egg-info/top_level.txt
writing ietfparse.egg-info/PKG-INFO
reading manifest file 'ietfparse.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '__pycache__'...
warning: no previously-included files matching '*.swp' found ...
writing manifest file 'ietfparse.egg-info/SOURCES.txt'
test_that_differing_parameters_is_acceptable_as_weak_match ...
...
Name Stmts Miss Branch BrMiss Cover Missing
----------------------------------------------------------------------
ietfparse 0 0 0 0 100%
ietfparse.algorithms 36 1 24 1 97% 98
ietfparse.datastructures 26 0 21 0 100%
ietfparse.errors 4 0 0 0 100%
ietfparse.headers 29 1 14 1 95% 82
----------------------------------------------------------------------
TOTAL 95 2 59 2 97%
----------------------------------------------------------------------
Ran 44 tests in 0.054s
OK
Before you can call the code complete, you really need to make sure that
it works across the supported python versions. Travis-CI will take care
of making sure that this is the case when the code is pushed to github
but you should do this before you push. The easiest way to do this is
to install detox
and run it:
$ env/bin/python install -q detox
$ env/bin/detox
py27 recreate: /.../ietfparse/build/tox/py27
GLOB sdist-make: /.../ietfparse/setup.py
py33 recreate: /.../ietfparse/build/tox/py33
py34 recreate: /.../ietfparse/build/tox/py34
py27 installdeps: -rtest-requirements.txt, mock
py33 installdeps: -rtest-requirements.txt
py34 installdeps: -rtest-requirements.txt
py27 inst: /.../ietfparse/build/tox/dist/ietfparse-0.0.0.zip
py27 runtests: PYTHONHASHSEED='2156646470'
py27 runtests: commands[0] | /../ietfparse/build/tox/py27/bin/nosetests
py33 inst: /../ietfparse/.build/tox/dist/ietfparse-0.0.0.zip
py34 inst: /../ietfparse/.build/tox/dist/ietfparse-0.0.0.zip
py33 runtests: PYTHONHASHSEED='2156646470'
py33 runtests: commands[0] | /.../ietfparse/build/tox/py33/bin/nosetests
py34 runtests: PYTHONHASHSEED='2156646470'
py34 runtests: commands[0] | /.../ietfparse/build/tox/py34/bin/nosetests
_________________________________ summary _________________________________
py27: commands succeeded
py33: commands succeeded
py34: commands succeeded
congratulations :)
This is what you want to see. Tests passing across the board. Time to submit a PR.
Submitting a Pull Request¶
The first thing to do is to fork the repository and set up a nice shiny environment in it. Once you can run the tests, it’s time to write some. I developed this library using a test-first methodology. If you are fixing a defect, then write a test that verifies the correct behavior. It should fail. Now, fix the defect making the test pass in the process. New functionality follows a similar path. Write a test that verifies the correct behavior of the new functionality. Then add enough functionality to make the test pass. Then, on to the next test. This is test driven development at its core. This actually is pretty important since pull requests that are not tested will not be merged. This is why nose is configured to report coverage. The coverage doesn’t have to be 100% but it should be pretty close. Anything that isn’t covered is usually scrutinized.
Once you have a few tests are written and some functionality is working, you should probably commit your work. If you are not comfortable with rebasing in git or cleaning up a commit history, your best bet is to create small commits – commit early, commit often. The smaller the commit is, the easier it will be to squash and rearrange them.
When your change is written and tested, make sure to update and/or add documentation as needed. The documentation suite is written using ReStructuredText and the excellent sphinx utility. If you don’t think that documentation matters, read Kenneth Reitz’s Documentation is King presentation. Pull requests that are not simply bug fixes will almost always require some documentation.
After the tests are written, code is complete, and documents are up to date, it is time to push your code back to github.com and submit a pull request against the upstream repository.
Changelog¶
1.4.3 (30-Oct-2017)¶
- Change parsing of qualified lists to retain the initial ordering whenever
possible. The algorithm prefers explicit highest quality (1.0) preferences
over inferred highest quality preferences. It also retains the initial
ordering in the presence of multiple highest quality matches. This affects
headers.parse_accept_charset()
,headers.parse_accept_encoding()
, andheaders.parse_accept_language()
.
1.4.0 (18-Oct-2016)¶
- Fixed parsing of lists like
max-age=5, x-foo="prune"
. The previous versions incorrectly produced['max-age=5', 'x-foo="prune']
. - Added
headers.parse_accept_encoding()
which parses HTTP Accept-Encoding header values into a list. - Added
headers.parse_accept_language()
which parses HTTP Accept-Language header values into a list.
1.3.0 (11-Aug-2016)¶
- Added
headers.parse_cache_control()
which parses HTTP Cache-Control header values into a dictionary. - Renamed
headers.parse_http_accept_header()
toheaders.parse_accept()
, adding a wrapper function that raises a deprecation function when invokingheaders.parse_http_accept_header()
. - Renamed
headers.parse_link_header()
toheaders.parse_link()
, adding a wrapper function that raises a deprecation function when invokingheaders.parse_link_header()
. - Renamed
headers.parse_list_header()
toheaders.parse_list()
, adding a wrapper function that raises a deprecation function when invokingheaders.parse_list_header()
.
1.2.2 (27-May-2015)¶
- Added
headers.parse_list_header()
which parses generic comma- separated list headers with support for quoted parts. - Added
headers.parse_accept_charset()
which parses an HTTP Accept-Charset header into a sorted list.
1.2.1 (25-May-2015)¶
algorithms.select_content_type()
claims to work withdatastructures.ContentType`
values but it was requiring the augmented ones returned fromalgorithms.parse_http_accept_header()
. IOW, the algorithm required that the quality attribute exist. RFC 7231#section-5.3.1 states that missing quality values are treated as 1.0.
1.2.0 (19-Apr-2015)¶
- Added support for RFC 5988
Link
headers. This consists ofheaders.parse_link_header()
anddatastructures.LinkHeader
1.1.1 (10-Feb-2015)¶
- Removed
setupext
module since it was causing problems with source distributions.
1.1.0 (26-Oct-2014)¶
- Added
algorithms.rewrite_url()
1.0.0 (21-Sep-2014)¶
- Initial implementation containing the following functionality:
-
algorithms.select_content_type()
-datastructures.ContentType
-errors.NoMatch
-errors.RootException
-headers.parse_content_type()
-headers.parse_http_accept_header()