项目作者: niksite

项目描述 :
URL normalization for Python
高级语言: Python
项目地址: git://github.com/niksite/url-normalize.git
创建时间: 2013-02-02T07:08:17Z
项目社区:https://github.com/niksite/url-normalize

开源协议:MIT License

下载


url-normalize

tests
Coveralls
PyPI

A Python library for standardizing and normalizing URLs with support for internationalized domain names (IDN).

Table of Contents

Introduction

url-normalize provides a robust URI normalization function that:

  • Takes care of IDN domains.
  • Always provides the URI scheme in lowercase characters.
  • Always provides the host, if any, in lowercase characters.
  • Only performs percent-encoding where it is essential.
  • Always uses uppercase A-through-F characters when percent-encoding.
  • Prevents dot-segments appearing in non-relative URI paths.
  • For schemes that define a default authority, uses an empty authority if the
    default is desired.
  • For schemes that define an empty path to be equivalent to a path of “/“,
    uses “/“.
  • For schemes that define a port, uses an empty port if the default is desired
  • Ensures all portions of the URI are utf-8 encoded NFC from Unicode strings

Inspired by Sam Ruby’s urlnorm.py

Features

  • IDN Support: Full internationalized domain name handling
  • Configurable Defaults:
    • Customizable default scheme (https by default)
    • Configurable default domain for absolute paths
  • Query Parameter Control:
    • Parameter filtering with allowlists
    • Support for domain-specific parameter rules
  • Versatile URL Handling:
    • Empty string URLs
    • Double slash URLs (//domain.tld)
    • Shebang (#!) URLs
  • Developer Friendly:
    • Cross-version Python compatibility (3.8+)
    • 100% test coverage
    • Modern type hints and string handling

Installation

  1. pip install url-normalize

Usage

Python API

  1. from url_normalize import url_normalize
  2. # Basic normalization (uses https by default)
  3. print(url_normalize("www.foo.com:80/foo"))
  4. # Output: https://www.foo.com/foo
  5. # With custom default scheme
  6. print(url_normalize("www.foo.com/foo", default_scheme="http"))
  7. # Output: http://www.foo.com/foo
  8. # With query parameter filtering enabled
  9. print(url_normalize("www.google.com/search?q=test&utm_source=test", filter_params=True))
  10. # Output: https://www.google.com/search?q=test
  11. # With custom parameter allowlist as a dict
  12. print(url_normalize(
  13. "example.com?page=1&id=123&ref=test",
  14. filter_params=True,
  15. param_allowlist={"example.com": ["page", "id"]}
  16. ))
  17. # Output: https://example.com?page=1&id=123
  18. # With custom parameter allowlist as a list
  19. print(url_normalize(
  20. "example.com?page=1&id=123&ref=test",
  21. filter_params=True,
  22. param_allowlist=["page", "id"]
  23. ))
  24. # Output: https://example.com?page=1&id=123
  25. # With default domain for absolute paths
  26. print(url_normalize("/images/logo.png", default_domain="example.com"))
  27. # Output: https://example.com/images/logo.png
  28. # With default domain and custom scheme
  29. print(url_normalize("/images/logo.png", default_scheme="http", default_domain="example.com"))
  30. # Output: http://example.com/images/logo.png

Command-line Usage

You can also use url-normalize from the command line:

  1. $ url-normalize "www.foo.com:80/foo"
  2. # Output: https://www.foo.com/foo
  3. # With custom default scheme
  4. $ url-normalize -s http "www.foo.com/foo"
  5. # Output: http://www.foo.com/foo
  6. # With query parameter filtering
  7. $ url-normalize -f "www.google.com/search?q=test&utm_source=test"
  8. # Output: https://www.google.com/search?q=test
  9. # With custom allowlist
  10. $ url-normalize -f -p page,id "example.com?page=1&id=123&ref=test"
  11. # Output: https://example.com/?page=1&id=123
  12. # With default domain for absolute paths
  13. $ url-normalize -d example.com "/images/logo.png"
  14. # Output: https://example.com/images/logo.png
  15. # With default domain and custom scheme
  16. $ url-normalize -d example.com -s http "/images/logo.png"
  17. # Output: http://example.com/images/logo.png
  18. # Via uv tool/uvx
  19. $ uvx url-normalize www.foo.com:80/foo
  20. # Output: https://www.foo.com:80/foo

Documentation

For a complete history of changes, see CHANGELOG.md.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License