Sitemap URL List Cleaner Guide: Remove Duplicates, Normalize, and Validate URLs
A URL list cleaner processes a raw list of URLs — often exported from crawlers, analytics tools, CMS databases, or sitemaps — and removes problems that make the list unsuitable for SEO use, sitemap submission, or migration work. Common issues include duplicate URLs in different formats, tracking parameters that inflate the URL count, inconsistent trailing slashes, mixed HTTP and HTTPS versions, and malformed entries that aren't valid URLs at all.
Common URL List Problems
| Problem | Example | Impact |
|---|---|---|
| Duplicate URLs | Same URL listed twice | Wastes crawl budget, inflates counts |
| Trailing slash inconsistency | /page/ and /page | Google may index both as separate pages |
| HTTP vs HTTPS mix | http:// and https:// same page | Duplicate content signals to search engines |
| UTM/tracking parameters | ?utm_source=email&utm_medium=cpc | Creates thousands of "unique" URLs from one page |
| Session IDs in URLs | ?sessionid=abc123 | Every user visit becomes a different URL |
| www vs non-www | www.example.com and example.com | Duplicate domain versions in sitemap |
| Malformed URLs | Missing protocol, spaces in URL | Crawler errors, sitemap validation failure |
URL Normalization Rules
URL normalization is the process of transforming URLs into a canonical, consistent form so that equivalent URLs are recognized as duplicates. The standard normalization steps are:
- Lowercase the scheme and host —
HTTPS://Example.COM/page→https://example.com/page - Remove default ports —
https://example.com:443/→https://example.com/ - Percent-decode unreserved characters —
/my%20page→/my%20page(keep encoded spaces, but decode unnecessarily encoded letters) - Resolve dot segments in paths —
/blog/../products/→/products/ - Apply consistent trailing slash rule — choose either always-trailing-slash or never-trailing-slash and apply uniformly
- Sort query parameters —
?b=2&a=1→?a=1&b=2(so same parameters in different order match) - Remove tracking parameters — strip
utm_*,fbclid,gclid, and other known tracking parameters
Stripping Tracking Parameters
UTM parameters (utm_source, utm_medium, utm_campaign, utm_term, utm_content) are appended to URLs for analytics tracking. They do not change the page content — /blog/post?utm_source=twitter serves the same content as /blog/post. Including them in sitemaps is wrong because:
- They inflate the URL count, potentially to thousands or millions of "unique" URLs
- Google may index tracked versions and split ranking signals between the clean and tracked URLs
- Sitemaps with tracking parameters fail Google Search Console validation recommendations
# Python: strip UTM parameters from a URL list
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse
UTM_PARAMS = {'utm_source','utm_medium','utm_campaign','utm_term','utm_content','fbclid','gclid'}
def clean_url(url):
parsed = urlparse(url)
params = {k: v for k, v in parse_qs(parsed.query).items()
if k not in UTM_PARAMS}
clean_query = urlencode(params, doseq=True)
return urlunparse(parsed._replace(query=clean_query))
Google's sitemap protocol allows a maximum of 50,000 URLs per sitemap file and a maximum file size of 50MB (uncompressed). Large sites use sitemap index files that point to multiple individual sitemaps. Cleaning your URL list before generating a sitemap prevents exceeding these limits with junk URLs.
What URLs Should NOT Be in a Sitemap
A sitemap is a signal to search engines about which pages you want indexed. Including the wrong URLs can actually harm your SEO:
- Pages with
noindex— contradictory to include in a sitemap; Google will likely ignore the noindex or flag the inconsistency - Redirect URLs — only include the final destination URL, not intermediate redirects
- Admin or login pages — no public value; blocked by robots.txt typically
- Filtered and paginated versions — e.g.
/products?color=red&size=smallusually shouldn't be in sitemaps unless they represent genuinely unique content - Canonicalized-away pages — if a page has a canonical pointing to a different URL, submit the canonical URL instead
Clean Your URL List Instantly
Paste your URLs and remove duplicates, strip tracking parameters, normalize slashes, and filter to a single domain.
Open the Sitemap URL List CleanerHow to Use the Sitemap URL List Cleaner
- Open the Sitemap URL List Cleaner
- Paste your raw URL list (one URL per line)
- Select cleaning options: remove duplicates, strip UTM parameters, normalize trailing slashes, filter to one domain
- Click Clean — the tool shows how many URLs were removed and why
- Copy the cleaned list or download it as a text file
- Use the output directly in your sitemap generator or migration tool