FixThatApp

Sitemap URL List Cleaner Guide: Remove Duplicates, Normalize, and Validate URLs

Updated March 19, 2026

A URL list cleaner processes a raw list of URLs — often exported from crawlers, analytics tools, CMS databases, or sitemaps — and removes problems that make the list unsuitable for SEO use, sitemap submission, or migration work. Common issues include duplicate URLs in different formats, tracking parameters that inflate the URL count, inconsistent trailing slashes, mixed HTTP and HTTPS versions, and malformed entries that aren't valid URLs at all.

Common URL List Problems

ProblemExampleImpact
Duplicate URLsSame URL listed twiceWastes crawl budget, inflates counts
Trailing slash inconsistency/page/ and /pageGoogle may index both as separate pages
HTTP vs HTTPS mixhttp:// and https:// same pageDuplicate content signals to search engines
UTM/tracking parameters?utm_source=email&utm_medium=cpcCreates thousands of "unique" URLs from one page
Session IDs in URLs?sessionid=abc123Every user visit becomes a different URL
www vs non-wwwwww.example.com and example.comDuplicate domain versions in sitemap
Malformed URLsMissing protocol, spaces in URLCrawler errors, sitemap validation failure

URL Normalization Rules

URL normalization is the process of transforming URLs into a canonical, consistent form so that equivalent URLs are recognized as duplicates. The standard normalization steps are:

  1. Lowercase the scheme and hostHTTPS://Example.COM/pagehttps://example.com/page
  2. Remove default portshttps://example.com:443/https://example.com/
  3. Percent-decode unreserved characters/my%20page/my%20page (keep encoded spaces, but decode unnecessarily encoded letters)
  4. Resolve dot segments in paths/blog/../products//products/
  5. Apply consistent trailing slash rule — choose either always-trailing-slash or never-trailing-slash and apply uniformly
  6. Sort query parameters?b=2&a=1?a=1&b=2 (so same parameters in different order match)
  7. Remove tracking parameters — strip utm_*, fbclid, gclid, and other known tracking parameters

Stripping Tracking Parameters

UTM parameters (utm_source, utm_medium, utm_campaign, utm_term, utm_content) are appended to URLs for analytics tracking. They do not change the page content — /blog/post?utm_source=twitter serves the same content as /blog/post. Including them in sitemaps is wrong because:

# Python: strip UTM parameters from a URL list
from urllib.parse import urlparse, urlencode, parse_qs, urlunparse

UTM_PARAMS = {'utm_source','utm_medium','utm_campaign','utm_term','utm_content','fbclid','gclid'}

def clean_url(url):
    parsed = urlparse(url)
    params = {k: v for k, v in parse_qs(parsed.query).items()
              if k not in UTM_PARAMS}
    clean_query = urlencode(params, doseq=True)
    return urlunparse(parsed._replace(query=clean_query))
Sitemap URL limits

Google's sitemap protocol allows a maximum of 50,000 URLs per sitemap file and a maximum file size of 50MB (uncompressed). Large sites use sitemap index files that point to multiple individual sitemaps. Cleaning your URL list before generating a sitemap prevents exceeding these limits with junk URLs.

What URLs Should NOT Be in a Sitemap

A sitemap is a signal to search engines about which pages you want indexed. Including the wrong URLs can actually harm your SEO:

Clean Your URL List Instantly

Paste your URLs and remove duplicates, strip tracking parameters, normalize slashes, and filter to a single domain.

Open the Sitemap URL List Cleaner

How to Use the Sitemap URL List Cleaner

  1. Open the Sitemap URL List Cleaner
  2. Paste your raw URL list (one URL per line)
  3. Select cleaning options: remove duplicates, strip UTM parameters, normalize trailing slashes, filter to one domain
  4. Click Clean — the tool shows how many URLs were removed and why
  5. Copy the cleaned list or download it as a text file
  6. Use the output directly in your sitemap generator or migration tool