HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON AN INTERNET SITE

How to Find All Existing and Archived URLs on an internet site

How to Find All Existing and Archived URLs on an internet site

Blog Article

There are various motives you might need to find many of the URLs on an internet site, but your precise target will determine Whatever you’re trying to find. As an illustration, you may want to:

Detect every single indexed URL to analyze difficulties like cannibalization or index bloat
Gather current and historic URLs Google has observed, specifically for site migrations
Locate all 404 URLs to Get well from post-migration mistakes
In Every single scenario, just one Device won’t Present you with every thing you will need. Sadly, Google Look for Console isn’t exhaustive, and also a “web site:example.com” search is restricted and challenging to extract information from.

In this particular write-up, I’ll wander you thru some tools to build your URL record and right before deduplicating the information using a spreadsheet or Jupyter Notebook, based on your website’s sizing.

Outdated sitemaps and crawl exports
In case you’re trying to find URLs that disappeared from your live internet site a short while ago, there’s a chance anyone in your crew can have saved a sitemap file or perhaps a crawl export before the modifications were made. For those who haven’t now, check for these documents; they could often deliver what you require. But, in case you’re reading through this, you almost certainly didn't get so Blessed.

Archive.org
Archive.org
Archive.org is a useful tool for SEO tasks, funded by donations. In the event you seek for a website and select the “URLs” choice, you can entry around ten,000 outlined URLs.

Having said that, There are many limitations:

URL limit: You could only retrieve as much as web designer kuala lumpur 10,000 URLs, which can be insufficient for bigger internet sites.
Good quality: A lot of URLs could possibly be malformed or reference resource data files (e.g., photos or scripts).
No export option: There isn’t a created-in way to export the checklist.
To bypass The shortage of the export button, use a browser scraping plugin like Dataminer.io. Even so, these constraints necessarily mean Archive.org may well not deliver an entire Alternative for bigger websites. Also, Archive.org doesn’t indicate no matter whether Google indexed a URL—but if Archive.org uncovered it, there’s a fantastic possibility Google did, as well.

Moz Pro
While you would possibly usually use a backlink index to locate external websites linking to you, these tools also uncover URLs on your site in the process.


The best way to utilize it:
Export your inbound inbound links in Moz Pro to obtain a brief and simple listing of concentrate on URLs from your web-site. In the event you’re handling a huge Web-site, think about using the Moz API to export details beyond what’s workable in Excel or Google Sheets.

It’s important to Be aware that Moz Professional doesn’t ensure if URLs are indexed or learned by Google. Even so, considering that most websites implement the identical robots.txt regulations to Moz’s bots since they do to Google’s, this technique usually operates properly being a proxy for Googlebot’s discoverability.

Google Research Console
Google Search Console offers quite a few important sources for creating your listing of URLs.

Inbound links reviews:


Just like Moz Professional, the Backlinks part provides exportable lists of goal URLs. However, these exports are capped at one,000 URLs Each individual. You can utilize filters for unique web pages, but considering the fact that filters don’t implement for the export, you could possibly ought to count on browser scraping resources—restricted to five hundred filtered URLs at a time. Not excellent.

Functionality → Search engine results:


This export gives you a list of web pages acquiring research impressions. Whilst the export is limited, You should utilize Google Lookup Console API for more substantial datasets. You will also find absolutely free Google Sheets plugins that simplify pulling additional intensive info.

Indexing → Web pages report:


This section supplies exports filtered by issue kind, while they're also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a wonderful supply for accumulating URLs, by using a generous limit of one hundred,000 URLs.


Better still, it is possible to use filters to produce different URL lists, successfully surpassing the 100k Restrict. For instance, if you wish to export only blog site URLs, adhere to these methods:

Action one: Insert a section to your report

Action 2: Simply click “Develop a new segment.”


Move 3: Define the section by using a narrower URL sample, for example URLs containing /blog site/


Be aware: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply useful insights.

Server log documents
Server or CDN log documents are Potentially the final word Software at your disposal. These logs seize an exhaustive checklist of every URL route queried by people, Googlebot, or other bots through the recorded time period.

Factors:

Information sizing: Log information can be large, numerous web pages only keep the final two months of knowledge.
Complexity: Examining log data files may be demanding, but many tools are offered to simplify the procedure.
Merge, and very good luck
When you finally’ve gathered URLs from all these resources, it’s time to combine them. If your web site is small enough, use Excel or, for larger datasets, instruments like Google Sheets or Jupyter Notebook. Make certain all URLs are consistently formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive list of present, previous, and archived URLs. Very good luck!

Report this page