Skip to main content

Overview

The Get URLs from Sitemap block extracts URLs from a website’s XML sitemap. Use it to discover all pages on a website, filter for specific sections, and feed URLs into other blocks for bulk scraping, analysis, or content processing.

Configuration

Sitemap URL

Enter the URL of the sitemap to parse. This is typically located at /sitemap.xml on most websites. Common sitemap locations:
  • https://example.com/sitemap.xml
  • https://example.com/sitemap_index.xml
  • https://example.com/post-sitemap.xml
This field supports placeholders for dynamic sitemap URLs:
https://{{step_1.output.domain}}/sitemap.xml
Most websites list their sitemap location in the robots.txt file at https://example.com/robots.txt. Check there if you can’t find the sitemap.

Limit how many URLs are returned from the sitemap.
  • Default: 100 URLs
  • Minimum: 1 URL
Set a lower limit when:
  • Testing your workflow before running at scale
  • Processing only a sample of pages
  • Staying within credit budgets for subsequent scraping

Filter URLs to include only those matching specific patterns. Enter one or more text patterns separated by commas. How it works:
  • URLs must contain at least one of the specified patterns (OR logic)
  • Matching is case-sensitive
  • Partial matches work (e.g., /blog/ matches /blog/post-title)
Examples:
PatternMatches
/blog/All blog post URLs
/products/All product pages
/blog/, /news/Blog posts OR news articles
/2024/Pages with 2024 in the URL
/category/seo/SEO category pages only
This field supports placeholders:
/category/{{step_1.output.category}}/

Filter out URLs containing specific patterns. Enter one or more text patterns separated by commas. How it works:
  • URLs matching any pattern are removed (OR logic)
  • Applied after include filter
  • Useful for removing unwanted page types
Examples:
PatternExcludes
/tag/Tag archive pages
/author/Author pages
/page/Pagination pages
/amp/, /feed/AMP pages and RSS feeds
/admin/, /api/Admin and API endpoints

Output

The block returns an array of URL strings extracted from the sitemap.

Output Example

[
  "https://example.com/blog/how-to-improve-seo",
  "https://example.com/blog/content-marketing-guide",
  "https://example.com/blog/keyword-research-tips",
  "https://example.com/blog/link-building-strategies"
]

Accessing URLs

Get all URLs:
{{step_n.output}}
Get first URL:
{{step_n.output[0]}}
Get URL count:
{{step_n.output | size}}
Loop through URLs: Use with the Loop block to process each URL individually.

Combining Filters

Include and exclude filters work together:
  1. First, include filter is applied (if set)
  2. Then, exclude filter removes unwanted URLs
  3. Finally, the limit is applied
Example:
  • Sitemap: https://example.com/sitemap.xml
  • Include: /blog/
  • Exclude: /tag/, /author/
  • Limit: 50
Result: Up to 50 blog post URLs, excluding tag and author pages.

Best Practices

  • Start with a small limit when testing workflows
  • Use include filters to target specific content types
  • Exclude pagination, tags, and archives for cleaner results
  • Check the sitemap structure first to understand URL patterns
  • Combine with Loop and Web Scrape blocks for bulk content extraction
  • Some sites have multiple sitemaps; check the sitemap index

Common Use Cases

Use CaseConfiguration Tips
Blog content auditInclude /blog/, exclude /tag/, /author/, /page/
Product catalog extractionInclude /products/ or /shop/
Competitor page discoverySet high limit, filter by content sections
Content migrationExtract all URLs, scrape content from each
SEO analysisGet all URLs, analyze with LLM for optimization opportunities
Broken link checkingExtract URLs, use Call API to check status codes

Example Workflow: Bulk Content Analysis

Analyze all blog posts from a competitor:
  1. Get URLs from Sitemap Block:
    • Sitemap URL: https://competitor.com/sitemap.xml
    • Include: /blog/
    • Exclude: /tag/, /category/, /author/
    • Limit: 100
  2. Loop Block: Iterate through each URL
  3. Web Scrape Block:
    • URL: {{current}}
    • Format: Markdown
    • Only Main Content: On
  4. LLM Block: Analyze content themes and structure
  5. Google Sheets Block: Store analysis results

Example Workflow: Site Inventory

Create a complete inventory of a website’s pages:
  1. Get URLs from Sitemap Block:
    • Sitemap URL: https://yoursite.com/sitemap.xml
    • Limit: 500
  2. Loop Block: Process each URL
  3. Web Scrape Block:
    • URL: {{current}}
    • Format: Markdown
    • Include Metadata: On
  4. Google Sheets Block: Append URL, title, and description

Troubleshooting

IssueCauseSolution
No URLs returnedInvalid sitemap URLVerify the sitemap exists and is accessible
Empty resultsFilters too restrictiveBroaden include patterns or remove exclude patterns
Missing pagesSitemap not completeCheck if site has multiple sitemaps
Wrong pagesIncorrect filter patternTest patterns against actual sitemap URLs
TimeoutVery large sitemapReduce limit to process fewer URLs

What’s Next

Now that you understand the Get URLs from Sitemap block: