Overview
The Get URLs from Sitemap block extracts URLs from a website’s XML sitemap. Use it to discover all pages on a website, filter for specific sections, and feed URLs into other blocks for bulk scraping, analysis, or content processing.
Configuration
Sitemap URL
Enter the URL of the sitemap to parse. This is typically located at /sitemap.xml on most websites.
Common sitemap locations:
https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/post-sitemap.xml
This field supports placeholders for dynamic sitemap URLs:
https://{{step_1.output.domain}}/sitemap.xml
Most websites list their sitemap location in the robots.txt file at https://example.com/robots.txt. Check there if you can’t find the sitemap.
Maximum Number of Links
Limit how many URLs are returned from the sitemap.
- Default: 100 URLs
- Minimum: 1 URL
Set a lower limit when:
- Testing your workflow before running at scale
- Processing only a sample of pages
- Staying within credit budgets for subsequent scraping
Include Only Links That Contain
Filter URLs to include only those matching specific patterns. Enter one or more text patterns separated by commas.
How it works:
- URLs must contain at least one of the specified patterns (OR logic)
- Matching is case-sensitive
- Partial matches work (e.g.,
/blog/ matches /blog/post-title)
Examples:
| Pattern | Matches |
|---|
/blog/ | All blog post URLs |
/products/ | All product pages |
/blog/, /news/ | Blog posts OR news articles |
/2024/ | Pages with 2024 in the URL |
/category/seo/ | SEO category pages only |
This field supports placeholders:
/category/{{step_1.output.category}}/
Exclude Links That Contain
Filter out URLs containing specific patterns. Enter one or more text patterns separated by commas.
How it works:
- URLs matching any pattern are removed (OR logic)
- Applied after include filter
- Useful for removing unwanted page types
Examples:
| Pattern | Excludes |
|---|
/tag/ | Tag archive pages |
/author/ | Author pages |
/page/ | Pagination pages |
/amp/, /feed/ | AMP pages and RSS feeds |
/admin/, /api/ | Admin and API endpoints |
Output
The block returns an array of URL strings extracted from the sitemap.
Output Example
[
"https://example.com/blog/how-to-improve-seo",
"https://example.com/blog/content-marketing-guide",
"https://example.com/blog/keyword-research-tips",
"https://example.com/blog/link-building-strategies"
]
Accessing URLs
Get all URLs:
Get first URL:
Get URL count:
Loop through URLs:
Use with the Loop block to process each URL individually.
Combining Filters
Include and exclude filters work together:
- First, include filter is applied (if set)
- Then, exclude filter removes unwanted URLs
- Finally, the limit is applied
Example:
- Sitemap:
https://example.com/sitemap.xml
- Include:
/blog/
- Exclude:
/tag/, /author/
- Limit: 50
Result: Up to 50 blog post URLs, excluding tag and author pages.
Best Practices
- Start with a small limit when testing workflows
- Use include filters to target specific content types
- Exclude pagination, tags, and archives for cleaner results
- Check the sitemap structure first to understand URL patterns
- Combine with Loop and Web Scrape blocks for bulk content extraction
- Some sites have multiple sitemaps; check the sitemap index
Common Use Cases
| Use Case | Configuration Tips |
|---|
| Blog content audit | Include /blog/, exclude /tag/, /author/, /page/ |
| Product catalog extraction | Include /products/ or /shop/ |
| Competitor page discovery | Set high limit, filter by content sections |
| Content migration | Extract all URLs, scrape content from each |
| SEO analysis | Get all URLs, analyze with LLM for optimization opportunities |
| Broken link checking | Extract URLs, use Call API to check status codes |
Example Workflow: Bulk Content Analysis
Analyze all blog posts from a competitor:
-
Get URLs from Sitemap Block:
- Sitemap URL:
https://competitor.com/sitemap.xml
- Include:
/blog/
- Exclude:
/tag/, /category/, /author/
- Limit: 100
-
Loop Block: Iterate through each URL
-
Web Scrape Block:
- URL:
{{current}}
- Format: Markdown
- Only Main Content: On
-
LLM Block: Analyze content themes and structure
-
Google Sheets Block: Store analysis results
Example Workflow: Site Inventory
Create a complete inventory of a website’s pages:
-
Get URLs from Sitemap Block:
- Sitemap URL:
https://yoursite.com/sitemap.xml
- Limit: 500
-
Loop Block: Process each URL
-
Web Scrape Block:
- URL:
{{current}}
- Format: Markdown
- Include Metadata: On
-
Google Sheets Block: Append URL, title, and description
Troubleshooting
| Issue | Cause | Solution |
|---|
| No URLs returned | Invalid sitemap URL | Verify the sitemap exists and is accessible |
| Empty results | Filters too restrictive | Broaden include patterns or remove exclude patterns |
| Missing pages | Sitemap not complete | Check if site has multiple sitemaps |
| Wrong pages | Incorrect filter pattern | Test patterns against actual sitemap URLs |
| Timeout | Very large sitemap | Reduce limit to process fewer URLs |
What’s Next
Now that you understand the Get URLs from Sitemap block: