Overview
The Web Scrape block extracts content from webpages. Use it to gather information from websites, pull article content, extract product details, or collect data for AI analysis. The block supports multiple output formats including AI-powered structured data extraction.
Configuration
Website URL
Enter the URL of the webpage to scrape. This field supports placeholders to scrape dynamic URLs from previous steps.
Examples:
- Static URL:
https://example.com/blog/article-title
- From search results:
{{step_1.output.organic[0].link}}
- From loop:
{{current.url}}
The scraper handles JavaScript-rendered pages, so dynamic content loads correctly.
Choose how the scraped content is returned.
| Format | Description | Best For |
|---|
| AI JSON Format | AI extracts structured data based on your prompt | Product details, article metadata, specific data points |
| Markdown Format | Clean, formatted text content | Content analysis, LLM processing, readability |
| HTML Format | Raw HTML markup | Preserving structure, custom parsing |
When you select AI JSON Format, the scraper uses AI to extract specific data from the page based on your prompt.
Prompt
Tell the AI what information to extract from the page. Be specific about the data structure you need.
Example prompts:
For a product page:
Extract the product name, price, description, and list of features.
For a blog article:
Extract the article title, author name, publication date, and main content summary.
For a company page:
Extract the company name, founding year, number of employees, and headquarters location.
JSON Output Example
For a product page with the prompt “Extract product name, price, and features”:
{
"product_name": "Wireless Bluetooth Headphones",
"price": "$79.99",
"features": [
"40-hour battery life",
"Active noise cancellation",
"Foldable design",
"Built-in microphone"
]
}
Accessing JSON data:
{{step_n.output.product_name}}
{{step_n.output.features[0]}}
Returns the page content as clean, readable markdown text. Navigation, ads, and boilerplate are removed.
Markdown Output Example
# How to Improve Your SEO in 2024
By John Smith | January 15, 2024
Search engine optimization continues to evolve. Here are the key strategies
for improving your rankings this year.
## 1. Focus on User Experience
Google's algorithm increasingly prioritizes pages that provide excellent
user experiences...
## 2. Create Quality Content
Content remains king. Focus on creating comprehensive, valuable content
that answers user questions...
Accessing markdown:
Returns the raw HTML content of the page. Useful when you need to preserve exact structure or perform custom parsing.
HTML Output Example
<article>
<h1>How to Improve Your SEO in 2024</h1>
<div class="author">By John Smith</div>
<div class="content">
<p>Search engine optimization continues to evolve...</p>
</div>
</article>
Content Options
Only Main Content
When enabled, the scraper excludes navigation menus, footers, sidebars, and other peripheral content. Returns only the primary content area.
- On: Cleaner output focused on main content
- Off: Full page content including navigation and sidebars
Use this when you want article text without site-wide elements.
When enabled, the output includes page metadata alongside the content.
Metadata fields included:
title - Page title
description - Meta description
ogTitle - Open Graph title
ogDescription - Open Graph description
language - Page language
favicon - Favicon URL
sourceURL - Original URL
twitter:title - Twitter card title
twitter:description - Twitter card description
When metadata is included, the output structure changes:
Markdown with metadata:
{
"markdown": "# Article Title\n\nArticle content here...",
"metadata": {
"title": "Article Title | Site Name",
"description": "A brief description of the article",
"ogTitle": "Article Title",
"language": "en",
"sourceURL": "https://example.com/article"
}
}
Accessing content with metadata:
{{step_n.output.markdown}}
{{step_n.output.metadata.title}}
{{step_n.output.metadata.description}}
Best Practices
- Use “Only Main Content” for cleaner article extraction
- Choose Markdown format when feeding content to LLM blocks
- Use AI JSON format when you need specific structured data
- Include metadata when you need page titles or descriptions
- Combine with Google Search to scrape top-ranking pages
- Test scraping on a single URL before running bulk operations
Common Use Cases
| Use Case | Configuration Tips |
|---|
| Content research | Markdown format + Only Main Content for clean articles |
| Competitor analysis | AI JSON to extract specific data points |
| Price monitoring | AI JSON with prompt for price and product details |
| Lead generation | AI JSON to extract contact information |
| SEO analysis | Include Metadata to get title tags and descriptions |
| Content aggregation | Loop through URLs, scrape each in Markdown |
Example Workflow: Competitor Content Analysis
Analyze content from top-ranking pages:
- Google Search Block: Search for target keyword
- Loop Block: Iterate through top 5 organic results
- Web Scrape Block:
- URL:
{{current.link}}
- Format: Markdown
- Only Main Content: On
- LLM Block: Analyze content themes and structure
- Google Sheets Block: Store analysis results
Extract product details from e-commerce pages:
- Google Sheets Block: Read list of product URLs
- Loop Block: Process each URL
- Web Scrape Block:
- URL:
{{current.product_url}}
- Format: AI JSON
- Prompt: “Extract product name, price, rating, number of reviews, and availability status”
- Google Sheets Block: Append extracted data
Troubleshooting
| Issue | Cause | Solution |
|---|
| Empty output | Page blocks scraping | Try different URL or check robots.txt |
| Missing content | JavaScript not rendered | Content should load; contact support if persistent |
| Timeout | Page too slow | Reduce concurrent scrapes, try again later |
| Prompt required error | Using JSON format without prompt | Add extraction prompt for AI JSON format |
| Incomplete JSON | Vague prompt | Be more specific about data to extract |
What’s Next
Now that you understand the Web Scrape block: