Skip to main content

Overview

The Web Scrape block extracts content from webpages. Use it to gather information from websites, pull article content, extract product details, or collect data for AI analysis. The block supports multiple output formats including AI-powered structured data extraction.

Configuration

Website URL

Enter the URL of the webpage to scrape. This field supports placeholders to scrape dynamic URLs from previous steps. Examples:
  • Static URL: https://example.com/blog/article-title
  • From search results: {{step_1.output.organic[0].link}}
  • From loop: {{current.url}}
The scraper handles JavaScript-rendered pages, so dynamic content loads correctly.

Result Format

Choose how the scraped content is returned.
FormatDescriptionBest For
AI JSON FormatAI extracts structured data based on your promptProduct details, article metadata, specific data points
Markdown FormatClean, formatted text contentContent analysis, LLM processing, readability
HTML FormatRaw HTML markupPreserving structure, custom parsing

AI JSON Format

When you select AI JSON Format, the scraper uses AI to extract specific data from the page based on your prompt.

Prompt

Tell the AI what information to extract from the page. Be specific about the data structure you need. Example prompts: For a product page:
Extract the product name, price, description, and list of features.
For a blog article:
Extract the article title, author name, publication date, and main content summary.
For a company page:
Extract the company name, founding year, number of employees, and headquarters location.

JSON Output Example

For a product page with the prompt “Extract product name, price, and features”:
{
  "product_name": "Wireless Bluetooth Headphones",
  "price": "$79.99",
  "features": [
    "40-hour battery life",
    "Active noise cancellation",
    "Foldable design",
    "Built-in microphone"
  ]
}
Accessing JSON data:
{{step_n.output.product_name}}
{{step_n.output.features[0]}}

Markdown Format

Returns the page content as clean, readable markdown text. Navigation, ads, and boilerplate are removed.

Markdown Output Example

# How to Improve Your SEO in 2024

By John Smith | January 15, 2024

Search engine optimization continues to evolve. Here are the key strategies
for improving your rankings this year.

## 1. Focus on User Experience

Google's algorithm increasingly prioritizes pages that provide excellent
user experiences...

## 2. Create Quality Content

Content remains king. Focus on creating comprehensive, valuable content
that answers user questions...
Accessing markdown:
{{step_n.output}}

HTML Format

Returns the raw HTML content of the page. Useful when you need to preserve exact structure or perform custom parsing.

HTML Output Example

<article>
  <h1>How to Improve Your SEO in 2024</h1>
  <div class="author">By John Smith</div>
  <div class="content">
    <p>Search engine optimization continues to evolve...</p>
  </div>
</article>

Content Options

Only Main Content

When enabled, the scraper excludes navigation menus, footers, sidebars, and other peripheral content. Returns only the primary content area.
  • On: Cleaner output focused on main content
  • Off: Full page content including navigation and sidebars
Use this when you want article text without site-wide elements.

Include Metadata

When enabled, the output includes page metadata alongside the content. Metadata fields included:
  • title - Page title
  • description - Meta description
  • ogTitle - Open Graph title
  • ogDescription - Open Graph description
  • language - Page language
  • favicon - Favicon URL
  • sourceURL - Original URL
  • twitter:title - Twitter card title
  • twitter:description - Twitter card description

Output with Metadata

When metadata is included, the output structure changes: Markdown with metadata:
{
  "markdown": "# Article Title\n\nArticle content here...",
  "metadata": {
    "title": "Article Title | Site Name",
    "description": "A brief description of the article",
    "ogTitle": "Article Title",
    "language": "en",
    "sourceURL": "https://example.com/article"
  }
}
Accessing content with metadata:
{{step_n.output.markdown}}
{{step_n.output.metadata.title}}
{{step_n.output.metadata.description}}

Best Practices

  • Use “Only Main Content” for cleaner article extraction
  • Choose Markdown format when feeding content to LLM blocks
  • Use AI JSON format when you need specific structured data
  • Include metadata when you need page titles or descriptions
  • Combine with Google Search to scrape top-ranking pages
  • Test scraping on a single URL before running bulk operations

Common Use Cases

Use CaseConfiguration Tips
Content researchMarkdown format + Only Main Content for clean articles
Competitor analysisAI JSON to extract specific data points
Price monitoringAI JSON with prompt for price and product details
Lead generationAI JSON to extract contact information
SEO analysisInclude Metadata to get title tags and descriptions
Content aggregationLoop through URLs, scrape each in Markdown

Example Workflow: Competitor Content Analysis

Analyze content from top-ranking pages:
  1. Google Search Block: Search for target keyword
  2. Loop Block: Iterate through top 5 organic results
  3. Web Scrape Block:
    • URL: {{current.link}}
    • Format: Markdown
    • Only Main Content: On
  4. LLM Block: Analyze content themes and structure
  5. Google Sheets Block: Store analysis results

Example Workflow: Product Data Extraction

Extract product details from e-commerce pages:
  1. Google Sheets Block: Read list of product URLs
  2. Loop Block: Process each URL
  3. Web Scrape Block:
    • URL: {{current.product_url}}
    • Format: AI JSON
    • Prompt: “Extract product name, price, rating, number of reviews, and availability status”
  4. Google Sheets Block: Append extracted data

Troubleshooting

IssueCauseSolution
Empty outputPage blocks scrapingTry different URL or check robots.txt
Missing contentJavaScript not renderedContent should load; contact support if persistent
TimeoutPage too slowReduce concurrent scrapes, try again later
Prompt required errorUsing JSON format without promptAdd extraction prompt for AI JSON format
Incomplete JSONVague promptBe more specific about data to extract

What’s Next

Now that you understand the Web Scrape block: