๐ŸงฉCustom Development

How to Build an OpenClaw Web Scraping Skill

Advanced2-4 hoursUpdated 2025-01-18

Web scraping with OpenClaw enables automated data extraction from websites: product prices, job listings, news articles, competitor data, and more. This advanced guide covers building a robust scraping skill with Playwright (for JavaScript-heavy sites) or Cheerio (for static HTML), including pagination, error handling, and anti-bot measures.

Why This Is Hard to Do Yourself

These are the common pitfalls that trip people up.

๐Ÿค–

Anti-bot detection and blocking

Modern sites use Cloudflare, Imperva, and fingerprinting to block scrapers. Headless detection is sophisticated

๐Ÿ”„

Dynamic content and pagination

JavaScript-rendered content, infinite scroll, and complex pagination require browser automation, not just HTTP requests

โฑ๏ธ

Rate limiting and politeness

Aggressive scraping gets you IP-banned. You need delays, rotating proxies, and respect for robots.txt

๐Ÿ’พ

Data extraction reliability

Websites change their HTML structure constantly. Selectors break without warning and need fallback strategies

๐Ÿงน

Data cleaning and normalization

Scraped data is messy: extra whitespace, inconsistent formats, HTML entities. Output needs cleaning and validation

Step-by-Step Guide

Step 1

Choose scraping approach (Playwright vs Cheerio)

# Decision matrix:

# Use Cheerio (fast, simple) if:
# - Site is server-rendered HTML
# - No JavaScript required to load content
# - Static pagination
# - No login required

# Use Playwright (slower, powerful) if:
# - Content loads via JavaScript (React, Vue, etc.)
# - Infinite scroll or lazy loading
# - Forms, logins, or interactions required
# - Anti-bot detection present

# For this guide, we'll use Playwright (more common case)

# Install dependencies:
npm install playwright cheerio
Step 2

Create the scraping skill

# Create skill structure:
mkdir -p ~/.openclaw/skills/web-scraper/scripts
cat > ~/.openclaw/skills/web-scraper/skill.md << 'EOF'
---
name: web-scraper
version: 1.0.0
description: Extracts structured data from websites
permissions:
  - network:outbound
  - process:spawn
  - filesystem:write
triggers:
  - command: /scrape
  - pattern: "scrape|extract data from"
---

## Instructions

You are a web scraping specialist.

When asked to scrape a website:
1. Determine if Playwright or Cheerio is needed
2. Navigate to the target URL
3. Extract the requested data using CSS selectors or XPath
4. Handle pagination if needed
5. Clean and normalize the output
6. Return structured data (JSON or CSV)
7. Respect rate limits and robots.txt
EOF

Warning: Web scraping may violate a website's Terms of Service. Always check robots.txt and terms before scraping. Some sites explicitly prohibit automated access.

Step 3

Implement URL parsing and validation

// ~/.openclaw/skills/web-scraper/scripts/scraper.js

import { chromium } from 'playwright';
import * as cheerio from 'cheerio';

export async function scrape(url, options = {}) {
  // Validate URL
  try {
    new URL(url);
  } catch {
    throw new Error('Invalid URL provided');
  }

  // Check robots.txt (simplified)
  if (options.respectRobotsTxt) {
    await checkRobotsTxt(url);
  }

  // Choose scraping method
  if (options.usePlaywright) {
    return await scrapeWithPlaywright(url, options);
  } else {
    return await scrapeWithCheerio(url, options);
  }
}

async function checkRobotsTxt(url) {
  const { origin } = new URL(url);
  const robotsUrl = `${origin}/robots.txt`;

  try {
    const response = await fetch(robotsUrl);
    const text = await response.text();

    // Simplified check - production should use robots-parser library
    if (text.includes('Disallow: /')) {
      console.warn('Site may disallow scraping. Check robots.txt manually.');
    }
  } catch {
    // robots.txt not found - proceed with caution
  }
}
Step 4

Add data extraction logic

// Playwright-based scraping with selectors:

async function scrapeWithPlaywright(url, options) {
  const browser = await chromium.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const context = await browser.newContext({
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    viewport: { width: 1280, height: 720 }
  });

  const page = await context.newPage();

  try {
    await page.goto(url, { waitUntil: 'networkidle' });

    // Wait for content to load
    if (options.waitFor) {
      await page.waitForSelector(options.waitFor);
    }

    // Extract data using provided selectors
    const data = await page.evaluate((selectors) => {
      const results = [];
      const items = document.querySelectorAll(selectors.item);

      items.forEach(item => {
        const result = {};
        for (const [key, selector] of Object.entries(selectors.fields)) {
          const element = item.querySelector(selector);
          result[key] = element ? element.textContent.trim() : null;
        }
        results.push(result);
      });

      return results;
    }, options.selectors);

    return data;
  } finally {
    await browser.close();
  }
}

// Example usage:
// scrape('https://example.com/products', {
//   usePlaywright: true,
//   selectors: {
//     item: '.product-card',
//     fields: {
//       title: '.product-title',
//       price: '.product-price',
//       url: 'a'
//     }
//   }
// });
Step 5

Handle pagination and multiple pages

// Add pagination support:

export async function scrapeMultiplePages(url, options) {
  const allData = [];
  let currentPage = 1;
  let hasNextPage = true;

  while (hasNextPage && currentPage <= (options.maxPages || 10)) {
    console.log(`Scraping page ${currentPage}...`);

    // Build paginated URL
    const pageUrl = options.paginationTemplate
      ? options.paginationTemplate.replace('{page}', currentPage)
      : `${url}?page=${currentPage}`;

    // Scrape this page
    const pageData = await scrape(pageUrl, options);
    allData.push(...pageData);

    // Check if there's a next page
    // (This logic varies by site - example only)
    if (pageData.length === 0) {
      hasNextPage = false;
    }

    currentPage++;

    // Polite delay between requests
    await sleep(options.delayMs || 2000);
  }

  return allData;
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// Example with pagination:
// scrapeMultiplePages('https://example.com/products', {
//   paginationTemplate: 'https://example.com/products?page={page}',
//   maxPages: 5,
//   delayMs: 3000
// });

Warning: Always add delays between pages. Scraping too fast is rude, wastes server resources, and will get you IP-banned quickly.

Step 6

Configure output formatting and cleaning

// Clean and normalize scraped data:

export function cleanData(data) {
  return data.map(item => {
    const cleaned = {};

    for (const [key, value] of Object.entries(item)) {
      if (typeof value === 'string') {
        // Remove extra whitespace
        let cleanValue = value.replace(/\s+/g, ' ').trim();

        // Decode HTML entities
        cleanValue = decodeHtmlEntities(cleanValue);

        // Remove common cruft
        cleanValue = cleanValue.replace(/^[\s\n\t]+|[\s\n\t]+$/g, '');

        cleaned[key] = cleanValue;
      } else {
        cleaned[key] = value;
      }
    }

    return cleaned;
  });
}

function decodeHtmlEntities(text) {
  const entities = {
    '&amp;': '&',
    '&lt;': '<',
    '&gt;': '>',
    '&quot;': '"',
    '&#39;': "'"
  };

  return text.replace(/&[^;]+;/g, match => entities[match] || match);
}

// Export to CSV or JSON:
export function exportData(data, format = 'json') {
  if (format === 'json') {
    return JSON.stringify(data, null, 2);
  } else if (format === 'csv') {
    const headers = Object.keys(data[0] || {});
    const rows = data.map(item =>
      headers.map(h => JSON.stringify(item[h] || '')).join(',')
    );
    return [headers.join(','), ...rows].join('\n');
  }
}
Step 7

Add error handling and rate limiting

// Robust scraping with retries and rate limiting:

export async function scrapeWithRetry(url, options, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const data = await scrape(url, options);
      const cleaned = cleanData(data);
      return cleaned;
    } catch (error) {
      if (attempt === maxRetries) {
        throw new Error(`Failed after ${maxRetries} attempts: ${error.message}`);
      }

      // Exponential backoff
      const delayMs = 1000 * Math.pow(2, attempt);
      console.warn(`Attempt ${attempt} failed, retrying in ${delayMs}ms...`);
      await sleep(delayMs);
    }
  }
}

// Rate limiter:
class RateLimiter {
  constructor(maxRequestsPerSecond = 1) {
    this.maxRequests = maxRequestsPerSecond;
    this.requests = [];
  }

  async waitForSlot() {
    const now = Date.now();
    this.requests = this.requests.filter(time => now - time < 1000);

    if (this.requests.length >= this.maxRequests) {
      const oldestRequest = Math.min(...this.requests);
      const waitTime = 1000 - (now - oldestRequest);
      await sleep(waitTime);
    }

    this.requests.push(Date.now());
  }
}

const limiter = new RateLimiter(1); // 1 request per second

export async function scrapeSafely(url, options) {
  await limiter.waitForSlot();
  return await scrapeWithRetry(url, options);
}

Web Scraping That Actually Works in Production

Anti-bot detection, dynamic content, pagination edge cases, rate limiting โ€” web scraping is full of challenges. Our experts build scrapers that stay online and extract clean data reliably.

Get matched with a specialist who can help.

Sign Up for Expert Help โ†’

Frequently Asked Questions