WebScraper API Documentation

Overview

The WebScraper API provides endpoints for web scraping, content extraction, and video transcript extraction. All API endpoints require authentication via API tokens.

Base URL: https://webscraper.bearwood.ai
Authentication

All API endpoints require authentication using API tokens. Include the token in the Authorization header:

Authorization: Bearer wst_your_api_token_here

API tokens must be generated by an administrator. Contact your administrator to obtain an API token.

API Endpoints
POST /api/v1/scrape Web Scraping

Extract content from any webpage and convert it to markdown.

Request Body
{
  "url": "https://example.com/article",
  "wait_for_selector": "article",  // Optional: CSS selector to wait for
  "wait_timeout": 30000,           // Optional: Timeout in milliseconds (default: 30000)
  "include_images": true,          // Optional: Include images in markdown (default: true)
  "include_links": true,           // Optional: Include links in markdown (default: true)
  "custom_headers": {              // Optional: Custom HTTP headers
    "User-Agent": "Custom Agent"
  },
  "use_cache": true,               // Optional: Use cached content (default: true)
  "force_refresh": false           // Optional: Force fresh scraping (default: false)
}
Parameters
Parameter Type Required Description
url string Yes The URL to scrape
wait_for_selector string No CSS selector to wait for before scraping
wait_timeout integer No Timeout in milliseconds (1000-120000, default: 30000)
include_images boolean No Include images in the markdown output (default: true)
include_links boolean No Include links in the markdown output (default: true)
custom_headers object No Custom HTTP headers to send with the request
use_cache boolean No Use cached content if available (default: true)
force_refresh boolean No Force fresh scraping, bypass cache (default: false)
Response
{
  "success": true,
  "data": {
    "url": "https://example.com/article",
    "title": "Article Title",
    "markdown": "# Article Title\n\nContent in markdown format...",
    "html": "...",
    "text": "Plain text content...",
    "metadata": {
      "title": "Article Title",
      "description": "Article description",
      "author": "John Doe",
      "published_date": "2024-01-19T10:00:00Z",
      "keywords": ["tech", "news"],
      "language": "en"
    },
    "extracted_at": "2024-01-19T10:00:00Z"
  },
  "status": "completed",
  "request_id": "req_123456",
  "cached": false
}
POST /api/v1/transcript YouTube Transcript Extraction

Extract transcripts from YouTube videos with timestamps.

Request Body
{
  "url": "https://www.youtube.com/watch?v=VIDEO_ID",
  "language": "en",        // Optional: preferred language code
  "use_cache": true        // Optional: use cached transcript (default: true)
}
Parameters
Parameter Type Required Description
url string Yes YouTube video URL
language string No Preferred language code (e.g., 'en', 'es', 'fr')
use_cache boolean No Use cached transcript if available (default: true)
Response
{
  "success": true,
  "title": "Video Title",
  "description": "Video description...",
  "transcript": "- [00:00:00] Welcome to this video\n- [00:00:05] Today we'll discuss...",
  "metadata": {
    "video_id": "VIDEO_ID",
    "url": "https://www.youtube.com/watch?v=VIDEO_ID",
    "duration": 600,
    "duration_formatted": "0:10:00",
    "uploader": "Channel Name",
    "upload_date": "20240119",
    "view_count": 1234567,
    "language": "en",
    "transcript_type": "manual"  // or "automatic"
  },
  "error": null
}
Note: For age-restricted or private videos, valid YouTube cookies must be configured on the server.
POST /api/v1/scrape/async Asynchronous Scraping

For long-running scrape jobs, use the async endpoint.

Request Body

Same parameters as /api/v1/scrape

Response
{
  "job_id": "job_123456789",
  "status": "pending",
  "message": "Scraping job queued"
}
GET /api/v1/scrape/job/{job_id} Check Async Job Status

Check the status and result of an asynchronous scraping job.

Response
{
  "job_id": "job_123456789",
  "status": "completed",  // Options: "pending", "processing", "completed", "failed"
  "result": {
    "success": true,
    "url": "https://example.com/article",
    "title": "Article Title",
    "markdown": "# Article Title\n\nContent...",
    "metadata": {...}
  },
  "error": null,
  "created_at": "2024-01-19T10:00:00Z",
  "completed_at": "2024-01-19T10:00:05Z"
}
GET /health Health Check

Check if the API is running and healthy. No authentication required.

Response
{
  "status": "healthy",
  "timestamp": "2024-01-19T10:00:00Z",
  "version": "1.0.0",
  "uptime": 3600.5,
  "cache_status": "connected",
  "playwright_status": "ready"
}
Code Examples
Python
import requests

# Use your API token
API_TOKEN = "wst_your_api_token_here"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Extract YouTube transcript
transcript_response = requests.post(
    "https://webscraper.bearwood.ai/api/v1/transcript",
    headers=headers,
    json={
        "url": "https://www.youtube.com/watch?v=VIDEO_ID",
        "use_cache": True
    }
)
transcript_data = transcript_response.json()
print(transcript_data["transcript"])

# Scrape a webpage
scrape_response = requests.post(
    "https://webscraper.bearwood.ai/api/v1/scrape",
    headers=headers,
    json={
        "url": "https://example.com/article",
        "wait_for_selector": "article",
        "use_cache": True
    }
)
scrape_data = scrape_response.json()
print(scrape_data["data"]["markdown"])
JavaScript
// Use your API token
const API_TOKEN = 'wst_your_api_token_here';

// Scrape a webpage
const scrapeResponse = await fetch('https://webscraper.bearwood.ai/api/v1/scrape', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://example.com/article',
    use_cache: true
  })
});
const scrapeData = await scrapeResponse.json();
console.log(scrapeData.data.markdown);

// Extract YouTube transcript
const transcriptResponse = await fetch('https://webscraper.bearwood.ai/api/v1/transcript', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://www.youtube.com/watch?v=VIDEO_ID',
    language: 'en'
  })
});
const transcriptData = await transcriptResponse.json();
console.log(transcriptData.transcript);
cURL
# Set your API token
API_TOKEN="wst_your_api_token_here"

# Extract YouTube transcript
curl -X POST https://webscraper.bearwood.ai/api/v1/transcript \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.youtube.com/watch?v=VIDEO_ID",
    "use_cache": true
  }'

# Scrape a webpage
curl -X POST https://webscraper.bearwood.ai/api/v1/scrape \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "wait_for_selector": "article",
    "include_images": true
  }'
Error Handling

All endpoints return consistent error responses:

{
  "detail": "Error message",
  "type": "error_type",
  "status_code": 400
}
Status Code Description
200 Success
400 Bad Request - Invalid parameters
401 Unauthorized - Missing or invalid token
403 Forbidden - Insufficient permissions
404 Not Found
429 Too Many Requests - Rate limit exceeded
500 Internal Server Error
Rate Limiting

The API implements rate limiting to prevent abuse:

  • Default limits: 100 requests per minute for authenticated users
  • Headers returned:
    • X-RateLimit-Limit: Maximum requests allowed
    • X-RateLimit-Remaining: Requests remaining
    • X-RateLimit-Reset: Unix timestamp when limit resets
Important Notes
  • Scraper Type: Only Playwright scraper is supported, providing full browser automation with JavaScript support.
  • YouTube Cookies: For YouTube transcript extraction to work with age-restricted or private videos, valid YouTube cookies must be configured on the server.
  • Caching: Cached content is stored for 24 hours by default. Use use_cache: false to force fresh extraction.
  • Timeouts: Default timeout is 30 seconds. For slow-loading sites, increase the wait_timeout value.