API Documentation - WebScraper API

Overview

The WebScraper API provides endpoints for web scraping, content extraction, and video transcript extraction. All API endpoints require authentication via API tokens.

Base URL: https://webscraper.bearwood.ai

Authentication

All API endpoints require authentication using API tokens. Include the token in the Authorization header:

Authorization: Bearer wst_your_api_token_here

API tokens must be generated by an administrator. Contact your administrator to obtain an API token.

API Endpoints

POST `/api/v1/scrape` Web Scraping

Extract content from any webpage and convert it to markdown.

Request Body

{
  "url": "https://example.com/article",
  "wait_for_selector": "article",  // Optional: CSS selector to wait for
  "wait_timeout": 30000,           // Optional: Timeout in milliseconds (default: 30000)
  "include_images": true,          // Optional: Include images in markdown (default: true)
  "include_links": true,           // Optional: Include links in markdown (default: true)
  "custom_headers": {              // Optional: Custom HTTP headers
    "User-Agent": "Custom Agent"
  },
  "use_cache": true,               // Optional: Use cached content (default: true)
  "force_refresh": false           // Optional: Force fresh scraping (default: false)
}

Parameters

Parameter	Type	Required	Description
`url`	string	Yes	The URL to scrape
`wait_for_selector`	string	No	CSS selector to wait for before scraping
`wait_timeout`	integer	No	Timeout in milliseconds (1000-120000, default: 30000)
`include_images`	boolean	No	Include images in the markdown output (default: true)
`include_links`	boolean	No	Include links in the markdown output (default: true)
`custom_headers`	object	No	Custom HTTP headers to send with the request
`use_cache`	boolean	No	Use cached content if available (default: true)
`force_refresh`	boolean	No	Force fresh scraping, bypass cache (default: false)

Response

{
  "success": true,
  "data": {
    "url": "https://example.com/article",
    "title": "Article Title",
    "markdown": "# Article Title\n\nContent in markdown format...",
    "html": "...",
    "text": "Plain text content...",
    "metadata": {
      "title": "Article Title",
      "description": "Article description",
      "author": "John Doe",
      "published_date": "2024-01-19T10:00:00Z",
      "keywords": ["tech", "news"],
      "language": "en"
    },
    "extracted_at": "2024-01-19T10:00:00Z"
  },
  "status": "completed",
  "request_id": "req_123456",
  "cached": false
}

POST `/api/v1/transcript` YouTube Transcript Extraction

Extract transcripts from YouTube videos with timestamps.

Request Body

{
  "url": "https://www.youtube.com/watch?v=VIDEO_ID",
  "language": "en",        // Optional: preferred language code
  "use_cache": true        // Optional: use cached transcript (default: true)
}

Parameters

Parameter	Type	Required	Description
`url`	string	Yes	YouTube video URL
`language`	string	No	Preferred language code (e.g., 'en', 'es', 'fr')
`use_cache`	boolean	No	Use cached transcript if available (default: true)

Response

{
  "success": true,
  "title": "Video Title",
  "description": "Video description...",
  "transcript": "- [00:00:00] Welcome to this video\n- [00:00:05] Today we'll discuss...",
  "metadata": {
    "video_id": "VIDEO_ID",
    "url": "https://www.youtube.com/watch?v=VIDEO_ID",
    "duration": 600,
    "duration_formatted": "0:10:00",
    "uploader": "Channel Name",
    "upload_date": "20240119",
    "view_count": 1234567,
    "language": "en",
    "transcript_type": "manual"  // or "automatic"
  },
  "error": null
}

Note: For age-restricted or private videos, valid YouTube cookies must be configured on the server.

POST `/api/v1/scrape/async` Asynchronous Scraping

For long-running scrape jobs, use the async endpoint.

Request Body

Same parameters as /api/v1/scrape

Response

{
  "job_id": "job_123456789",
  "status": "pending",
  "message": "Scraping job queued"
}

GET `/api/v1/scrape/job/{job_id}` Check Async Job Status

Check the status and result of an asynchronous scraping job.

Response

{
  "job_id": "job_123456789",
  "status": "completed",  // Options: "pending", "processing", "completed", "failed"
  "result": {
    "success": true,
    "url": "https://example.com/article",
    "title": "Article Title",
    "markdown": "# Article Title\n\nContent...",
    "metadata": {...}
  },
  "error": null,
  "created_at": "2024-01-19T10:00:00Z",
  "completed_at": "2024-01-19T10:00:05Z"
}

GET `/health` Health Check

Check if the API is running and healthy. No authentication required.

Response

{
  "status": "healthy",
  "timestamp": "2024-01-19T10:00:00Z",
  "version": "1.0.0",
  "uptime": 3600.5,
  "cache_status": "connected",
  "playwright_status": "ready"
}

Code Examples

Python

import requests

# Use your API token
API_TOKEN = "wst_your_api_token_here"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Extract YouTube transcript
transcript_response = requests.post(
    "https://webscraper.bearwood.ai/api/v1/transcript",
    headers=headers,
    json={
        "url": "https://www.youtube.com/watch?v=VIDEO_ID",
        "use_cache": True
    }
)
transcript_data = transcript_response.json()
print(transcript_data["transcript"])

# Scrape a webpage
scrape_response = requests.post(
    "https://webscraper.bearwood.ai/api/v1/scrape",
    headers=headers,
    json={
        "url": "https://example.com/article",
        "wait_for_selector": "article",
        "use_cache": True
    }
)
scrape_data = scrape_response.json()
print(scrape_data["data"]["markdown"])

JavaScript

// Use your API token
const API_TOKEN = 'wst_your_api_token_here';

// Scrape a webpage
const scrapeResponse = await fetch('https://webscraper.bearwood.ai/api/v1/scrape', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://example.com/article',
    use_cache: true
  })
});
const scrapeData = await scrapeResponse.json();
console.log(scrapeData.data.markdown);

// Extract YouTube transcript
const transcriptResponse = await fetch('https://webscraper.bearwood.ai/api/v1/transcript', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://www.youtube.com/watch?v=VIDEO_ID',
    language: 'en'
  })
});
const transcriptData = await transcriptResponse.json();
console.log(transcriptData.transcript);

cURL

# Set your API token
API_TOKEN="wst_your_api_token_here"

# Extract YouTube transcript
curl -X POST https://webscraper.bearwood.ai/api/v1/transcript \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.youtube.com/watch?v=VIDEO_ID",
    "use_cache": true
  }'

# Scrape a webpage
curl -X POST https://webscraper.bearwood.ai/api/v1/scrape \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "wait_for_selector": "article",
    "include_images": true
  }'

Error Handling

All endpoints return consistent error responses:

{
  "detail": "Error message",
  "type": "error_type",
  "status_code": 400
}

Status Code	Description
`200`	Success
`400`	Bad Request - Invalid parameters
`401`	Unauthorized - Missing or invalid token
`403`	Forbidden - Insufficient permissions
`404`	Not Found
`429`	Too Many Requests - Rate limit exceeded
`500`	Internal Server Error

Rate Limiting

The API implements rate limiting to prevent abuse:

Default limits: 100 requests per minute for authenticated users
Headers returned:
- X-RateLimit-Limit: Maximum requests allowed
- X-RateLimit-Remaining: Requests remaining
- X-RateLimit-Reset: Unix timestamp when limit resets

Important Notes

Scraper Type: Only Playwright scraper is supported, providing full browser automation with JavaScript support.
YouTube Cookies: For YouTube transcript extraction to work with age-restricted or private videos, valid YouTube cookies must be configured on the server.
Caching: Cached content is stored for 24 hours by default. Use use_cache: false to force fresh extraction.
Timeouts: Default timeout is 30 seconds. For slow-loading sites, increase the wait_timeout value.

WebScraper API Documentation

Overview

Authentication

API Endpoints

POST /api/v1/scrape Web Scraping

Request Body

Parameters

Response

POST /api/v1/transcript YouTube Transcript Extraction

Request Body

Parameters

Response

POST /api/v1/scrape/async Asynchronous Scraping

Request Body

Response

GET /api/v1/scrape/job/{job_id} Check Async Job Status

Response

GET /health Health Check

Response

Code Examples

Python

JavaScript

cURL

Error Handling

Rate Limiting

Important Notes

POST `/api/v1/scrape` Web Scraping

POST `/api/v1/transcript` YouTube Transcript Extraction

POST `/api/v1/scrape/async` Asynchronous Scraping

GET `/api/v1/scrape/job/{job_id}` Check Async Job Status

GET `/health` Health Check