WebScraper API Documentation
Overview
The WebScraper API provides endpoints for web scraping, content extraction, and video transcript extraction. All API endpoints require authentication via API tokens.
https://webscraper.bearwood.ai
Authentication
All API endpoints require authentication using API tokens. Include the token in the Authorization header:
Authorization: Bearer wst_your_api_token_here
API tokens must be generated by an administrator. Contact your administrator to obtain an API token.
API Endpoints
POST
/api/v1/scrape
Web Scraping
Extract content from any webpage and convert it to markdown.
Request Body
{
"url": "https://example.com/article",
"wait_for_selector": "article", // Optional: CSS selector to wait for
"wait_timeout": 30000, // Optional: Timeout in milliseconds (default: 30000)
"include_images": true, // Optional: Include images in markdown (default: true)
"include_links": true, // Optional: Include links in markdown (default: true)
"custom_headers": { // Optional: Custom HTTP headers
"User-Agent": "Custom Agent"
},
"use_cache": true, // Optional: Use cached content (default: true)
"force_refresh": false // Optional: Force fresh scraping (default: false)
}
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | Yes | The URL to scrape |
wait_for_selector |
string | No | CSS selector to wait for before scraping |
wait_timeout |
integer | No | Timeout in milliseconds (1000-120000, default: 30000) |
include_images |
boolean | No | Include images in the markdown output (default: true) |
include_links |
boolean | No | Include links in the markdown output (default: true) |
custom_headers |
object | No | Custom HTTP headers to send with the request |
use_cache |
boolean | No | Use cached content if available (default: true) |
force_refresh |
boolean | No | Force fresh scraping, bypass cache (default: false) |
Response
{
"success": true,
"data": {
"url": "https://example.com/article",
"title": "Article Title",
"markdown": "# Article Title\n\nContent in markdown format...",
"html": "...",
"text": "Plain text content...",
"metadata": {
"title": "Article Title",
"description": "Article description",
"author": "John Doe",
"published_date": "2024-01-19T10:00:00Z",
"keywords": ["tech", "news"],
"language": "en"
},
"extracted_at": "2024-01-19T10:00:00Z"
},
"status": "completed",
"request_id": "req_123456",
"cached": false
}
POST
/api/v1/transcript
YouTube Transcript Extraction
Extract transcripts from YouTube videos with timestamps.
Request Body
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"language": "en", // Optional: preferred language code
"use_cache": true // Optional: use cached transcript (default: true)
}
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | Yes | YouTube video URL |
language |
string | No | Preferred language code (e.g., 'en', 'es', 'fr') |
use_cache |
boolean | No | Use cached transcript if available (default: true) |
Response
{
"success": true,
"title": "Video Title",
"description": "Video description...",
"transcript": "- [00:00:00] Welcome to this video\n- [00:00:05] Today we'll discuss...",
"metadata": {
"video_id": "VIDEO_ID",
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"duration": 600,
"duration_formatted": "0:10:00",
"uploader": "Channel Name",
"upload_date": "20240119",
"view_count": 1234567,
"language": "en",
"transcript_type": "manual" // or "automatic"
},
"error": null
}
POST
/api/v1/scrape/async
Asynchronous Scraping
For long-running scrape jobs, use the async endpoint.
Request Body
Same parameters as /api/v1/scrape
Response
{
"job_id": "job_123456789",
"status": "pending",
"message": "Scraping job queued"
}
GET
/api/v1/scrape/job/{job_id}
Check Async Job Status
Check the status and result of an asynchronous scraping job.
Response
{
"job_id": "job_123456789",
"status": "completed", // Options: "pending", "processing", "completed", "failed"
"result": {
"success": true,
"url": "https://example.com/article",
"title": "Article Title",
"markdown": "# Article Title\n\nContent...",
"metadata": {...}
},
"error": null,
"created_at": "2024-01-19T10:00:00Z",
"completed_at": "2024-01-19T10:00:05Z"
}
GET
/health
Health Check
Check if the API is running and healthy. No authentication required.
Response
{
"status": "healthy",
"timestamp": "2024-01-19T10:00:00Z",
"version": "1.0.0",
"uptime": 3600.5,
"cache_status": "connected",
"playwright_status": "ready"
}
Code Examples
Python
import requests
# Use your API token
API_TOKEN = "wst_your_api_token_here"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
# Extract YouTube transcript
transcript_response = requests.post(
"https://webscraper.bearwood.ai/api/v1/transcript",
headers=headers,
json={
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"use_cache": True
}
)
transcript_data = transcript_response.json()
print(transcript_data["transcript"])
# Scrape a webpage
scrape_response = requests.post(
"https://webscraper.bearwood.ai/api/v1/scrape",
headers=headers,
json={
"url": "https://example.com/article",
"wait_for_selector": "article",
"use_cache": True
}
)
scrape_data = scrape_response.json()
print(scrape_data["data"]["markdown"])
JavaScript
// Use your API token
const API_TOKEN = 'wst_your_api_token_here';
// Scrape a webpage
const scrapeResponse = await fetch('https://webscraper.bearwood.ai/api/v1/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${API_TOKEN}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://example.com/article',
use_cache: true
})
});
const scrapeData = await scrapeResponse.json();
console.log(scrapeData.data.markdown);
// Extract YouTube transcript
const transcriptResponse = await fetch('https://webscraper.bearwood.ai/api/v1/transcript', {
method: 'POST',
headers: {
'Authorization': `Bearer ${API_TOKEN}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://www.youtube.com/watch?v=VIDEO_ID',
language: 'en'
})
});
const transcriptData = await transcriptResponse.json();
console.log(transcriptData.transcript);
cURL
# Set your API token
API_TOKEN="wst_your_api_token_here"
# Extract YouTube transcript
curl -X POST https://webscraper.bearwood.ai/api/v1/transcript \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"use_cache": true
}'
# Scrape a webpage
curl -X POST https://webscraper.bearwood.ai/api/v1/scrape \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"wait_for_selector": "article",
"include_images": true
}'
Error Handling
All endpoints return consistent error responses:
{
"detail": "Error message",
"type": "error_type",
"status_code": 400
}
| Status Code | Description |
|---|---|
200 |
Success |
400 |
Bad Request - Invalid parameters |
401 |
Unauthorized - Missing or invalid token |
403 |
Forbidden - Insufficient permissions |
404 |
Not Found |
429 |
Too Many Requests - Rate limit exceeded |
500 |
Internal Server Error |
Rate Limiting
The API implements rate limiting to prevent abuse:
- Default limits: 100 requests per minute for authenticated users
- Headers returned:
X-RateLimit-Limit: Maximum requests allowedX-RateLimit-Remaining: Requests remainingX-RateLimit-Reset: Unix timestamp when limit resets
Important Notes
- Scraper Type: Only Playwright scraper is supported, providing full browser automation with JavaScript support.
- YouTube Cookies: For YouTube transcript extraction to work with age-restricted or private videos, valid YouTube cookies must be configured on the server.
- Caching: Cached content is stored for 24 hours by default. Use
use_cache: falseto force fresh extraction. - Timeouts: Default timeout is 30 seconds. For slow-loading sites, increase the
wait_timeoutvalue.