Spider is a high-performance web crawler and web scraping library written in Rust that enables developers to crawl and index websites efficiently. It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents. Spider can operate concurrently across many pages, allowing it to gather large datasets in a short period of time. Spider also provides mechanisms for subscribing to crawl events so developers can process page data such as URLs, status codes, or HTML content as it is discovered. It supports advanced capabilities such as headless browser rendering, background crawling tasks, and configurable rules that control crawl depth or ignored paths. These capabilities make the project suitable for building search indexers, data extraction pipelines, & SEO analysis tools.
Features
- Multi-threaded asynchronous web crawler for high-speed page discovery
- Website scraping that collects HTML content and extracted data
- Event subscription system to process pages during crawling
- Configurable crawl limits, path blacklists, and request budgets
- Optional headless Chrome rendering for dynamic web pages
- Background crawling tasks and scheduled jobs for automated runs