File size: 442 Bytes
8ab279a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Async Web Crawler
High-performance async web scraper for dataset collection.
## Install
```bash
pip install aiohttp
```
## Usage
```bash
python crawler.py seeds.txt output_dir/ --workers 100
```
## Get Seeds
```bash
curl -sL https://tranco-list.eu/top-1m.csv.zip -o tranco.zip && unzip tranco.zip
awk -F, '{print "https://"$2"/"}' top-1m.csv > seeds.txt
```
## Output
Each file contains URL and extracted text.
*OpenTransformers Ltd*
|