File size: 442 Bytes
8ab279a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Async Web Crawler

High-performance async web scraper for dataset collection.

## Install
```bash
pip install aiohttp
```

## Usage
```bash
python crawler.py seeds.txt output_dir/ --workers 100
```

## Get Seeds
```bash
curl -sL https://tranco-list.eu/top-1m.csv.zip -o tranco.zip && unzip tranco.zip
awk -F, '{print "https://"$2"/"}' top-1m.csv > seeds.txt
```

## Output
Each file contains URL and extracted text.

*OpenTransformers Ltd*