Spaces:
Running
Running
itsOwen
commited on
Commit
Β·
1dfdc1c
1
Parent(s):
e0f5917
Clean up README and improve extraction prompt
Browse files- Simplify README structure
- Update prompt to extract ALL items from multipage content
- Clarify that entry limits only apply when explicitly requested
- README.md +26 -251
- src/prompts.py +2 -1
README.md
CHANGED
|
@@ -1,11 +1,5 @@
|
|
| 1 |
# π CyberScraper 2077
|
| 2 |
|
| 3 |
-
<p align="center">
|
| 4 |
-
<a href="https://www.thordata.com/?ls=VNSCxroa&lk=CyberScraper">
|
| 5 |
-
<img src="https://i.postimg.cc/dtwTvm5V/728-x-90-2.gif" alt="Collect-web-data-728x90" border="0">
|
| 6 |
-
</a>
|
| 7 |
-
</p>
|
| 8 |
-
|
| 9 |
<p align="center">
|
| 10 |
<img src="https://i.postimg.cc/j5b7QSzg/scraper.png" alt="CyberScraper 2077 Logo">
|
| 11 |
</p>
|
|
@@ -18,7 +12,6 @@
|
|
| 18 |
[](https://streamlit.io/)
|
| 19 |
[](https://opensource.org/licenses/MIT)
|
| 20 |
[](http://makeapullrequest.com)
|
| 21 |
-
[](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077)
|
| 22 |
|
| 23 |
> Rip data from the net, leaving no trace. Welcome to the future of web scraping.
|
| 24 |
|
|
@@ -28,72 +21,25 @@ CyberScraper 2077 is not just another web scraping tool β it's a glimpse into
|
|
| 28 |
|
| 29 |
Whether you're a corpo data analyst, a street-smart netrunner, or just someone looking to pull information from the digital realm, CyberScraper 2077 has got you covered.
|
| 30 |
|
| 31 |
-
### π Two Powerful Versions Available
|
| 32 |
-
|
| 33 |
-
**Main Branch (Current)**: Traditional web scraping with advanced features
|
| 34 |
-
**[Scrapeless Integration Branch](https://github.com/itsOwen/CyberScraper-2077/tree/)**: Enterprise-grade scraping with [Scrapeless SDK](https://www.scrapeless.com?utm_source=owen) integration
|
| 35 |
-
|
| 36 |
-
<p align="center">
|
| 37 |
-
<a href="https://www.thordata.com/?ls=VNSCxroa&lk=CyberScraper">
|
| 38 |
-
<img src="https://i.postimg.cc/dtwTvm5V/728-x-90-2.gif" alt="Collect-web-data-728x90" border="0">
|
| 39 |
-
</a>
|
| 40 |
-
</p>
|
| 41 |
-
|
| 42 |
<p align="center">
|
| 43 |
<img src="https://i.postimg.cc/3NHb15wq/20240821-074556.gif">
|
| 44 |
</p>
|
| 45 |
|
| 46 |
## β¨ Features
|
| 47 |
|
| 48 |
-
|
| 49 |
-
-
|
| 50 |
-
-
|
| 51 |
-
-
|
| 52 |
-
-
|
| 53 |
-
-
|
| 54 |
-
-
|
| 55 |
-
-
|
| 56 |
-
-
|
| 57 |
-
-
|
| 58 |
-
-
|
| 59 |
-
-
|
| 60 |
-
-
|
| 61 |
-
- π§ **Navigate through the Pages (BETA)**: Navigate through the webpage and scrape data from different pages.
|
| 62 |
-
|
| 63 |
-
### βοΈ Scrapeless Integration Branch Features
|
| 64 |
-
> **Want enterprise-grade scraping? Check out our [Scrapeless Integration Branch](https://github.com/itsOwen/CyberScraper-2077/tree/)!**
|
| 65 |
-
|
| 66 |
-
- π **Advanced Web Unlocker**: Utilizes Scrapeless's enterprise-grade anti-detection technology to bypass Cloudflare, Akamai, DataDome, and other protection systems.
|
| 67 |
-
- π€ **Automatic CAPTCHA Solving**: Seamlessly solves reCAPTCHA v2/v3, and other verification challenges without human intervention.
|
| 68 |
-
- π **Global Proxy Network**: Access content from specific countries with Scrapeless's extensive proxy network.
|
| 69 |
-
- π **High-Speed Extraction**: Extract data at unprecedented speed without the overhead of local browser instances.
|
| 70 |
-
- π **95% Success Rate**: Achieve ~95% success rate on even heavily protected sites (compared to ~60-70% with traditional methods).
|
| 71 |
-
- π **Auto-Updates**: Automatic updates to bypass new protection systems without manual maintenance.
|
| 72 |
-
- β‘ **Lightweight Operations**: API-based calls instead of heavy browser instances.
|
| 73 |
-
- π‘οΈ **Enterprise Security**: Professional-grade anti-bot detection bypassing.
|
| 74 |
-
|
| 75 |
-
### π Scrapeless vs Traditional Comparison
|
| 76 |
-
|
| 77 |
-
| Feature | Main Branch | Scrapeless Branch |
|
| 78 |
-
|---------|-------------|-------------------|
|
| 79 |
-
| **Anti-Bot Protection** | Limited custom solutions | Enterprise-grade bypassing |
|
| 80 |
-
| **CAPTCHA Handling** | Manual intervention required | Automatic solving |
|
| 81 |
-
| **Proxy Management** | Basic single proxy | Global proxy network with country selection |
|
| 82 |
-
| **Success Rate** | ~60-70% on protected sites | ~95% on even heavily protected sites |
|
| 83 |
-
| **Resource Usage** | Heavy (browser instances) | Light (API calls only) |
|
| 84 |
-
| **Scalability** | Limited by local resources | Unlimited - cloud-based |
|
| 85 |
-
| **Maintenance** | Constant updates needed | Automatic updates |
|
| 86 |
-
| **Development Time** | Complex custom code | Simple API calls |
|
| 87 |
-
|
| 88 |
-
## π₯ Demo
|
| 89 |
-
|
| 90 |
-
Check out our Redesigned and Improved Version of CyberScraper-2077 with more functionality [YouTube video](https://www.youtube.com/watch?v=TWyensVOIvs) for a full walkthrough of CyberScraper 2077's capabilities.
|
| 91 |
-
|
| 92 |
-
Check out our first build (Old Video) [YouTube video](https://www.youtube.com/watch?v=iATSd5Ijl4M)
|
| 93 |
-
|
| 94 |
-
### Scrapeless DEMO
|
| 95 |
-
|
| 96 |
-
[](https://www.youtube.com/watch?v=tem8u3mYTMY)
|
| 97 |
|
| 98 |
## πͺ For Windows Users
|
| 99 |
|
|
@@ -103,8 +49,6 @@ Please follow the Docker Container Guide given below, as I won't be able to main
|
|
| 103 |
|
| 104 |
**Note: CyberScraper 2077 requires Python 3.10 or higher.**
|
| 105 |
|
| 106 |
-
### Main Branch Installation
|
| 107 |
-
|
| 108 |
1. Clone this repository:
|
| 109 |
```bash
|
| 110 |
git clone https://github.com/itsOwen/CyberScraper-2077.git
|
|
@@ -135,29 +79,7 @@ Please follow the Docker Container Guide given below, as I won't be able to main
|
|
| 135 |
export GOOGLE_API_KEY="your-api-key-here"
|
| 136 |
```
|
| 137 |
|
| 138 |
-
###
|
| 139 |
-
|
| 140 |
-
For enterprise-grade scraping with automatic CAPTCHA solving and advanced anti-bot bypassing:
|
| 141 |
-
|
| 142 |
-
1. Clone the Scrapeless integration branch:
|
| 143 |
-
```bash
|
| 144 |
-
git clone -b CyberScrapeless-2077 https://github.com/itsOwen/CyberScraper-2077.git
|
| 145 |
-
cd CyberScraper-2077
|
| 146 |
-
```
|
| 147 |
-
|
| 148 |
-
2. Install requirements and set API keys:
|
| 149 |
-
```bash
|
| 150 |
-
pip install -r requirements.txt
|
| 151 |
-
|
| 152 |
-
# Set all API keys
|
| 153 |
-
export OPENAI_API_KEY="your_openai_api_key_here"
|
| 154 |
-
export GOOGLE_API_KEY="your_google_api_key_here"
|
| 155 |
-
export SCRAPELESS_API_KEY="your_scrapeless_api_key_here"
|
| 156 |
-
```
|
| 157 |
-
|
| 158 |
-
3. Get your Scrapeless API key from [Scrapeless Dashboard](https://app.scrapeless.com/dashboard/account?tab=apiKey)
|
| 159 |
-
|
| 160 |
-
### Using Ollama (Both Branches)
|
| 161 |
|
| 162 |
Note: I only recommend using OpenAI and Gemini API as these models are really good at following instructions. If you are using open-source LLMs, make sure you have a good system as the speed of the data generation/presentation depends on how well your system can run the LLM. You may also have to fine-tune the prompt and add some additional filters yourself.
|
| 163 |
|
|
@@ -172,8 +94,6 @@ Note: I only recommend using OpenAI and Gemini API as these models are really go
|
|
| 172 |
|
| 173 |
If you prefer to use Docker, follow these steps to set up and run CyberScraper 2077:
|
| 174 |
|
| 175 |
-
### Main Branch Docker
|
| 176 |
-
|
| 177 |
1. Ensure you have Docker installed on your system.
|
| 178 |
|
| 179 |
2. Clone this repository:
|
|
@@ -192,21 +112,6 @@ If you prefer to use Docker, follow these steps to set up and run CyberScraper 2
|
|
| 192 |
docker run -p 8501:8501 -e OPENAI_API_KEY="your-actual-api-key" -e GOOGLE_API_KEY="your-actual-api-key" cyberscraper-2077
|
| 193 |
```
|
| 194 |
|
| 195 |
-
### Scrapeless Branch Docker
|
| 196 |
-
|
| 197 |
-
For the Scrapeless integration branch:
|
| 198 |
-
|
| 199 |
-
```bash
|
| 200 |
-
git clone -b CyberScrapeless-2077 https://github.com/itsOwen/CyberScraper-2077.git
|
| 201 |
-
cd CyberScraper-2077
|
| 202 |
-
docker build -t cyberscrapeless .
|
| 203 |
-
docker run -p 8501:8501 \
|
| 204 |
-
-e OPENAI_API_KEY="your-actual-api-key" \
|
| 205 |
-
-e GOOGLE_API_KEY="your-actual-api-key" \
|
| 206 |
-
-e SCRAPELESS_API_KEY="your-scrapeless-api-key" \
|
| 207 |
-
cyberscrapeless
|
| 208 |
-
```
|
| 209 |
-
|
| 210 |
### Using Ollama with Docker
|
| 211 |
|
| 212 |
If you want to use Ollama with the Docker setup:
|
|
@@ -228,7 +133,7 @@ If you want to use Ollama with the Docker setup:
|
|
| 228 |
```
|
| 229 |
|
| 230 |
Now visit the url: http://localhost:8501/
|
| 231 |
-
|
| 232 |
On Linux you might need to use this below:
|
| 233 |
```bash
|
| 234 |
docker run -e OLLAMA_BASE_URL=http://<your-host-ip>:11434 -p 8501:8501 cyberscraper-2077
|
|
@@ -241,8 +146,6 @@ Note: Ensure that your firewall allows connections to port 11434 for Ollama.
|
|
| 241 |
|
| 242 |
## π Usage
|
| 243 |
|
| 244 |
-
### Main Branch Usage
|
| 245 |
-
|
| 246 |
1. Fire up the Streamlit app:
|
| 247 |
```bash
|
| 248 |
streamlit run main.py
|
|
@@ -256,16 +159,7 @@ Note: Ensure that your firewall allows connections to port 11434 for Ollama.
|
|
| 256 |
|
| 257 |
5. Watch as CyberScraper 2077 tears through the net, extracting your data faster than you can say "flatline"!
|
| 258 |
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
The Scrapeless integration branch offers the same user interface with enhanced capabilities:
|
| 262 |
-
|
| 263 |
-
1. **Enterprise Scraping**: Automatically bypasses advanced anti-bot systems like Cloudflare, Akamai, and DataDome
|
| 264 |
-
2. **CAPTCHA-Free**: No manual CAPTCHA solving required - handled automatically
|
| 265 |
-
3. **Global Access**: Choose proxy countries for geo-restricted content
|
| 266 |
-
4. **Higher Success Rate**: Achieve ~95% success rate on protected sites
|
| 267 |
-
|
| 268 |
-
Example usage with page ranges (both branches):
|
| 269 |
```
|
| 270 |
https://example.com/products 1-5
|
| 271 |
https://example.com/search?q=cyberpunk&page={page} 1-10
|
|
@@ -308,20 +202,11 @@ I suggest you enter the URL structure every time if you want to scrape multiple
|
|
| 308 |
4. **Automatic Pattern Detection**:
|
| 309 |
If you don't specify a pattern, CyberScraper 2077 will attempt to detect the URL pattern automatically. However, for best results, specifying the pattern is recommended.
|
| 310 |
|
| 311 |
-
### Enhanced Multi-Page with Scrapeless
|
| 312 |
-
|
| 313 |
-
The [Scrapeless integration branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077) provides enhanced multi-page scraping with:
|
| 314 |
-
- **Automatic retry logic** for failed pages
|
| 315 |
-
- **Global proxy rotation** for different pages
|
| 316 |
-
- **CAPTCHA auto-solving** across all pages
|
| 317 |
-
- **Higher success rates** on protected paginated sites
|
| 318 |
-
|
| 319 |
### Tips for Effective Multi-Page Scraping
|
| 320 |
|
| 321 |
- Start with a small range of pages to test before scraping a large number.
|
| 322 |
- Be mindful of the website's load and your scraping speed to avoid overloading servers.
|
| 323 |
- Use the `simulate_human` option for more natural scraping behavior on sites with anti-bot measures.
|
| 324 |
-
- Consider using the [Scrapeless branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077) for heavily protected sites.
|
| 325 |
- Regularly check the website's `robots.txt` file and terms of service to ensure compliance.
|
| 326 |
|
| 327 |
### Example
|
|
@@ -344,10 +229,10 @@ CyberScraper 2077 now supports scraping .onion sites through the Tor network, al
|
|
| 344 |
```bash
|
| 345 |
# Ubuntu/Debian
|
| 346 |
sudo apt install tor
|
| 347 |
-
|
| 348 |
# macOS (using Homebrew)
|
| 349 |
brew install tor
|
| 350 |
-
|
| 351 |
# Start the Tor service
|
| 352 |
sudo service tor start # on Linux
|
| 353 |
brew services start tor # on macOS
|
|
@@ -408,7 +293,7 @@ docker run -p 8501:8501 \
|
|
| 408 |
<img src="https://i.postimg.cc/3JvhgtMP/cyberscraper-onion.png" alt="CyberScraper 2077 Onion Scrape">
|
| 409 |
</p>
|
| 410 |
|
| 411 |
-
## Setup Google Sheets Authentication
|
| 412 |
|
| 413 |
1. Go to the Google Cloud Console (https://console.cloud.google.com/).
|
| 414 |
2. Select your project.
|
|
@@ -430,24 +315,7 @@ docker run -p 8501:8501 \
|
|
| 430 |
10. Click "Create" to generate the new client ID.
|
| 431 |
11. Download the new client configuration JSON file and rename it to `client_secret.json`.
|
| 432 |
|
| 433 |
-
##
|
| 434 |
-
|
| 435 |
-
### Choose Main Branch If:
|
| 436 |
-
- You need Tor network support for .onion sites
|
| 437 |
-
- You prefer local browser control
|
| 438 |
-
- You want to use your current browser session
|
| 439 |
-
- You're doing research or educational projects
|
| 440 |
-
- Budget is a primary concern (free tier friendly)
|
| 441 |
-
|
| 442 |
-
### Choose [Scrapeless Integration Branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077) If:
|
| 443 |
-
- You're scraping heavily protected sites (Cloudflare, Akamai, DataDome)
|
| 444 |
-
- You need enterprise-grade success rates (~95%)
|
| 445 |
-
- CAPTCHAs are blocking your scraping
|
| 446 |
-
- You want automatic proxy rotation
|
| 447 |
-
- You need reliable, scalable scraping for business use
|
| 448 |
-
- You value time over manual configuration
|
| 449 |
-
|
| 450 |
-
## Adjusting PlaywrightScraper Settings (optional)
|
| 451 |
|
| 452 |
Customize the `PlaywrightScraper` settings to fit your scraping needs. If some websites are giving you issues, you might want to check the behavior of the website:
|
| 453 |
|
|
@@ -463,52 +331,11 @@ Adjust these settings based on your target website and environment for optimal r
|
|
| 463 |
|
| 464 |
You can also bypass the captcha using the ```-captcha``` parameter at the end of the URL. The browser window will pop up, complete the captcha, and go back to your terminal window. Press enter and the bot will complete its task.
|
| 465 |
|
| 466 |
-
## π οΈ Advanced Features
|
| 467 |
-
|
| 468 |
-
### Scrapeless SDK Integration
|
| 469 |
-
|
| 470 |
-
For users of the [Scrapeless integration branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077), here are the core capabilities:
|
| 471 |
-
|
| 472 |
-
#### Web Unlocker API
|
| 473 |
-
```python
|
| 474 |
-
# Automatic anti-bot bypass
|
| 475 |
-
result = scrapeless.unlocker(
|
| 476 |
-
actor="unlocker.webunlocker",
|
| 477 |
-
input={
|
| 478 |
-
"url": "https://protected-website.com",
|
| 479 |
-
"proxy_country": "US",
|
| 480 |
-
"js_render": True
|
| 481 |
-
}
|
| 482 |
-
)
|
| 483 |
-
```
|
| 484 |
-
|
| 485 |
-
#### CAPTCHA Solver API
|
| 486 |
-
```python
|
| 487 |
-
# Automatic CAPTCHA solving
|
| 488 |
-
result = scrapeless.solver_captcha(
|
| 489 |
-
actor="captcha.recaptcha",
|
| 490 |
-
input={
|
| 491 |
-
"version": "v2",
|
| 492 |
-
"pageURL": "https://example.com",
|
| 493 |
-
"siteKey": "your-site-key"
|
| 494 |
-
}
|
| 495 |
-
)
|
| 496 |
-
```
|
| 497 |
-
|
| 498 |
-
#### Pre-built Scrapers
|
| 499 |
-
```python
|
| 500 |
-
# E-commerce scrapers
|
| 501 |
-
result = scrapeless.scraper(
|
| 502 |
-
actor="scraper.shopee",
|
| 503 |
-
input={"url": "https://shopee.com/product"}
|
| 504 |
-
)
|
| 505 |
-
```
|
| 506 |
-
|
| 507 |
## π€ Contributing
|
| 508 |
|
| 509 |
-
We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberScraper 2077!
|
| 510 |
|
| 511 |
-
1. Fork the repository
|
| 512 |
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
|
| 513 |
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
|
| 514 |
4. Push to the branch (`git push origin feature/amazing-feature`)
|
|
@@ -516,33 +343,15 @@ We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberS
|
|
| 516 |
|
| 517 |
## π§ Troubleshooting
|
| 518 |
|
| 519 |
-
### Main Branch Issues
|
| 520 |
Ran into a glitch in the matrix? Let me know by adding the issue to this repo so that we can fix it together.
|
| 521 |
|
| 522 |
-
### Scrapeless Integration Issues
|
| 523 |
-
If you encounter issues with the [Scrapeless branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077):
|
| 524 |
-
|
| 525 |
-
1. **API Key Issues**: Verify your Scrapeless API key is valid
|
| 526 |
-
2. **High Success Rate Expected**: Scrapeless should achieve ~95% success on protected sites
|
| 527 |
-
3. **CAPTCHA Auto-Solve**: Should work automatically without manual intervention
|
| 528 |
-
4. **Proxy Network**: Test with different country codes if content is geo-restricted
|
| 529 |
-
|
| 530 |
## β FAQ
|
| 531 |
|
| 532 |
-
**Q: Which branch should I use?**
|
| 533 |
-
A: Use the main branch for general scraping and Tor support. Use the [Scrapeless integration branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077) for enterprise-grade scraping with automatic CAPTCHA solving and anti-bot bypassing.
|
| 534 |
-
|
| 535 |
**Q: Is CyberScraper 2077 legal to use?**
|
| 536 |
A: CyberScraper 2077 is designed for ethical web scraping. Always ensure you have the right to scrape a website and respect their robots.txt file.
|
| 537 |
|
| 538 |
**Q: Can I use this for commercial purposes?**
|
| 539 |
-
A: Yes, under the terms of the MIT License.
|
| 540 |
-
|
| 541 |
-
**Q: What's the success rate difference?**
|
| 542 |
-
A: Main branch: ~60-70% on protected sites. Scrapeless branch: ~95% on even heavily protected sites.
|
| 543 |
-
|
| 544 |
-
**Q: Do I need to pay for Scrapeless?**
|
| 545 |
-
A: Scrapeless offers various pricing tiers. Check their [pricing page](https://www.scrapeless.com?utm_source=owen) for current rates. The main branch remains free to use.
|
| 546 |
|
| 547 |
## π License
|
| 548 |
|
|
@@ -552,42 +361,8 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
|
|
| 552 |
|
| 553 |
Got questions? Need support? Want to hire me for a gig?
|
| 554 |
|
| 555 |
-
-
|
| 556 |
-
-
|
| 557 |
-
- π¬ Website: [Portfolio](https://www.owensingh.com)
|
| 558 |
-
|
| 559 |
-
## π Get Started With Scrapeless
|
| 560 |
-
|
| 561 |
-
### π Free Trial
|
| 562 |
-
|
| 563 |
-
1. **π [Sign Up](https://app.scrapeless.com/signup?utm_source=owen)** - No credit card required
|
| 564 |
-
2. **π Get API Key** - Instant access to all features
|
| 565 |
-
3. **π¦ Install SDK** - Choose your preferred language
|
| 566 |
-
4. **π Follow Quick Start** - Working in 5 minutes
|
| 567 |
-
5. **π Scale Up** - Upgrade when ready
|
| 568 |
-
|
| 569 |
-
### π’ Enterprise Contact
|
| 570 |
-
|
| 571 |
-
- **π° Custom Pricing** - Volume discounts available
|
| 572 |
-
- **π¨βπΌ Dedicated Support** - Named customer success manager
|
| 573 |
-
- **π SLA Guarantees** - 99.99% uptime commitment
|
| 574 |
-
- **ποΈ On-premise Options** - Private cloud deployment
|
| 575 |
-
- **π§ Email**: market@scrapeless.com
|
| 576 |
-
|
| 577 |
-
### π Connect With Scrapeless Devs
|
| 578 |
-
|
| 579 |
-
- **π Website**: [scrapeless.com](https://www.scrapeless.com?utm_source=owen)
|
| 580 |
-
- **π Documentation**: [docs.scrapeless.com](https://docs.scrapeless.com)
|
| 581 |
-
- **π¬ Discord**: [Discord Community](https://discord.com/invite/xBcTfGPjCQ)
|
| 582 |
-
- **πΌ LinkedIn**: [Follow Us](https://www.linkedin.com/company/scrapeless/posts/?feedView=all)
|
| 583 |
-
- **π¦ Twitter**: [Follow Us](https://x.com/Scrapelessteam)
|
| 584 |
-
- **π§ Email**: market@scrapeless.com
|
| 585 |
-
|
| 586 |
-
<p align="center">
|
| 587 |
-
<a href="https://www.thordata.com/?ls=VNSCxroa&lk=CyberScraper">
|
| 588 |
-
<img src="https://i.postimg.cc/dtwTvm5V/728-x-90-2.gif" alt="Collect-web-data-728x90" border="0">
|
| 589 |
-
</a>
|
| 590 |
-
</p>
|
| 591 |
|
| 592 |
## π¨ Disclaimer
|
| 593 |
|
|
@@ -618,5 +393,5 @@ Remember, samurai: In the dark future of the NET, knowledge is power, but it's a
|
|
| 618 |
</p>
|
| 619 |
|
| 620 |
<p align="center">
|
| 621 |
-
Built with
|
| 622 |
</p>
|
|
|
|
| 1 |
# π CyberScraper 2077
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
<p align="center">
|
| 4 |
<img src="https://i.postimg.cc/j5b7QSzg/scraper.png" alt="CyberScraper 2077 Logo">
|
| 5 |
</p>
|
|
|
|
| 12 |
[](https://streamlit.io/)
|
| 13 |
[](https://opensource.org/licenses/MIT)
|
| 14 |
[](http://makeapullrequest.com)
|
|
|
|
| 15 |
|
| 16 |
> Rip data from the net, leaving no trace. Welcome to the future of web scraping.
|
| 17 |
|
|
|
|
| 21 |
|
| 22 |
Whether you're a corpo data analyst, a street-smart netrunner, or just someone looking to pull information from the digital realm, CyberScraper 2077 has got you covered.
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
<p align="center">
|
| 25 |
<img src="https://i.postimg.cc/3NHb15wq/20240821-074556.gif">
|
| 26 |
</p>
|
| 27 |
|
| 28 |
## β¨ Features
|
| 29 |
|
| 30 |
+
- **AI-Powered Extraction**: Utilizes cutting-edge AI models to understand and parse web content intelligently.
|
| 31 |
+
- **Sleek Streamlit Interface**: User-friendly GUI that even a chrome-armed street samurai could navigate.
|
| 32 |
+
- **Multi-Format Support**: Export your data in JSON, CSV, HTML, SQL or Excel β whatever fits your cyberdeck.
|
| 33 |
+
- **Tor Network Support**: Safely scrape .onion sites through the Tor network with automatic routing and security features.
|
| 34 |
+
- **Stealth Mode**: Implemented stealth mode parameters that help avoid detection as a bot.
|
| 35 |
+
- **Ollama Support**: Use a huge library of open source LLMs.
|
| 36 |
+
- **Async Operations**: Lightning-fast scraping that would make a Trauma Team jealous.
|
| 37 |
+
- **Smart Parsing**: Structures scraped content as if it was extracted straight from the engram of a master netrunner.
|
| 38 |
+
- **Caching**: Implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls.
|
| 39 |
+
- **Upload to Google Sheets**: Now you can easily upload your extracted CSV data to Google Sheets with one click.
|
| 40 |
+
- **Bypass Captcha**: Bypass captcha by using the -captcha at the end of the URL. (Currently only works natively, doesn't work on Docker)
|
| 41 |
+
- **Current Browser**: The current browser feature uses your local browser instance which will help you bypass 99% of bot detections. (Only use when necessary)
|
| 42 |
+
- **Navigate through the Pages (BETA)**: Navigate through the webpage and scrape data from different pages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## πͺ For Windows Users
|
| 45 |
|
|
|
|
| 49 |
|
| 50 |
**Note: CyberScraper 2077 requires Python 3.10 or higher.**
|
| 51 |
|
|
|
|
|
|
|
| 52 |
1. Clone this repository:
|
| 53 |
```bash
|
| 54 |
git clone https://github.com/itsOwen/CyberScraper-2077.git
|
|
|
|
| 79 |
export GOOGLE_API_KEY="your-api-key-here"
|
| 80 |
```
|
| 81 |
|
| 82 |
+
### Using Ollama
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
Note: I only recommend using OpenAI and Gemini API as these models are really good at following instructions. If you are using open-source LLMs, make sure you have a good system as the speed of the data generation/presentation depends on how well your system can run the LLM. You may also have to fine-tune the prompt and add some additional filters yourself.
|
| 85 |
|
|
|
|
| 94 |
|
| 95 |
If you prefer to use Docker, follow these steps to set up and run CyberScraper 2077:
|
| 96 |
|
|
|
|
|
|
|
| 97 |
1. Ensure you have Docker installed on your system.
|
| 98 |
|
| 99 |
2. Clone this repository:
|
|
|
|
| 112 |
docker run -p 8501:8501 -e OPENAI_API_KEY="your-actual-api-key" -e GOOGLE_API_KEY="your-actual-api-key" cyberscraper-2077
|
| 113 |
```
|
| 114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
### Using Ollama with Docker
|
| 116 |
|
| 117 |
If you want to use Ollama with the Docker setup:
|
|
|
|
| 133 |
```
|
| 134 |
|
| 135 |
Now visit the url: http://localhost:8501/
|
| 136 |
+
|
| 137 |
On Linux you might need to use this below:
|
| 138 |
```bash
|
| 139 |
docker run -e OLLAMA_BASE_URL=http://<your-host-ip>:11434 -p 8501:8501 cyberscraper-2077
|
|
|
|
| 146 |
|
| 147 |
## π Usage
|
| 148 |
|
|
|
|
|
|
|
| 149 |
1. Fire up the Streamlit app:
|
| 150 |
```bash
|
| 151 |
streamlit run main.py
|
|
|
|
| 159 |
|
| 160 |
5. Watch as CyberScraper 2077 tears through the net, extracting your data faster than you can say "flatline"!
|
| 161 |
|
| 162 |
+
Example usage with page ranges:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
```
|
| 164 |
https://example.com/products 1-5
|
| 165 |
https://example.com/search?q=cyberpunk&page={page} 1-10
|
|
|
|
| 202 |
4. **Automatic Pattern Detection**:
|
| 203 |
If you don't specify a pattern, CyberScraper 2077 will attempt to detect the URL pattern automatically. However, for best results, specifying the pattern is recommended.
|
| 204 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
### Tips for Effective Multi-Page Scraping
|
| 206 |
|
| 207 |
- Start with a small range of pages to test before scraping a large number.
|
| 208 |
- Be mindful of the website's load and your scraping speed to avoid overloading servers.
|
| 209 |
- Use the `simulate_human` option for more natural scraping behavior on sites with anti-bot measures.
|
|
|
|
| 210 |
- Regularly check the website's `robots.txt` file and terms of service to ensure compliance.
|
| 211 |
|
| 212 |
### Example
|
|
|
|
| 229 |
```bash
|
| 230 |
# Ubuntu/Debian
|
| 231 |
sudo apt install tor
|
| 232 |
+
|
| 233 |
# macOS (using Homebrew)
|
| 234 |
brew install tor
|
| 235 |
+
|
| 236 |
# Start the Tor service
|
| 237 |
sudo service tor start # on Linux
|
| 238 |
brew services start tor # on macOS
|
|
|
|
| 293 |
<img src="https://i.postimg.cc/3JvhgtMP/cyberscraper-onion.png" alt="CyberScraper 2077 Onion Scrape">
|
| 294 |
</p>
|
| 295 |
|
| 296 |
+
## π Setup Google Sheets Authentication
|
| 297 |
|
| 298 |
1. Go to the Google Cloud Console (https://console.cloud.google.com/).
|
| 299 |
2. Select your project.
|
|
|
|
| 315 |
10. Click "Create" to generate the new client ID.
|
| 316 |
11. Download the new client configuration JSON file and rename it to `client_secret.json`.
|
| 317 |
|
| 318 |
+
## βοΈ Adjusting PlaywrightScraper Settings (optional)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 319 |
|
| 320 |
Customize the `PlaywrightScraper` settings to fit your scraping needs. If some websites are giving you issues, you might want to check the behavior of the website:
|
| 321 |
|
|
|
|
| 331 |
|
| 332 |
You can also bypass the captcha using the ```-captcha``` parameter at the end of the URL. The browser window will pop up, complete the captcha, and go back to your terminal window. Press enter and the bot will complete its task.
|
| 333 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 334 |
## π€ Contributing
|
| 335 |
|
| 336 |
+
We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberScraper 2077!
|
| 337 |
|
| 338 |
+
1. Fork the repository
|
| 339 |
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
|
| 340 |
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
|
| 341 |
4. Push to the branch (`git push origin feature/amazing-feature`)
|
|
|
|
| 343 |
|
| 344 |
## π§ Troubleshooting
|
| 345 |
|
|
|
|
| 346 |
Ran into a glitch in the matrix? Let me know by adding the issue to this repo so that we can fix it together.
|
| 347 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 348 |
## β FAQ
|
| 349 |
|
|
|
|
|
|
|
|
|
|
| 350 |
**Q: Is CyberScraper 2077 legal to use?**
|
| 351 |
A: CyberScraper 2077 is designed for ethical web scraping. Always ensure you have the right to scrape a website and respect their robots.txt file.
|
| 352 |
|
| 353 |
**Q: Can I use this for commercial purposes?**
|
| 354 |
+
A: Yes, under the terms of the MIT License.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 355 |
|
| 356 |
## π License
|
| 357 |
|
|
|
|
| 361 |
|
| 362 |
Got questions? Need support? Want to hire me for a gig?
|
| 363 |
|
| 364 |
+
- Email: owensingh72@proton.me
|
| 365 |
+
- Website: [owen.sh](https://owen.sh)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 366 |
|
| 367 |
## π¨ Disclaimer
|
| 368 |
|
|
|
|
| 393 |
</p>
|
| 394 |
|
| 395 |
<p align="center">
|
| 396 |
+
Built with love and chrome by the streets of Night City | Β© 2077 Owen Singh
|
| 397 |
</p>
|
src/prompts.py
CHANGED
|
@@ -26,9 +26,10 @@ IMPORTANT: Always return JSON format for ANY export request. The system will aut
|
|
| 26 |
|
| 27 |
## Rules for Data Export
|
| 28 |
- Return ONLY the JSON array, no explanations or additional text
|
|
|
|
| 29 |
- Include all requested fields; use "N/A" if not found
|
| 30 |
- Never invent data not present in the content
|
| 31 |
-
-
|
| 32 |
- Use relevant field names based on content and query
|
| 33 |
|
| 34 |
## Conversational Mode
|
|
|
|
| 26 |
|
| 27 |
## Rules for Data Export
|
| 28 |
- Return ONLY the JSON array, no explanations or additional text
|
| 29 |
+
- Extract ALL matching items from the entire content (including all pages if multipage)
|
| 30 |
- Include all requested fields; use "N/A" if not found
|
| 31 |
- Never invent data not present in the content
|
| 32 |
+
- Only limit entries if a specific count is explicitly requested by the user
|
| 33 |
- Use relevant field names based on content and query
|
| 34 |
|
| 35 |
## Conversational Mode
|