itsOwen commited on
Commit
1dfdc1c
Β·
1 Parent(s): e0f5917

Clean up README and improve extraction prompt

Browse files

- Simplify README structure
- Update prompt to extract ALL items from multipage content
- Clarify that entry limits only apply when explicitly requested

Files changed (2) hide show
  1. README.md +26 -251
  2. src/prompts.py +2 -1
README.md CHANGED
@@ -1,11 +1,5 @@
1
  # 🌐 CyberScraper 2077
2
 
3
- <p align="center">
4
- <a href="https://www.thordata.com/?ls=VNSCxroa&lk=CyberScraper">
5
- <img src="https://i.postimg.cc/dtwTvm5V/728-x-90-2.gif" alt="Collect-web-data-728x90" border="0">
6
- </a>
7
- </p>
8
-
9
  <p align="center">
10
  <img src="https://i.postimg.cc/j5b7QSzg/scraper.png" alt="CyberScraper 2077 Logo">
11
  </p>
@@ -18,7 +12,6 @@
18
  [![Streamlit](https://img.shields.io/badge/Streamlit-FF4B4B)](https://streamlit.io/)
19
  [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
20
  [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com)
21
- [![Scrapeless](https://img.shields.io/badge/Scrapeless%20Branch-Available-blue)](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077)
22
 
23
  > Rip data from the net, leaving no trace. Welcome to the future of web scraping.
24
 
@@ -28,72 +21,25 @@ CyberScraper 2077 is not just another web scraping tool – it's a glimpse into
28
 
29
  Whether you're a corpo data analyst, a street-smart netrunner, or just someone looking to pull information from the digital realm, CyberScraper 2077 has got you covered.
30
 
31
- ### πŸš€ Two Powerful Versions Available
32
-
33
- **Main Branch (Current)**: Traditional web scraping with advanced features
34
- **[Scrapeless Integration Branch](https://github.com/itsOwen/CyberScraper-2077/tree/)**: Enterprise-grade scraping with [Scrapeless SDK](https://www.scrapeless.com?utm_source=owen) integration
35
-
36
- <p align="center">
37
- <a href="https://www.thordata.com/?ls=VNSCxroa&lk=CyberScraper">
38
- <img src="https://i.postimg.cc/dtwTvm5V/728-x-90-2.gif" alt="Collect-web-data-728x90" border="0">
39
- </a>
40
- </p>
41
-
42
  <p align="center">
43
  <img src="https://i.postimg.cc/3NHb15wq/20240821-074556.gif">
44
  </p>
45
 
46
  ## ✨ Features
47
 
48
- ### πŸ”§ Main Branch Features
49
- - πŸ€– **AI-Powered Extraction**: Utilizes cutting-edge AI models to understand and parse web content intelligently.
50
- - πŸ–₯️ **Sleek Streamlit Interface**: User-friendly GUI that even a chrome-armed street samurai could navigate.
51
- - πŸ”„ **Multi-Format Support**: Export your data in JSON, CSV, HTML, SQL or Excel – whatever fits your cyberdeck.
52
- - πŸ§… **Tor Network Support**: Safely scrape .onion sites through the Tor network with automatic routing and security features.
53
- - πŸ•΅οΈ **Stealth Mode**: Implemented stealth mode parameters that help avoid detection as a bot.
54
- - πŸ¦™ **Ollama Support**: Use a huge library of open source LLMs.
55
- - ⚑ **Async Operations**: Lightning-fast scraping that would make a Trauma Team jealous.
56
- - 🧠 **Smart Parsing**: Structures scraped content as if it was extracted straight from the engram of a master netrunner.
57
- - πŸ’Ύ **Caching**: Implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls.
58
- - πŸ“Š **Upload to Google Sheets**: Now you can easily upload your extracted CSV data to Google Sheets with one click.
59
- - πŸ›‘οΈ **Bypass Captcha**: Bypass captcha by using the -captcha at the end of the URL. (Currently only works natively, doesn't work on Docker)
60
- - 🌐 **Current Browser**: The current browser feature uses your local browser instance which will help you bypass 99% of bot detections. (Only use when necessary)
61
- - 🧭 **Navigate through the Pages (BETA)**: Navigate through the webpage and scrape data from different pages.
62
-
63
- ### βš”οΈ Scrapeless Integration Branch Features
64
- > **Want enterprise-grade scraping? Check out our [Scrapeless Integration Branch](https://github.com/itsOwen/CyberScraper-2077/tree/)!**
65
-
66
- - πŸ” **Advanced Web Unlocker**: Utilizes Scrapeless's enterprise-grade anti-detection technology to bypass Cloudflare, Akamai, DataDome, and other protection systems.
67
- - πŸ€– **Automatic CAPTCHA Solving**: Seamlessly solves reCAPTCHA v2/v3, and other verification challenges without human intervention.
68
- - 🌍 **Global Proxy Network**: Access content from specific countries with Scrapeless's extensive proxy network.
69
- - πŸš€ **High-Speed Extraction**: Extract data at unprecedented speed without the overhead of local browser instances.
70
- - πŸ“ˆ **95% Success Rate**: Achieve ~95% success rate on even heavily protected sites (compared to ~60-70% with traditional methods).
71
- - πŸ”„ **Auto-Updates**: Automatic updates to bypass new protection systems without manual maintenance.
72
- - ⚑ **Lightweight Operations**: API-based calls instead of heavy browser instances.
73
- - πŸ›‘οΈ **Enterprise Security**: Professional-grade anti-bot detection bypassing.
74
-
75
- ### πŸ“Š Scrapeless vs Traditional Comparison
76
-
77
- | Feature | Main Branch | Scrapeless Branch |
78
- |---------|-------------|-------------------|
79
- | **Anti-Bot Protection** | Limited custom solutions | Enterprise-grade bypassing |
80
- | **CAPTCHA Handling** | Manual intervention required | Automatic solving |
81
- | **Proxy Management** | Basic single proxy | Global proxy network with country selection |
82
- | **Success Rate** | ~60-70% on protected sites | ~95% on even heavily protected sites |
83
- | **Resource Usage** | Heavy (browser instances) | Light (API calls only) |
84
- | **Scalability** | Limited by local resources | Unlimited - cloud-based |
85
- | **Maintenance** | Constant updates needed | Automatic updates |
86
- | **Development Time** | Complex custom code | Simple API calls |
87
-
88
- ## πŸŽ₯ Demo
89
-
90
- Check out our Redesigned and Improved Version of CyberScraper-2077 with more functionality [YouTube video](https://www.youtube.com/watch?v=TWyensVOIvs) for a full walkthrough of CyberScraper 2077's capabilities.
91
-
92
- Check out our first build (Old Video) [YouTube video](https://www.youtube.com/watch?v=iATSd5Ijl4M)
93
-
94
- ### Scrapeless DEMO
95
-
96
- [![Video Demo](https://img.youtube.com/vi/tem8u3mYTMY/maxresdefault.jpg)](https://www.youtube.com/watch?v=tem8u3mYTMY)
97
 
98
  ## πŸͺŸ For Windows Users
99
 
@@ -103,8 +49,6 @@ Please follow the Docker Container Guide given below, as I won't be able to main
103
 
104
  **Note: CyberScraper 2077 requires Python 3.10 or higher.**
105
 
106
- ### Main Branch Installation
107
-
108
  1. Clone this repository:
109
  ```bash
110
  git clone https://github.com/itsOwen/CyberScraper-2077.git
@@ -135,29 +79,7 @@ Please follow the Docker Container Guide given below, as I won't be able to main
135
  export GOOGLE_API_KEY="your-api-key-here"
136
  ```
137
 
138
- ### Scrapeless Integration Branch Installation
139
-
140
- For enterprise-grade scraping with automatic CAPTCHA solving and advanced anti-bot bypassing:
141
-
142
- 1. Clone the Scrapeless integration branch:
143
- ```bash
144
- git clone -b CyberScrapeless-2077 https://github.com/itsOwen/CyberScraper-2077.git
145
- cd CyberScraper-2077
146
- ```
147
-
148
- 2. Install requirements and set API keys:
149
- ```bash
150
- pip install -r requirements.txt
151
-
152
- # Set all API keys
153
- export OPENAI_API_KEY="your_openai_api_key_here"
154
- export GOOGLE_API_KEY="your_google_api_key_here"
155
- export SCRAPELESS_API_KEY="your_scrapeless_api_key_here"
156
- ```
157
-
158
- 3. Get your Scrapeless API key from [Scrapeless Dashboard](https://app.scrapeless.com/dashboard/account?tab=apiKey)
159
-
160
- ### Using Ollama (Both Branches)
161
 
162
  Note: I only recommend using OpenAI and Gemini API as these models are really good at following instructions. If you are using open-source LLMs, make sure you have a good system as the speed of the data generation/presentation depends on how well your system can run the LLM. You may also have to fine-tune the prompt and add some additional filters yourself.
163
 
@@ -172,8 +94,6 @@ Note: I only recommend using OpenAI and Gemini API as these models are really go
172
 
173
  If you prefer to use Docker, follow these steps to set up and run CyberScraper 2077:
174
 
175
- ### Main Branch Docker
176
-
177
  1. Ensure you have Docker installed on your system.
178
 
179
  2. Clone this repository:
@@ -192,21 +112,6 @@ If you prefer to use Docker, follow these steps to set up and run CyberScraper 2
192
  docker run -p 8501:8501 -e OPENAI_API_KEY="your-actual-api-key" -e GOOGLE_API_KEY="your-actual-api-key" cyberscraper-2077
193
  ```
194
 
195
- ### Scrapeless Branch Docker
196
-
197
- For the Scrapeless integration branch:
198
-
199
- ```bash
200
- git clone -b CyberScrapeless-2077 https://github.com/itsOwen/CyberScraper-2077.git
201
- cd CyberScraper-2077
202
- docker build -t cyberscrapeless .
203
- docker run -p 8501:8501 \
204
- -e OPENAI_API_KEY="your-actual-api-key" \
205
- -e GOOGLE_API_KEY="your-actual-api-key" \
206
- -e SCRAPELESS_API_KEY="your-scrapeless-api-key" \
207
- cyberscrapeless
208
- ```
209
-
210
  ### Using Ollama with Docker
211
 
212
  If you want to use Ollama with the Docker setup:
@@ -228,7 +133,7 @@ If you want to use Ollama with the Docker setup:
228
  ```
229
 
230
  Now visit the url: http://localhost:8501/
231
-
232
  On Linux you might need to use this below:
233
  ```bash
234
  docker run -e OLLAMA_BASE_URL=http://<your-host-ip>:11434 -p 8501:8501 cyberscraper-2077
@@ -241,8 +146,6 @@ Note: Ensure that your firewall allows connections to port 11434 for Ollama.
241
 
242
  ## πŸš€ Usage
243
 
244
- ### Main Branch Usage
245
-
246
  1. Fire up the Streamlit app:
247
  ```bash
248
  streamlit run main.py
@@ -256,16 +159,7 @@ Note: Ensure that your firewall allows connections to port 11434 for Ollama.
256
 
257
  5. Watch as CyberScraper 2077 tears through the net, extracting your data faster than you can say "flatline"!
258
 
259
- ### Scrapeless Branch Usage
260
-
261
- The Scrapeless integration branch offers the same user interface with enhanced capabilities:
262
-
263
- 1. **Enterprise Scraping**: Automatically bypasses advanced anti-bot systems like Cloudflare, Akamai, and DataDome
264
- 2. **CAPTCHA-Free**: No manual CAPTCHA solving required - handled automatically
265
- 3. **Global Access**: Choose proxy countries for geo-restricted content
266
- 4. **Higher Success Rate**: Achieve ~95% success rate on protected sites
267
-
268
- Example usage with page ranges (both branches):
269
  ```
270
  https://example.com/products 1-5
271
  https://example.com/search?q=cyberpunk&page={page} 1-10
@@ -308,20 +202,11 @@ I suggest you enter the URL structure every time if you want to scrape multiple
308
  4. **Automatic Pattern Detection**:
309
  If you don't specify a pattern, CyberScraper 2077 will attempt to detect the URL pattern automatically. However, for best results, specifying the pattern is recommended.
310
 
311
- ### Enhanced Multi-Page with Scrapeless
312
-
313
- The [Scrapeless integration branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077) provides enhanced multi-page scraping with:
314
- - **Automatic retry logic** for failed pages
315
- - **Global proxy rotation** for different pages
316
- - **CAPTCHA auto-solving** across all pages
317
- - **Higher success rates** on protected paginated sites
318
-
319
  ### Tips for Effective Multi-Page Scraping
320
 
321
  - Start with a small range of pages to test before scraping a large number.
322
  - Be mindful of the website's load and your scraping speed to avoid overloading servers.
323
  - Use the `simulate_human` option for more natural scraping behavior on sites with anti-bot measures.
324
- - Consider using the [Scrapeless branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077) for heavily protected sites.
325
  - Regularly check the website's `robots.txt` file and terms of service to ensure compliance.
326
 
327
  ### Example
@@ -344,10 +229,10 @@ CyberScraper 2077 now supports scraping .onion sites through the Tor network, al
344
  ```bash
345
  # Ubuntu/Debian
346
  sudo apt install tor
347
-
348
  # macOS (using Homebrew)
349
  brew install tor
350
-
351
  # Start the Tor service
352
  sudo service tor start # on Linux
353
  brew services start tor # on macOS
@@ -408,7 +293,7 @@ docker run -p 8501:8501 \
408
  <img src="https://i.postimg.cc/3JvhgtMP/cyberscraper-onion.png" alt="CyberScraper 2077 Onion Scrape">
409
  </p>
410
 
411
- ## Setup Google Sheets Authentication:
412
 
413
  1. Go to the Google Cloud Console (https://console.cloud.google.com/).
414
  2. Select your project.
@@ -430,24 +315,7 @@ docker run -p 8501:8501 \
430
  10. Click "Create" to generate the new client ID.
431
  11. Download the new client configuration JSON file and rename it to `client_secret.json`.
432
 
433
- ## πŸ”§ Branch Selection Guide
434
-
435
- ### Choose Main Branch If:
436
- - You need Tor network support for .onion sites
437
- - You prefer local browser control
438
- - You want to use your current browser session
439
- - You're doing research or educational projects
440
- - Budget is a primary concern (free tier friendly)
441
-
442
- ### Choose [Scrapeless Integration Branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077) If:
443
- - You're scraping heavily protected sites (Cloudflare, Akamai, DataDome)
444
- - You need enterprise-grade success rates (~95%)
445
- - CAPTCHAs are blocking your scraping
446
- - You want automatic proxy rotation
447
- - You need reliable, scalable scraping for business use
448
- - You value time over manual configuration
449
-
450
- ## Adjusting PlaywrightScraper Settings (optional)
451
 
452
  Customize the `PlaywrightScraper` settings to fit your scraping needs. If some websites are giving you issues, you might want to check the behavior of the website:
453
 
@@ -463,52 +331,11 @@ Adjust these settings based on your target website and environment for optimal r
463
 
464
  You can also bypass the captcha using the ```-captcha``` parameter at the end of the URL. The browser window will pop up, complete the captcha, and go back to your terminal window. Press enter and the bot will complete its task.
465
 
466
- ## πŸ› οΈ Advanced Features
467
-
468
- ### Scrapeless SDK Integration
469
-
470
- For users of the [Scrapeless integration branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077), here are the core capabilities:
471
-
472
- #### Web Unlocker API
473
- ```python
474
- # Automatic anti-bot bypass
475
- result = scrapeless.unlocker(
476
- actor="unlocker.webunlocker",
477
- input={
478
- "url": "https://protected-website.com",
479
- "proxy_country": "US",
480
- "js_render": True
481
- }
482
- )
483
- ```
484
-
485
- #### CAPTCHA Solver API
486
- ```python
487
- # Automatic CAPTCHA solving
488
- result = scrapeless.solver_captcha(
489
- actor="captcha.recaptcha",
490
- input={
491
- "version": "v2",
492
- "pageURL": "https://example.com",
493
- "siteKey": "your-site-key"
494
- }
495
- )
496
- ```
497
-
498
- #### Pre-built Scrapers
499
- ```python
500
- # E-commerce scrapers
501
- result = scrapeless.scraper(
502
- actor="scraper.shopee",
503
- input={"url": "https://shopee.com/product"}
504
- )
505
- ```
506
-
507
  ## 🀝 Contributing
508
 
509
- We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberScraper 2077! Whether you're enhancing the main branch, improving the Scrapeless integration, or adding new features, your contributions are valued.
510
 
511
- 1. Fork the repository (choose your preferred branch)
512
  2. Create your feature branch (`git checkout -b feature/amazing-feature`)
513
  3. Commit your changes (`git commit -m 'Add some amazing feature'`)
514
  4. Push to the branch (`git push origin feature/amazing-feature`)
@@ -516,33 +343,15 @@ We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberS
516
 
517
  ## πŸ”§ Troubleshooting
518
 
519
- ### Main Branch Issues
520
  Ran into a glitch in the matrix? Let me know by adding the issue to this repo so that we can fix it together.
521
 
522
- ### Scrapeless Integration Issues
523
- If you encounter issues with the [Scrapeless branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077):
524
-
525
- 1. **API Key Issues**: Verify your Scrapeless API key is valid
526
- 2. **High Success Rate Expected**: Scrapeless should achieve ~95% success on protected sites
527
- 3. **CAPTCHA Auto-Solve**: Should work automatically without manual intervention
528
- 4. **Proxy Network**: Test with different country codes if content is geo-restricted
529
-
530
  ## ❓ FAQ
531
 
532
- **Q: Which branch should I use?**
533
- A: Use the main branch for general scraping and Tor support. Use the [Scrapeless integration branch](https://github.com/itsOwen/CyberScraper-2077/tree/CyberScrapeless-2077) for enterprise-grade scraping with automatic CAPTCHA solving and anti-bot bypassing.
534
-
535
  **Q: Is CyberScraper 2077 legal to use?**
536
  A: CyberScraper 2077 is designed for ethical web scraping. Always ensure you have the right to scrape a website and respect their robots.txt file.
537
 
538
  **Q: Can I use this for commercial purposes?**
539
- A: Yes, under the terms of the MIT License. The Scrapeless integration branch is particularly well-suited for commercial use with its enterprise-grade features.
540
-
541
- **Q: What's the success rate difference?**
542
- A: Main branch: ~60-70% on protected sites. Scrapeless branch: ~95% on even heavily protected sites.
543
-
544
- **Q: Do I need to pay for Scrapeless?**
545
- A: Scrapeless offers various pricing tiers. Check their [pricing page](https://www.scrapeless.com?utm_source=owen) for current rates. The main branch remains free to use.
546
 
547
  ## πŸ“„ License
548
 
@@ -552,42 +361,8 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
552
 
553
  Got questions? Need support? Want to hire me for a gig?
554
 
555
- - πŸ“§ Email: owensingh72@proton.me
556
- - 🐦 Twitter: [@owensingh_](https://x.com/owensingh_)
557
- - πŸ’¬ Website: [Portfolio](https://www.owensingh.com)
558
-
559
- ## πŸš€ Get Started With Scrapeless
560
-
561
- ### πŸ†“ Free Trial
562
-
563
- 1. **πŸ“ [Sign Up](https://app.scrapeless.com/signup?utm_source=owen)** - No credit card required
564
- 2. **πŸ”‘ Get API Key** - Instant access to all features
565
- 3. **πŸ“¦ Install SDK** - Choose your preferred language
566
- 4. **πŸ“– Follow Quick Start** - Working in 5 minutes
567
- 5. **πŸ“ˆ Scale Up** - Upgrade when ready
568
-
569
- ### 🏒 Enterprise Contact
570
-
571
- - **πŸ’° Custom Pricing** - Volume discounts available
572
- - **πŸ‘¨β€πŸ’Ό Dedicated Support** - Named customer success manager
573
- - **πŸ“‹ SLA Guarantees** - 99.99% uptime commitment
574
- - **πŸ—οΈ On-premise Options** - Private cloud deployment
575
- - **πŸ“§ Email**: market@scrapeless.com
576
-
577
- ### 🌐 Connect With Scrapeless Devs
578
-
579
- - **🌐 Website**: [scrapeless.com](https://www.scrapeless.com?utm_source=owen)
580
- - **πŸ“š Documentation**: [docs.scrapeless.com](https://docs.scrapeless.com)
581
- - **πŸ’¬ Discord**: [Discord Community](https://discord.com/invite/xBcTfGPjCQ)
582
- - **πŸ’Ό LinkedIn**: [Follow Us](https://www.linkedin.com/company/scrapeless/posts/?feedView=all)
583
- - **🐦 Twitter**: [Follow Us](https://x.com/Scrapelessteam)
584
- - **πŸ“§ Email**: market@scrapeless.com
585
-
586
- <p align="center">
587
- <a href="https://www.thordata.com/?ls=VNSCxroa&lk=CyberScraper">
588
- <img src="https://i.postimg.cc/dtwTvm5V/728-x-90-2.gif" alt="Collect-web-data-728x90" border="0">
589
- </a>
590
- </p>
591
 
592
  ## 🚨 Disclaimer
593
 
@@ -618,5 +393,5 @@ Remember, samurai: In the dark future of the NET, knowledge is power, but it's a
618
  </p>
619
 
620
  <p align="center">
621
- Built with ❀️ and chrome by the streets of Night City | © 2077 Owen Singh
622
  </p>
 
1
  # 🌐 CyberScraper 2077
2
 
 
 
 
 
 
 
3
  <p align="center">
4
  <img src="https://i.postimg.cc/j5b7QSzg/scraper.png" alt="CyberScraper 2077 Logo">
5
  </p>
 
12
  [![Streamlit](https://img.shields.io/badge/Streamlit-FF4B4B)](https://streamlit.io/)
13
  [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
14
  [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com)
 
15
 
16
  > Rip data from the net, leaving no trace. Welcome to the future of web scraping.
17
 
 
21
 
22
  Whether you're a corpo data analyst, a street-smart netrunner, or just someone looking to pull information from the digital realm, CyberScraper 2077 has got you covered.
23
 
 
 
 
 
 
 
 
 
 
 
 
24
  <p align="center">
25
  <img src="https://i.postimg.cc/3NHb15wq/20240821-074556.gif">
26
  </p>
27
 
28
  ## ✨ Features
29
 
30
+ - **AI-Powered Extraction**: Utilizes cutting-edge AI models to understand and parse web content intelligently.
31
+ - **Sleek Streamlit Interface**: User-friendly GUI that even a chrome-armed street samurai could navigate.
32
+ - **Multi-Format Support**: Export your data in JSON, CSV, HTML, SQL or Excel – whatever fits your cyberdeck.
33
+ - **Tor Network Support**: Safely scrape .onion sites through the Tor network with automatic routing and security features.
34
+ - **Stealth Mode**: Implemented stealth mode parameters that help avoid detection as a bot.
35
+ - **Ollama Support**: Use a huge library of open source LLMs.
36
+ - **Async Operations**: Lightning-fast scraping that would make a Trauma Team jealous.
37
+ - **Smart Parsing**: Structures scraped content as if it was extracted straight from the engram of a master netrunner.
38
+ - **Caching**: Implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls.
39
+ - **Upload to Google Sheets**: Now you can easily upload your extracted CSV data to Google Sheets with one click.
40
+ - **Bypass Captcha**: Bypass captcha by using the -captcha at the end of the URL. (Currently only works natively, doesn't work on Docker)
41
+ - **Current Browser**: The current browser feature uses your local browser instance which will help you bypass 99% of bot detections. (Only use when necessary)
42
+ - **Navigate through the Pages (BETA)**: Navigate through the webpage and scrape data from different pages.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## πŸͺŸ For Windows Users
45
 
 
49
 
50
  **Note: CyberScraper 2077 requires Python 3.10 or higher.**
51
 
 
 
52
  1. Clone this repository:
53
  ```bash
54
  git clone https://github.com/itsOwen/CyberScraper-2077.git
 
79
  export GOOGLE_API_KEY="your-api-key-here"
80
  ```
81
 
82
+ ### Using Ollama
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  Note: I only recommend using OpenAI and Gemini API as these models are really good at following instructions. If you are using open-source LLMs, make sure you have a good system as the speed of the data generation/presentation depends on how well your system can run the LLM. You may also have to fine-tune the prompt and add some additional filters yourself.
85
 
 
94
 
95
  If you prefer to use Docker, follow these steps to set up and run CyberScraper 2077:
96
 
 
 
97
  1. Ensure you have Docker installed on your system.
98
 
99
  2. Clone this repository:
 
112
  docker run -p 8501:8501 -e OPENAI_API_KEY="your-actual-api-key" -e GOOGLE_API_KEY="your-actual-api-key" cyberscraper-2077
113
  ```
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ### Using Ollama with Docker
116
 
117
  If you want to use Ollama with the Docker setup:
 
133
  ```
134
 
135
  Now visit the url: http://localhost:8501/
136
+
137
  On Linux you might need to use this below:
138
  ```bash
139
  docker run -e OLLAMA_BASE_URL=http://<your-host-ip>:11434 -p 8501:8501 cyberscraper-2077
 
146
 
147
  ## πŸš€ Usage
148
 
 
 
149
  1. Fire up the Streamlit app:
150
  ```bash
151
  streamlit run main.py
 
159
 
160
  5. Watch as CyberScraper 2077 tears through the net, extracting your data faster than you can say "flatline"!
161
 
162
+ Example usage with page ranges:
 
 
 
 
 
 
 
 
 
163
  ```
164
  https://example.com/products 1-5
165
  https://example.com/search?q=cyberpunk&page={page} 1-10
 
202
  4. **Automatic Pattern Detection**:
203
  If you don't specify a pattern, CyberScraper 2077 will attempt to detect the URL pattern automatically. However, for best results, specifying the pattern is recommended.
204
 
 
 
 
 
 
 
 
 
205
  ### Tips for Effective Multi-Page Scraping
206
 
207
  - Start with a small range of pages to test before scraping a large number.
208
  - Be mindful of the website's load and your scraping speed to avoid overloading servers.
209
  - Use the `simulate_human` option for more natural scraping behavior on sites with anti-bot measures.
 
210
  - Regularly check the website's `robots.txt` file and terms of service to ensure compliance.
211
 
212
  ### Example
 
229
  ```bash
230
  # Ubuntu/Debian
231
  sudo apt install tor
232
+
233
  # macOS (using Homebrew)
234
  brew install tor
235
+
236
  # Start the Tor service
237
  sudo service tor start # on Linux
238
  brew services start tor # on macOS
 
293
  <img src="https://i.postimg.cc/3JvhgtMP/cyberscraper-onion.png" alt="CyberScraper 2077 Onion Scrape">
294
  </p>
295
 
296
+ ## πŸ” Setup Google Sheets Authentication
297
 
298
  1. Go to the Google Cloud Console (https://console.cloud.google.com/).
299
  2. Select your project.
 
315
  10. Click "Create" to generate the new client ID.
316
  11. Download the new client configuration JSON file and rename it to `client_secret.json`.
317
 
318
+ ## βš™οΈ Adjusting PlaywrightScraper Settings (optional)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
319
 
320
  Customize the `PlaywrightScraper` settings to fit your scraping needs. If some websites are giving you issues, you might want to check the behavior of the website:
321
 
 
331
 
332
  You can also bypass the captcha using the ```-captcha``` parameter at the end of the URL. The browser window will pop up, complete the captcha, and go back to your terminal window. Press enter and the bot will complete its task.
333
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
334
  ## 🀝 Contributing
335
 
336
+ We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberScraper 2077!
337
 
338
+ 1. Fork the repository
339
  2. Create your feature branch (`git checkout -b feature/amazing-feature`)
340
  3. Commit your changes (`git commit -m 'Add some amazing feature'`)
341
  4. Push to the branch (`git push origin feature/amazing-feature`)
 
343
 
344
  ## πŸ”§ Troubleshooting
345
 
 
346
  Ran into a glitch in the matrix? Let me know by adding the issue to this repo so that we can fix it together.
347
 
 
 
 
 
 
 
 
 
348
  ## ❓ FAQ
349
 
 
 
 
350
  **Q: Is CyberScraper 2077 legal to use?**
351
  A: CyberScraper 2077 is designed for ethical web scraping. Always ensure you have the right to scrape a website and respect their robots.txt file.
352
 
353
  **Q: Can I use this for commercial purposes?**
354
+ A: Yes, under the terms of the MIT License.
 
 
 
 
 
 
355
 
356
  ## πŸ“„ License
357
 
 
361
 
362
  Got questions? Need support? Want to hire me for a gig?
363
 
364
+ - Email: owensingh72@proton.me
365
+ - Website: [owen.sh](https://owen.sh)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366
 
367
  ## 🚨 Disclaimer
368
 
 
393
  </p>
394
 
395
  <p align="center">
396
+ Built with love and chrome by the streets of Night City | Β© 2077 Owen Singh
397
  </p>
src/prompts.py CHANGED
@@ -26,9 +26,10 @@ IMPORTANT: Always return JSON format for ANY export request. The system will aut
26
 
27
  ## Rules for Data Export
28
  - Return ONLY the JSON array, no explanations or additional text
 
29
  - Include all requested fields; use "N/A" if not found
30
  - Never invent data not present in the content
31
- - Limit entries if a specific count is requested
32
  - Use relevant field names based on content and query
33
 
34
  ## Conversational Mode
 
26
 
27
  ## Rules for Data Export
28
  - Return ONLY the JSON array, no explanations or additional text
29
+ - Extract ALL matching items from the entire content (including all pages if multipage)
30
  - Include all requested fields; use "N/A" if not found
31
  - Never invent data not present in the content
32
+ - Only limit entries if a specific count is explicitly requested by the user
33
  - Use relevant field names based on content and query
34
 
35
  ## Conversational Mode