raccy 2.0.0

Creator: railscoder56

Last updated:

Add to Cart

Description:

raccy 2.0.0

RACCY
OVERVIEW
Raccy is a multithreaded web scraping library based on selenium.
It can be used for web automation, web scraping, and
data mining.
REQUIREMENTS

Python 3.7+
Works on Linux, Windows, and Mac

ARCHITECTURE OVERVIEW


UrlDownloaderWorker: resonsible for downloading item(s) to be scraped urls and enqueue(s) them in ItemUrlQueue


ItemUrlQueue: receives item urls from UrlDownloaderWorker and enqueues them
for feeding them to CrawlerWorker


CrawlerWorker: fetches item web pages and scrapes or extract data from them and enqueues the data in DatabaseQueue


DatabaseQueue: receives scraped item data from CrawlerWorker(s) and enques them
for feeding them to DatabaseWorker.


DatabaseWorker: receives scraped data from DatabaseQueue and stores it in a persistent database.


INSTALL
pip install raccy

TUTORIAL
from raccy import (
UrlDownloaderWorker, CrawlerWorker, DatabaseWorker, WorkersManager
)
import ro as model
from selenium import webdriver
from shutil import which

config = model.Config()
config.DATABASE = model.SQLiteDatabase('quotes.sqlite3')


class Quote(model.Model):
quote_id = model.PrimaryKeyField()
quote = model.TextField()
author = model.CharField(max_length=100)


class UrlDownloader(UrlDownloaderWorker):
start_url = 'https://quotes.toscrape.com/page/1/'
max_url_download = 10

def job(self):
url = self.driver.current_url
self.url_queue.put(url)
self.follow(xpath="//a[contains(text(), 'Next')]", callback=self.job)


class Crawler(CrawlerWorker):

def parse(self, url):
self.driver.get(url)
quotes = self.driver.find_elements_by_xpath("//div[@class='quote']")
for q in quotes:
quote = q.find_element_by_xpath(".//span[@class='text']").text
author = q.find_element_by_xpath(".//span/small").text

data = {
'quote': quote,
'author': author
}
self.log.info(data)
self.db_queue.put(data)


class Db(DatabaseWorker):

def save(self, data):
Quote.objects.create(**data)


def get_driver():
driver_path = which('.\\chromedriver.exe')
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument("--start-maximized")
driver = webdriver.Chrome(executable_path=driver_path, options=options)
return driver


if __name__ == '__main__':
manager = WorkersManager()
manager.add_driver(get_driver)
manager.start()
print('Done scraping...........')

Author

Afriyie Daniel

Hope You Enjoy Using It !!!!

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.