GitLocker: The Coding Marketplace

Description:

parallelhtmlscraper 0.1.0

Parallel HTML Scraper

Helps you to web scrape html file in parallel without async / await syntax.
Feature
This project helps you to web scrape html file in parallel without async / await syntax.
Installation
pip install parallelhtmlscraper

Usage
Minimum example:
from bs4 import BeautifulSoup

from parallelhtmlscraper.html_analyzer import HtmlAnalyzer
from parallelhtmlscraper.parallel_html_scraper import ParallelHtmlScraper

class AnalyzerExample(HtmlAnalyzer):
async def execute(self, soup: BeautifulSoup) -> str:
return soup.find('title').text

host_google = 'https://www.google.com'
path_and_content = [
'', # Google Search
'/imghp?hl=EN', # Google Images
'/shopping?hl=en', # Google Shopping
'/save', # Collection
'https://www.google.com/maps?hl=en', # Google Maps
'https://www.google.com/drive/apps.html', # Google drive
'https://www.google.com/mail/help/intl/en/about.html?vm=r', # GMail
]

list_response = ParallelHtmlScraper.execute(host_google, path_and_content, AnalyzerExample())
print(list_response)

$ pipenv run python test.py
['\n Gmail - Email from Google\n ', 'Google Images', ' Google Maps ', 'Using Google Drive - New Features, Benefits & Advantages of Google Cloud Storage', 'Google Shopping', 'Google', 'Collections']

API
ParallelMediaDownloader.execute
class ParallelHtmlScraper:
"""API of parallel HTML scraping."""

@staticmethod
def execute(
base_url: str,
list_url: Iterable[str],
analyzer: HtmlAnalyzer[_T],
*,
limit: int = 5
interval: int = 1
) -> List[_T]:

base_url: str
Common part of request URL.
This will be help to download URLs got from HTML.
list_url: Iterable[str]
List of URL. Method will download them in parallel.
Absolute URL having same base URL as base_url also can be specified.
analyzer: HtmlAnalyzer[_T]
The instance extends HtmlAnalyzer to analyze HTML by using BeautifulSoup.
Following example will be help to understand its roll:
class AnalyzerExample(HtmlAnalyzer):
async def execute(self, soup: BeautifulSoup) -> str:
return soup.find('title').text

limit: int = 5
Limit number of parallel processes.
interval: int = 1
Interval between each request(second).