PrepGem 1.0.6

Creator: railscoder56

Last updated:

Add to Cart

Description:

PrepGem 1.0.6

PrepGem
PrepGem is a Python package for preprocessing text data, designed to simplify the text-cleaning process for natural language processing (NLP) projects.
Features
PrepGem offers the following features:

Handle Missing Values: Easily handle missing values in specified DataFrame columns.
Clean HTML Text: Remove HTML tags and special characters from text or DataFrame columns.
Remove URLs: Remove URLs from text or DataFrame columns.
Remove Punctuation: Remove punctuation from text or DataFrame columns.
Remove Emojis: Remove emojis from text or DataFrame columns.
Remove Foreign Letters: Remove foreign letters from text or DataFrame columns.
Remove Numbers: Remove numbers from text or DataFrame columns.
Lowercasing: Convert text to lowercase in text or DataFrame columns.
Remove White Spaces: Remove extra white spaces from text or DataFrame columns.
Remove Repeated Characters: Remove repeated characters in words from text or DataFrame columns.
Remove Nonsense Words: Remove nonsense words from text or DataFrame columns.
Spell Correction: Perform spell-checking on text or DataFrame columns.
Nonsense Words and Spell Check: Perform spell-checking and remove nonsense words from text or DataFrame columns.
Tokenize: Tokenize text using NLTK's word_tokenize function.
Remove Stopwords: Remove stopwords from text tokens.
Stemming: Perform stemming on text tokens.

Installation
You can install PrepGem via pip:
pip install prepgem

Usage
Importing the module python
import prepgem

Basic Usage
text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)

Preprocessing a single text
text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_single_text(text)
print(cleaned_text)

Preprocessing a DataFrame
import pandas as pd

# Create a sample DataFrame
data = {
'text_column': ["This is an example text.", "Another example text with numbers: 12345."]
}
df = pd.DataFrame(data)

# Preprocess text column in the DataFrame
cleaned_df = prepgem.preprocess_dataframe(df, columns=['text_column'])
print(cleaned_df)

Default preprocessing pipeline
Default available preprocessing step is:

clean_html_text.
remove_urls
remove_punctuation
remove_emojis
remove_foreign_letters
remove_numbers
lowercasing
remove_white_spaces
remove_repeated_characters
nosense_words_and_spell_check
tokenize
remove_stopwords
stemming

text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)

Custom preprocessing pipeline
You can customize the preprocessing steps by passing a list of parameters to the preprocess_text method. Available parameters include:

clean_html_text.
remove_urls
remove_punctuation
remove_emojis
remove_foreign_letters
remove_numbers
lowercasing
remove_white_spaces
remove_repeated_characters
remove_nonsense_words
spell_corrector
nosense_words_and_spell_check
tokenize
remove_stopwords
stemming
handle_missing_values

Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text","nosense_words_and_spell_check"])
print(cleaned_text)

You can customize the preprocessing steps by passing a parameter remove with value of True remove=True to the preprocess_text method to remove a step. Available parameters include:
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text"], remove=True)
print(cleaned_text)

You can use all step as normal function just by passing The text or DataFrame containing the text column to be cleaned
from prepgem import remove_urls

# Example text with URLs
text_with_urls = "This is an example text with URLs: https://example.com and http://www.example.org."

# Remove URLs from the text

cleaned_text = remove_urls(text_with_urls)

print("Original text:")
print(text_with_urls)
print("\nText after removing URLs:")
print(cleaned_text)

This will output:
Original text:
This is an example text with URLs: https://example.com and http://www.example.org.

Text after removing URLs:
This is an example text with URLs: and .

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.