Last updated:
0 purchases
PrepGem 1.0.6
PrepGem
PrepGem is a Python package for preprocessing text data, designed to simplify the text-cleaning process for natural language processing (NLP) projects.
Features
PrepGem offers the following features:
Handle Missing Values: Easily handle missing values in specified DataFrame columns.
Clean HTML Text: Remove HTML tags and special characters from text or DataFrame columns.
Remove URLs: Remove URLs from text or DataFrame columns.
Remove Punctuation: Remove punctuation from text or DataFrame columns.
Remove Emojis: Remove emojis from text or DataFrame columns.
Remove Foreign Letters: Remove foreign letters from text or DataFrame columns.
Remove Numbers: Remove numbers from text or DataFrame columns.
Lowercasing: Convert text to lowercase in text or DataFrame columns.
Remove White Spaces: Remove extra white spaces from text or DataFrame columns.
Remove Repeated Characters: Remove repeated characters in words from text or DataFrame columns.
Remove Nonsense Words: Remove nonsense words from text or DataFrame columns.
Spell Correction: Perform spell-checking on text or DataFrame columns.
Nonsense Words and Spell Check: Perform spell-checking and remove nonsense words from text or DataFrame columns.
Tokenize: Tokenize text using NLTK's word_tokenize function.
Remove Stopwords: Remove stopwords from text tokens.
Stemming: Perform stemming on text tokens.
Installation
You can install PrepGem via pip:
pip install prepgem
Usage
Importing the module python
import prepgem
Basic Usage
text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)
Preprocessing a single text
text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_single_text(text)
print(cleaned_text)
Preprocessing a DataFrame
import pandas as pd
# Create a sample DataFrame
data = {
'text_column': ["This is an example text.", "Another example text with numbers: 12345."]
}
df = pd.DataFrame(data)
# Preprocess text column in the DataFrame
cleaned_df = prepgem.preprocess_dataframe(df, columns=['text_column'])
print(cleaned_df)
Default preprocessing pipeline
Default available preprocessing step is:
clean_html_text.
remove_urls
remove_punctuation
remove_emojis
remove_foreign_letters
remove_numbers
lowercasing
remove_white_spaces
remove_repeated_characters
nosense_words_and_spell_check
tokenize
remove_stopwords
stemming
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)
Custom preprocessing pipeline
You can customize the preprocessing steps by passing a list of parameters to the preprocess_text method. Available parameters include:
clean_html_text.
remove_urls
remove_punctuation
remove_emojis
remove_foreign_letters
remove_numbers
lowercasing
remove_white_spaces
remove_repeated_characters
remove_nonsense_words
spell_corrector
nosense_words_and_spell_check
tokenize
remove_stopwords
stemming
handle_missing_values
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text","nosense_words_and_spell_check"])
print(cleaned_text)
You can customize the preprocessing steps by passing a parameter remove with value of True remove=True to the preprocess_text method to remove a step. Available parameters include:
Example usage
text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text"], remove=True)
print(cleaned_text)
You can use all step as normal function just by passing The text or DataFrame containing the text column to be cleaned
from prepgem import remove_urls
# Example text with URLs
text_with_urls = "This is an example text with URLs: https://example.com and http://www.example.org."
# Remove URLs from the text
cleaned_text = remove_urls(text_with_urls)
print("Original text:")
print(text_with_urls)
print("\nText after removing URLs:")
print(cleaned_text)
This will output:
Original text:
This is an example text with URLs: https://example.com and http://www.example.org.
Text after removing URLs:
This is an example text with URLs: and .
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.