data-preprocessors 0.58.0

Creator: bradpython12

Last updated:

Add to Cart

Description:

datapreprocessors 0.58.0

Data Preprocessors
An easy-to-use tool for Data Preprocessing especially for Text Preprocessing






Table of Contents

Installation
Quick Start
Features

Split Textfile
Build Parallel Corpus
Separate Parallel Corpus
Decontruct Words of Sentence
Remove Punctuation
Space Punctuation
Text File to List
Text File to Dataframe
List to Text File
Remove File
Count Characters of a Sentence
Count Words of Sentence
Count No of Lines in a Text File
Convert Excel to Multiple Text Files
Merge Multiple Text Files
Apply Any Function in a Full Text File



Installation
Install the latest stable release
For windows
pip install -U data-preprocessors

For Linux/WSL2
pip3 install -U data-preprocessors

Quick Start
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

>> bla bla bla bla

Features
Split Textfile
This function will split your textfile into train, test and validate. Three separate text files. By changing shuffle and seed value, you can randomly shuffle the lines of your text files.
from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
main_file_path="example.txt",
train_file_path="splitted/train.txt",
val_file_path="splitted/val.txt",
test_file_path="splitted/test.txt",
train_size=0.6,
val_size=0.2,
test_size=0.2,
shuffle=True,
seed=42
)

# Total lines: 500
# Train set size: 300
# Validation set size: 100
# Test set size: 100

Separate Parallel Corpus
By using this function, you will be able to easily separate src_tgt_file into separated src_file and tgt_file.
from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")

Decontracting Words from Sentence
tp.decontracting_words(sentence)

Remove Punctuation
By using this function, you will be able to remove the punction of a single line of a text file.
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

# bla bla bla bla

Space Punctuation
By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)

# bla bla bla bla

Text File to List
Convert any text file into list.
mylist= tp.text2list(myfile_path="myfile.txt")

List to Text File
Convert any list into a text file (filename.txt)
tp.list2text(mylist=mylist, myfile_path="myfile.txt")

Count Characters of a Sentence
This function will help to count the total characters of a sentence.
tp.count_chars(myfile="file.txt")

Convert Excel to Multiple Text Files
This function will help to Convert an Excel file's columns into multiple text files.
tp.excel2multitext(excel_file_path="",
column_names=None,
src_file="",
tgt_file="",
aligns_file="",
separator="|||",
src_tgt_file="",
)

Apply a function in whole text file
In the place of function_name you can use any function and that function will be applied in the full/whole text file.
from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
function_name,
myfile_path="myfile.txt",
modified_file_path="modified_file.txt"
)

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.