GitLocker: The Coding Marketplace

Description:

parsa 1.1.5

A text parser that doesn't care about your file extensions

Key Features •
Supported Formats •
Installation •
Usage •
Related projects •
Contributing •
MIT License

Parsa is a textract-based CLI text parser that supports multiple file extensions.
It takes any number of inputs, and outputs them to .txt files in a directory of choice, preserving the structure of the original text.

Key features

Extends textract's functionalities to work with multiple inputs and to automatically save the output
Takes an arbitrary number of inputs of different filetypes, and processess them all equally when supported
Outputs the parsed text from the input files individually to corresponding .txt files, with the option of selecting a custom output path
Includes a naming system that always avoids overwriting existing files, instead naming new files in a simple manner
Supports over 20 of the most common formats (see Supported formats for more)
Preserves the structure of document file formats (.docx, .pdf, ...)
Supports audio formats (.wav, .mp3, ...) via the speech recognition tools sox, SpeechRecognition and pocketsphinx
Supports image formats (.jpg, .png, ...), via the optical character recognition (OCR) tool tesseract-ocr
Prompts the user for an input file's extension if it's not explicitly present; this feature can be turned off via --noprompt

Supported formats
See this page from textract's documentation for a full list of the supported formats and their linked dependencies.
Installation
System requirements

Linux
Python 2.7/3.x (any Python 3 version)

Linux
Via pip:
$ pip install parsa

Or, if you prefer, you can install it from source:
# Clone the repository
$ git clone https://github.com/rdimaio/parsa

# Go into the parsa folder
$ cd parsa

# Install parsa
$ python setup.py install

Tests
$ python -m unittest discover tests

Usage
Single input
# Basic usage
$ parsa path/to/input_file
# The output will be saved inside the input file's parent folder.

Multi input
# Basic usage
$ parsa path/to/input_folder
# The output will be saved inside a folder named `parsaoutput` in the input folder.

Optional: custom output folder
# Basic usage
$ parsa path/to/input -o path/to/output_folder
# Works with both single and multi input.

Optional: ignore files without an explicit extension
# Basic usage
$ parsa --noprompt path/to/input
# Useful for situations where your input includes log/system files without an extension.

Full help message
$ parsa --help
usage: parsa [-h] [--noprompt] [--output [OUTPUT]] input

Textract-based text parser that supports most text file extensions. Parsa can
parse multiple formats at once, writing them to .txt files in the directory of
choice.

positional arguments:
input input file or folder; if a folder is passed as input,
parsa will scan every file inside it recursively
(scanning subfolders as well)

optional arguments:
-h, --help show this help message and exit
--noprompt, -n ignore files without an extension and don't prompt the
user to input their extension
--output [OUTPUT], -o [OUTPUT]
folder where the output files will be stored. The default folder is:
(a) the input file's parent folder, if the input is a file, or
(b) a folder named 'parsaoutput' located in the input folder, if the input is a folder.

Related projects

parsa-gui - Graphical version of parsa (WIP)
xparsa - Extended parsa, enhanced with statistics about the parsed files (WIP)
xparsa-gui - GUI for xparsa (WIP)

Contributing
Pull requests are welcome! If you would like to include/remove/change a major feature, please open an issue first.
License
This project is licensed under the MIT License - see the LICENSE file for details.