Last updated:
0 purchases
pdfsim 0.3
PDF Similarity Matcher
The PDF Similarity Matcher is a command-line tool for finding and displaying PDF documents similar to a given input PDF based on extracted text features. It leverages text extraction and similarity comparison to help you identify relevant matches from a directory of PDFs.
Features
Extracts text from PDF files.
Processes and compares features from multiple PDFs.
Calculates similarity scores between an input PDF and PDFs in the directory.
Optionally displays detailed key-value feature information for similar PDFs.
Installation
Follow these steps to install and set up the PDF Similarity Matcher:
Clone the repository:
git clone https://github.com/yourusername/pdfsim.git
cd pdfsim
Create a virtual environment:
python3 -m venv venv
Activate the virtual environment:
On Windows:
venv\Scripts\activate
On macOS/Linux:
source venv/bin/activate
Install the required packages:
pip install -r requirements.txt
Ensure requirements.txt includes the necessary libraries:
PyPDF2
scikit-learn
nltk
Usage
To find similar PDFs, use the following command:
python3 main.py -d <directory_containing_pdf> -i <input_pdf> -t <top_n> [-kv]
Arguments
-d, --database (required): Path to the directory containing PDF files to compare against.
-i, --input (required): Path to the input PDF file you want to compare.
-t, --top (optional, default: 1): Number of top similar PDFs to display.
-kv (optional): Enable detailed key-value feature output for similar PDFs.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.