GitLocker: The Coding Marketplace

Description:

datafog 4.0.0

Open-source PII Detection & Anonymization.

Installation
DataFog can be installed via pip:
pip install datafog

CLI
📚 Quick Reference

Command
Description

scan-text
Analyze text for PII

scan-image
Extract and analyze text from images

redact-text
Redact PII in text

replace-text
Replace PII with anonymized values

hash-text
Hash PII in text

health
Check service status

show-config
Display current settings

download-model
Get a specific spaCy model

list-spacy-models
Show available models

list-entities
View supported PII entities

🔍 Detailed Usage
Scanning Text
To scan and annotate text for PII entities:
datafog scan-text "Your text here"

Example:
datafog scan-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"

Scanning Images
To extract text from images and optionally perform PII annotation:
datafog scan-image "path/to/image.png" --operations extract

Example:
datafog scan-image "nokia-statement.png" --operations extract

To extract text and annotate PII:
datafog scan-image "nokia-statement.png" --operations scan

Redacting Text
To redact PII in text:
datafog redact-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"

which should output:
[REDACTED] is the CEO of [REDACTED] and is based out of [REDACTED], [REDACTED]

Replacing Text
To replace detected PII:
datafog replace-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"

which should return something like:
[PERSON_B86CACE6] is the CEO of [UNKNOWN_445944D7] and is based out of [UNKNOWN_32BA5DCA], [UNKNOWN_B7DF4969]

Note: a unique randomly generated identifier is created for each detected entity
Hashing Text
You can select from SHA256, SHA3-256, and MD5 hashing algorithms to hash detected PII. Currently the hashed output does not match the length of the original entity, for privacy-preserving purposes. The default is SHA256.
datafog hash-text "Tim Cook is the CEO of Apple and is based out of Cupertino, California"

generating an output which looks like this:
5738a37f0af81594b8a8fd677e31b5e2cabd6d7791c89b9f0a1c233bb563ae39 is the CEO of f223faa96f22916294922b171a2696d868fd1f9129302eb41a45b2a2ea2ebbfd and is based out of ab5f41f04096cf7cd314357c4be26993eeebc0c094ca668506020017c35b7a9c, cad0535decc38b248b40e7aef9a1cfd91ce386fa5c46f05ea622649e7faf18fb

Utility Commands
🏥 Health Check
datafog health

⚙️ Show Configuration
datafog show-config

📥 Download Model
datafog download-model en_core_web_sm

📂 Show Model Directory
datafog show-spacy-model-directory en_core_web_sm

📋 List Models
datafog list-spacy-models

🏷️ List Entities
datafog list-entities

⚠️ Important Notes

For scan-image and scan-text commands, use --operations to specify different operations. Default is scan.
Process multiple images or text strings in a single command by providing multiple arguments.
Ensure proper permissions and configuration of the DataFog service before running commands.

💡 Tip: For more detailed information on each command, use the --help option, e.g., datafog scan-text --help.
Python SDK
Getting Started
To use DataFog, you'll need to create a DataFog client with the desired operations. Here's a basic setup:
from datafog import DataFog

# For text annotation
client = DataFog(operations="scan")

# For OCR (Optical Character Recognition)
ocr_client = DataFog(operations="extract")

Text PII Annotation
Here's an example of how to annotate PII in a text document:
import requests

# Fetch sample medical record
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
response = requests.get(doc_url)
text_lines = [line for line in response.text.splitlines() if line.strip()]

# Run annotation
annotations = client.run_text_pipeline_sync(str_list=text_lines)
print(annotations)

OCR PII Annotation
For OCR capabilities, you can use the following:
import asyncio
import nest_asyncio

nest_asyncio.apply()

async def run_ocr_pipeline_demo():
image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
print("OCR Pipeline Results:", results)

loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())

Note: The DataFog library uses asynchronous programming for OCR, so make sure to use the async/await syntax when calling the appropriate methods.
Text Anonymization
DataFog provides various anonymization techniques to protect sensitive information. Here are examples of how to use them:
Redacting Text
To redact PII in text:
from datafog import DataFog
from datafog.config import OperationType

client = DataFog(operations=[OperationType.SCAN, OperationType.REDACT])

text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
redacted_text = client.run_text_pipeline_sync([text])[0]
print(redacted_text)

Output:
[REDACTED] is the CEO of [REDACTED] and is based out of [REDACTED], [REDACTED]

Replacing Text
To replace detected PII with unique identifiers:
from datafog import DataFog
from datafog.config import OperationType

client = DataFog(operations=[OperationType.SCAN, OperationType.REPLACE])

text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
replaced_text = client.run_text_pipeline_sync([text])[0]
print(replaced_text)

Output:
[PERSON_B86CACE6] is the CEO of [UNKNOWN_445944D7] and is based out of [UNKNOWN_32BA5DCA], [UNKNOWN_B7DF4969]

Hashing Text
To hash detected PII:
from datafog import DataFog
from datafog.config import OperationType
from datafog.models.anonymizer import HashType

client = DataFog(operations=[OperationType.SCAN, OperationType.HASH], hash_type=HashType.SHA256)

text = "Tim Cook is the CEO of Apple and is based out of Cupertino, California"
hashed_text = client.run_text_pipeline_sync([text])[0]
print(hashed_text)

Output:
5738a37f0af81594b8a8fd677e31b5e2cabd6d7791c89b9f0a1c233bb563ae39 is the CEO of f223faa96f22916294922b171a2696d868fd1f9129302eb41a45b2a2ea2ebbfd and is based out of ab5f41f04096cf7cd314357c4be26993eeebc0c094ca668506020017c35b7a9c, cad0535decc38b248b40e7aef9a1cfd91ce386fa5c46f05ea622649e7faf18fb

You can choose from SHA256 (default), SHA3-256, and MD5 hashing algorithms by specifying the hash_type parameter
Examples
For more detailed examples, check out our Jupyter notebooks in the examples/ directory:

text_annotation_example.ipynb: Demonstrates text PII annotation
image_processing.ipynb: Shows OCR capabilities and text extraction from images

These notebooks provide step-by-step guides on how to use DataFog for various tasks.
Dev Notes
For local development:

Clone the repository.
Navigate to the project directory:
cd datafog-python

Create a new virtual environment (using .venv is recommended as it is hardcoded in the justfile):
python -m venv .venv

Activate the virtual environment:

On Windows:
.venv\Scripts\activate

On macOS/Linux:
source .venv/bin/activate

Install the package in editable mode:
pip install -r requirements-dev.txt

Set up the project:
just setup

Now, you can develop and run the project locally.
Important Actions:

Format the code:
just format

This runs isort to sort imports.
Lint the code:
just lint

This runs flake8 to check for linting errors.
Generate coverage report:
just coverage-html

This runs pytest and generates a coverage report in the htmlcov/ directory.

We use pre-commit to run checks locally before committing changes. Once installed, you can run:
pre-commit run --all-files

Dependencies
For OCR, we use Tesseract, which is incorporated into the build step. You can find the relevant configurations under .github/workflows/ in the following files:

dev-cicd.yml
feature-cicd.yml
main-cicd.yml

Testing

Python 3.10

License
This software is published under the MIT
license.