File Processing#

The file processing library is a metadata extraction tool that supports 26 common file types, including OCR for image files and transcription for audio/video files. There are 3 main use cases:

  1. Extracting metadata from individual files
  2. Extracting metadata from files in a directory and generating reports (.csv) that:
    • list metadata for each file
    • provide aggregate statistics on file size and count of each file type
    • compare file similarity of document-based files to identify possible duplicates in a directory
  3. Comparing 2 document-based files via cosine similarity and Levenshtein distance

Example Usage and Output#

Test files can be found in docs/sample_reports and examples are shown below

Lists metadata of all files. There are multiple report configurations and only a subset of the columns are shown:

_images/metadata_report.png

Aggregates file size and count for each file type.

_images/analytics_report.png

Compares the text in each document-based file to every other file according to cosine similarity.

_images/cosine_similarity.png

Determines the top n matches to each file using FAISS indicies.

_images/faiss_similarity.png
from file_processing import File

file = File('path/to/file')
print(file.metadata)
{
    'original_format': 'PNG',
    'mode': 'RGBA',
    'width': 1188,
    'height': 429
}

Supported File Types and Extracted Metadata#


File Type

Metadata Fields

Example

[All file types]

  • file_path

  • file_name

  • extension

  • owner

  • size (in bytes)

  • modification_time

  • access_time

  • creation_time

  • permissions

  • parent_directory

  • is_file

  • is_symlink

  • absolute_path

  • build/index.html

  • index.html

  • .html

  • AD/BLUO

  • 53800

  • 1707161382

  • 1707161382

  • 1707161382

  • 666

  • build/

  • False

  • False

  • C:/…/index.html

mp3, wav, mp4, flac, aiff, ogg

  • bitrate

  • length (in seconds)

  • artist

  • date

  • title

  • 50000

  • 4.32

  • John Doe

  • 1707161382

  • The ABCs

jpeg, png, heic/heif, tiff/tif

  • original_format

  • mode

  • width

  • height

  • GIF

  • P

  • 1024

  • 980

gif

  • original_format

  • mode

  • width

  • height

  • animated

  • frames

  • GIF

  • P

  • 1024

  • 980

  • True

  • 24

csv

  • text

  • encoding

  • num_rows

  • num_cols

  • num_cells

  • empty_cells

  • Text goes here

  • utf-8

  • 12

  • 5

  • 60

  • 5

xlsx

  • active_sheet

  • sheet_names

  • data

  • last_modified_by

  • creator

  • has_password

  • Sheet1

  • [“Sheet1”]

  • {“Sheet1”: [()]

  • John Doe

  • John Doe

  • False

docx

  • text

  • author

  • last_modified_by

  • has_password

  • Text goes here

  • John Doe

  • John Doe

  • False

pptx

  • text

  • author

  • last_modified_by

  • num_slides

  • has_password

  • Text goes here

  • John Doe

  • John Doe

  • 17

  • False

html, txt, xml

  • text

  • encoding

  • lines

  • words

  • num_lines

  • num_words

  • Text goes here

  • utf-8

  • [“a b c”, “d e”]

  • [“Word”, “Text”]

  • 2

  • 9

json

  • text

  • encoding

  • num_keys

  • key_names

  • empty_values

  • Text goes here

  • utf-8

  • 3

  • [“A”, “B”, “C”]

  • 2

msg

  • text

  • subject

  • date

  • sender

  • Text goes here

  • “Title”

  • 1707161382

  • John Doe

pdf

  • text

  • has_password

  • author

  • producer

  • Text goes here

  • False

  • John Doe

  • John Doe

py

  • num_lines

  • num_functions

  • num_classes

  • imports

  • docstrings

  • 3

  • 3

  • 3

  • [“pandas”]

  • [“Docstring A”]

rtf

  • text

  • Text goes here

zip

  • num_files

  • file_types

  • file_names

  • 1

  • [“html”]

  • [“a.html”]