File Processing#

The file processing library is a metadata extraction tool that supports 26 common file types, including OCR for image files and transcription for audio/video files. There are 3 main use cases:

Extracting metadata from individual files
Extracting metadata from files in a directory and generating reports (.csv) that:
- list metadata for each file
- provide aggregate statistics on file size and count of each file type
- compare file similarity of document-based files to identify possible duplicates in a directory
Comparing 2 document-based files via cosine similarity and Levenshtein distance

Example Usage and Output#

Test files can be found in docs/sample_reports and examples are shown below

Metadata Report

Lists metadata of all files. There are multiple report configurations and only a subset of the columns are shown:

Analytics Report

Aggregates file size and count for each file type.

Similarity Report (1)

Compares the text in each document-based file to every other file according to cosine similarity.

Similarity Report (2)

Determines the top n matches to each file using FAISS indicies.

File Metadata

from file_processing import File

file = File('path/to/file')
print(file.metadata)

{
    'original_format': 'PNG',
    'mode': 'RGBA',
    'width': 1188,
    'height': 429
}

Supported File Types and Extracted Metadata#

File Type	Metadata Fields	Example
[All file types]	file_path file_name extension owner size (in bytes) modification_time access_time creation_time permissions parent_directory is_file is_symlink absolute_path	build/index.html index.html .html AD/BLUO 53800 1707161382 1707161382 1707161382 666 build/ False False C:/…/index.html
mp3, wav, mp4, flac, aiff, ogg	bitrate length (in seconds) artist date title	50000 4.32 John Doe 1707161382 The ABCs
jpeg, png, heic/heif, tiff/tif	original_format mode width height	GIF P 1024 980
gif	original_format mode width height animated frames	GIF P 1024 980 True 24
csv	text encoding num_rows num_cols num_cells empty_cells	Text goes here utf-8 12 5 60 5
xlsx	active_sheet sheet_names data last_modified_by creator has_password	Sheet1 [“Sheet1”] {“Sheet1”: [()] John Doe John Doe False
docx	text author last_modified_by has_password	Text goes here John Doe John Doe False
pptx	text author last_modified_by num_slides has_password	Text goes here John Doe John Doe 17 False
html, txt, xml	text encoding lines words num_lines num_words	Text goes here utf-8 [“a b c”, “d e”] [“Word”, “Text”] 2 9
json	text encoding num_keys key_names empty_values	Text goes here utf-8 3 [“A”, “B”, “C”] 2
msg	text subject date sender	Text goes here “Title” 1707161382 John Doe
pdf	text has_password author producer	Text goes here False John Doe John Doe
py	num_lines num_functions num_classes imports docstrings	3 3 3 [“pandas”] [“Docstring A”]
rtf	text	Text goes here
zip	num_files file_types file_names	1 [“html”] [“a.html”]