A Utility for Investigative Discovery of PDF Files
Initially developed for use in the research process used by the IPYT Index, ArchivEye is an open source investigative utility with accessibility and security in mind. ArchivEye works offline with PDF files on your local or personal PC. Implementing an OCR reading process against any collection of PDF files it makes it possible to find keywords and names within them.
What lead to the functions of this utility came from needs found in legal and investigative industries for users of even an entry level understanding of tech. Often there is an overwhelming amount of files collected for discovery that have to be sorted, indexed, and searched. In some cases, especially in litigation, these documents can be highly sensitive. To provide an offline option that makes these investigations possible with significantly less risk to victim identity and privacy is incredibly valuable.
VIEW THE DEMO VIDEO
ArchivEye combines the OCR utility of Tesseract allowing PDF pages to become searchable text and GhostScript which turns each PDF page into an image for Tesseract. Having these installed is required for ArchivEye to work which the utility will ensure both are recognized and able to be used. After installation the PDFs for your research are indexed which entails scanning every page of every PDF document. This takes some time but ends up saving significantly more with it's search feature.
After indexing you can enter a term or keyword to search the document collection with results shown that include where it was located in both the document itself and the specific page. A highlighted context to understand where this appears is presented beneath the search results next to a feature that shows the PDF page itself. The GUI makes examining the page easy with a zoom feature to further inspect.
INFO - GITHUB README