Some PDFs contain \x00 null bytes in their text which PostgreSQL rejects
with "invalid byte sequence for encoding UTF8: 0x00". Sanitize extracted
text in both document-analyzer and file-content-extractor services.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pdf-parse v2 requires DOMMatrix (browser API) which fails in Node.js.
Replaced with unpdf (serverless PDF.js build) for PDFs and mammoth for
Word .docx files. Also fixed the same broken pdf-parse usage in
file-content-extractor.ts used by AI filtering.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces a document analyzer service that extracts page count (via pdf-parse),
text preview, and detected language (via franc) from uploaded files. Analysis runs
automatically on upload (configurable via SystemSettings) and can be triggered
retroactively for existing files. Results are displayed as badges in the FileViewer
and fed to AI screening for language-based filtering criteria.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>