MOPC-Portal

MOPC/MOPC-Portal

Fork 0

Commit Graph

Author	SHA1	Message	Date
Matt	d80043c4aa	Strip null bytes from extracted text to fix PostgreSQL UTF-8 errors Some checks failed Build and Push Docker Image / build (push) Has been cancelled Details Some PDFs contain \x00 null bytes in their text which PostgreSQL rejects with "invalid byte sequence for encoding UTF8: 0x00". Sanitize extracted text in both document-analyzer and file-content-extractor services. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 11:34:05 +01:00
Matt	ed5e782f61	Fix document analysis: switch to unpdf + mammoth for PDF/Word parsing All checks were successful Build and Push Docker Image / build (push) Successful in 11m26s Details pdf-parse v2 requires DOMMatrix (browser API) which fails in Node.js. Replaced with unpdf (serverless PDF.js build) for PDFs and mammoth for Word .docx files. Also fixed the same broken pdf-parse usage in file-content-extractor.ts used by AI filtering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:27:36 +01:00
Matt	c9640c6086	Add document analysis: page count, text extraction & language detection All checks were successful Build and Push Docker Image / build (push) Successful in 11m7s Details Introduces a document analyzer service that extracts page count (via pdf-parse), text preview, and detected language (via franc) from uploaded files. Analysis runs automatically on upload (configurable via SystemSettings) and can be triggered retroactively for existing files. Results are displayed as badges in the FileViewer and fed to AI screening for language-based filtering criteria. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:09:04 +01:00

Author

SHA1

Message

Date

Matt

d80043c4aa

Strip null bytes from extracted text to fix PostgreSQL UTF-8 errors

Build and Push Docker Image / build (push) Has been cancelled

Details

Some PDFs contain \x00 null bytes in their text which PostgreSQL rejects
with "invalid byte sequence for encoding UTF8: 0x00". Sanitize extracted
text in both document-analyzer and file-content-extractor services.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-17 11:34:05 +01:00

Matt

ed5e782f61

Fix document analysis: switch to unpdf + mammoth for PDF/Word parsing

Build and Push Docker Image / build (push) Successful in 11m26s

Details

pdf-parse v2 requires DOMMatrix (browser API) which fails in Node.js.
Replaced with unpdf (serverless PDF.js build) for PDFs and mammoth for
Word .docx files. Also fixed the same broken pdf-parse usage in
file-content-extractor.ts used by AI filtering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-17 10:27:36 +01:00

Matt

c9640c6086

Add document analysis: page count, text extraction & language detection

Build and Push Docker Image / build (push) Successful in 11m7s

Details

Introduces a document analyzer service that extracts page count (via pdf-parse),
text preview, and detected language (via franc) from uploaded files. Analysis runs
automatically on upload (configurable via SystemSettings) and can be triggered
retroactively for existing files. Results are displayed as badges in the FileViewer
and fed to AI screening for language-based filtering criteria.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-17 10:09:04 +01:00

3 Commits