AI classification · Data extraction · Onboarding automation
A production-grade backend API built with Node.js and PHP that uses OpenAI to classify and extract structured data from South African FICA documents — ID documents, bank statements, and payslips — in PDF or image form. The API accepts multiple files per request, identifies the document type, extracts every relevant field, and returns clean structured JSON ready to plug into onboarding, CRM, or credit workflows.
Built to remove the bottleneck of manual FICA document handling during customer onboarding
Onboarding teams were manually opening every FICA document, identifying whether it was an ID, payslip, or bank statement, then typing the same fields into the CRM by hand. This was slow, error-prone, hard to audit, and impossible to scale during high-volume application periods.
A backend API built in Node.js + PHP that accepts a batch of files (PDF or image), automatically classifies each document using OpenAI, runs tailored extraction prompts per document type, and returns a clean structured JSON response with every field detected — including encrypted PDF handling with per-file passwords.
Manual capture time per customer dropped dramatically, data quality improved through consistent structured output, and onboarding teams could focus on verification instead of typing. The API now plugs directly into downstream CRM and credit workflows for fully automated FICA intake.
From raw upload to structured JSON in one round trip
API accepts a batch of files (PDF / JPG / PNG / TIFF / WebP) with optional per-file passwords for encrypted PDFs.
OpenAI is used to determine whether each document is an ID, bank statement, or payslip — no manual tagging required.
Document-specific prompts pull every relevant field (names, ID numbers, balances, salary breakdown, etc.) into structured data.
API responds with one structured JSON object per document, grouped by classification, ready to ingest into the CRM.
Tailored extraction logic per document type for maximum accuracy
Smart Card or Green Book — supports both formats and full identity extraction.
Detects all major SA banks and pulls account, balance, and statement metadata.
Full salary breakdown including deductions, employer details, and banking info.
Built for real onboarding workflows — not just a tech demo
Uses OpenAI to automatically identify document type — no manual tagging.
Pulls every relevant data point per document type into a structured object.
Process several documents in one API call with mixed types and formats.
Handles PDF, JPG, PNG, GIF, BMP, WebP, and TIFF inputs up to 10 MB each.
Per-file password support for password-protected bank statements and payslips.
Clean, predictable JSON shape per document type — easy to consume downstream.
Demo frontend masks sensitive fields client-side; raw API returns full data.
Designed to slot directly into onboarding, CRM, or credit workflows.
Returns N/A for missing fields rather than guessing — safer for compliance.
Processes multi-document batches in seconds rather than minutes of manual work.
Tailored for South African ID formats, local banks, and SARS payslip layouts.
Currently in active use — not a prototype or proof-of-concept demo.
The screenshots below show a thin demo frontend wrapping the API. The frontend masks sensitive output for public viewing — the actual API returns complete data. Click any image to enlarge.
Drag-and-drop or browse to queue multiple files for the API. Supports PDF and common image formats.
Encrypted PDFs (like password-protected payslips) can be unlocked individually before processing.
Documents are classified and parsed using OpenAI — each file is handled with type-specific logic.
Employer, employee, salary breakdown, deductions, and banking details — all extracted automatically.
Bank statement metadata and full ID document fields returned in a single batch response.
Built end-to-end with a focus on reliability, accuracy, and integration
If you're processing high volumes of documents — onboarding, KYC, FICA, or anything that needs classification and structured extraction — let's talk.