From document import to AI-powered answers with verified sources.
Document import
Dokuhaku connects directly to OEM documentation portals. When a manufacturer publishes a new manual, it appears in your library automatically. No file downloads, no manual updates, no missed revisions. You can also create your own folder structure and upload internal procedures, service bulletins, and reference materials. Both paths feed into the same processing pipeline.
Automatic import from manufacturer documentation systems. Folder hierarchies are preserved exactly as the manufacturer organized them.
Create folders, upload files, and organize your internal documentation alongside OEM materials. Structure your library to match how your team works.
Only new and updated documents are processed. The system tracks revisions automatically so you never have duplicate versions or miss an update.
PDF, Word, Excel, PowerPoint, images, and plain text files are all supported. Each format is handled by a dedicated extraction engine optimized for its structure.
Document processing
Each document goes through an automated pipeline. An extraction engine reads every page, identifies structure, and breaks the content into overlapping chunks optimized for search. Each chunk is converted into a vector representation and stored in a vector database. A parallel keyword index captures exact terms and part numbers.
An extraction engine reads every page and identifies structure: titles, paragraphs, tables, lists, and diagrams.
Tables are preserved with their row and column structure. Diagrams and figures are recognized and included alongside surrounding text for complete context.
Content is split into overlapping chunks optimized for search. Small blocks are merged for context. Large blocks are split at natural boundaries.
Every chunk knows its page number and position within the document. This enables one-click navigation to the exact page and passage in the source viewer.
Each chunk is converted into a vector representation by an embedding model and stored in a vector database. A parallel keyword index captures exact terms, part numbers, and technical identifiers.
When the processing pipeline improves, existing documents can be reprocessed without re-uploading. Updated extraction and chunking are applied automatically.
Search and AI
When you ask a question, two search systems work in parallel. A vector search finds passages with similar meaning, even across languages. A keyword search catches exact terms and part numbers. Results are merged and reranked by a neural model trained for relevance. The best matches feed into the AI, which writes an answer and cites every source.
Vector search finds semantically similar passages. Keyword search catches exact terms and identifiers. Both run in parallel across your entire documentation library.
Ask questions in Finnish, get answers from English documentation. The search pipeline translates queries and uses multilingual embeddings to find relevant passages regardless of language.
Results from both search engines are merged into a single unified ranking. This ensures that exact keyword matches and semantically similar passages are considered together.
A dedicated reranking model scores every candidate result for relevance to your specific question. The most relevant passages rise to the top.
The AI reads the top-ranked source passages and writes an answer. Every claim is backed by a citation that links directly to the source document.
One click opens the document viewer on the exact page with the source passage highlighted. Every answer is verifiable against the original documentation.
Infrastructure and security
Each workspace is fully isolated. Row-level security policies enforce that users only see documents belonging to their organization. No data leaks between tenants.
All documents, credentials, and API keys are encrypted at rest. Connection secrets to OEM portals are stored with encryption and never exposed in plain text.
Document processing runs on dedicated cloud infrastructure that scales with your library size. Search queries are served by high-performance vector and relational databases.
Every document upload, search query, and AI interaction is logged. Workspace administrators have full visibility into how documentation is being used across their team.