How it works

From document import to AI-powered answers with verified sources.

Document import

Two paths into your library. Both fully indexed.

Dokuhaku connects directly to OEM documentation portals. When a manufacturer publishes a new manual, it appears in your library automatically. No file downloads, no manual updates, no missed revisions. You can also create your own folder structure and upload internal procedures, service bulletins, and reference materials. Both paths feed into the same processing pipeline.

OEM portal sync

Automatic import from manufacturer documentation systems. Folder hierarchies are preserved exactly as the manufacturer organized them.

Your own documents

Create folders, upload files, and organize your internal documentation alongside OEM materials. Structure your library to match how your team works.

Change detection

Only new and updated documents are processed. The system tracks revisions automatically so you never have duplicate versions or miss an update.

Multi-format support

PDF, Word, Excel, PowerPoint, images, and plain text files are all supported. Each format is handled by a dedicated extraction engine optimized for its structure.

Document processing

Every page parsed. Every paragraph indexed.

Each document goes through an automated pipeline. An extraction engine reads every page, identifies structure, and breaks the content into overlapping chunks optimized for search. Each chunk is converted into a vector representation and stored in a vector database. A parallel keyword index captures exact terms and part numbers.

Content extraction

An extraction engine reads every page and identifies structure: titles, paragraphs, tables, lists, and diagrams.

Table and diagram preservation

Tables are preserved with their row and column structure. Diagrams and figures are recognized and included alongside surrounding text for complete context.

Intelligent chunking

Content is split into overlapping chunks optimized for search. Small blocks are merged for context. Large blocks are split at natural boundaries.

Page-aware chunks

Every chunk knows its page number and position within the document. This enables one-click navigation to the exact page and passage in the source viewer.

Dual indexing

Each chunk is converted into a vector representation by an embedding model and stored in a vector database. A parallel keyword index captures exact terms, part numbers, and technical identifiers.

Automatic reprocessing

When the processing pipeline improves, existing documents can be reprocessed without re-uploading. Updated extraction and chunking are applied automatically.

Search and AI

Two search engines. Neural reranking. Every source cited.

When you ask a question, two search systems work in parallel. A vector search finds passages with similar meaning, even across languages. A keyword search catches exact terms and part numbers. Results are merged and reranked by a neural model trained for relevance. The best matches feed into the AI, which writes an answer and cites every source.

Hybrid search

Vector search finds semantically similar passages. Keyword search catches exact terms and identifiers. Both run in parallel across your entire documentation library.

Cross-language queries

Ask questions in Finnish, get answers from English documentation. The search pipeline translates queries and uses multilingual embeddings to find relevant passages regardless of language.

Rank fusion

Results from both search engines are merged into a single unified ranking. This ensures that exact keyword matches and semantically similar passages are considered together.

Neural reranking

A dedicated reranking model scores every candidate result for relevance to your specific question. The most relevant passages rise to the top.

AI generation with citations

The AI reads the top-ranked source passages and writes an answer. Every claim is backed by a citation that links directly to the source document.

Source verification

One click opens the document viewer on the exact page with the source passage highlighted. Every answer is verifiable against the original documentation.

Infrastructure and security

Built for production. Built for your data.

Multi-tenant isolation

Each workspace is fully isolated. Row-level security policies enforce that users only see documents belonging to their organization. No data leaks between tenants.

Encrypted at rest

All documents, credentials, and API keys are encrypted at rest. Connection secrets to OEM portals are stored with encryption and never exposed in plain text.

Scalable cloud infrastructure

Document processing runs on dedicated cloud infrastructure that scales with your library size. Search queries are served by high-performance vector and relational databases.

Audit logging

Every document upload, search query, and AI interaction is logged. Workspace administrators have full visibility into how documentation is being used across their team.

Interested?

We'd be happy to show you how Dokuhaku works with your documentation.

Contact Us