DeepSeek OCR is a next-generation document intelligence platform leveraging context optical compression to achieve high accuracy and throughput. It employs a two-stage transformer-based architecture:
- Stage 1 (DeepEncoder): Compresses high-resolution page images into compact vision tokens using a windowed SAM vision transformer, a dense CLIP-Large encoder, and a convolutional compressor.
- Stage 2 (MoE Decoder): Decodes the vision tokens with a 3B-parameter mixture-of-experts model to reconstruct text, layout, and diagrams.
Key features include:
- Context Optical Compression: Reduces document size by up to 10x while preserving essential information.
- Multilingual Support: Supports over 100 languages, including Latin, CJK, and Cyrillic scripts.
- Structured Output: Generates HTML tables, Markdown charts, SMILES chemistry strings, and geometry annotations.
- High Throughput: Processes up to 200k pages per day on a single NVIDIA A100 GPU.
- Open Source: MIT-licensed weights allow for on-premises deployment.
Use cases include:
- Document Digitization: Converting scanned books and reports into searchable and analyzable data.
- Technical Diagram Extraction: Extracting information from technical diagrams and formulas.
- Multilingual Dataset Creation: Building training datasets for language models.
- Document Conversion Apps: Embedding into platforms for invoice, contract, and form processing.
