User Guide | Trittask

Quick Start

Run inference in 5 lines

No hardware required. Start with simulation mode to explore the API.

Simulation Mode

from trittask import Trittask
import numpy as np

# Create accelerator in simulation mode
accel = Trittask(simulation=True)

# Load ternary weights (-1, 0, +1)
weights = np.random.choice([-1, 0, 1], size=(768, 768)).astype(np.int8)
accel.load_weights(weights)

# Run inference
x = np.random.randn(768).astype(np.float32)
y = accel.forward(x)
print(f"Output shape: {y.shape}")

Installation

Setup your environment

Python Driver

# Install the Python driver
cd drivers/pynq
pip install .

# For development (with testing tools)
pip install -e .[dev]

# On PYNQ board (with hardware support)
pip install .[pynq]

RTL Simulation Tools

# macOS
brew install icarus-verilog verilator

# Ubuntu/Debian
sudo apt install iverilog verilator

# Verify installation
iverilog -V
verilator --version

Basic Usage

Inference API

Load and Run

from trittask import Trittask
import numpy as np

# Initialize (simulation mode for development)
accel = Trittask(simulation=True)

# Or with real hardware
# accel = Trittask('trittask.bit')

# Load model weights
weights = np.load('model_weights.npz')
for layer_name, layer_weights in weights.items():
    accel.load_weights(layer_weights, layer=layer_name)

# Run inference
input_data = np.random.randn(1, 768).astype(np.float32)
output = accel.forward(input_data)

Batch Processing

# Process multiple inputs
batch = np.random.randn(32, 768).astype(np.float32)
results = []

for i in range(batch.shape[0]):
    result = accel.forward(batch[i])
    results.append(result)

outputs = np.stack(results)

Model Preparation

Convert and quantize models

Converting PyTorch Models

# Compile a PyTorch model to Trittask format
python3 scripts/compile_pytorch.py \
  path/to/model.pt --output build/weights

Quantizing to Ternary

from trittask.models import quantize_to_ternary

# Load FP32 weights
fp32_weights = np.load('original_weights.npy')

# Quantize to ternary (-1, 0, +1)
ternary_weights, scale = quantize_to_ternary(fp32_weights)

# Save for Trittask
np.savez('trittask_weights.npz',
         weights=ternary_weights,
         scale=scale)

Translating Different Architectures

from trittask.models import translate_model, print_isa_summary

# Translate any PyTorch model
translated = translate_model(pytorch_model)

# Or specify the architecture type
translated = translate_model(resnet18, model_type='cnn')
translated = translate_model(bert, model_type='transformer')
translated = translate_model(lstm, model_type='lstm')

# View required ISA operations
print_isa_summary(translated)

On-Device Training

LoRA fine-tuning on FPGA

LoRA Training API

from trittask import Trittask
from trittask.lora import LoRAConfig, LoRATrainer

# Initialize accelerator
accel = Trittask('trittask.bit')
accel.load_weights(base_weights)

# Configure LoRA
config = LoRAConfig(
    rank=16,              # Bottleneck dimension
    alpha=16.0,           # Scaling factor
    learning_rate=1e-4,   # Adam learning rate
    use_adam=True         # Use Adam optimizer
)

# Create trainer
trainer = LoRATrainer(accel, config)

# Training loop
for epoch in range(10):
    total_loss = 0
    for x, y in dataloader:
        loss = trainer.step(x, y)
        total_loss += loss
    print(f"Epoch {epoch}: Loss = {total_loss / len(dataloader):.4f}")

# Save checkpoint
trainer.save_checkpoint('lora_adapted.npz')

Double-Buffer Mode

Train and serve inference simultaneously. Zero downtime adaptation.

# Enable double buffering
accel.enable_lora(
    rank=16,
    double_buffer=True  # Train while serving
)

# Inference continues uninterrupted
# during training

FPGA Deployment

Supported boards and builds

Board FPGA TRUs TOPS / Power / Cost

EBAZ4205 XC7Z010 16-48 3-5 TOPS / 3W / ~$25

                PYNQ-Z2
                XC7Z020
                128 (4-way systolic)
                25 TOPS / 5W / ~$150
            

                ZCU102
                XCZU9EG
                128-512
                25-100 TOPS / 15W / ~$3000
            

Building for FPGA

# Standard builds
make vivado_zcu102    # ZCU102 (max performance)
make vivado_pynq      # PYNQ-Z2 (best efficiency)
make vivado_ebaz      # EBAZ4205 (lowest cost)

# Maximum utilization builds (~90% FPGA fill)
make vivado_zcu102 DEFINES="+define+TARGET_ZCU102_MAX"
make vivado_pynq DEFINES="+define+TARGET_PYNQ_Z2_MAX"

Deploying to PYNQ

from trittask import Trittask

# Load bitstream
accel = Trittask('trittask.bit')

# Verify connection
status = accel.status()
print(f"Accelerator: {status}")

# Load weights and run
accel.load_weights(weights)
output = accel.forward(input_data)

Resource Utilization

FPGA resource usage by target

Choose between balanced configurations (headroom for timing closure) or MAX configurations (maximum throughput).

Board Available LUTs Available BRAMs Available DSPs

ZCU102 (XCZU9EG) 274,080 912 2,520

PYNQ-Z2 (XC7Z020) 53,200 140 220

EBAZ4205 (XC7Z010) 17,600 60 80

Configuration LUTs Used BRAMs Used DSPs Used

TARGET_ZCU102 (Balanced) ~40K (15%) ~100 (11%) ~90 (4%)

                TARGET_ZCU102_MAX
                ~246K (90%)
                ~820 (90%)
                ~2268 (90%)
            

TARGET_PYNQ_Z2 (Balanced) ~30K (56%) ~80 (57%) ~50 (23%)

                TARGET_PYNQ_Z2_MAX
                ~50K (94%)
                ~8 (6%)
                ~200 (91%)
            

TARGET_EBAZ4205 (Minimal) ~10K (57%) ~40 (67%) ~16 (20%)

                TARGET_EBAZ4205_MAX
                ~15.8K (90%)
                ~54 (90%)
                ~72 (90%)
            

Balanced Config

Default configuration with headroom for timing closure. Recommended for development and production deployments.

MAX Config

128 TRUs with 4-way systolic read (PYNQ-Z2). Guaranteed 100MHz timing. 4x read throughput with 32-cycle max latency.

Power vs Performance

MAX configs trade power for throughput. Use balanced for battery-powered or fanless deployments.

Building with MAX Configuration

# ZCU102 MAX (~100 TOPS, 512+ TRUs)
make vivado_zcu102 DEFINES="+define+TARGET_ZCU102_MAX"

# PYNQ-Z2 MAX (~25 TOPS, 128 TRUs, 4-way systolic)
make vivado_pynq DEFINES="+define+TARGET_PYNQ_Z2_MAX"

# Configurable core count for PYNQ-Z2:
vivado -mode batch -source scripts/vivado_pynq.tcl -tclargs 128 200  # 128 TRUs, 200 DSPs
vivado -mode batch -source scripts/vivado_pynq.tcl -tclargs 80       # 80 TRUs (faster timing)

# EBAZ4205 MAX (~5 TOPS, 48 TRUs)
make vivado_ebaz DEFINES="+define+TARGET_EBAZ4205_MAX"

Simulation & Verification

RTL simulation and digital twins

Digital Twin Verification

Bit-accurate Python simulations that match RTL behavior exactly:

# Verify ternary MAC (BitNet b1.58)
python3 scripts/digital_twin/ternary_mac.py

# Verify SANTA stochastic attention
python3 scripts/digital_twin/santa.py

# Verify ECO L1-distance (EcoTransformer)
python3 scripts/digital_twin/eco.py

# Verify PWA softmax (4-segment exp)
python3 scripts/digital_twin/softmax.py

# Full MFE integration test
python3 scripts/digital_twin/mfe.py

RTL Simulation

# Compile MFE testbench
iverilog -g2012 -o build/tb_mfe \
  tb/tb_multiplier_free_engine.v \
  rtl/multiplier_free_engine.v \
  rtl/ternary_mac.v rtl/santa_unit.v \
  rtl/eco_transformer.v rtl/pwa_activation.v

# Run simulation
vvp build/tb_mfe

# View waveforms (optional)
gtkwave build/tb_multiplier_free_engine.vcd

Cached MFE (BRAM Double-Buffer)

# Compile cached MFE with double-buffering
iverilog -g2012 -o build/tb_cached_mfe \
  tb/tb_cached_mfe.sv \
  rtl/cached_mfe_top.sv \
  rtl/cache/weight_double_buffer.sv \
  rtl/cache/weight_cache_simple.sv \
  rtl/multiplier_free_engine.v \
  rtl/ternary_mac.v rtl/santa_unit.v

# Run simulation
vvp build/tb_cached_mfe
# Expected: Stall cycles: 1 (99.8% efficiency)

Co-Simulation

# Verify Python matches RTL exactly
python3 scripts/debug_cosim.py

# Expected output:
# "SUCCESS: Python and Verilog match perfectly!"

Overlay PE Testing

# Assemble a test program
python3 scripts/overlay/assembler.py \
  tb/overlay/test_programs/test_matmul.asm \
  -o build/test_matmul.bin

# Run overlay testbench
iverilog -g2012 -o build/tb_overlay \
  tb/overlay/tb_temporal_overlay_pe.sv \
  rtl/overlay/*.sv
vvp build/tb_overlay

Troubleshooting

Common issues

"PYNQ not available"

Normal on development machines. The driver runs in simulation mode. Install PYNQ on your FPGA board: pip install pynq

Weight dimension mismatch

Ensure weights are 2D and match expected dimensions. Expected: (dim_out, dim_in) e.g., (768, 768)

Simulation vs hardware

Check mode with accel.simulation. Returns True for simulation, or hardware status for real FPGA.

Feature Configuration

Modular builds for different FPGAs

Configure Trittask for your target FPGA with compile-time feature flags. Smaller FPGAs can disable unused features to fit.

Feature Presets

# Minimal build for tiny FPGAs (<10K LUTs)
# Tang Nano 9K, iCE40, etc.
make synth PRESET=PRESET_MINIMAL

# Inference-only for mid-range FPGAs
# ECP5, PYNQ-Z2 (no training)
make synth PRESET=PRESET_INFERENCE_ONLY

# Full training support
# ZCU102, ZCU104
make synth PRESET=PRESET_TRAINING_FOCUS

Individual Feature Flags

# Enable/disable features individually
+define+FEATURE_LORA       # LoRA training
+define+FEATURE_SANTA      # Stochastic attention
+define+FEATURE_ECO        # L1-distance attention
+define+FEATURE_PWA        # PWA activation
+define+FEATURE_SOFTMAX    # PWA softmax
+define+FEATURE_CACHE      # Weight cache

# Example: inference-only with softmax
make vivado_pynq DEFINES="+define+FEATURE_SOFTMAX"

Preset LUTs Target Features

PRESET_MINIMAL ~5K Tang Nano 9K, iCE40 Ternary MAC only

PRESET_INFERENCE_ONLY ~20K ECP5, PYNQ-Z2 MAC + SANTA + ECO + PWA

                PRESET_TRAINING_FOCUS
                ~40K
                ZCU102, ZCU104
                All features + LoRA/QLoRA
            

Quick Reference

Key files and memory map

Key Files

multiplier_free_engine.v Core TRU module
mfe_array.v 128-TRU array with 4-way systolic read
pynq_max_top.sv PYNQ-Z2 MAX configuration top
cached_mfe_top.sv Cached MFE with double-buffer
weight_double_buffer.sv BRAM ping-pong prefetch
temporal_overlay_pe.sv Programmable overlay PE
lora_adapter_v2.sv LoRA training engine
feature_config.svh Feature configuration header
trittask_top.sv Configurable top-level module
softmax_unit.sv PWA softmax (buffered/streaming)
drivers/pynq/tercel/ Python driver
scripts/digital_twin/ Bit-accurate simulations

Memory Map

Address Register

0x00 DATA_IN - Input activations

0x04 CONTROL - Mode, enable, reset

0x08 OUTPUT - Result output

0x10 LORA_CTRL - LoRA configuration

Compute Modes

Mode Description

MODE_TERNARY (0x0) Ternary MAC (default)

MODE_SANTA (0x1) Stochastic attention

MODE_ECO (0x2) L1-distance attention

MODE_LORA (0x3) LoRA training

Attention Modes

Mode Path

ATTN_STANDARD (0) (QK^T)V - standard attention

                        ATTN_LINEAR (1)
                        Q(KTV) - 8x ops for long seq
                    

Cached MFE Modes

mode_sel Description

0b00 (Streaming) Direct AXI-Stream weights

0b01 (Cache) Weight cache with miss handling

                        0b10 (Buffer)
                        BRAM double-buffer (99.8% eff.)
                    

Getting Help

Resources

Documentation

See docs/ARCHITECTURE.md for technical details on the hardware design.

Examples

Check drivers/pynq/notebooks/ for Jupyter tutorials and example workflows.

Support

See docs/ARCHITECTURE.md for technical details

Terminology

Glossary

Key terms for understanding Trittask's ternary AI acceleration technology.

Term Definition

Trittask A reconfigurable AI accelerator designed for efficient edge deployment. Trittask uses ternary computation and multiplier-free architecture to deliver high-throughput AI inference at a fraction of the power consumption of traditional GPUs.

TRU (Trit Reconfigurable Unit) The fundamental compute core in a Trittask accelerator. Each TRU performs ternary operations without multipliers, enabling high-efficiency AI inference. Multiple TRUs work in parallel to accelerate neural network workloads.

Trit Short for ternary digit. The basic unit of information in ternary computing, representing three possible values: -1, 0, or +1. Analogous to a "bit" in binary computing. Trits enable efficient neural network computation with minimal precision loss.

Ternary Neural Network (TNN) A neural network where weights are constrained to three values (-1, 0, +1). TNNs dramatically reduce memory footprint and enable multiplier-free computation while maintaining accuracy for many AI tasks.

Multiplier-Free A compute architecture that replaces costly multiply operations with simple additions and sign flips. Since ternary weights are only -1, 0, or +1, multiplication becomes: add, skip, or negate—eliminating power-hungry multiplier circuits.

Reconfigurable The ability to reprogram the accelerator's compute fabric for different neural network architectures. Unlike fixed-function ASICs, Trittask can adapt to new models, layer types, and workloads without hardware changes.

Term Definition

Edge AI Running AI inference directly on local devices (drones, cameras, robots, IoT) rather than in the cloud. Edge AI reduces latency, bandwidth costs, and privacy concerns. Trittask is optimized for edge deployment.

Inference The process of running a trained neural network on new input data to generate predictions. Trittask accelerates inference workloads with high throughput and low power consumption.

Quantization Reducing the numerical precision of neural network weights and activations (e.g., from 32-bit floats to ternary). Quantization shrinks model size and speeds up computation with minimal accuracy loss.

LoRA (Low-Rank Adaptation) A parameter-efficient fine-tuning technique that adapts pre-trained models by training small, low-rank weight matrices. Trittask supports on-device LoRA for edge customization.

Quick Reference

What You Know Trittask Equivalent Notes

Bit (binary digit) Trit (ternary digit) -1, 0, +1 vs 0, 1

CUDA Core (NVIDIA) TRU Ternary compute unit

GPU / TPU Trittask Accelerator Reconfigurable fabric

Float16 / Int8 Ternary (-1, 0, +1) 2 bits per weight

References

Standing on the shoulders of giants

BitNet b1.58 The Era of 1-bit LLMs (Ma et al., 2024) — arXiv:2402.17764
MatMul-free LM Scalable MatMul-free Language Modeling (Zhu et al., 2024) — arXiv:2406.02528
LoRA Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
QLoRA Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
EcoTransformer Attention without Multiplication (Gao et al., 2025) — arXiv:2507.20096
SANTA Stochastic Computing (Gaines 1969, Alaghi & Hayes 2013)

Algorithm Verification

All algorithms are verified against their paper specifications:

# Run algorithm verification suite
python3 tests/test_algorithms_pynq.py

# Or run on PYNQ hardware
jupyter notebook drivers/pynq/notebooks/04_algorithm_verification.ipynb