User Guide

multiply by zero. accelerate everything.

This guide walks you through using Trittask, from installing the Python driver to deploying on FPGA hardware.

python driver fpga deployment feature config digital twins lora training
Quick Start

Run inference in 5 lines

No hardware required. Start with simulation mode to explore the API.

Simulation Mode

from trittask import Trittask
import numpy as np

# Create accelerator in simulation mode
accel = Trittask(simulation=True)

# Load ternary weights (-1, 0, +1)
weights = np.random.choice([-1, 0, 1], size=(768, 768)).astype(np.int8)
accel.load_weights(weights)

# Run inference
x = np.random.randn(768).astype(np.float32)
y = accel.forward(x)
print(f"Output shape: {y.shape}")
Installation

Setup your environment

Python Driver

# Install the Python driver
cd drivers/pynq
pip install .

# For development (with testing tools)
pip install -e .[dev]

# On PYNQ board (with hardware support)
pip install .[pynq]

RTL Simulation Tools

# macOS
brew install icarus-verilog verilator

# Ubuntu/Debian
sudo apt install iverilog verilator

# Verify installation
iverilog -V
verilator --version
Basic Usage

Inference API

Load and Run

from trittask import Trittask
import numpy as np

# Initialize (simulation mode for development)
accel = Trittask(simulation=True)

# Or with real hardware
# accel = Trittask('trittask.bit')

# Load model weights
weights = np.load('model_weights.npz')
for layer_name, layer_weights in weights.items():
    accel.load_weights(layer_weights, layer=layer_name)

# Run inference
input_data = np.random.randn(1, 768).astype(np.float32)
output = accel.forward(input_data)

Batch Processing

# Process multiple inputs
batch = np.random.randn(32, 768).astype(np.float32)
results = []

for i in range(batch.shape[0]):
    result = accel.forward(batch[i])
    results.append(result)

outputs = np.stack(results)
Model Preparation

Convert and quantize models

Converting PyTorch Models

# Compile a PyTorch model to Trittask format
python3 scripts/compile_pytorch.py \
  path/to/model.pt --output build/weights

Quantizing to Ternary

from trittask.models import quantize_to_ternary

# Load FP32 weights
fp32_weights = np.load('original_weights.npy')

# Quantize to ternary (-1, 0, +1)
ternary_weights, scale = quantize_to_ternary(fp32_weights)

# Save for Trittask
np.savez('trittask_weights.npz',
         weights=ternary_weights,
         scale=scale)

Translating Different Architectures

from trittask.models import translate_model, print_isa_summary

# Translate any PyTorch model
translated = translate_model(pytorch_model)

# Or specify the architecture type
translated = translate_model(resnet18, model_type='cnn')
translated = translate_model(bert, model_type='transformer')
translated = translate_model(lstm, model_type='lstm')

# View required ISA operations
print_isa_summary(translated)
On-Device Training

LoRA fine-tuning on FPGA

LoRA Training API

from trittask import Trittask
from trittask.lora import LoRAConfig, LoRATrainer

# Initialize accelerator
accel = Trittask('trittask.bit')
accel.load_weights(base_weights)

# Configure LoRA
config = LoRAConfig(
    rank=16,              # Bottleneck dimension
    alpha=16.0,           # Scaling factor
    learning_rate=1e-4,   # Adam learning rate
    use_adam=True         # Use Adam optimizer
)

# Create trainer
trainer = LoRATrainer(accel, config)

# Training loop
for epoch in range(10):
    total_loss = 0
    for x, y in dataloader:
        loss = trainer.step(x, y)
        total_loss += loss
    print(f"Epoch {epoch}: Loss = {total_loss / len(dataloader):.4f}")

# Save checkpoint
trainer.save_checkpoint('lora_adapted.npz')

Double-Buffer Mode

Train and serve inference simultaneously. Zero downtime adaptation.

# Enable double buffering
accel.enable_lora(
    rank=16,
    double_buffer=True  # Train while serving
)

# Inference continues uninterrupted
# during training
FPGA Deployment

Supported boards and builds

Board FPGA TRUs TOPS / Power / Cost
EBAZ4205 XC7Z010 16-48 3-5 TOPS / 3W / ~$25
PYNQ-Z2 XC7Z020 128 (4-way systolic) 25 TOPS / 5W / ~$150
ZCU102 XCZU9EG 128-512 25-100 TOPS / 15W / ~$3000

Building for FPGA

# Standard builds
make vivado_zcu102    # ZCU102 (max performance)
make vivado_pynq      # PYNQ-Z2 (best efficiency)
make vivado_ebaz      # EBAZ4205 (lowest cost)

# Maximum utilization builds (~90% FPGA fill)
make vivado_zcu102 DEFINES="+define+TARGET_ZCU102_MAX"
make vivado_pynq DEFINES="+define+TARGET_PYNQ_Z2_MAX"

Deploying to PYNQ

from trittask import Trittask

# Load bitstream
accel = Trittask('trittask.bit')

# Verify connection
status = accel.status()
print(f"Accelerator: {status}")

# Load weights and run
accel.load_weights(weights)
output = accel.forward(input_data)
Resource Utilization

FPGA resource usage by target

Choose between balanced configurations (headroom for timing closure) or MAX configurations (maximum throughput).

Board Available LUTs Available BRAMs Available DSPs
ZCU102 (XCZU9EG) 274,080 912 2,520
PYNQ-Z2 (XC7Z020) 53,200 140 220
EBAZ4205 (XC7Z010) 17,600 60 80
Configuration LUTs Used BRAMs Used DSPs Used
TARGET_ZCU102 (Balanced) ~40K (15%) ~100 (11%) ~90 (4%)
TARGET_ZCU102_MAX ~246K (90%) ~820 (90%) ~2268 (90%)
TARGET_PYNQ_Z2 (Balanced) ~30K (56%) ~80 (57%) ~50 (23%)
TARGET_PYNQ_Z2_MAX ~50K (94%) ~8 (6%) ~200 (91%)
TARGET_EBAZ4205 (Minimal) ~10K (57%) ~40 (67%) ~16 (20%)
TARGET_EBAZ4205_MAX ~15.8K (90%) ~54 (90%) ~72 (90%)

Balanced Config

Default configuration with headroom for timing closure. Recommended for development and production deployments.

MAX Config

128 TRUs with 4-way systolic read (PYNQ-Z2). Guaranteed 100MHz timing. 4x read throughput with 32-cycle max latency.

Power vs Performance

MAX configs trade power for throughput. Use balanced for battery-powered or fanless deployments.

Building with MAX Configuration

# ZCU102 MAX (~100 TOPS, 512+ TRUs)
make vivado_zcu102 DEFINES="+define+TARGET_ZCU102_MAX"

# PYNQ-Z2 MAX (~25 TOPS, 128 TRUs, 4-way systolic)
make vivado_pynq DEFINES="+define+TARGET_PYNQ_Z2_MAX"

# Configurable core count for PYNQ-Z2:
vivado -mode batch -source scripts/vivado_pynq.tcl -tclargs 128 200  # 128 TRUs, 200 DSPs
vivado -mode batch -source scripts/vivado_pynq.tcl -tclargs 80       # 80 TRUs (faster timing)

# EBAZ4205 MAX (~5 TOPS, 48 TRUs)
make vivado_ebaz DEFINES="+define+TARGET_EBAZ4205_MAX"
Simulation & Verification

RTL simulation and digital twins

Digital Twin Verification

Bit-accurate Python simulations that match RTL behavior exactly:

# Verify ternary MAC (BitNet b1.58)
python3 scripts/digital_twin/ternary_mac.py

# Verify SANTA stochastic attention
python3 scripts/digital_twin/santa.py

# Verify ECO L1-distance (EcoTransformer)
python3 scripts/digital_twin/eco.py

# Verify PWA softmax (4-segment exp)
python3 scripts/digital_twin/softmax.py

# Full MFE integration test
python3 scripts/digital_twin/mfe.py

RTL Simulation

# Compile MFE testbench
iverilog -g2012 -o build/tb_mfe \
  tb/tb_multiplier_free_engine.v \
  rtl/multiplier_free_engine.v \
  rtl/ternary_mac.v rtl/santa_unit.v \
  rtl/eco_transformer.v rtl/pwa_activation.v

# Run simulation
vvp build/tb_mfe

# View waveforms (optional)
gtkwave build/tb_multiplier_free_engine.vcd

Cached MFE (BRAM Double-Buffer)

# Compile cached MFE with double-buffering
iverilog -g2012 -o build/tb_cached_mfe \
  tb/tb_cached_mfe.sv \
  rtl/cached_mfe_top.sv \
  rtl/cache/weight_double_buffer.sv \
  rtl/cache/weight_cache_simple.sv \
  rtl/multiplier_free_engine.v \
  rtl/ternary_mac.v rtl/santa_unit.v

# Run simulation
vvp build/tb_cached_mfe
# Expected: Stall cycles: 1 (99.8% efficiency)

Co-Simulation

# Verify Python matches RTL exactly
python3 scripts/debug_cosim.py

# Expected output:
# "SUCCESS: Python and Verilog match perfectly!"

Overlay PE Testing

# Assemble a test program
python3 scripts/overlay/assembler.py \
  tb/overlay/test_programs/test_matmul.asm \
  -o build/test_matmul.bin

# Run overlay testbench
iverilog -g2012 -o build/tb_overlay \
  tb/overlay/tb_temporal_overlay_pe.sv \
  rtl/overlay/*.sv
vvp build/tb_overlay
Troubleshooting

Common issues

"PYNQ not available"

Normal on development machines. The driver runs in simulation mode. Install PYNQ on your FPGA board: pip install pynq

Weight dimension mismatch

Ensure weights are 2D and match expected dimensions. Expected: (dim_out, dim_in) e.g., (768, 768)

Simulation vs hardware

Check mode with accel.simulation. Returns True for simulation, or hardware status for real FPGA.

Feature Configuration

Modular builds for different FPGAs

Configure Trittask for your target FPGA with compile-time feature flags. Smaller FPGAs can disable unused features to fit.

Feature Presets

# Minimal build for tiny FPGAs (<10K LUTs)
# Tang Nano 9K, iCE40, etc.
make synth PRESET=PRESET_MINIMAL

# Inference-only for mid-range FPGAs
# ECP5, PYNQ-Z2 (no training)
make synth PRESET=PRESET_INFERENCE_ONLY

# Full training support
# ZCU102, ZCU104
make synth PRESET=PRESET_TRAINING_FOCUS

Individual Feature Flags

# Enable/disable features individually
+define+FEATURE_LORA       # LoRA training
+define+FEATURE_SANTA      # Stochastic attention
+define+FEATURE_ECO        # L1-distance attention
+define+FEATURE_PWA        # PWA activation
+define+FEATURE_SOFTMAX    # PWA softmax
+define+FEATURE_CACHE      # Weight cache

# Example: inference-only with softmax
make vivado_pynq DEFINES="+define+FEATURE_SOFTMAX"
Preset LUTs Target Features
PRESET_MINIMAL ~5K Tang Nano 9K, iCE40 Ternary MAC only
PRESET_INFERENCE_ONLY ~20K ECP5, PYNQ-Z2 MAC + SANTA + ECO + PWA
PRESET_TRAINING_FOCUS ~40K ZCU102, ZCU104 All features + LoRA/QLoRA
Quick Reference

Key files and memory map

Key Files

  • multiplier_free_engine.v Core TRU module
  • mfe_array.v 128-TRU array with 4-way systolic read
  • pynq_max_top.sv PYNQ-Z2 MAX configuration top
  • cached_mfe_top.sv Cached MFE with double-buffer
  • weight_double_buffer.sv BRAM ping-pong prefetch
  • temporal_overlay_pe.sv Programmable overlay PE
  • lora_adapter_v2.sv LoRA training engine
  • feature_config.svh Feature configuration header
  • trittask_top.sv Configurable top-level module
  • softmax_unit.sv PWA softmax (buffered/streaming)
  • drivers/pynq/tercel/ Python driver
  • scripts/digital_twin/ Bit-accurate simulations

Memory Map

Address Register
0x00 DATA_IN - Input activations
0x04 CONTROL - Mode, enable, reset
0x08 OUTPUT - Result output
0x10 LORA_CTRL - LoRA configuration

Compute Modes

Mode Description
MODE_TERNARY (0x0) Ternary MAC (default)
MODE_SANTA (0x1) Stochastic attention
MODE_ECO (0x2) L1-distance attention
MODE_LORA (0x3) LoRA training

Attention Modes

Mode Path
ATTN_STANDARD (0) (QKT)V - standard attention
ATTN_LINEAR (1) Q(KTV) - 8x ops for long seq

Cached MFE Modes

mode_sel Description
0b00 (Streaming) Direct AXI-Stream weights
0b01 (Cache) Weight cache with miss handling
0b10 (Buffer) BRAM double-buffer (99.8% eff.)
Getting Help

Resources

Documentation

See docs/ARCHITECTURE.md for technical details on the hardware design.

Examples

Check drivers/pynq/notebooks/ for Jupyter tutorials and example workflows.

Support

See docs/ARCHITECTURE.md for technical details

Terminology

Glossary

Key terms for understanding Trittask's ternary AI acceleration technology.

Term Definition
Trittask A reconfigurable AI accelerator designed for efficient edge deployment. Trittask uses ternary computation and multiplier-free architecture to deliver high-throughput AI inference at a fraction of the power consumption of traditional GPUs.
TRU (Trit Reconfigurable Unit) The fundamental compute core in a Trittask accelerator. Each TRU performs ternary operations without multipliers, enabling high-efficiency AI inference. Multiple TRUs work in parallel to accelerate neural network workloads.
Trit Short for ternary digit. The basic unit of information in ternary computing, representing three possible values: -1, 0, or +1. Analogous to a "bit" in binary computing. Trits enable efficient neural network computation with minimal precision loss.
Ternary Neural Network (TNN) A neural network where weights are constrained to three values (-1, 0, +1). TNNs dramatically reduce memory footprint and enable multiplier-free computation while maintaining accuracy for many AI tasks.
Multiplier-Free A compute architecture that replaces costly multiply operations with simple additions and sign flips. Since ternary weights are only -1, 0, or +1, multiplication becomes: add, skip, or negate—eliminating power-hungry multiplier circuits.
Reconfigurable The ability to reprogram the accelerator's compute fabric for different neural network architectures. Unlike fixed-function ASICs, Trittask can adapt to new models, layer types, and workloads without hardware changes.
Term Definition
Edge AI Running AI inference directly on local devices (drones, cameras, robots, IoT) rather than in the cloud. Edge AI reduces latency, bandwidth costs, and privacy concerns. Trittask is optimized for edge deployment.
Inference The process of running a trained neural network on new input data to generate predictions. Trittask accelerates inference workloads with high throughput and low power consumption.
Quantization Reducing the numerical precision of neural network weights and activations (e.g., from 32-bit floats to ternary). Quantization shrinks model size and speeds up computation with minimal accuracy loss.
LoRA (Low-Rank Adaptation) A parameter-efficient fine-tuning technique that adapts pre-trained models by training small, low-rank weight matrices. Trittask supports on-device LoRA for edge customization.

Quick Reference

What You Know Trittask Equivalent Notes
Bit (binary digit) Trit (ternary digit) -1, 0, +1 vs 0, 1
CUDA Core (NVIDIA) TRU Ternary compute unit
GPU / TPU Trittask Accelerator Reconfigurable fabric
Float16 / Int8 Ternary (-1, 0, +1) 2 bits per weight
References

Standing on the shoulders of giants

Algorithm Verification

All algorithms are verified against their paper specifications:

# Run algorithm verification suite
python3 tests/test_algorithms_pynq.py

# Or run on PYNQ hardware
jupyter notebook drivers/pynq/notebooks/04_algorithm_verification.ipynb