Trittask

trit reconfigurable accelerator

Trittask is an FPGA accelerator powered by TRUs (Trit Reconfigurable Units) that runs neural networks without hardware multipliers. Using ternary weights (-1, 0, +1), inference becomes pure addition and subtraction—enabling 16x memory compression on $50 hardware.

(cons 'compress (cons 'score (cons 'mix (cons 'adapt nil))))

zero multipliers 16x compression on-device training
25 tops (z7020)
5.0 tops/watt
$150 pynq-z2

Get Your Trittask Kit

Complete hardware + software bundles ready to deploy

EVALUATION

PYNQ-Z2 Eval Kit

128 TRUs 25 TOPS 5W
  • PYNQ-Z2 development board
  • Trittask software license
  • Pre-loaded bitstream
  • Getting started guide
  • Email support (90 days)
$599
OPEN SOURCE

ECP5-85F Eval Kit

64 TRUs 12 TOPS 3W
  • ULX3S ECP5-85F board
  • Trittask software license
  • Yosys/nextpnr toolchain
  • Open-source bitstream
  • Email support (90 days)
$699
ENTERPRISE

Service & Support

24/7 Support SLA Monthly
  • 24/7 priority technical support
  • Dedicated support engineer
  • Custom development hours (4/mo)
  • 99.9% SLA guarantee
  • Architecture consulting
  • Private Slack/Discord channel
  • Quarterly roadmap reviews
$999/month

All kits ship within 5-7 business days. Hardware sourced from authorized distributors. Contact us for volume pricing.

The Problem

GPUs are overkill for edge AI

Running AI on edge devices shouldn't require 450W GPUs or cloud connections. Trittask solves this by changing the math—not the hardware budget.

Memory wall

Moving FP32 weights from memory consumes 100x more energy than the compute itself. Trittask uses 2-bit weights—16x less data to move.

Ternary Weights

Floating-point multipliers dominate chip area and power. Trittask eliminates them entirely—ternary weights mean add/subtract only.

No training on edge

Most accelerators are inference-only. Trittask includes LoRA training engines so models can adapt on-device without cloud round-trips.

Architecture

How Trittask eliminates multipliers

Traditional neural networks compute Y = W × X (multiply-accumulate). Trittask quantizes weights to {-1, 0, +1}, so multiplication becomes: add if +1, subtract if -1, skip if 0.

(cons 'prompt 'generated-code) → Transformer Self-Attention
(car . prompt)
Write assembly for a transformer
self-attention block using the
Trittask overlay ISA.

Target: FPGA with ternary weights
(-1, 0, +1). NO multipliers.

Available instructions:
- LOAD_WEIGHTS: Load ternary matrix
- MATMUL: Ternary multiply (add/sub)
- SETMODE: MODE_SANTA or MODE_ECO
- ATTN_QK: Attention scores
- SOFTMAX: 4-pass stable softmax
- ATTN_SV: Apply attention to V
- VADD: Vector add (residual)
- LAYERNORM: Bit-shift RMSNorm

Generate assembly that:
1. Projects hidden → Q, K, V
2. Computes multiplier-free attention
3. Applies softmax
4. Adds residual + LayerNorm
(cdr . generated-code)
;; Q, K, V projections (no multipliers)
(load-weights W_q)
(matmul Q W_q hidden)
(load-weights W_k)
(matmul K W_k hidden)
(load-weights W_v)
(matmul V W_v hidden)

;; Attention (SANTA: stochastic)
;; Or use ATTN_LINEAR for Q(K^TV) path
(setmode MODE_SANTA)
(attn-qk scores Q K)
(softmax attn scores)  ; PWA 4-segment
(attn-sv out attn V)

;; Output projection + residual
(load-weights W_o)
(matmul proj W_o out)
(vadd res hidden proj)
(layernorm normed res)
;; MODE_SANTA → stochastic attention (rand < weight ? add : skip)
;; MODE_ECO → L1-distance attention (|Q - K| instead of Q · K)
;; All ops use (op dst src1 src2) register format

Ternary MAC operation

;; Traditional: needs multiplier
(mul result weight activation)

;; Trittask TRU: no multiplier needed
(cond
  ((= weight +1) (add acc activation))
  ((= weight -1) (sub acc activation))
  ((= weight  0) acc))  ; skip

Weight encoding

;; 2-bit ternary encoding
(define trit-encoding
  ((#b00  0)    ; zero → skip
   (#b01 +1)    ; positive → add
   (#b10 -1)    ; negative → subtract
   (#b11 nil))) ; reserved

;; 16 weights per 32-bit word
;; 768×768 layer: 147KB vs 2.4MB FP32
Component Purpose Multipliers
TRU (Trit Reconfigurable Unit) Core inference: ternary MAC, attention, activations 0
TRU Array 128 TRUs with 4-way systolic read paths 0
SANTA Attention Stochastic attention without QK multiply 0
EcoTransformer L1-distance attention (|Q-K| instead of Q·K) 0
Linear Attention Q(KTV) path for long sequences (8x ops reduction) 0
PWA Softmax 4-segment exp() + reciprocal (streaming/buffered) 0
LoRA Training On-device fine-tuning (uses DSP for gradients) 8-64 DSPs

4-Way Systolic Read

128 TRUs in 4 banks with parallel shift chains. Max 32-cycle latency, 4x read throughput.

Overlay PE

Programmable 5-stage pipeline with custom ISA for transformer flows.

BRAM Double-Buffer

Ping-pong prefetch achieves 99.8% memory efficiency with only 1 stall cycle.

Weight Cache

256KB-2MB on-chip cache with prefetch to hide memory latency.

PWA Activation

Piecewise-affine GELU/SiLU approximation without exponentials.

Matrix Chain Optimizer

O(n³) DP compiler optimization. Linear attention mode provides up to 8x ops reduction for long sequences.

PWA Softmax

4-segment exp() approximation with direct reciprocal. ~6% avg error, ~17% worst-case. Buffered or streaming modes.

Digital Twins

Bit-accurate Python simulation matching RTL exactly. Verify algorithms without hardware.

Performance

Edge-optimized efficiency

Trittask targets cost and power efficiency for edge deployment. Ternary weights eliminate multipliers and reduce memory bandwidth 16x (vs FP32).

Trittask configurations (projected)

Platform TOPS* Power TOPS/W
EBAZ4205 ($25) 3-5 3W 1.0-1.7
PYNQ-Z2 ($150) 25 5W 5.0
ZCU104 ($1000) 50 10W 5.0
ZCU102 ($3000) 100 15W 6.7

Speedup vs Python

Operation Python Trittask
Ternary dot (8-elem) 0.44 µs 0.01 µs (44x)
Ternary matmul 768² 500 ms 0.05 ms (10000x)
LoRA forward 2 ms 0.02 ms (100x)
LoRA backward 5 ms 0.1 ms (50x)

*TOPS projected from RTL simulation @ target frequency. Actual hardware benchmarks in progress.

EBAZ4205 ($25)

16-48 TRUs, 3-5 TOPS, 3W. Ultra-low-cost for battery-powered IoT deployments.

PYNQ-Z2 ($150)

128 TRUs with 4-way systolic, ~25 TOPS, 5W. Best TOPS/$ with on-device training.

ZCU102 ($3000)

512+ TRUs, ~100 TOPS, 15W. Maximum throughput for production workloads.

Modular Feature Configuration

Configure Trittask for different FPGA targets with compile-time feature selection:

Preset Target Features
PRESET_MINIMAL Tang Nano 9K, iCE40 Ternary MAC only
PRESET_INFERENCE_ONLY ECP5, PYNQ-Z2 MAC + SANTA + ECO + PWA
PRESET_TRAINING_FOCUS ZCU102, ZCU104 Full LoRA/QLoRA/DoRA training
Competitive Landscape

How Trittask compares

Trittask is the only sub-$500 edge AI platform with hardware-accelerated on-device training. Compare against NVIDIA Jetson, Google Coral, and Hailo.

Platform TOPS Power Price Training
Trittask PYNQ-Z2 25 5W $150 LoRA/QLoRA/DoRA
Trittask ZCU102 100 15W $3,000 LoRA/QLoRA/DoRA
Google Coral Edge TPU 4 2W $25 None
Hailo-8L (RPi AI Kit) 13 3W $70 None
Hailo-8 26 8W $200 None
Jetson Orin Nano Super 67 25W $249 Full GPU
Jetson AGX Orin 64GB 275 60W $1,599 Full GPU

Note: TOPS figures are not directly comparable across architectures. Trittask uses ternary operations; Jetson uses INT8 sparse tensor ops; Coral/Hailo use INT8 dense.

BitNet Research Results

From Microsoft Research (Ma et al., 2024):

Metric BitNet vs FP16
Memory usage 3.55x less
Inference speed 2.71x faster
Energy (70B scale) 71.4x less

Total Cost of Ownership (1 year)

24/7 operation at $0.12/kWh:

Platform Hardware Total
Trittask PYNQ-Z2 $150 $155
Jetson Orin Nano $249 $275
Jetson AGX Orin $1,599 $1,652

vs Jetson

Trittask achieves comparable training throughput per watt at 1/10th the hardware cost. Jetson wins on raw TOPS but requires 5-20x more power.

vs Coral/Hailo

Coral and Hailo offer mature inference-only solutions. Trittask adds on-device training and full FPGA reconfigurability.

When to choose Trittask

Ultra-low cost ($25-50), on-device training required, power-critical (3W), or custom architectures needed.

On-Device Training

Fine-tune TinyLlama on the edge

Trittask includes hardware LoRA training engines. Fine-tune TinyLlama-1.1B directly on the edge device—no GPU server, no data upload, no round-trip latency.

TinyLlama Spec Value Training Metric Z7020 Performance
Parameters 1.1B Samples/hour 700-1200
Layers 22 Step time 3-5 seconds
Hidden size 2048 LoRA memory 16-64 MB
Base weights (ternary) ~275 MB Total DDR3 used ~350 MB

LoRA

Low-rank adaptation with rank 8-64. Adam optimizer in hardware. Layer-wise management for 22 transformer layers.

QLoRA

4-bit NF4 quantized base weights with full-precision adapters. 4x more memory efficient.

Streaming LoRA

Double-buffered training during inference. Atomic bank swap—zero downtime adaptation.

TinyLlama Training API

from trittask import Trittask, TinyLlamaTrainer, TrainingConfig

accel = Trittask("trittask.bit")
trainer = TinyLlamaTrainer(
    accelerator=accel,
    training_config=TrainingConfig(
        lora_rank=16,
        learning_rate=1e-4,
    )
)

# Add your domain-specific training data
trainer.add_training_data([
    ("What is error E-47?", "Replace the filter cartridge."),
    ("Machine vibration high", "Check bearing alignment."),
])

# Train on-device
for epoch in range(10):
    metrics = trainer.train_epoch()
    print(f"Epoch {epoch}: loss={metrics['loss']:.4f}")

# Save and use
trainer.save_checkpoint("my_domain_lora")

Use cases

Use Case Samples Time
Domain vocabulary 100-500 1-2 hours
Task specialization 500-2000 4-8 hours
Personalization 50-200 30-60 min
Continuous learning Streaming Ongoing

Perfect for: equipment manuals, error code explanations, company terminology, user preferences, style adaptation.

Model Support

Beyond Transformers

Trittask supports any architecture that can be expressed with ternary weights. The overlay ISA handles network-specific operations.

Architecture Inference Training Key Operations
Transformer (LLM, ViT) Native LoRA MATMUL, ATTENTION, SOFTMAX, LAYERNORM
CNN (ResNet, EfficientNet) Compiler* LoRA LOOP + MATMUL + POOL (shift-add)
LSTM / GRU / RNN Compiler* LoRA MATMUL + PWA (sigmoid/tanh) + VMUL
MLP Native LoRA MATMUL, RELU

Native: Direct hardware support. Compiler*: Decomposed to ternary primitives—verified via RTL simulation (tb_cnn_ops.sv, tb_rnn_ops.sv).

Vision Transformer (ViT)

from trittask import Trittask, ViTTrainer, VIT_SMALL

accel = Trittask('trittask.bit')
trainer = ViTTrainer(
    accelerator=accel,
    config=VIT_SMALL,
)

# Load UC Merced aerial dataset (21 classes)
trainer.add_training_directory('datasets/uc_merced/train')

# Fine-tune with LoRA
for epoch in range(10):
    metrics = trainer.train_epoch()
    print(f"Acc: {metrics['accuracy']:.2%}")

# Classify aerial imagery
predicted_class, probs = trainer.predict(drone_image)

ViT Performance (Z7020)

Model Params Images/hr
ViT-Tiny 5.7M ~1200
ViT-Small 22M ~600
ViT-Base 86M ~300

Use cases: drone/satellite imagery, product QC, defect detection, medical imaging, aerial scene classification.

UC Merced Dataset

21-class aerial scene classification dataset. Includes: agricultural, airplane, runway, harbor, buildings, forest, and more. Download with included script.

Drone Integration

Process drone imagery on-device. Classify terrain, detect objects, monitor infrastructure—all without cloud connectivity.

Edge Deployment

Fine-tune ViT directly on edge hardware. Adapt to new classes without retraining from scratch. LoRA enables rapid domain adaptation.

Model translation

from trittask.models import translate_model, print_isa_summary

translated = translate_model(pytorch_model)
translated = translate_model(resnet18, model_type="cnn")

print_isa_summary(translated)