Trittask | Multiply by Zero. Accelerate Everything.

The Problem

GPUs are overkill for edge AI

Running AI on edge devices shouldn't require 450W GPUs or cloud connections. Trittask solves this by changing the math—not the hardware budget.

Memory wall

Moving FP32 weights from memory consumes 100x more energy than the compute itself. Trittask uses 2-bit weights—16x less data to move.

Ternary Weights

Floating-point multipliers dominate chip area and power. Trittask eliminates them entirely—ternary weights mean add/subtract only.

No training on edge

Most accelerators are inference-only. Trittask includes LoRA training engines so models can adapt on-device without cloud round-trips.

Architecture

How Trittask eliminates multipliers

Traditional neural networks compute Y = W × X (multiply-accumulate). Trittask quantizes weights to {-1, 0, +1}, so multiplication becomes: add if +1, subtract if -1, skip if 0.

(cons 'prompt 'generated-code) → Transformer Self-Attention

(car . prompt)

Write assembly for a transformer
self-attention block using the
Trittask overlay ISA.

Target: FPGA with ternary weights
(-1, 0, +1). NO multipliers.

Available instructions:
- LOAD_WEIGHTS: Load ternary matrix
- MATMUL: Ternary multiply (add/sub)
- SETMODE: MODE_SANTA or MODE_ECO
- ATTN_QK: Attention scores
- SOFTMAX: 4-pass stable softmax
- ATTN_SV: Apply attention to V
- VADD: Vector add (residual)
- LAYERNORM: Bit-shift RMSNorm

Generate assembly that:
1. Projects hidden → Q, K, V
2. Computes multiplier-free attention
3. Applies softmax
4. Adds residual + LayerNorm

(cdr . generated-code)

;; Q, K, V projections (no multipliers)
(load-weights W_q)
(matmul Q W_q hidden)
(load-weights W_k)
(matmul K W_k hidden)
(load-weights W_v)
(matmul V W_v hidden)

;; Attention (SANTA: stochastic)
;; Or use ATTN_LINEAR for Q(K^TV) path
(setmode MODE_SANTA)
(attn-qk scores Q K)
(softmax attn scores)  ; PWA 4-segment
(attn-sv out attn V)

;; Output projection + residual
(load-weights W_o)
(matmul proj W_o out)
(vadd res hidden proj)
(layernorm normed res)

;; MODE_SANTA → stochastic attention (rand < weight ? add : skip)

;; MODE_ECO → L1-distance attention (|Q - K| instead of Q · K)

;; All ops use (op dst src1 src2) register format

Ternary MAC operation

;; Traditional: needs multiplier
(mul result weight activation)

;; Trittask TRU: no multiplier needed
(cond
  ((= weight +1) (add acc activation))
  ((= weight -1) (sub acc activation))
  ((= weight  0) acc))  ; skip

Weight encoding

;; 2-bit ternary encoding
(define trit-encoding
  ((#b00  0)    ; zero → skip
   (#b01 +1)    ; positive → add
   (#b10 -1)    ; negative → subtract
   (#b11 nil))) ; reserved

;; 16 weights per 32-bit word
;; 768×768 layer: 147KB vs 2.4MB FP32

Component Purpose Multipliers

TRU (Trit Reconfigurable Unit) Core inference: ternary MAC, attention, activations 0

TRU Array 128 TRUs with 4-way systolic read paths 0

SANTA Attention Stochastic attention without QK multiply 0

EcoTransformer L1-distance attention (|Q-K| instead of Q·K) 0

Linear Attention Q(K^TV) path for long sequences (8x ops reduction) 0

PWA Softmax 4-segment exp() + reciprocal (streaming/buffered) 0

LoRA Training On-device fine-tuning (uses DSP for gradients) 8-64 DSPs

4-Way Systolic Read

128 TRUs in 4 banks with parallel shift chains. Max 32-cycle latency, 4x read throughput.

Overlay PE

Programmable 5-stage pipeline with custom ISA for transformer flows.

BRAM Double-Buffer

Ping-pong prefetch achieves 99.8% memory efficiency with only 1 stall cycle.

Weight Cache

256KB-2MB on-chip cache with prefetch to hide memory latency.

PWA Activation

Piecewise-affine GELU/SiLU approximation without exponentials.

Matrix Chain Optimizer

O(n³) DP compiler optimization. Linear attention mode provides up to 8x ops reduction for long sequences.

PWA Softmax

4-segment exp() approximation with direct reciprocal. ~6% avg error, ~17% worst-case. Buffered or streaming modes.

Digital Twins

Bit-accurate Python simulation matching RTL exactly. Verify algorithms without hardware.

Performance

Edge-optimized efficiency

Trittask targets cost and power efficiency for edge deployment. Ternary weights eliminate multipliers and reduce memory bandwidth 16x (vs FP32).

Trittask configurations (projected)

Platform TOPS* Power TOPS/W

EBAZ4205 ($25) 3-5 3W 1.0-1.7

                        PYNQ-Z2 ($150)
                        25
                        5W
                        5.0
                    

                        ZCU104 ($1000)
                        50
                        10W
                        5.0
                    

                        ZCU102 ($3000)
                        100
                        15W
                        6.7
                    

Speedup vs Python

Operation Python Trittask

Ternary dot (8-elem) 0.44 µs 0.01 µs (44x)

Ternary matmul 768² 500 ms 0.05 ms (10000x)

LoRA forward 2 ms 0.02 ms (100x)

LoRA backward 5 ms 0.1 ms (50x)

*TOPS projected from RTL simulation @ target frequency. Actual hardware benchmarks in progress.

EBAZ4205 ($25)

16-48 TRUs, 3-5 TOPS, 3W. Ultra-low-cost for battery-powered IoT deployments.

PYNQ-Z2 ($150)

128 TRUs with 4-way systolic, ~25 TOPS, 5W. Best TOPS/$ with on-device training.

ZCU102 ($3000)

512+ TRUs, ~100 TOPS, 15W. Maximum throughput for production workloads.

Modular Feature Configuration

Configure Trittask for different FPGA targets with compile-time feature selection:

Preset Target Features

PRESET_MINIMAL Tang Nano 9K, iCE40 Ternary MAC only

PRESET_INFERENCE_ONLY ECP5, PYNQ-Z2 MAC + SANTA + ECO + PWA

                    PRESET_TRAINING_FOCUS
                    ZCU102, ZCU104
                    Full LoRA/QLoRA/DoRA training
                

Competitive Landscape

How Trittask compares

Trittask is the only sub-$500 edge AI platform with hardware-accelerated on-device training. Compare against NVIDIA Jetson, Google Coral, and Hailo.

Platform TOPS Power Price Training

                Trittask PYNQ-Z2
                25
                5W
                $150
                LoRA/QLoRA/DoRA
            

                Trittask ZCU102
                100
                15W
                $3,000
                LoRA/QLoRA/DoRA
            

Google Coral Edge TPU 4 2W $25 None

Hailo-8L (RPi AI Kit) 13 3W $70 None

Hailo-8 26 8W $200 None

Jetson Orin Nano Super 67 25W $249 Full GPU

Jetson AGX Orin 64GB 275 60W $1,599 Full GPU

Note: TOPS figures are not directly comparable across architectures. Trittask uses ternary operations; Jetson uses INT8 sparse tensor ops; Coral/Hailo use INT8 dense.

BitNet Research Results

From Microsoft Research (Ma et al., 2024):

Metric BitNet vs FP16

Memory usage 3.55x less

Inference speed 2.71x faster

Energy (70B scale) 71.4x less

Total Cost of Ownership (1 year)

24/7 operation at $0.12/kWh:

Platform Hardware Total

                        Trittask PYNQ-Z2
                        $150
                        $155
                    

Jetson Orin Nano $249 $275

Jetson AGX Orin $1,599 $1,652

vs Jetson

Trittask achieves comparable training throughput per watt at 1/10th the hardware cost. Jetson wins on raw TOPS but requires 5-20x more power.

vs Coral/Hailo

Coral and Hailo offer mature inference-only solutions. Trittask adds on-device training and full FPGA reconfigurability.

When to choose Trittask

Ultra-low cost ($25-50), on-device training required, power-critical (3W), or custom architectures needed.

On-Device Training

Fine-tune TinyLlama on the edge

Trittask includes hardware LoRA training engines. Fine-tune TinyLlama-1.1B directly on the edge device—no GPU server, no data upload, no round-trip latency.

TinyLlama Spec Value Training Metric Z7020 Performance

Parameters 1.1B Samples/hour 700-1200

Layers 22 Step time 3-5 seconds

Hidden size 2048 LoRA memory 16-64 MB

Base weights (ternary) ~275 MB Total DDR3 used ~350 MB

LoRA

Low-rank adaptation with rank 8-64. Adam optimizer in hardware. Layer-wise management for 22 transformer layers.

QLoRA

4-bit NF4 quantized base weights with full-precision adapters. 4x more memory efficient.

Streaming LoRA

Double-buffered training during inference. Atomic bank swap—zero downtime adaptation.

TinyLlama Training API

from trittask import Trittask, TinyLlamaTrainer, TrainingConfig

accel = Trittask("trittask.bit")
trainer = TinyLlamaTrainer(
    accelerator=accel,
    training_config=TrainingConfig(
        lora_rank=16,
        learning_rate=1e-4,
    )
)

# Add your domain-specific training data
trainer.add_training_data([
    ("What is error E-47?", "Replace the filter cartridge."),
    ("Machine vibration high", "Check bearing alignment."),
])

# Train on-device
for epoch in range(10):
    metrics = trainer.train_epoch()
    print(f"Epoch {epoch}: loss={metrics['loss']:.4f}")

# Save and use
trainer.save_checkpoint("my_domain_lora")

Use cases

Use Case Samples Time

Domain vocabulary 100-500 1-2 hours

Task specialization 500-2000 4-8 hours

Personalization 50-200 30-60 min

Continuous learning Streaming Ongoing

Perfect for: equipment manuals, error code explanations, company terminology, user preferences, style adaptation.

Model Support

Beyond Transformers

Trittask supports any architecture that can be expressed with ternary weights. The overlay ISA handles network-specific operations.

Architecture Inference Training Key Operations

Transformer (LLM, ViT) Native LoRA MATMUL, ATTENTION, SOFTMAX, LAYERNORM

CNN (ResNet, EfficientNet) Compiler* LoRA LOOP + MATMUL + POOL (shift-add)

LSTM / GRU / RNN Compiler* LoRA MATMUL + PWA (sigmoid/tanh) + VMUL

MLP Native LoRA MATMUL, RELU

Native: Direct hardware support. Compiler*: Decomposed to ternary primitives—verified via RTL simulation (tb_cnn_ops.sv, tb_rnn_ops.sv).

Vision Transformer (ViT)

from trittask import Trittask, ViTTrainer, VIT_SMALL

accel = Trittask('trittask.bit')
trainer = ViTTrainer(
    accelerator=accel,
    config=VIT_SMALL,
)

# Load UC Merced aerial dataset (21 classes)
trainer.add_training_directory('datasets/uc_merced/train')

# Fine-tune with LoRA
for epoch in range(10):
    metrics = trainer.train_epoch()
    print(f"Acc: {metrics['accuracy']:.2%}")

# Classify aerial imagery
predicted_class, probs = trainer.predict(drone_image)

ViT Performance (Z7020)

Model Params Images/hr

ViT-Tiny 5.7M ~1200

ViT-Small 22M ~600

ViT-Base 86M ~300

Use cases: drone/satellite imagery, product QC, defect detection, medical imaging, aerial scene classification.

UC Merced Dataset

21-class aerial scene classification dataset. Includes: agricultural, airplane, runway, harbor, buildings, forest, and more. Download with included script.

Drone Integration

Process drone imagery on-device. Classify terrain, detect objects, monitor infrastructure—all without cloud connectivity.

Edge Deployment

Fine-tune ViT directly on edge hardware. Adapt to new classes without retraining from scratch. LoRA enables rapid domain adaptation.

Model translation

from trittask.models import translate_model, print_isa_summary

translated = translate_model(pytorch_model)
translated = translate_model(resnet18, model_type="cnn")

print_isa_summary(translated)