Trittask is an FPGA accelerator powered by TRUs (Trit Reconfigurable Units) that runs neural networks without hardware multipliers. Using ternary weights (-1, 0, +1), inference becomes pure addition and subtraction—enabling 16x memory compression on $50 hardware.
(cons 'compress (cons 'score (cons 'mix (cons 'adapt nil))))
Complete hardware + software bundles ready to deploy
All kits ship within 5-7 business days. Hardware sourced from authorized distributors. Contact us for volume pricing.
Running AI on edge devices shouldn't require 450W GPUs or cloud connections. Trittask solves this by changing the math—not the hardware budget.
Moving FP32 weights from memory consumes 100x more energy than the compute itself. Trittask uses 2-bit weights—16x less data to move.
Floating-point multipliers dominate chip area and power. Trittask eliminates them entirely—ternary weights mean add/subtract only.
Most accelerators are inference-only. Trittask includes LoRA training engines so models can adapt on-device without cloud round-trips.
Traditional neural networks compute Y = W × X (multiply-accumulate). Trittask quantizes weights to {-1, 0, +1}, so multiplication becomes: add if +1, subtract if -1, skip if 0.
Write assembly for a transformer self-attention block using the Trittask overlay ISA. Target: FPGA with ternary weights (-1, 0, +1). NO multipliers. Available instructions: - LOAD_WEIGHTS: Load ternary matrix - MATMUL: Ternary multiply (add/sub) - SETMODE: MODE_SANTA or MODE_ECO - ATTN_QK: Attention scores - SOFTMAX: 4-pass stable softmax - ATTN_SV: Apply attention to V - VADD: Vector add (residual) - LAYERNORM: Bit-shift RMSNorm Generate assembly that: 1. Projects hidden → Q, K, V 2. Computes multiplier-free attention 3. Applies softmax 4. Adds residual + LayerNorm
;; Q, K, V projections (no multipliers) (load-weights W_q) (matmul Q W_q hidden) (load-weights W_k) (matmul K W_k hidden) (load-weights W_v) (matmul V W_v hidden) ;; Attention (SANTA: stochastic) ;; Or use ATTN_LINEAR for Q(K^TV) path (setmode MODE_SANTA) (attn-qk scores Q K) (softmax attn scores) ; PWA 4-segment (attn-sv out attn V) ;; Output projection + residual (load-weights W_o) (matmul proj W_o out) (vadd res hidden proj) (layernorm normed res)
;; Traditional: needs multiplier (mul result weight activation) ;; Trittask TRU: no multiplier needed (cond ((= weight +1) (add acc activation)) ((= weight -1) (sub acc activation)) ((= weight 0) acc)) ; skip
;; 2-bit ternary encoding (define trit-encoding ((#b00 0) ; zero → skip (#b01 +1) ; positive → add (#b10 -1) ; negative → subtract (#b11 nil))) ; reserved ;; 16 weights per 32-bit word ;; 768×768 layer: 147KB vs 2.4MB FP32
128 TRUs in 4 banks with parallel shift chains. Max 32-cycle latency, 4x read throughput.
Programmable 5-stage pipeline with custom ISA for transformer flows.
Ping-pong prefetch achieves 99.8% memory efficiency with only 1 stall cycle.
256KB-2MB on-chip cache with prefetch to hide memory latency.
Piecewise-affine GELU/SiLU approximation without exponentials.
O(n³) DP compiler optimization. Linear attention mode provides up to 8x ops reduction for long sequences.
4-segment exp() approximation with direct reciprocal. ~6% avg error, ~17% worst-case. Buffered or streaming modes.
Bit-accurate Python simulation matching RTL exactly. Verify algorithms without hardware.
Trittask targets cost and power efficiency for edge deployment. Ternary weights eliminate multipliers and reduce memory bandwidth 16x (vs FP32).
*TOPS projected from RTL simulation @ target frequency. Actual hardware benchmarks in progress.
16-48 TRUs, 3-5 TOPS, 3W. Ultra-low-cost for battery-powered IoT deployments.
128 TRUs with 4-way systolic, ~25 TOPS, 5W. Best TOPS/$ with on-device training.
512+ TRUs, ~100 TOPS, 15W. Maximum throughput for production workloads.
Configure Trittask for different FPGA targets with compile-time feature selection:
Trittask is the only sub-$500 edge AI platform with hardware-accelerated on-device training. Compare against NVIDIA Jetson, Google Coral, and Hailo.
Note: TOPS figures are not directly comparable across architectures. Trittask uses ternary operations; Jetson uses INT8 sparse tensor ops; Coral/Hailo use INT8 dense.
From Microsoft Research (Ma et al., 2024):
24/7 operation at $0.12/kWh:
Trittask achieves comparable training throughput per watt at 1/10th the hardware cost. Jetson wins on raw TOPS but requires 5-20x more power.
Coral and Hailo offer mature inference-only solutions. Trittask adds on-device training and full FPGA reconfigurability.
Ultra-low cost ($25-50), on-device training required, power-critical (3W), or custom architectures needed.
Trittask includes hardware LoRA training engines. Fine-tune TinyLlama-1.1B directly on the edge device—no GPU server, no data upload, no round-trip latency.
Low-rank adaptation with rank 8-64. Adam optimizer in hardware. Layer-wise management for 22 transformer layers.
4-bit NF4 quantized base weights with full-precision adapters. 4x more memory efficient.
Double-buffered training during inference. Atomic bank swap—zero downtime adaptation.
from trittask import Trittask, TinyLlamaTrainer, TrainingConfig
accel = Trittask("trittask.bit")
trainer = TinyLlamaTrainer(
accelerator=accel,
training_config=TrainingConfig(
lora_rank=16,
learning_rate=1e-4,
)
)
# Add your domain-specific training data
trainer.add_training_data([
("What is error E-47?", "Replace the filter cartridge."),
("Machine vibration high", "Check bearing alignment."),
])
# Train on-device
for epoch in range(10):
metrics = trainer.train_epoch()
print(f"Epoch {epoch}: loss={metrics['loss']:.4f}")
# Save and use
trainer.save_checkpoint("my_domain_lora")
Perfect for: equipment manuals, error code explanations, company terminology, user preferences, style adaptation.
Trittask supports any architecture that can be expressed with ternary weights. The overlay ISA handles network-specific operations.
Native: Direct hardware support. Compiler*: Decomposed to ternary primitives—verified via RTL simulation (tb_cnn_ops.sv, tb_rnn_ops.sv).
from trittask import Trittask, ViTTrainer, VIT_SMALL
accel = Trittask('trittask.bit')
trainer = ViTTrainer(
accelerator=accel,
config=VIT_SMALL,
)
# Load UC Merced aerial dataset (21 classes)
trainer.add_training_directory('datasets/uc_merced/train')
# Fine-tune with LoRA
for epoch in range(10):
metrics = trainer.train_epoch()
print(f"Acc: {metrics['accuracy']:.2%}")
# Classify aerial imagery
predicted_class, probs = trainer.predict(drone_image)
Use cases: drone/satellite imagery, product QC, defect detection, medical imaging, aerial scene classification.
21-class aerial scene classification dataset. Includes: agricultural, airplane, runway, harbor, buildings, forest, and more. Download with included script.
Process drone imagery on-device. Classify terrain, detect objects, monitor infrastructure—all without cloud connectivity.
Fine-tune ViT directly on edge hardware. Adapt to new classes without retraining from scratch. LoRA enables rapid domain adaptation.
from trittask.models import translate_model, print_isa_summary translated = translate_model(pytorch_model) translated = translate_model(resnet18, model_type="cnn") print_isa_summary(translated)