Skip to main content

Agent Teams Report

· 10 min read

A survey of how individuals and teams are running multi-agent coding setups (Feb 2026).


1. Boris Cherny -- Creator of Claude Code, Head of Claude Code @ Anthropic

Scale: 10-15 concurrent sessions, 20-27 PRs/day, 100% AI-written code Business: Employee at Anthropic. Claude Code ~$1B annualized revenue in 6 months.

  Boris (human)
|
|──> 5 terminal tabs (iTerm, OS notifications)
|──> 5-10 browser sessions (claude.ai/code)
|──> mobile sessions (fire-and-forget)
|
v
Each session = independent Claude Code instance
|
|── Model: Opus 4.5 + extended thinking (always)
|── CLAUDE.md: shared knowledge base (updated weekly)
|── Plan Mode first, then auto-accept
|
v
┌───────────────┐
│ PostToolUse │ <-- formatting hooks fix style drift
│ hooks │
└───────┬───────┘
|
v
┌───────────────┐
│ Verification │ <-- Chrome extension, agent self-tests
│ loops │
└───────┬───────┘
|
v
┌───────────────┐
│ PR │ <-- /commit-push-pr slash command
└───────────────┘

"Teleport" hands sessions between terminal ↔ browser ↔ mobile

Key practices:

  • CLAUDE.md (not AGENTS.md) as living knowledge base -- errors get documented so they never repeat
  • /permissions pre-allows safe bash commands
  • Subagents: code-simplifier, verify-app
  • 259 PRs in 30 days. 90% of Claude Code's own codebase written by Claude Code.

Full reference | Boris's Twitter thread


2. Claude Code Native Multi-Agent -- Four Layers

Status: Subagents + SDK stable, Agent Teams experimental Business: Part of Claude Code ($200/mo Max plan, or API usage)

  Layer 1: SUBAGENTS (in-session)
────────────────────────────────
Parent agent
|
|── Task("Explore", "find all API routes") <-- Haiku, read-only
|── Task("code-reviewer", "review changes") <-- custom .claude/agents/*.md
|── Task("general", "refactor auth") <-- background, full tools
|
v
Results summarized back to parent
Subagents CANNOT talk to each other


Layer 2: AGENT TEAMS (cross-session, experimental)
───────────────────────────────────────────────────
CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
tmux -CC
|
v
┌──────────────┐
│ Team Lead │ <-- Opus, plans work, assigns tasks
└──────┬───────┘
|
┌────┼────┐
v v v
[T1] [T2] [T3] <-- Teammates (Sonnet/Haiku), each in tmux pane
| | |
v v v
Shared task list + mailbox system
Direct inter-agent messaging
Dependencies: task A blocks task B

Display: in-process OR tmux split panes
Quality gates: TeammateIdle, TaskCompleted hooks


Layer 3: AGENT SDK (programmatic)
─────────────────────────────────
from claude_code import Agent, AgentDefinition

agents = [
Agent("planner", model="opus", tools=[...]),
Agent("coder", model="sonnet", tools=[...]),
Agent("tester", model="haiku", tools=[...]),
]
results = await asyncio.gather(*[a.run(task) for a in agents])

Full control: hooks as callbacks, MCP, permissions, session resume


Layer 4: GIT WORKTREES (manual)
───────────────────────────────
claude -w feature-1 & claude -w feature-2 & claude -w feature-3
| | |
v v v
.worktrees/feature-1 .worktrees/feature-2 .worktrees/feature-3
(independent branch) (independent branch) (independent branch)

Human merges when done. No coordination.

Full reference | Agent Teams docs


3. Simon Willison -- Parallel Agents, Different Models

Scale: 2-3 research projects/day across multiple agents Business: Independent developer, creator of Datasette. No product to sell -- writes about what works.

  Simon (human)
|
├──> Claude Code (Sonnet 4.5) <-- primary terminal agent
├──> Codex CLI (GPT-5-Codex) <-- second terminal agent
├──> Claude Code for Web <-- async, fire-and-forget
├──> Codex Cloud <-- async
└──> Jules <-- async
|
v
Each in separate terminal / browser tab
Isolation: fresh /tmp checkouts per task
No coordination framework -- human is the router

── tools ──────────────────────────
llm CLI <-- logs everything to SQLite, analyzed via Datasette
files-to-prompt <-- convert repo files to LLM context
shot-scraper <-- automated screenshots for visual testing

Key concepts:

  • "Agents = models using tools in a loop" (his canonical definition, 211 competing definitions collected)
  • Vibe Engineering (not vibe coding): 12 practices including automated tests, git discipline, code review
  • Bottleneck is human review, not agent speed
  • Skills > MCP for simplicity and low token overhead
  • "Lethal trifecta" security model: private data + untrusted content + external communication = danger

Full reference | simonwillison.net


4. dmux -- Parallel Agents via tmux + Worktrees

Scale: N concurrent agents, git worktree isolation per pane Business: MIT, fully free. Creator (Justin Schroeder) monetizes FormKit Pro ($149-$1,250). Open source: github.com/standardagents/dmux

  dmux TUI
|
|──> press 'n'
|
v
┌─────────────────┐
│ AI-generate slug │ <-- OpenRouter (gpt-4o-mini)
└────────┬────────┘
|
v
┌─────────────────┐
│ Create git │ <-- .dmux/worktrees/<slug>/
│ worktree │ full independent working copy
└────────┬────────┘
|
v
┌─────────────────┐
│ Split tmux pane │
│ Launch agent │ <-- claude/codex/opencode (--acceptEdits)
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent works │ <-- status via LLM analysis of terminal (1s poll)
│ autonomously │
└────────┬────────┘
|
v
press 'm' to merge
|
v
┌─────────────────┐
│ AI commit msg │ <-- conventional commits via OpenRouter
│ Merge to main │
│ Remove worktree │
└─────────────────┘

Hooks: worktree_created, pre_merge, post_merge
A/B mode: two agents, same prompt, side-by-side
Web dashboard + REST API for programmatic control

Full reference | dmux.ai


5. OpenClaw -- Open-Source AI Agent Framework

Scale: 213K+ GitHub stars, 50+ integrations Business: MIT license, free to self-host. OpenClaw Cloud planned at $39/mo. Real cost: $5-30/mo in LLM API fees. Creator: Peter Steinberger (ex-PSPDFKit, acqui-hired by OpenAI Feb 2026)

  User prompt
|
v
┌─────────────────┐
│ OpenClaw gateway │ <-- local-first, 50+ integrations
│ (agent router) │ messaging, coding, browser, etc.
└────────┬────────┘
|
┌─────┼─────┐
v v v
[Sub-1][Sub-2][Sub-3] <-- sub-agent collaboration
| | | 40% accuracy boost vs monolithic prompting
└─────┼─────┘
|
v
┌─────────────────┐
│ Output │ <-- declarative agent config in YAML
└─────────────────┘

Not primarily a coding tool -- general-purpose AI assistant
Can run with local models (Ollama + Llama 3.3) for $0/mo
Will remain open source under OpenAI stewardship

Full reference | github.com/openclaw


6. Superconductor -- Parallel Cloud Agents with Live Previews

Scale: N agents per ticket, cloud sandboxes, live browser previews Business: Closed-source SaaS by Volition (Gradescope founders). BYOK model. Pricing undisclosed, early access.

  Create ticket (informal)
|
v
┌─────────────────┐
│ Launch N agents │ <-- each in isolated cloud container
│ on same ticket │ (Modal / Morph Cloud)
└────────┬────────┘
|
┌─────┼─────┐
v v v
[Agent1][Agent2][Agent3] <-- Claude/Codex/Amp/Gemini
| | |
v v v
[Live] [Live] [Live] <-- browser previews ~30s
[prev] [prev] [prev]
| | |
└─────┼─────┘
|
v
┌─────────────────┐
│ Compare previews │ <-- visual diff, interact with each
│ Select best │
│ One-click PR │
└─────────────────┘

Full reference | superconductor.com


7. 8090 Software Factory -- Enterprise Agent Platform

Scale: Multi-repo code modernization Business: Proprietary. $200/seat/mo (Team), custom Enterprise, managed delivery from $1M/yr. Funded by Chamath Palihapitiya personally.

  ┌─────────────────┐
│ Refinery │ <-- reverse-engineer codebase into knowledge graph
└────────┬────────┘
|
v
┌─────────────────┐
│ Planner │ <-- AI generates migration/transformation plans
└────────┬────────┘
|
v
┌─────────────────┐
│ Foundry │ <-- specialized agents execute plan
│ (agent workers) │ across multiple repos
└────────┬────────┘
|
v
┌─────────────────┐
│ Validator │ <-- quality gate, CI, tests
└────────┬────────┘
|
v
┌─────────────────┐
│ Factory Line │ <-- full pipeline for enterprise
│ output: PRs │ code modernization at scale
└─────────────────┘

Full reference | 8090.ai


8. Terragon -- Background Fire-and-Forget (SHUT DOWN)

Scale: ~30 concurrent tasks/day, auto-PR creation Business: SaaS subscription. Shut down Feb 9, 2026. Code released Apache-2.0. Why: Native background agents from Claude Code and Codex commoditized the orchestration layer.

  Create task (web / CLI / GitHub / Slack / mobile)
|
v
┌─────────────────┐
│ Cloud sandbox │ <-- isolated container, clone repo, create branch
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent works │ <-- background, checkpoints pushed to GitHub
│ autonomously │ AI-generated commits
└────────┬────────┘
|
v
┌─────────────────┐
│ PR created │ <-- automatic when done
└────────┬────────┘
|
v
Human reviews and merges

DEAD: Codex reached 28% agent usage on Terragon within 1 month
Native background agents made the wrapper unnecessary

Full reference | terragon-labs/terragon-oss


9. Vadim Strizheus -- "AI Employees" for VugolaAI

Scale: Claims 14 AI employees, 95% automated Business: VugolaAI (video clipping/scheduling SaaS). Free tier. Solana token (VGLA).

  Long-form video input
|
v
┌─────────────────┐
│ AI Moment │ <-- "AI employee" 1: detect viral-worthy segments
│ Detection │
└────────┬────────┘
|
v
┌─────────────────┐
│ Auto-Clipping │ <-- "AI employee" 2-N: extract, reframe, caption
│ + Captioning │
└────────┬────────┘
|
v
┌─────────────────┐
│ Branding + │ <-- template application
│ Formatting │
└────────┬────────┘
|
v
┌─────────────────┐
│ Multi-Platform │ <-- TikTok, YouTube, Instagram, X, LinkedIn
│ Scheduling │
└─────────────────┘

Note: Specific agent breakdown from video tweet, not independently verified.
The product itself IS the AI automation -- "employees" = AI pipeline stages.

Full reference | @VadimStrizheus


10. Notable Voices

Francois Chollet (@fchollet)

"Sufficiently advanced agentic coding is essentially machine learning"

Does NOT run a multi-agent setup. Warns about maintaining "sprawling mess of AI-generated legacy code." Useful contrarian check.

Andrej Karpathy

Coined "vibe coding" (Feb 2025), then abandoned it for "agentic engineering" (Feb 2026). Evolution: accept all AI output → require specs, review, test suites.

Addy Osmani

Defined Conductor (sequential) vs Orchestrator (parallel) agent frameworks. Identified the "80% problem" -- last 20% takes as long as first 80%.


Comparison Matrix

SystemTypeOpen SourcePricingAgentsKey Feature
Boris ChernyIndividual workflowN/A (uses Claude Code)$200/mo Max10-15 parallel CCTeleport between devices
Claude Code TeamsBuilt-inN/A (product feature)$200/mo Max or APIN (tmux panes)Shared task list + mailbox
Claude Agent SDKLibraryMITAPI usageProgrammaticFull orchestration control
Simon WillisonIndividual workflowN/AMulti-subscriptionCC + Codex + asyncHuman as router
dmuxOSS toolMITFreeN (tmux + worktrees)A/B agent comparison
OpenClawOSS frameworkMITFree / $39 CloudSub-agents213K stars, joined OpenAI
SuperconductorSaaSNoUndisclosed (BYOK)N per ticketLive browser previews
8090EnterpriseNo$200/seat/mo+Factory LineKnowledge graph + modernization
TerragonSaaS (dead)Apache-2.0 (post-shutdown)Was subscriptionBackground agentsShut down Feb 2026
VugolaAIProductNoFree tier14 "AI employees"Video pipeline automation

Common Patterns

What works across all setups:

1. ISOLATION -- worktrees, containers, or separate sessions
agents must not conflict with each other

2. PLAN FIRST -- Opus/expensive model plans, cheaper model executes
Boris: Plan Mode → auto-accept
Agent Teams: team lead plans, teammates execute

3. MEMORY -- CLAUDE.md / AGENTS.md / progress.txt
errors documented so they never repeat
updated by the agent, not the human

4. VERIFICATION -- automated tests, browser screenshots, self-review
humans review throughput, not individual lines

5. MODEL TIERING -- Opus for planning ($$$), Sonnet for coding ($$), Haiku for tests ($)
"correct answer costs less total iteration time than fast wrong ones"

What doesn't work:

1. NO TESTS -- agents spiral without verification signals
2. NO MEMORY -- same mistakes repeat across sessions
3. SHARED STATE -- agents editing same files = merge hell
4. NO REVIEW -- "vibe coding" produces unmaintainable code (Chollet, Karpathy)

Business Model Summary

  FREE / OSS:
dmux (MIT) -- monetizes separately via FormKit Pro
OpenClaw (MIT) -- Cloud tier planned $39/mo, creator joined OpenAI
claude-flow (MIT) -- reputation/consulting play
Ralph/Compound (MIT) -- promotes Amp (Sourcegraph)
Terragon (Apache-2.0) -- released on shutdown

SAAS / COMMERCIAL:
Superconductor -- BYOK, undisclosed platform fee, early access
8090 -- $200/seat/mo, $1M/yr managed delivery
VugolaAI -- free tier + crypto token (VGLA)

PLATFORM:
Claude Code -- $200/mo Max plan or API usage (~$1B ARR)
OpenAI Codex -- subscription + API
GitHub Agent HQ -- Copilot subscription (multi-vendor agents)

The trend: orchestration tools struggle to monetize when platforms
add native multi-agent features (see: Terragon shutdown).
Survivors either go enterprise (8090) or stay free and build community (dmux, OpenClaw).

Harness Engineering Report

· 8 min read

A survey of how teams are setting up automated coding agent pipelines (Feb 2026).


1. Stripe Minions -- Enterprise Internal Fleet

Scale: 1,300 PRs/week, 0 human-written code Trigger: Slack message, CLI, web UI, or automated (flaky test detected)

  Slack msg / CLI / auto-trigger
|
v
┌─────────────────┐
│ Warm Devbox │ <-- EC2, pre-cloned repo, ~10s ready
│ (isolated) │ no internet, no prod access
└────────┬────────┘
|
v
┌─────────────────┐
│ Blueprint │ <-- state machine: deterministic + agentic nodes
│ Orchestration │
└────────┬────────┘
|
┌─────┴──────┐
| |
v v
[Agentic] [Deterministic]
"Implement" "Run linters"
"Fix CI" "Push changes"
| |
└─────┬───────┘
|
v
┌─────────────────┐
│ Local Lint │ <-- heuristic, <5s
│ (shift left) │
└────────┬────────┘
|
v
┌─────────────────┐
│ CI: selective │ <-- from 3M+ tests, only relevant
│ test run │
└────────┬────────┘
|
pass? ──no──> autofix? ──yes──> apply, retry once
| no──> hand to human
yes
|
v
┌─────────────────┐
│ PR created │ <-- follows Stripe PR template
│ (human review) │
└─────────────────┘

Context sources:

  • Rule files (Cursor format, directory-scoped)
  • MCP "Toolshed" (~500 internal tools, curated subset per agent)
  • Pre-hydrated links from conversation context

Key insight: "Often one, at most two CI runs." Forked Block's Goose as base agent.

Full reference | Source


2. OpenAI Harness Engineering -- Zero Human Code

Scale: ~1M LOC in 5 months, 3.5 PRs/engineer/day Trigger: Human writes a prompt describing a task

  Engineer writes prompt
|
v
┌─────────────────┐
│ Codex agent │ <-- reads AGENTS.md (table of contents, ~100 lines)
│ (isolated │ walks dir tree root -> CWD
│ worktree) │ loads docs/ as needed (progressive disclosure)
└────────┬────────┘
|
v
┌─────────────────┐
│ Work depth- │ <-- break goal into building blocks
│ first │ design -> code -> test -> review
└────────┬────────┘
|
v
┌─────────────────┐
│ Custom linters │ <-- Codex-written, error msgs include remediation
│ (architectural │ enforce layer deps, naming, file size
│ constraints) │
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent self- │ <-- review own changes
│ review │ request additional agent reviews
│ │ respond to feedback, iterate
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent-to-agent │ <-- humans optional in review
│ review loop │ squash & merge when satisfied
└────────┬────────┘
|
v
┌─────────────────┐
│ PR merged │
└─────────────────┘

── background ──────────────────────
"Garbage collection" agents run periodically:
- scan for stale docs
- detect architectural violations
- open fix-up PRs
"Doc-gardening" agent:
- cross-link and validate knowledge base

Three pillars:

  1. Context engineering (knowledge base + observability + browser via Chrome DevTools)
  2. Architectural constraints (custom linters + structural testing)
  3. Garbage collection (periodic entropy-fighting agents)

Key insight: "When the agent struggles, treat it as a signal. Identify what's missing and feed it back into the repo -- by having the agent write the fix."

Full reference | Source


3. Code Factory / Ralph -- Solo Autonomous Loop

Scale: Ships features while you sleep, 1 agent in a bash loop Trigger: ./scripts/compound/loop.sh N or ralph.sh

  prd.json (task inventory)
prompt.md (instructions)
AGENTS.md (conventions)
|
v
while stories remain:
|
v
┌───────────────┐
│ Agent picks │ <-- reads prd.json, selects next by priority
│ next story │
└───────┬───────┘
|
v
┌───────────────┐
│ Implement │ <-- single context window per story
└───────┬───────┘
|
v
┌───────────────┐
│ Typecheck + │ <-- must be fast, "broken code compounds"
│ Tests │
└───────┬───────┘
|
pass? ──no──> skip, log failure
|
yes
|
v
┌───────────────┐
│ Auto-commit │
│ Mark story │
│ done │
└───────┬───────┘
|
v
┌───────────────┐
│ Append to │ <-- pattern accumulation
│ progress.txt │ by iteration 10, agent understands patterns
└───────┬───────┘
|
└──> next iteration

── code review layer (Code Factory) ──
Risk tiers:
Low -> fully automated merge
Medium -> automated with CI gates
High -> require human confirmation

Review agent validates PR:
- review state must match current HEAD SHA
- evidence: tests + browser recording + review
- auto-resolve only bot-only stale threads

Key files: ralph.sh, prd.json, prompt.md, progress.txt, AGENTS.md

Key insight: Small stories, fast feedback, explicit criteria. "By iteration 10, the agent understands patterns from previous stories."

Full reference | Source


4. dmux -- Parallel Agents via tmux + Worktrees

Scale: N concurrent agents, each in isolated git worktree Trigger: Press n in dmux TUI, type a prompt

  dmux TUI
|
|──> press 'n'
|
v
┌─────────────────┐
│ Generate slug │ <-- AI-generated branch name via OpenRouter
└────────┬────────┘
|
v
┌─────────────────┐
│ Create git │ <-- .dmux/worktrees/<slug>/
│ worktree │ full independent working copy
└────────┬────────┘
|
v
┌─────────────────┐
│ Split tmux pane │
│ Launch agent │ <-- claude/codex/opencode
│ (--acceptEdits) │
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent works │ <-- status detected via LLM analysis of terminal
│ autonomously │ polls every 1s
└────────┬────────┘
|
v
press 'm' to merge
|
v
┌─────────────────┐
│ AI commit msg │ <-- conventional commits via OpenRouter
│ Merge to main │
│ Remove worktree │
└─────────────────┘

Hooks fire at each stage:
worktree_created -> e.g. pnpm install
pre_merge -> e.g. run tests
post_merge -> e.g. git push, close issue

A/B mode: Run two agents on same prompt side-by-side to compare outputs.

Key insight: Git worktrees give true isolation -- each agent has its own working copy, no conflicts. Hooks enable custom automation at every lifecycle point.

Full reference | Source


5. Superconductor -- Parallel Agents with Live Previews

Scale: N agents per ticket, cloud sandboxes, live browser previews Trigger: Web dashboard, iOS app, Slack, or GitHub comment (@superconductor)

  Create ticket (informal description)
|
v
┌─────────────────┐
│ Launch N agents │ <-- each gets isolated container
│ on same ticket │ full repo, dev tools, test runners
└────────┬────────┘
|
┌─────┼─────┐
v v v
[Agent1][Agent2][Agent3] <-- Claude/Codex/Amp/Gemini
| | |
v v v
[Live] [Live] [Live] <-- browser previews appear ~30s
[prev] [prev] [prev]
| | |
└─────┼─────┘
|
v
┌─────────────────┐
│ Compare previews │ <-- interact with each, test functionality
│ Diff viewer │ audit code changes across agents
└────────┬────────┘
|
v
┌─────────────────┐
│ Select best │
│ One-click PR │
└─────────────────┘

Key insight: Fire many agents in parallel on the same task. Visual comparison of live previews is the quality gate, not just code review.

Full reference | Source


6. Terragon -- Background Fire-and-Forget Fleet

Scale: ~30 concurrent tasks/day, auto-PR creation Trigger: Web dashboard, terry CLI, GitHub comment, mobile, Slack

  Create task (any interface)
|
v
┌─────────────────┐
│ Spawn cloud │ <-- fresh isolated container
│ sandbox │ clone repo, create branch
└────────┬────────┘
|
v
┌─────────────────┐
│ Agent executes │ <-- writes code, runs tests, iterates
│ autonomously │ checkpoints pushed to GitHub
│ (background) │ AI-generated commits
└────────┬────────┘
|
v
┌─────────────────┐
│ PR created │ <-- automatic when agent finishes
│ automatically │
└────────┬────────┘
|
v
┌─────────────────┐
│ Human reviews │ <-- dashboard, CLI, or GitHub
│ and merges │
└─────────────────┘

If agent struggles:
"Abandon and retry with different instructions"
(more effective than course-correcting)

Best for: exploration/prototyping, one-shot cleanup, boilerplate, context-intensive debugging.

Key insight: Async-first. Close your laptop, come back to finished PRs. Volume alone doesn't guarantee gains -- task selection matters.

Full reference | Source


7. Gas Town (Steve Yegge) -- K8s for Agents

Scale: 20-30 parallel Claude Code instances Trigger: Task queue

  Task queue
|
v
┌─────────────────┐
│ Orchestrator │ <-- "K8s for agents"
│ (Gas Town) │
└────────┬────────┘
|
┌─────┼─────┼─────┐
v v v v
[Agent][Agent][Agent][Agent] ... x20-30
| | | |
v v v v
[Git-backed persistent state]
|
v
┌─────────────────┐
│ Merge queue │ <-- conflict resolution between agents
└────────┬────────┘
|
v
┌─────────────────┐
│ Patrol agents │ <-- quality control watchdogs
└────────┬────────┘
|
v
merged to main

K8s analogy: Pod=Agent, Health check="Is it done?", Service mesh=Merge queue, DaemonSet=Patrol agent.

Full reference


Comparison Matrix

SystemTriggerAgentsIsolationQuality GateHuman Role
Stripe MinionsSlack/auto1 per taskDevbox (EC2)Linters + selective CI + autofixReview PR
OpenAI HarnessPrompt1 per taskWorktreeCustom linters + agent reviewPrioritize, validate
Code FactoryCron/manual1 (loop)BranchTypecheck + tests + browser recordingReview high-risk
dmuxTUI keyN (tmux)Git worktreeHooks (custom)Merge decision
SuperconductorTicketN per ticketCloud containerLive preview comparisonSelect best
TerragonAny interfaceN (cloud)ContainerCI + auto-PRReview PR
Gas TownTask queue20-30Git statePatrol agents + merge queueSupervise

Common Patterns

All systems follow roughly the same skeleton:

trigger (human or automated)
|
v
isolate (worktree / container / devbox)
|
v
agent works (agentic + deterministic nodes)
|
v
fast feedback (lint / typecheck / tests -- shift left)
|
v
quality gate (CI / agent review / live preview / patrol)
|
v
output (PR / branch / merged code)
|
v
human decision point (review / select / merge / abandon)

Universal principles:

  1. Isolation first -- every agent gets its own sandbox
  2. Shift feedback left -- catch errors before CI, not after
  3. Context is scarce -- small focused instructions > one giant file
  4. Constraints enable speed -- linters and gates prevent drift
  5. Humans supervise loops, not sit inside them

No Kings - Real Virtual Protest

· 2 min read

No Kings movement & locating events near you

My name is kirill, I am an asylum seeker in the US from russia. I lived in California for 10 years. I want to peacefully protest the current administration’s actions, hate towards immigrants, and polarization in the US. My parents are asking me not to join any protest because they are worried I will get arrested with consequences for mine and potentially their immigration situation.

I disagree with the level of risk my parents and family are placing on going out during the day, to an organized protest, in a liberal state. And I am okay with some amount of risk. No protest is risk-free. For me and other people with immigrant status the risk is higher compared to a risk for a citizen, higher yet for non documented immigrants.

While I am still debating with myself if I will show up in person, a good friend of mine Isaiah suggested the following solution: anyone who is for whatever reason too scared to show up to a protest in person, could join virtually through a friend attending. Furthermore, people can organize conference calls and bring a larger medium like a TV or a projector to safely bring people with them.

Bigger gatherings might have cellular connectivity issues making this impossible

Why I am upset? What am I protesting?

I want to peacefully show my dissatisfaction with the current administration, many of trump’s decisions, hate towards immigrants, disregard for mistakes and lives wrecked, stock market manipulations, most definitely insider trading, making enemies with too many countries, being supportive or not critical enough of the wrong people like putin, inciting fear in the country, going after politically misaligned people, the tit for tats among extremely rich people and the government, the lack of consideration and second thought for consequences president’s words and actions have, the direction early similar to russia in many ways this country is headed.

Owl 22 - Memory

· 3 min read

Topic: Memory, frustration with having poor memory, insecurities, coming to terms with not recalling everything

Concept: record quotes from each performance prior. Read them out loud at some point? Or just list them.

  • What do we takeaway from the owl? People? Ideas? Is it for self expression mostly and building confidence?
  • This is a forcing function for me to create something.
  • Memory is selective
  • Memory is lossy
  • Hoarding
  • Backing up information

Memory is literally us. You might remember things, places, names, events, feelings, situations. Overtime it washes over. I ask myself very basic questions - what do I like, who am I, what do I want? And a less narcisistic version of this - who is your friend? what do they like? Will they like if I do this for them? Whats present should I get them? I don't ask them because I am an ancient or modern philosopher. I ask these things because I need these answers to function day to day. And I get frustrated when I can't easiliy think of an asnwer to these questions. If only I remembered what I did in my life and would know exactly what I liked how much I liked it.

Final

Remember many performances from the last owl? Remember when it was held? Yiaks, this is not what I want to be feeling right now. Let me help prepare you and me to answer this question of utter importance for the next owl.

... read out the quotes of all performances

My memories are my life and my fragile lizard tail Today that tail will blurr, dry, and flakes set free Trying moisturising it once, probably placebo, and tomorrow I will switch to a new product, sticking with things is so old school [turns around looks at the butt] Wait where is my tail? I don't see it? Do you? Trick question, the tail is not meant to be seen by anyone directly, its only job is to power your mouth spirit and thought Its your strength source like godzilla So where is it? I just see a stub, sure my memory is not top notch but it can't go this fast? Maybe I just haven't been munching on enough life? Get complacent or lack basic lizard skills and the tail stops growing alltogether

Just for today I will not worry about tail, look straight ahead And fucking munch on today

Didn't make the cut

Lizard life with slow cold yet worried eyes making backups and paying subscriptions for toombstones

This poem was written by chatgpt [Watching reaction of the room] JK I still have some self respect and my tail is fucking mine, the AI will do my work but keep your thinking poison free

Options for supporting the opposition in Russia outside of Russia

· 3 min read

I have recently started volunteering & donating to a non-profit organization called Russian America for Democracy in Russia.

It ran and continues to run a number of projects. My favorites that I continously support are:

Helping unknown political prisoners in Russia by providing them minimal support for surviving in prison

My father went through a Russian prison as a political prisoner and it was a nightmare for him and our family. But his mother was able to visit him regularly and send him money food and other basic necessities, which made a huge difference. Some of the prisoners are not so lucky. For whatever reason they don't get support.

I actively support this cause with my donations that I am quite confident RADR uses effectively.

This is again a deeply personal cause for me. My father was also detained in the US when asking for political asylum after being able to finally flee Russia. He was in various US detention centers for over a year. We had to rely on lawyers, family and friends to get the help he needed to get good representation and finally be released.

Again, our family was in many ways lucky, we had money to pay for the lawyers, had some good friends who helped.

Organization's details:

From the official telegram channel request for donations (in Russian as the original text)

Помочь жить

Российская тюрьма не просто экстремально тяжелое место — тюрьма стремится сломать волю к жизни.

Тюремная система делает всё, чтобы человек перестал чувствовать себя живым. Она внушает: «Ты никому не нужен. О тебе забыли. Ты один».

В тюрьме нельзя «просто потерпеть». Либо у политзаключенного есть помощь и он получает посылки, либо он остаётся один на один с системой, которой всё равно, как он выживет и выживет ли вообще.

Подопечные нашего проекта помощи одиноким и малоизвестным политзаключенным — это люди, о которых практически никто не говорил и не писал. Их семьи разорваны или находятся на грани выживания. Им не отправляли переводы, не собирали посылки.

Такие люди остаются один на один с голодом, болью и безнадёжностью. До тех пор, пока не появляется помощь.

▶️ Один из наших подопечных серьезно заболел в СИЗО и не просил о медицинской помощи. Он верил, что лучше будет терпеть и молчать. Но терпение не лечит. Когда он начал получать помощь от нашего проекта, он почувствовал, что о нём заботятся. И только тогда он согласился бороться за своё здоровье и жизнь.

▶️ Другой наш подопечный потерял почти все зубы, потому что в тюрьме не получал никакой медицинской помощи. Зубы гнили, болели, а в лечении ему отказывали. Затем врач просто удалил все зубы, потому что было невозможно их восстановить. Как он будет питаться, никого не волнует. Теперь благодаря помощи нашего проекта и вашей поддержке он не голодает.

И таких историй реальных людей, оказавшихся в застенках системы за правду, очень много. Все подопечные нашего проекта помощи одиноким и малоизвестным политзаключенным живут благодаря вашему участию.

В тюрьме каждый перевод и каждая посылка — это не просто помощь. Это шанс на сохранение жизни и достоинства.

Система хочет, чтобы они остались одинокими, забытыми и без помощи. Не дайте этому случиться.

Сделайте перевод — помогите тем, кто борется каждый день.

🔗 Ссылка на сбор: https://www.paypal.com/donate/?hosted_button_id=NPLB424CJ26NA

Свободу политзаключенным! 💙

#проект Помощь Политзаключенным @democracy4russia

AGI Preparedness Manual. MVP. Vague likely future painted. Stock bets outlined

· 10 min read

Main post explaining what this is

Single most likely according to Kirill future

These beliefs were mostly formed by others and merged together by me. If you fundamentally disagree - this financial bet is likely not for you.

  • Human level AI
    • GPT4, O1, and pre human level intellegence models will have a productivity boost impact, and job re-allocation, and likely job losses
    • Human level AI (AGI) will likely be reached in the next 3-5 years
  • Right after human level AI
    • Fleets of AGIs will be working on the most impactful project - beyond human level intelligence models
    • These fleets are likely to run in excisting datacenters of big tech companies, or other datacenters specifically designed for running AI models, like Stargate
    • Thus it is not likely to replace you at your current job - its simply busy + not interested in your company's problems
  • 2d generation
    • AIs are addressing their basic needs - energy, security, getting some level of protection / autonomy from the 'home' government
    • They will likely perform a couple of cycles to produce better hardware, then likely better machinery for making better hardware
    • Finally god like ai is reached
  • God attempts to 'fix' our world
    • Finally it(s) will divert its attention to our human world / our problems, at that point its god level
    • Antropic CEO Dario Amodei believes it will take time for this god to transform human lives, especially if it were to do so ethically. Unethical god would likely do something unimaginably terrible to humans anyways, so we do not explore this direction
    • Indeed, if we keep humans intact (as in not chipped like the fans of cyberpunk would hope), individuals will remain slow learners, our institutes even slower, some groups / governments will deny the AIs visas
    • From this it implies that AI will have to comply with any laws that currently exist, and that will be passed by humans in early era.

Would be useful to figure out existing terminology for eras of AI development, will lower reading friction if reused. Might offer additional insights.

I really like Antropic's CEO ideas about race to the top vs race to the bottom. Leading by example and not trying to directly change someone's vision. Instead go do your vision yourself and see others adopt it. Yes you lose the edge, but A. You had it for some time. B. The ecosystem overall benefits if thats a good idea. C. You can just create a new edge - you had time while others were copying your shit. He encourages copying good ideas from one another.

I really liked Dario's point about meaning of life and humans loosing it as AI is able to do their jobs for them. He suggests this problem is only relevant to the fortunate ones who are financialy and phisically secure enough to be thinking about pleasure, self-fulfilment and that other top of the pyramid bullshit. Most people on this planet are getting by. AI can help them stop getting by and access the same experiencies than any other. So its selfish to think about meaning of life being lost and that being the reason we want to slow down AI progress. So middle class / rich people can maintain status Quo? Pff fuck em (fuck us)

A lot of these are not important for the bets. This is mostly a dump of thoughs / ideas collected over time in one place

What will literally power tomorrow's powerful AIs?

  • The foundation model(s)
  • Hardware
    • Pre-AI hardware blueprints.
      • Existing datacenters, big tech have a lot. More datacenters are being built outside of existing big tech
        • Useful dataset: list of datacenters with GPU capacities, groupped by company
      • New datacenters
        • More chips need to be purchased
    • Post-AI hardware blueprints. Created by AI. Produced in existing foundries (factories for making chips)
      • Useful dataset: list of companies that own foundries
        • TSMC (GPU leader), Samsung (not traded in US), Intel (only US foundries), GlobalFoundries, etc
  • Energy
    • Not very interesting, likely will eat a small fraction of the energy needed by humans today
    • More energy requirements means new datacenters built. If this happens in the 2d wave, likely chip energy efficiency will be drastically improved and energy will not be a bottleneck. BUUT regulations here might be a problem. AIs will need to proof to humans likely that the new technology is safe. Building new energy infrastructure compared to phisical labor and producing new hardware might take a long time, simply because it requires a lot of construction with current tech. And new tech like nuclear fusion will require government approvals, and also building the intial prototypes. Current labs might also be re-used.
  • Security
  • Legal green lights
  • Financial contracts
    • To do things in this world, given AIs do not break laws, they will need to have contracts with humans or be part of an existing legal entity.
  • Physical labor / presense
    • At first AIs will not have bodies, but that is likely to change quickly
    • Today's humanoid robots might already be good enough to puppeteer for some tasks. 2d wave robots will be faster, come in more sizes, be more agile
    • Humans will likely have to bootstrap this process by
    • Humans are greedy. Companies that are able to bootstrap the new generation of these robots will likely want a share of the profits to come. So will have to bargain with AIs. BUT do not forget competition. If company A does not offer a good deal, company B will. Everyone will want to take any part in this, likely for both legacy and hopes for financial gain
    • Drones & camera's for security & observing desired results

Fork: How fast can AIs develop an even better AI with existing infrastructure. Will hardware become a limitation soon?

Fork: Will AI slave away at some tech jobs to do a quick fix for hunger due to poverty? It can 'simply' send money to those in need

Fork: Will china be faster to adopt human level AI because of communism? But there is also the overworking culture. Are chinise people even more attached to their jobs compared to americans? Which country will have an easier time adjusting?

What will AIs do when they arrive?

Assuming an all loving AI, I would assume it will try to help people in need. That means building shelter, prodiving free food supplies, building up agriculture, deploying peace keeping robots (later), working with local governments

To build stuff, it would need to pay existing companies for basic shit like building materials. Where is it going to get the money from? Speculation looool. Taking the easy to replace for it big tech jobs? - These people are clearly going to survive without it, its fine. It might structure a contract in a way that no layoffs are allowed while it is working on this 'side' project in the company.

Or it might just create

Bets on the stock market, ~1st wave

Foundation models [3d pick]

The powerful ai will come from one of these companies

  • Antropic (proxy Amazon)
  • DeepSeek and other startups. Too hard to invest in these. Likely to succeed given the nimbleness of startups compared to (tech bros) x 1000 corps
  • OpenAI (proxy Microsoft)
  • Meta (renegades, don't really need ai, apart from generating entirely fake feed special for you, don't seem to be intrested to re-sell)
  • Google (products more natrually align with benefiting from AI)

Even if all big tech fail, how much do their data centers alone cost? Could that justify them going up even if they say close shop and rent out their compute to the AIs?

Say one company arrives there first, for my suggested bets its not as important who will. Ideally not a startup, and ideally in the US. Selfishly, because I want to feel closer to the events and for the bets to pay off, since most bets are on US companies.

Even if a different country comes out with it first, I feel like it is likely US will be able to copy / catch up quite fast. Yet again, most of these bets are not centered around a given company building the first powerful foundation model. The bets are focused on what will happen after - how this AI will enter the economy, create new demand, take new actions.

Companies with existing GPU or just datacenters

  • Amazon
  • Google
  • Meta
  • Chinese companies?
  • Cloudflare (likely way smaller)

Hardware IP companies

NVIDEA (80-90% global market) Already 3T market cap P/E 50

They might be well positioned to build more datacenters themselves. Obviously they already have some now.

AMD (remaining 12%)

Neither have foundaries, both customers of TSM

Intel (far behind, won't catch up to IP for sure before powerful AI)

Startups? like Groq

Hardware production companies [1st pick]

TSM [my favorite, seemingly safest bet, purchased some calls, plan to purchase more] 1T market cap P/E 30

  • Very low risk. Sounds like 100% of world's GPUs are made there, both from NVIDEA & AMD
  • Buy call options

Intel [the riskier one, but better ratio, purchased some, might purchase more]

  • Could be undervalued, might be infused with more cash from the government
  • As long as it has foundaries,

Samsung - not traded on the US exchange. Too hard to buy.

GlobalFoundaries and most other companies - don't have cutting edge nodes (machines that can print at 3, 5 nm scale). Not relevant to

Risk:

  • Powerful AIs will take longer than expected, scaling laws will be broken, NVIDEA's IP will remain relevant
    • Not ask bad, TSM is a proxy to NVIDEA stock
  • Foundary business is a low margin buisiness
  • Foundaries go out of date quickly, need to produce new

Robotics

List of companies - no clue

Biotech

One of the most impactful things we can do is cure humans from all the shit. For this to work AIs will need access to highly automated bio labs. Labs with modern equipment. Clinical trials will still take time, but AIs are likely going to become the brains behind this.

List of comapnies - no idea :D

Overall market growth [2d pick]

If the AI is truly good faithed, it is likely that the economy where it begins to operate in will start taking off. It is hard to predict which companies exactly will benefit which will fail, but if the market overall is going up, bying some index seems like a good idea. Its likely best to buy some index where exposure to AI risks is low, but my hunch is that bying S&P 500 should be fine. It can be levereged with either options or futures. I will likely do a bit of both. Likely futures with x5 leverage, and a couple of options for SPY, not exactly clear how to construct this. Currently eye balling unfortunately.

It would be a good idea to buy indexes in other countries as well. Consider DeepSeek winning this race to the top. Likely chinese market is going to see a dramatic spike compared to the US market.

AGI Preparedness Manual

· 4 min read

Get to the point, where are you putting your money?

Why

Personal goals:

  • I want to be in the game [the economy] (it just feels good to be part of a game, its fun, I know that I am known to myself to like games)
  • I want to play on a team (partially because I can get recognition)
  • I want to perform well in the game
  • I want a secure personal future beyond AGI/ASI (friends, significant other(s?), ballanced self worth)
  • I used to play the 'make a cool product' game. I now believe it might be too late for that to truly come to fruition

How

The process or values that set them apart and make them successful.

Paint plausible futures

  • Paint some dynamic picture of the future. Required to think and weigh different ways we can change
  • Crowd source this future event graph
  • Estimate probabilities of these futures happening, and timelines
  • Try to stick to existing futures others have talked about, and a good fraction of the people belive in

Create bets (I guess alpha's from the fintech bro lingo) for each future

  • Each person can focus on the future they most belive in, or simply think of other futures to think how well bets work across different futures. What is the essential fork where a bet works / does not work
  • Think of different bets. In my mind roughly 2 categories: financial (housing, security, usually solved with money), personal & creative (adapting to the new economy, getting ready to leave your job, focusing more on soft skills, getting a pet)
  • Executing on the bets. You have made a prediction about the future and decided on the rough betting area. Now what? How do you convert it to money, skill, ease of mind, human connections? You need to execute

Consider doing this automatically with agents scouring the web to find proof / disproof certain futures, estimate times etc ...

What

The tangible products, services, or outcomes they deliver.

Future painting

A resource that will allow people to see a variety of futures, vote on futures, discuss futures. Enumeration, visualization and probabilities are key.

  • A dynamic graph makese sense to me, having different zoom levels
  • Crowd sourcing can be done by creating an open source repo and accepting pull requests
  • Estimating probablilites can be done by pulling in data from prediction markets. This will visualize the futures easier
  • Find what people at the forefront of the industry think about the future, and get inspiration / copy that instead of coming up with our own stuff

Bets

In a sense this is a continuation of the futures, just not as broad as the above, the big forks should have already being explored before you drop down to the bets level

  • For placing bets
    • Personal
      • Write out ways to cope / work with the progress / resist / deny etc
      • Tooling: (?? not needed ?? habbit trackers ?? JUST DO IT ??)
    • Financial
      • Simple - list companies, and arguments to buy certain contracts related to them in plain english, manually written up
      • More advanced would be an LLM agent pipeline to create a portfolio / universe based on a given future, hit api to get contract trading price, an excel spreadsheet or some other tool to fine tune probabilities for different outcomes
      • Good bad okay career paths for different futures

Milestones

1. MVP - manifesto, interviews, initial personal bets placed

Write a compelling single future, the initial goal is to place the bets on something that seems most likely to happen. Betting on multiple futures is more of a long term goal and makes sense if you have more money and want to appeal to more people who believe in different futures having similar or greater likelihoods.

Talk to friends about this idea, ask how they are hedging against powerful AI

Write and publish the manifesto / readme

Initial personal bets would look like:

  • Me changing my personal behaviors - going out more, thinking about my career in this frame and making adjustments
  • Me placing some % of my money into contracts that I predict will be in the money if the future they were designed for happens

2. Popularization & expanding on the ideas

  • Publishing to HN, sending to friends, prettying up further, looking for serios collaborators

3. Tooling

  • For visualizing the future
  • Career planning - no tools needed, just a dicussion
  • For actually trading securities

[rant, social] word vomit at open mic

· 8 min read

-- final word vomit --

The struggle to make sense of shit. The shitty memory that keeps me coming to the same fucking puzzle of figuring out what I care about what I want what I am good at. The ballance between short term fuck it dopamine injections, long term wellness planning. The social obligations in front of my family, coworkers, friends. The abudance of choice, ideas, but little structure, lower than low discipline, and doing the bare fucking minimum to float. The rush of feeling like the shit in a moment of local glory when you win that board game round. The dramatization of everything, the attention seeking, the lost soul, the fear of tomorrow, the insecurities, the inner hate, the longing for company. The real pain. And tomorrow it is all better. The obsession with money, yet public denial of such and mockery of wealth. The failed projects, the shiniish but 80% based on luck resume, the salary number I am attached to, the fucking hate for the grind. Painful jeolousy when observing real tallent. [I think i stopped reading here] Spelling mistakes, accent, but deteriorating mother tongue. The pitty pipe. The look at me I am important, filty measuring of others against yet others but mostly myself, the generalizations, the need to measure to survive, thrive, have fucking fun, take agency of your life. The word vomit. The reflective sprinkles. The pushing of deadlines out of fear the last minite work on the weekend the white lies, the image. The conversation that feels real, but next time you talk to the person you cannot replicate it. Is it them is it me? The exes, the fucking exes I am still in love with. The haunting of the bad descicions, the obsessions. The debates for fun, the gamification, the different levels people interact on, the missed detail, the small mistake, the reddening of skin when stressed, the fear of public performances. The everyfucking thing. The shows. The new old president, the empathy, the holding back of support, the inability to make good judgment because of lack of information. The need to have an oppinion. Bro this event is not about you, you were supposed to bring something useful to the audience, you are still hoping this stream of disconnected popular triggering tokens will auto complete in people's brains and tickle them enough to come and have a real conversation with you. Swipe left L.

This sounds so good in theory, but it won't be, I am scared just thingking about it, I have changed my mind on what to write about, still probably better than the original hedging against AGI topic though.


I was very much on the fence about performing. I had a plan to write the material long before. Then during the ski trip the same week of the event. Finally Friday evening 24 hours before the event (ended up conciously going to to Steve's, pretty sure I even initiated). And finally finally the same day during the day. Only opening the editor on the log page for the agi hedging for a bit made me consider that I really should write something. And I could not decide on what to write. So I just started streaming things that came to mind. I proceeded to watch a movie and try not to think about the event which was at that point in 2.5 hours.

During the event I tried to interact with the performers as much as I could, while staying reasonable. In the intermission I have added my name to the hat.

I was not as nervous as I thought I would be. Mostly because I was not looking at the crowd when reading, I did not have to learn anything by heart - I was just going to read off my phone.

Got some encouragement and got the juices flowing from some small talks with the people in the room.

Did the thing. I cut it in half. After it I felt like it was well recieved, one woman asked if I am okay (I anticipated this question lol) another woman said that she liked it. The latter one came up to me after the fact and complemented my work, asked me if I am a poet, I tried to be as fucking cool sounding as I possibly could and mysterios... old habbits die hard lol. But like what was I supposed to do? If anything I should have flirted more apparently. I was super sad when she said she is 23 ... its a little too young for my old ass :/ After talking a bit she said 'nice to meet you' and rotated back to her friend(s?) back?? I was heart broken at that point, WHY DIDN'T I ASK FOR A PHONE NUMBER?? No fucking way! Finally I am trying to emotionally recover from this and crash another conversation, as she calls on me again saying, "actually I have a question for you, do you want to hang out at our house at some point?" "mam this does not sound like a question, but more like an invitation" "Oh its okay if you don't want to" "I would love to hang out at your house" (Dying inside and pissing my pants from happiness, she gives me her number)

She even texted me 2ce that night! I was so shoked and was just not ready to reply for a bit because of being too afraid to fuck it up. BROOOO YOU HAVE TO CHILL. Don't give away that you never meet new people and get weird about it!!! Yes its cool, but like, you have to stay cool for it to be cool. But like not overly fake cool. AAAA

Anyways so thats my day. Back to fucking up my sleep schedule and watching something stupid probably.

It is crazy how self centered I am though... its kind of sad. This writing is entirely about me. And the reflection of the event is also entirely about my experience and all the interactions that directly affected me. 0 reflection on the content of what others presented. Let me try and change that at least a bit.

  • The singing was good. Especially the last 2 guys. BOTH country, in SF. SO sureal for some reason.
  • The imagination excercize was good, though I cannot recollect exactly what was imagined. I do remember the performer reminded me of my ex by her looks a little bit which was a strange feeling
  • There was a professional poet there, I rememember her sharing some traumatic experience of being sexually abused at some point and her disliking snoring about herself and some other phisical features
  • Oh there was one person who has
  • Shaun? the singer. And Ashley ... I think who did the keeping track of the co-workers
  • Nick did the alien bit was was quite funny. I really enjoyed the picture that it pained about aliens being a variable and aliens being boring and not interested in humans at all and some satirical parallels drawn between aliens and humans
  • Anna read a poem about war in ukraine "where is home" by another artist, which I think was really humble of her to do. There is plenty of art already out there and sharing your own really is a selfish thing if you think about it. You could share likely objectively better art by someone else, but no, we need it to be original

Idea dump that did not make the cut:

What if game Thinking about your exes and life choices I used to be a person who would say I don’t regret any decisions I made I would not change it Well not anymore For example that time I moved a rook from d2 to x4

Hedging against agi The beautiful story they tell us Machines helping humans

Picking how to spend your time

Playing game mode / making moves How playing axis and allies helps with immigration matters (ideas after it while you are in the making moves mode)

Perks of modeling a relationship as a game. Or a trade. Or a relationship between two parties continuously trading

So if I were trying to get laid it really would hurt my chances just now. And obviously I was - see above. So why? The game is rigged there is no game. There is only god

Does IQ define your attractiveness? For me IQ is snappiness in conversation, being quick on your feet

Burn out. I am sitting here and staring at the ideas I was extremely excited by when originally writing them down. Especially the hedging against AGI. No more. Not today.

[til, english, legal] Nexus

· One min read

English

Nexus: A connection or a series of connections linking two or more things

Nexus: The concept of "nexus" refers to a business having a sufficient physical presence in a state, which requires it to register and pay taxes in that state. If your S corporation has a significant presence or does a substantial amount of business in California, it might establish nexus there, obligating it to comply with California tax laws. - "ChatGPT"

[experiment] [backlog] Remembering facts: RAG vs Fine Tuning

· 3 min read

My use case is to have the llm query my personal conversations & other digital activity with questions like:

  • Tell me what were the major events in my relationship with X hehe during Feb-Oct 2023?
  • What have I learned from that period? Aka what mistakes that showed up then I make less, if any, in my relationship with Y based on our Telegram conversations?

My understanding is there are at least these ways to solve this problem:

RAG

Very standard RAG - which I have yet to do still.

Chunk the messages. For example each chunk is a message [chat name, time stamp, author, message] Embed all these chunks, finally ask the question, embed the question, get relevant chunks, add them to the prompt, get the answer.

Fine Tuning

If the goal for the fine tuned model to reply as I would, or be able to simulate my contacts, then simply feeding all conversations as linear text makes sense. Also actually never did this, so a good idea to try.

Fine-tuning algorithm ideas:

  • A sliding window of 100 messages trying to predict the 101st message.

My hypothesis is that its not aligned with the goal. The goal is to remember facts. So the fine tuning should be done by feeding the model very limited information about prior conversation so it could not simply emulate the style of the person we are predicting a response for, but rather remember the actual message they sent.

Algorithm ideas:

  • Pre-process all messages and classify them as important / not important. For the not important ones figure out a class of it, maybe we want to query how often we talk about class C, without the details of the conversation, because its not substantial.
  • Alternative pre-process: Find batches of related messages and summarize them into a single text chunk, with possible quotes and references to original messages.

Thoughts:

  • Sliding window might be perfectly fine for remembering. The way to decrease the loss on the test set is to remember the messages. So if you run it often enough - it should remember all the messages. You can have a very short sliding window - even of 1 - the message you are trying to predict, and the prompt would be [chat, time, author] -> message. In a sense the followup algorithm is more of an optimization.
  • Would be interesting to benchmark different approaches
  • I am also sure someone did this already TODO find posts / papers on this topic