Edukey - IT Training Logo
AI / Artificial IntelligenceAutomation & Workflows

My 10 Days with OpenClaw 🦞 and Claude Opus 4.6 - AI Moves Faster Than You Can

Author
Łukasz Matuszewski
Date Published
Featured image for: AI Moves Faster Than Your Team: My 10 Days with OpenClaw and Claude Opus 4.6

TL;DR

  • The breakthrough isn’t “better answers” - it’s autonomous execution loops: plan -> delegate -> implement -> test -> report.
  • With OpenClaw my role shifted from "supervisor" to idea generator giving feedback. I became the bottleneck, because I can't do code review at the pace of agents.
  • Claude Opus 4.6 is a real leap in autonomy, and the new Tasks and SubAgents in Claude Code allow managing a "swarm" of agents.
  • Anthropic declares 1M token context and better long-context retrieval - 76% in MRCR v2 at 1M tokens (Anthropic).
  • Economics matter: cheap GLM-4.7 (or GLM-5) for orchestration + expensive "executor" models is a practical architecture.
  • OpenClaw is still not enterprise-ready. The risks are real: permissions, prompt injection, tool abuse, compliance (especially GDPR), and legal ambiguity.
EDIT 16.02.2026: OpenAI just "acquired" OpenClaw - its creator, Peter Steinberger, will join the Codex team and take responsibility for autonomous agent development. OpenClaw itself will become a foundation, supported by OpenAI. Hopefully this move will accelerate the project's adoption to enterprise requirements and raise security standards. Let's hope Peter finds time to keep developing the OpenClaw vision after joining Codex!

Anthropic probably regrets forcing Peter to rename from ClawdBot to MoltBot (apparently "Clawd" was too similar to "Claude"), because that pushed him to become a Codex CLI ambassador and brought him closer to OpenAI - which seemingly resulted in yet another rename, from MoltBot to OpenClaw ;)

The uncomfortable truth: AI model lifespan is now 2 months.

1–2 years ago, "AI in programming" mostly meant autocomplete and a chat sidebar in the IDE. Agents often wasted our time and made dumb mistakes. By the end of 2025 a breakthrough happened - new models (Opus 4.5, GPT-5.2, GLM-4.7, and to some extent Gemini 3 Pro) became much more reliable for agentic coding. What more could you want?

In February 2026, those models already feel outdated.

Opus 4.6 with OpenClaw or Claude Code is a capable system that:

  • reads your task list,
  • picks what it can handle autonomously,
  • plans (not for 5 minutes - even 5 hours of work),
  • delegates tasks to sub-agents,
  • manages, checks, gives feedback,
  • runs tests, creates PRs, checks CI/CD results,
  • pings you on WhatsApp: "Done, CR pls. Now issue ED-943, ok?" 😉

That's exactly the workflow I configured in the last 10 days.

Someone will say that's nothing new - WebSockets, cron jobs, prompts. And technically, 2-3 years ago I was already building agent teams (architect, designer, dev, QA, copywriter) in CrewAI, LangChain, then LangGraph. Except back then those agents produced nightmare websites straight out of the 90s. Now? For my last 3 PRs I had only minor review notes; for 1 PR I had significant UI corrections because my agent didn't have browser access (for security reasons) - but the business logic was correct and far from trivial.

I'm not claiming this is "the one true way" to build software. I'm saying the pace of change is so fast that if you're a CTO, tech lead, or senior developer, you don't get to ignore this anymore. The biggest shift is not model IQ - it's autonomy.

Autonomy changes team workflows, the kind of competencies you need, tool choices, security posture, and how you measure productivity. Because how many thousands of lines of code can we actually review? And do we always have to? 😉

The biggest lesson: I became the bottleneck

Here's a line that may scare many engineers, but is very familiar to managers:

“I’m no longer the person writing code. I’m the bottleneck.”

When agents can produce "good enough" code quickly, the speed of our decision-making and review becomes the limiting factor. That shifts what you optimize from codding to:

  • better project management,
  • better slicing of tasks and PRs into smaller chunks,
  • stronger test suite,
  • more deterministic CI,
  • clearer acceptance criteria, definition of done,
  • clear scope of permissions for agents (level of autonomy),
  • full automation - including AI doing code review, security tests, and maybe... deploys?

In other words: AI autonomy forces you to improve and think harder about engineering fundamentals. And once you have them in place, a new question emerges: Do we really need to read every single line of code? When does automation suffice? I don't have the answers yet — for now I'm reviewing, but painfully, because I'm the bottleneck. I'm blocking my agent's work both on the input side (decisions, access) and on the output side (code review, acceptance). Like... a manager? 😄

OpenClaw 🦞 as a real workflow, not a toy

OpenClaw is an autonomous agent framework. Out of the box, agents have access to:

  • communication channels (WhatsApp, Slack, Teams, Telegram, etc.),
  • context search and reading (plans, documentation, repo state, tasks, built-in memory with local RAG, Perplexity or Grok search),
  • tools, MCP, and skills (unlimited integration possibilities),
  • scheduled tasks (custom cron + heartbeats, e.g. every 1h).

Running in full mode (there's also a sandbox mode), it can:

  • install any tools and programs,
  • access files it needs for its work,
  • access multiple different projects,
  • compile code and even add binaries to PATH,
  • use a browser, click, fill out forms,
  • use any CLI or even TUI tools,
  • essentially do almost anything a human can do on a computer.

Our role then becomes: come up with ideas, delegate, and verify. Plus design and optimize the workflow (my agent happily helps with this - it can modify its own config files, prompts, and even write its own Skills - full flexibility).

The agent takes care of the rest. And it does it at a speed we can only dream of, while maintaining high quality (after proper workflow optimization and giving it tools to test and fix its own work).

The most surprising thing wasn't the speed - it was task discovery: I didn't explicitly assign most of the tasks. The agent picked them up from my list on its own and was able to judge when it had everything it needed to start, and when it needed to ask me, ask for help, or wait for a decision. And it asks smart questions.

Hard numbers: what 10 days of agent work looked like

I tracked a 10-day window (Feb 2 → Feb 12). The dashboard summary from the Usage panel shows:

  • 893.4M tokens (in/out/cache combined)
  • $431.41 estimated API-equivalent cost
  • real cost approx. $40 (I used Claude Pro, Google AI Pro, ChatGPT Plus, and GLM/Z.ai subscriptions - more on that in the risks section)
  • 227 sessions, 11,297 messages, 8,499 tool calls
  • 5.23% error rate (591 errors) - mostly from experiments and... Gemini

What OpenClaw created during this period:

5 custom Skills for itself (workflow instructions + Python and JS scripts + templates):

  • Blogger - research, memory analysis (our past experience), topic proposals, draft generation, running quality-check scripts (avoiding filler words, character count, etc.), image generation, adding drafts to CMS, notification.
  • Email template selector - matching the right email template to a topic, with a Python script that returns the correct template based on subject.
  • Notification routing - which channels to use depending on the recipient.
  • System monitor - since it runs builds and linting, it needs to know whether system resources allow it at any given moment.
  • User interview - gathering information to personalize the agent for each person in the company.

9 apps and tools, including:

  • CLI in Rust for counting tokens in files, to optimize and monitor context window usage (it built a binary and added it to PATH - more on this below)
  • CLI for RAG (building a knowledge base, generating embeddings, and semantic search)
  • Notepad app in Tauri with Rust, TypeScript and React - for me, Markdown-based
  • Heavily rebuilt Codex Monitor project (open source in Tauri, similar to Codex App but available on Linux and Windows), adding support for Gemini CLI and Claude Code (WIP)

10 Pull Requests to the Edukey repo - it would have done many more, but I'm the bottleneck, blocking it on input and output. I had to change the Heartbeat (agent trigger frequency) from 30min to 1h because I couldn't keep up when it was picking up tasks every 30 minutes. PRs included:

  • Many new features or major improvements to the Edukey website and our CRM
  • Improved Roles and Permissions system (RBAC) - built partly for the agent itself (so it can't delete anything and has restricted access)
  • Numerous SEO and GEO optimizations (Generative Engine Optimization = AI tools)

Costs and tokens by model:

  • gemini-3-pro-high - cost: $104.52, tokens: 193.3M, messages: 571
  • gpt-5.2-codex - cost: $96.98, tokens: 343.0M, messages: 2,887
  • claude-opus-4-6 - cost: $76.34, tokens: 94.4M, messages: 2,084
  • claude-opus-4-5-thinking - cost: $62.01, tokens: 62.7M, messages: 1,151
  • glm-4.7 - cost: $46.80, tokens: 170.9M, messages: 2,664

Token type breakdown:

  • Output: 2.7M
  • Input: 133.1M
  • Cache read: 770.4M
  • Cache write: 3.8M
  • Cache hit rate: 85.3%

This means the system consumed far more tokens on context handling and cache than on generating text or code.

That's why context engineering - designing orchestration, flows, and prompts - matters more than most teams realize. But the agent is happy to help with this and can improve its own prompts. 😊

Self-optimization and improvement

OpenClaw is "Agent-as-a-Code". The entire configuration lives in a single folder, and each agent has its own dedicated workspace (sub-folder) within it. On top of that, most of the agent's capabilities are easy-to-edit Skills - Markdown files with instructions + optional assets and scripts (e.g. in Bash, Python, or JS) for repeatable logic like API connections or data processing. This means autonomous systems can optimize themselves - if you allow it.

In my setup I ran into a real problem: Opus is expensive, and the delivered context (7 main Markdown files) duplicated a lot of information and was written verbosely. So I asked the agent to optimize the context (shorten the files) while preserving the information.

What it did next is simultaneously genius and terrifying:

  • created a plan,
  • wrote a Rust CLI tool to count tokens in workspace files,
  • compiled it and added the binary to PATH,
  • used it to measure context size during optimization,
  • cut context by 48% with no loss in quality.

That's real savings. And at the same time - enormous risk.

An agent that can compile and install binaries can also do much worse things. You cannot run it and leave it unsupervised without numerous restrictions: sandboxing, SSH tunnels, Tailscale, logs, automatic "circuit breakers", and so on.

Risks of autonomy at mass scale

An agent that can compile and install binaries can also do much worse things. You cannot run it and leave it unsupervised without numerous restrictions: sandboxing, SSH tunnels, Tailscale, logs, automatic "circuit breakers", etc.

But just a quick look at the OpenClaw Discord is enough to see that among the 14,000 members, the majority are non-technical people...

OpenClaw offers so many capabilities that within 2 months of launch (it started in November 2025) it became the fastest-growing project in GitHub history (currently over 180,000 stars). This means mass adoption of an immature product - with bugs and security gaps - by people with zero knowledge of the risks and how to mitigate them. And this is not a notepad app...

Of course, scams and dangerous Skills have already appeared in the ClawHub marketplace, including prompt injection attacks - prompting techniques that break LLM model safeguards, e.g. to extract information or trick the agent into executing dangerous actions like the proverbial rm -rf / to delete files, or downloading malware.

In 2 years we went from "Don't give AI access to the internet!" to "Give AI access to everything!" - Madness!

That we haven't had a catastrophe yet is probably thanks mainly to the improving quality of LLM models, which are increasingly reluctant to do stupid things. But they still do them.

Why Opus 4.6 felt like a step-change

When people say "new model release", the default assumption is: slightly better answers. Opus 4.6 is not just smarter. Above all, it sustains multi-step, tool-driven work far more reliably (with thousands of tool calls, even a small error rate can cause cascading failures in a workflow), plans complex tasks better on large codebases, and delegates work to other agents effectively.

A few notable points Anthropic states publicly:

  • Opus 4.6 has a 1M token context window (beta), which allows keeping an entire medium-sized project in memory at once (Anthropic).
  • On the MRCR v2 "needle-in-a-haystack" benchmark at 1M tokens, Anthropic reports 76% for Opus 4.6 (vs 18.5% for Sonnet 4.5).
  • Agent Teams and Tasks: multi-step work spanning 2 weeks, autonomously, with delegation to up to 16 sub-agents. This way, Opus 4.6 built a working C compiler in Rust entirely on its own — writing 100k lines of code and costing only $20,000 (a project that would have previously cost millions). It also compiled the Linux kernel in 3 architectures.
  • Issue triage and delegation: Opus 4.6 autonomously closed 13 issues and assigned 12 to the right people in a single day across an organization of about 50 people in 6 repositories — based on person and task descriptions.
  • Security: Without any explicit instructions, Opus 4.6 found over 500 security vulnerabilities in a large open source project used by millions of people — a project that had previously passed security audits conducted by human experts.

I recommend watching the analysis by Nate B. Jones, who breaks down the Anthropic blog post and explains why this is such an important shift: Nate’s Newsletter.

GLM-4.7: cheap orchestrator + premium executors

The most practical takeaway from this experiment is a pattern:

  1. Use a low-cost, stable model as the control plane (task selection, tool calling, summaries, coordination - tasks requiring good function-calling skills but not deep knowledge).
  2. Delegate heavy coding, architecture decisions, and complex reasoning to more expensive executor models.

In my case that often looked like:

  • "Architect / Planner": Opus 4.6 (large context window, full project understanding at once)
  • "Developer / Implementer": GPT-5.3 Codex (very strong for implementation-heavy tasks)
  • "Copywriter / Creative": Sonnet 4.5 (naturally sounding texts and articles)
  • "Control plane / Runner": GLM-4.7 (surprisingly good at function calling, lowest error rate - under 1% in my tests; Gemini performed worst here)

This matters because many teams try to use one model for everything and then get shocked by costs or reliability.

GLM cost me a laughable $2.50/month (holiday promo) on the Coding Plan Lite (the link gives an extra +10% discount), and I haven't managed to hit the 5-hour usage limit yet (max 36% reached). On the Lite plan it's a bit slow for pair-programming mode, but for autonomous background agents it's not a big deal - I'm not waiting for it.

For a concrete pricing reference (pay-per-use), Z.ai publishes token pricing and cached-input pricing in their docs (Z.ai pricing).

Yesterday a new GLM-5 was also released - reportedly at the level of Opus 4.5, but with the ability to work on a task for an hour, meaning they went in a similar direction to Opus 4.6. I'll update the article once I've tested it.

One important caveat about GLM: the servers are located in China on the Coding Plan from Z.ai (though it's an open source model and alternatives exist). This creates real problems with personal data / GDPR and security for EU-based companies.

Is this enterprise-ready?

When evaluating Opus 4.6 and Claude Code alone (with complex tasks and sub-agent delegation) - yes, these are already being used successfully in large companies.

However, universal autonomous agents like OpenClaw come with a pile of additional risks for larger organizations (which are often acceptable in small companies and startups).

1) Permission risk (the "rm -rf" class of problem)

If an agent can run tools, it can run destructive commands. Even when the model is "aligned", it can be tricked by ambiguous instructions or prompt injection.

Minimum recommendation: run agent workloads on a separate machine/VPS, with restricted credentials and a recoverable filesystem (backups).

2) Tool abuse and hallucinated actions

In my own testing, there was a moment where the agent tried to send a WhatsApp message to a non-existent number. It was blocked by restrictions, but the intent matters - agents can confidently attempt actions that look plausible but are dangerous.

Minimum: give agents only the tools they need, set up a sandbox environment, monitor tool usage, apply filters on dangerous commands.

3) ToS and platform enforcement

Consumer subscriptions and "agent frameworks" are still a gray zone. In my case, after about a week of heavy autonomous usage of Gemini CLI inside OpenClaw, access to Google AI Assist services was blocked for a ToS violation - with no clear reason given. Waiting on their email response. I know Anthropic does the same.

Lesson: if you're building production workflows, you need API terms, quotas, and DPAs - not hope. Consumer plans don't provide the legal guarantees your company needs.

4) GDPR and personal data in the EU

If you're in the EU, assume that "just don't paste sensitive data" is not a compliance strategy.

  • Real anonymization is hard.
  • Logs and traces often contain identifiers.
  • Consumer plans may not provide the legal guarantees a company needs.

If you want to operationalize agents in a company, you need a proper data-processing model: EU region, data processing agreements, retention policies, and access controls. All major providers offer this - e.g. Azure (GPT and Anthropic) and GCP (Gemini and Anthropic).

What to do in the next 90 days (for CTOs and tech leads)

If you lead a team, don't start by buying "an agent". Start by upgrading the environment so agents can work safely.

  1. Pick one workflow that is safe and measurable (e.g., internal docs updates, small refactors with tests, automated code review).
  2. Build a test-first safety net (unit tests + lint + CI gates). Agents make tests more valuable, not less.
  3. Separate the system into roles and match models to them:
    • orchestrator (cheap, reliable)
    • executors (powerful, expensive)
  4. Implement a permission model:
    • no production credentials for the agent!
    • separate environment, Docker, VPS, sandbox (people are massively buying Mac Minis for OpenClaw setups)
    • limited and monitored network access
    • tool allowlists
    • logging and audit trails
  5. Train people on the new skill - prompting alone is not enough. The development team needs to develop managerial competencies:
    • writing clear requirement specs for tasks,
    • breaking work into smaller pieces and delegating (though Opus 4.6 already helps with this),
    • verifying results and giving feedback,
    • analyzing failures and improving workflows,
    • managing multiple agents simultaneously, both locally and in the cloud.

At Edukey, we see exactly this on our AI training courses for developers: less "how to use Cursor or Copilot", and more "how to build a workflow you can actually trust".

FAQ

Is OpenClaw enterprise-ready today? Not by default. It can be powerful in a controlled environment, but permissioning, security, and governance need real work. There are also performance issues at larger scale - it's still an immature project. Great as an Executive Assistant for a CTO or an experiment for an experienced developer to "see the future." But not for mass deployment - though that's a matter of months before similar enterprise-grade solutions appear.

What is the biggest productivity bottleneck with agentic workflows? Review and verification. Plus fast delegation and unblocking the agent (access, tool configuration). When agents write faster than humans can review or delegate, we become the bottleneck for the agents.

Do I need one frontier model (Opus) for everything? Usually no. A cheaper orchestrator plus premium executors is often more cost-effective and operationally stable.

Does a 1M token context window solve "context rot"? It helps, but you still need good information design. Even Anthropic's own reported MRCR v2 results show lower retrieval at 1M than at shorter contexts (Anthropic).

What's the safest way to start with autonomous agents? Use a separate machine or sandbox, restrict credentials, log everything, and start with low-risk tasks.

Want to make this practical for your team?

If you want to train developers and tech leads on agentic workflows (architecture, tool safety, prompt injection defenses, and real delivery loops), Edukey can help. Join our course: OpenClaw 🦞 - Autonomous AI Agent in 1 Day


Related Courses and Services

OpenClaw Agent handling tasks for Lucas Matuszewski in modern cyberpunk office
AI / Artificial IntelligenceAutomation & Workflows

OpenClaw 🦞 - Autonomous AI Agent in 1 Day

From installation to a working agent in 7 hours. Build your AI assistant that works 24/7 and does what you never have time for (but always dreamed of ;)

Women working on a laptop in cafe. Tasks completed by AI Agents are symbolized by green holographic check marks.
AI Sparkles Icon - Artificial Intelligence

Automate your reports. Reply to emails faster and analyze documents in seconds. And while you're at it, let an AI Agent build you a new webpage!

AI Sparkles Icon - Artificial Intelligence
AI / Artificial IntelligenceMicrosoft 365 (Office)

Microsoft Copilot 365 and AI Tools in Daily Work

Discover how to use AI and Microsoft Copilot 365 in your daily work. Boost your team's productivity and gain time for what really matters.


Keep Reading

Lucas Matuszewski with a Book: If anyone builds it everyone dies. Why superhuman AI would kill us all
AI / Artificial IntelligenceLearning & Development

Great analogies: AI engineers acting like ALCHEMISTS, not scientists… Experimenting, hoping and believing, without real understanding… yet still pursuing at all cost. Are the smartest among us so blinded?

Featured image for: AI Tools for Business in 2026: What's Worth Using?
AI / Artificial IntelligenceAutomation & WorkflowsCloud & Collaboration

A practical guide to AI tools that deliver ROI in 2026, from ChatGPT and Copilot to Gemini, plus automation platforms like Copilot Studio and Google Workspace Flows.

Business team in a meeting watching a presentation about artificial intelligence and neural networks on a digital screen.
AI / Artificial Intelligence

AI is becoming a key factor shaping the future of many industries. At Edukey – Training of the Future, we are particularly interested in how this revolution will affect the training sector.