250+ Hours of Claude Code for GTM: Here's What We Learned
What 250+ Hours Building an Claude Code Powered GTM Campaign Taught Us About Automation and Accuracy
Blogby JanMarch 04, 2026

We built a full outbound GTM campaign system that runs entirely from Claude Code. Databar for enrichments, Smartlead for sending, and deterministic Python scripts for everything that should not involve an LLM.
After running test campaigns, iterating on the architecture, and logging 250+ hours inside Claude Code sessions, we have a set of learnings that apply to anyone building GTM workflows with AI agents. Some of these were obvious in hindsight. Most of them cost us time and credits to figure out the hard way.
Here are the eight learnings, in the order that matters most.
1. Guardrails Are Your Most Important Feature
The most valuable lines of code we wrote were not the scoring algorithm or the enrichment waterfall. They were the guardrails that prevent expensive mistakes.
Without guardrails, Claude will happily send emails with links in the first touchpoint (which destroys deliverability). It will write personalized copy that could apply to literally anyone. It does not do these things maliciously. It does them because nobody told it not to, and it is optimizing for completion, not for your domain reputation.
The guardrails are what make the system safe to run autonomously.
For email copy:
→ 11 banned openers ("I noticed that...", "Hope this email finds you well", etc.). These kill reply rates.
→ Email 1 = zero links, always. Links in the first cold email destroy deliverability.
→ Max 75 words per email. Every word over 75 drops your reply rate.
→ No duplicate opening lines across any two contacts in the same campaign.
→ Personalization quality check: if the email could be sent to anyone at any company and still make sense, it is not personalized enough. Reject and regenerate.
For campaigns:
→ Never send without explicit human confirmation. Campaigns always created as PAUSED. → Test batch first: 50 contacts, wait 24 to 48 hours for deliverability data before loading the rest.
→ Check credit balance before bulk enrichment runs.
→ Auto-disqualify companies by geography, partner/competitor status, and gov/edu/nonprofit domains.
For data quality:
→ Domain-match check: if the enriched email domain does not match the expected company domain, drop it.
→ Email verification built into every waterfall call. No exceptions.
These guardrails sound obvious written down. They are not obvious when you are building at 11 PM and Claude is confidently offering to push your entire list to production. The guardrails are the difference between a system you trust and a system that sends your CEO a spam complaint at 3 AM.
2. Waterfall Everything Through a Single API
We started by manually chaining enrichment providers inside Claude Code. Try Provider A, if it fails try Provider B, then C. This approach was fragile, slow, and burned credits on failed lookups. Every provider needed its own API key, its own response format, its own error handling.
The shift: Databar's waterfall API handles the entire cascade through a single endpoint. One API call, multiple providers deep, with built-in email verification. You don't manage provider subscriptions, you don't write fallback logic, and you don't pay for failed lookups that returned nothing.
Before (manual chaining):
→ Sign up for Findymail, Prospeo, Hunter separately → Write custom integration code for each provider's API → Build your own fallback logic: if Provider A returns null, try Provider B, then C → Handle three different response schemas, three different error codes → Total: 3 API calls, 2 wasted on failures, weeks of integration work
After (Databar waterfall):
→ One API call to Databar's waterfall endpoint → Databar cascades through multiple providers automatically until one returns a result → Built-in email verification on the same call → Total: 1 API call, 1 integration to maintain, verified result
The key waterfalls we use in production:
→ email_getter. First name, last name, and company domain in. Verified email out. Cascades through multiple providers automatically.
→ person_getter. Email address in. Full person data out, including name and LinkedIn URL. Cascades through multiple providers.
→ person_by_link. LinkedIn URL in. Verified email out. Cascades through multiple providers with optional email verification.
The real advantage is what you stop doing. You stop signing up for six different data providers. You stop maintaining six different API integrations. You stop writing brittle if/else chains that break when a provider changes their response format. Databar gives you access to 100+ data providers through a single API, and the waterfall logic runs server-side where it belongs, not in your Claude Code session where it adds complexity and failure points.
The math is simple. Fewer API calls = fewer failure points = lower cost. And the email verification happens inside the waterfall call itself, so you never push unverified emails to your sending platform. That single architectural decision eliminated an entire category of deliverability problems.
For bulk operations, Databar also supports a bulk waterfall endpoint that processes hundreds of records asynchronously. You submit the batch, poll for completion, and get results back with the same cascade logic applied to every record. No loops, no retry handling, no manual batching in your code.
3. Deterministic Scoring Beats LLM Scoring Every Time
We built a 100-point company scoring system with deterministic Python logic. Every single point is assigned by code. Zero LLM involvement. The specific weights and thresholds are calibrated to our ICP, but the architecture works for any B2B outbound use case.
Here is how the structure works, using a simplified example:
| Dimension | Example Weight | What It Measures |
| Industry Fit | 25-35 pts | How closely the company's industry matches your buyer profile |
| Company Size | 20-30 pts | Whether headcount falls in your sweet spot (varies by product) |
| Growth Signals | 15-25 pts | Funding stage, headcount growth, expansion indicators |
| Tech Stack | 10-20 pts | Whether they use tools adjacent to your product category |
| Geography | 5-15 pts | Region fit and market prioritization |
| Bonus Signals | +5-10 pts | Hiring patterns, specific job postings, intent indicators |
The weights in your system will look different. A company selling to enterprise will weight company size higher. A company selling to startups will weight funding signals higher. The point is not the specific numbers. The point is that every number is assigned by a Python if statement, not by Claude interpreting a prompt.
You define tiers based on total score. Something like: above 70 is Tier 1 (hot), 45 to 69 is Tier 2 (warm), 25 to 44 is Tier 3 (nurture), below 25 is disqualified. These thresholds shift as you run campaigns and learn which scores actually convert.
Why deterministic matters more than you think:
→ Run it 100 times, get the exact same result. Try that with a prompt.
→ Debug any score by reading the reasoning string. Something like: "Industry: B2B Software (high fit) | 85 employees (sweet spot) | Funding: Series A (ideal) | Tech signals: uses competitor product | Location: US" → Adjust one weight, re-run, see exactly what changed. No prompt engineering. No temperature tuning.
Edge cases are where this really pays off. Say your product sells well to small agencies, but a generic size-based model scores them low because they have 15 employees. With deterministic scoring, one conditional line fixes it: if the company matches your agency criteria, override the size score. That rule fires consistently on every run, on every record. Try doing that reliably with prompt engineering. You would need to explain the exception, hope Claude remembers it across runs, and still get inconsistent handling of edge cases.
The broader lesson: anything with a right answer should not involve an LLM. Scoring has a right answer for each company given your criteria. Qualification has a right answer given your rules. Filtering has a right answer given your parameters. Keep the LLM for tasks that genuinely require reasoning, like writing personalized copy or analyzing unstructured data. Give the math to Python.
4. The 8-Prompt Campaign Flow
A full outbound campaign is not one task. It is eight distinct steps, each with its own inputs, tools, quality checks, and outputs. Trying to run it as one mega-prompt does not work. Breaking it into eight sequential prompts with human checkpoints between them does.
The flow:
→ Step 1: Score and tier companies. Deterministic Python scoring. Human reviews Tier 1 list before proceeding.
→ Step 2: Find contacts. Enrichment waterfall to identify the right people at scored companies. Cross-verify when only one source returns data.
→ Step 3: Enrich emails. Databar's email_getter or person_by_link waterfall with auto-verification. Output: verified contact list with email, title, and LinkedIn URL.
→ Step 4: Qualify leads. Filter out partners, competitors, existing customers. Match against ICP role requirements. Enforce geographic and domain rules.
→ Step 5: Generate copy. Select email framework by tier. Apply all guardrails. Generate personalized sequences for each contact.
→ Step 6: Test batch. Push 50 contacts to Smartlead. Campaign created as PAUSED. Wait 24 to 48 hours for deliverability data.
→ Step 7: Load remaining. After deliverability is confirmed, push the remaining contacts. Monitor bounce rates.
→ Step 8: Campaign learnings. Pull analytics, document what worked, propose template updates for the next campaign.
Why eight steps instead of one: each step has a human checkpoint. You review Tier 1 companies before finding contacts. You verify deliverability before loading a huge number of contacts. You read every email before it sends. The system is autonomous within each step but requires human approval between steps.
This 8-step structure turned out to be the right granularity. Fewer steps meant too much happened without review. More steps added friction without catching additional errors. Eight gives you enough control to prevent mistakes while keeping the velocity high enough that a campaign goes from ICP to live sends in a single day.
5. Slash Command Skills Turn Prompts into Repeatable Workflows
After running several campaigns, we noticed the same prompt patterns repeating. So we turned them into Claude Code skills, which are reusable slash commands with built-in guardrails.
The four skills we built:
→ /score-companies. Scores and tiers a company list using the deterministic Python model, with optional pre-enrichment for missing data points.
→ /enrich-leads. Finds contacts at scored companies and gets verified emails via Databar's waterfall enrichment. Handles the full cascade from LinkedIn URL to verified contact record. → /write-sequence. Generates email copy with all guardrails enforced. Banned openers, word limits, personalization checks, no links in email one.
→ /push-smartlead. Pushes contacts to Smartlead with three modes: test-batch (first 50, campaign PAUSED, the default for safety), remaining (contacts 51+ after deliverability is confirmed), and full (all contacts, only when explicitly requested).
What makes a good skill: each skill is a markdown file that tells Claude Code exactly what to do. Which files to read, which tools to call, what guardrails to check, what output to produce. The key is being specific enough that Claude cannot deviate, but flexible enough to handle different inputs.
The /push-smartlead modes are a good illustration of why skills matter. The default mode pushes only 50 contacts and creates the campaign as PAUSED. This prevents the most common and most expensive mistake in outbound: pushing a bunch of contacts to a live campaign before verifying deliverability. A human has to explicitly request "full" mode. The system defaults to safe.
6. The WAT Framework: Why This Architecture Works
By now you have seen the individual components: guardrails preventing mistakes, waterfall APIs handling enrichment, deterministic scripts scoring companies, an 8-step flow with human checkpoints, and slash commands making it all repeatable. The question is why this specific combination works when simpler approaches fail.
The answer comes down to one architectural principle. When AI handles every step directly, accuracy compounds downward. If each step is 90% accurate, five steps in sequence give you a 59% success rate. That math gets ugly fast. By step eight, you are below 50%.
The fix: split your system into three layers. The WAT framework.
→ Workflows (markdown SOPs). Plain-language instructions defining what to do. These are the recipes.
→ Agents(Claude Code). Reads workflows, makes decisions, calls tools. This is the orchestration layer.
→ Tools (Python scripts + APIs). Deterministic execution. Zero LLM involvement. Zero room for error.
Every component in this article maps to one of those layers. The scoring script is a Tool. The 8-step campaign flow is a Workflow. Claude Code reading the flow and deciding which tool to call at each step is the Agent. The guardrails are encoded in both the Workflows (rules) and the Tools (enforcement).
The pattern is clear: AI for reasoning tasks (orchestration, copy generation, analysis). Deterministic tools for deterministic tasks (scoring, qualification, data validation, filtering).
A concrete example: company scoring. The scoring uses a Python script with hardcoded rules. Industry fit = points based on a lookup table. Company size = points based on ranges. Claude never "decides" a score. It runs the script. The script scores companies in under a second with 100% consistency. If we had left scoring inside Claude's reasoning loop, we would still be debugging prompt drift.
7. Feed Campaign Results Back into Your Templates
Most teams run campaigns, check the metrics, and start the next campaign from scratch. We built a feedback loop where campaign analytics automatically propose updates to our scoring criteria and copy frameworks.
How it works (Step 8 of the campaign flow):
→ Pull analytics from the Smartlead API: open rates, reply rates, bounce rates per sequence step.
→ Analyze by segment: which tiers replied most? Which email framework performed best? Which subject lines got the highest opens?
→ Document learnings in a structured file (campaigns/{name}/learnings.md).
→ Propose (not auto-apply) updates to templates. Should tier thresholds change? Which angles worked? Any new phrases to add to the banned openers list?
Why propose, not auto-apply: templates are your accumulated knowledge. Letting an AI auto-update them based on one campaign's data is risky. Instead, the system presents evidence-based recommendations and a human decides whether to accept them.
The compounding effect is real. After five campaigns, your scoring criteria are calibrated to your actual close rate. Your copy frameworks reflect what actually gets replies. Your banned opener list includes phrases you have proven do not work for your specific audience. Each campaign makes the next one measurably better because the learnings are captured in code and templates, not trapped in someone's head or buried in a Slack thread.
This feedback loop is the biggest difference between teams that plateau after their first campaign and teams that improve with every iteration. The system we built is not a static tool. It is a GTM workflow that gets smarter over time because the infrastructure captures and applies learnings automatically.
The Architecture in One Picture
If you take away one thing from 250+ hours of building this:
→ Guardrails everywhere (never send without confirmation, test batch first, verify every email)
→ Single-API enrichment (Databar waterfalls replace manual provider chaining)
→ Deterministic tools for deterministic tasks(scoring, qualification, filtering, data validation)
→ AI agents for reasoning tasks (orchestration, copy generation, analysis, research)
→ Feedback loops (campaign learnings feed back into templates and scoring weights)
→ Slash command skills for repeatability (same workflow runs the same way every time)
The system works because it respects the boundary between what AI is good at and what deterministic code is good at. Claude Code is the orchestrator, not the calculator. Python is the calculator. Databar's API is the data layer. Smartlead is the delivery layer. Each layer does what it does best, and the boundaries between them are enforced by the WAT framework and the guardrails.
That is 250+ hours distilled into seven learnings. The system is not done. It gets better every week. But the architecture, guardrails, waterfall enrichment, deterministic scoring, the 7-step flow, slash command skills, the WAT framework, self-auditing, and feedback loops, that architecture is stable. Everything we build from here compounds on top of it.
FAQ
What is the WAT framework?
WAT stands for Workflows, Agents, Tools. Workflows are markdown SOPs defining what to do. Agents (Claude Code) read the workflows, make decisions, and call tools. Tools are deterministic Python scripts that execute with zero LLM involvement. The framework separates AI reasoning from deterministic execution, which prevents accuracy degradation across multi-step workflows. You can see each layer in action throughout the article: guardrails and the 8-step flow are Workflows, Claude Code is the Agent, and the scoring scripts and waterfall calls are Tools.
Do I need to know Python to build this?
You need Claude Code to write the Python. You do not need to write it yourself. The deterministic scoring scripts, the API integrations, and the data processing tools are all generated by Claude Code based on your specifications. The skill you need is knowing what the scoring criteria should be and how the campaign should flow. That is GTM expertise, not programming.
Why not let the LLM handle scoring?
Consistency. LLM scoring produces different results on different runs for the same input. Deterministic scoring produces identical results every time. When you debug a scoring issue, you can read the reasoning string and trace exactly why a company scored the way it did. With LLM scoring, you are debugging a prompt, which means guessing at why the model interpreted something differently this time.
Can this work for agencies managing multiple clients?
Yes. The slash command skills and WAT framework are designed to be client-agnostic. Each client gets their own scoring weights, ICP definition, and copy frameworks stored in separate configuration files. The skills load the appropriate config per client. One system, multiple clients, uniquely calibrated per engagement. Our article on enrichment tools for agencies covers the multi-client data layer in more detail.
Why use Databar instead of integrating with each data provider directly?
Managing individual provider integrations is an operational burden that compounds over time. Each provider has its own API format, authentication, rate limits, billing, and quirks. Databar consolidates 100+ providers behind a single API, so you write one integration and get access to all of them. The waterfall endpoints handle provider cascade logic server-side, and built-in caching means re-enriching the same contact later costs nothing. For a Claude Code based system where simplicity and reliability matter, a single data layer is significantly easier to maintain than a dozen individual provider scripts.
Related articles

MCP vs. SDK vs. API: When to Use Which for GTM Workflows
When to Use MCP: Best for Exploratory and Conversational Workflows
by Jan, March 06, 2026

Claude Cowork for GTM: What Sales and RevOps Teams Need to Know
How Claude Cowork Simplifies Sales and Revenue Operations
by Jan, March 05, 2026

Contextual ICP Scoring with Claude Code: Why Employee Count and Tech Stack Aren't Enough Anymore
Get deeper insights and better conversion rates by moving beyond simple filters to dynamic ICP scoring powered by AI
by Jan, March 03, 2026

Claude Code vs. Clay: When to Use Which for GTM Workflows
Finding the Right Tool for Your GTM Strategy and Data Enrichment Needs
by Jan, March 02, 2026




