How I Cut My Infrastructure Costs by 35% Overnight - A Startup Survival Checklist

Posted October 5, 2025 by Gowri Shankar ‐ 10 min read

Picture this: It's 6 AM, I'm nursing my third cup of coffee (don't judge), and I'm staring at my GCP billing dashboard like it personally offended my mother. The numbers are glowing with the enthusiasm of a neon sign in Vegas, except instead of promising jackpots, they're promising bankruptcy. A $$,000+ per month. For a healthcare startup that's still figuring out if doctors actually want AI assistance or just want us to leave them alone. That's when it hit me... not enlightenment, not a business epiphany, but pure, unadulterated panic. At this burn rate, QIQ Health would run out of runway faster than a paper airplane in a hurricane. The unit economics were laughing at me, and my cap table was about to become a cautionary tale at startup meetups. But here's the plot twist: within 24 hours, I managed to slash that bill by over 35%. No venture debt, no emergency funding rounds, no selling my kidney on the dark web. Just some good old-fashioned detective work and the kind of infrastructure archaeology that would make Indiana Jones proud.

A brutally honest story about startup survival, billing shock therapy, and why sometimes the best architecture is the one that doesn’t bankrupt you.

Cloud

The Problem: When Your Infra Costs More Than Your Team

Let me paint you a picture of startup life circa three months ago. We had built this beautiful, cloud-native architecture using Infrastructure as Code (IaC, using Pulumi on Typescript). One click deployment, auto-scaling, multi-region redundancy… the whole nine yards. It was engineering poetry in motion.

The problem? Poetry doesn’t pay the bills, and our infrastructure was eating through cash faster than a crypto bro in a bull market.

Here’s what was happening:

Monthly burn rate: a, $$K+ just on GCP
Growth in usage: Minimal (we’re still in MVP phase)
Growth in costs: Exponential (thanks, compound billing)
Sleep quality: Approaching negative territory

The math was simple and terrifying. At our current revenue of approximately a‚ $0 (ah, the joys of pre-revenue startups), we had maybe 4-5 months before we’d be explaining to investors why we needed another round to fund our cloud bill instead of actual product development.

Something had to give, and spoiler alert: it wasn’t going to be my sanity.

Going Full Sherlock Holmes on My GCP Bill

The Great SKU Detective Story

When you’re bleeding money, the first step isn’t panic (though I did plenty of that). It’s forensic accounting. I dove into our GCP billing like it was a murder mystery, except the victim was my bank account and the suspect was… well, me.

Google Cloud’s billing architecture is like Russian nesting dolls, but instead of cute wooden figures, each layer reveals another way you’re spending money you didn’t know you had. Here’s how it breaks down:

Service Level: The big categories (Compute, Networking, Storage)
SKU Level: The nitty-gritty details of what’s actually costing you money
Usage Level: The cold, hard truth about how much you’re actually using

Looking at my bill, three SKUs were absolutely demolishing my budget:

Network Intelligence Center SKUs (The Silent Budget Killers):

Network Intelligence Center Internet to Google Cloud Performance Resource Hours BDBA-22FA-3925: â‚ $$,275
Network Intelligence Center Network Analyzer Resource Hours 9BF8-CD36-F9B8: â‚ $$,744
Network Intelligence Center Topology and Google Cloud Performance Resource Hours D9AD-28F8-05D8: â‚ $$,744

These weren’t providing any real value to our healthcare app. We weren’t running enterprise-scale networking that needed deep intelligence insights. We were a small startup serving medical intelligence for a mere 34 active users, not managing a global CDN.

Cloud NAT Gateway (The Honest Thief): At least this one was upfront about what it did. But at â‚ $$,630 for something we barely used, it was like paying for a Ferrari when all you need is a bicycle.

SKU

The IaC Trap: When Automation Goes Rogue

Here’s the thing about Infrastructure as Code… it’s absolutely brilliant until it isn’t. Over three months of iterative development, our Pulumi scripts had accumulated services like a hoarder accumulates cats. Each “just in case” service, each “we might need this later” component, each “better safe than sorry” redundancy was quietly racking up costs.

Our IaC had become Infrastructure as Debt. One-click deployment meant one-click financial commitment to a dozen services we’d forgotten we’d enabled.

The Network Intelligence Center was particularly insidious because:

It gets enabled automatically with certain GCP services
It charges per resource-hour for EVERY running instance
It provides insights that are overkill for most startups
There’s no clear “disable” button in the console

The Great Infrastructure Purge

Step 1: Nuking Network Intelligence Center

Disabling Network Intelligence Center felt like defusing a bomb while wearing oven mitts. Google doesn’t make it obvious how to turn this thing off, and the documentation reads like it was written by someone who assumes you want to keep paying for it.

Here’s what actually worked:

# Disable the APIs (both are needed)
# gcloud services disable networkmanagement.googleapis.com --force
gcloud services disable networkintelligence.googleapis.com --force

# Clean up any lingering resources
gcloud compute networks list  # Check for phantom networks
gcloud compute routers list   # Look for abandoned routers

The key insight: Network Intelligence Center isn’t just one service… it’s a collection of billing SKUs that get activated when you use networking features. Each SKU charges separately, and they’re all enabled by default.

Result: â‚$$,000+ in annual savings. Just like that.

Step 2: The Redis Reality Check

Our second big win came from questioning something I thought was sacrosanct: our Redis cache infrastructure.

We were running a dedicated Compute Engine instance just for Redis caching. It made sense in theory… Redis is fast, it’s what everyone uses for caching, it’s the “right” way to do things. But it was also costing us â‚ $$,000+ monthly for what amounted to caching a few hundred medical insights.

That’s when I had my second coffee-induced epiphany: What if we didn’t need Redis at all?

The PostgreSQL UNLOGGED Revelation

PostgreSQL has this beautiful feature called UNLOGGED tables. They skip the Write-Ahead Log (WAL), which makes them:

2-3x faster for writes than regular tables
Perfect for regenerable data like caches
Zero additional infrastructure cost (uses existing CloudSQL)
Still ACID compliant for consistency

Here’s the magic:

-- Create cache table
CREATE UNLOGGED TABLE cache_store (
    key VARCHAR(255) NOT NULL PRIMARY KEY,
    value VARCHAR(2048) NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    expires_at TIMESTAMP NULL,
    hit_count INTEGER NOT NULL DEFAULT 0
);

-- Add performance indexes
CREATE INDEX idx_cache_expires_at ON cache_store (expires_at);
CREATE INDEX idx_cache_created_at ON cache_store (created_at);

Performance comparison (based on our medical transcription workload):

Redis: ~0.1ms latency, 100K+ ops/sec
PostgreSQL UNLOGGED: ~1-2ms latency, 10K+ ops/sec
Our actual needs: <1K ops/sec for medical workflows

The PostgreSQL solution was 10x more than we needed, but infinitely more cost-effective than maintaining a separate Redis instance.

Building the Unified Cache System

Instead of just ripping out Redis, I built a unified cache interface that could work with either backend. This way, if we ever scale to the point where we need Redis performance, we can switch without rewriting application code.

class CacheInterface:
    def __init__(self, backend):
        self.backend = backend

    async def get(self, key: str) -> Optional[str]:
        if isinstance(self.backend, Redis):
            return await self._redis_get(key)
        else:  # PostgreSQL CacheService
            return await self._postgres_get(key)

    async def set(self, key: str, value: str, ttl_seconds: int = 3600):
        if isinstance(self.backend, Redis):
            await self._redis_set(key, value, ttl_seconds)
        else:
            await self._postgres_set(key, value, ttl_seconds)

This abstraction gave us the best of both worlds:

Zero infrastructure overhead for our current scale
Future-proof architecture if we need to scale up
Operational simplicity with one less moving part to monitor

Result: â‚ $$,000+ monthly savings by eliminating the Redis compute instance.

The Numbers Don’t Lie (And Neither Does My Bank Account)

Let’s talk cold, hard numbers:

Before Optimization:

Total Monthly GCP Bill: â‚ 60,000+
Network Intelligence Center: â‚ 75,000+ annually
Redis Compute Instance: â‚ 8,000+ monthly
Runway at current burn: 4-5 months
Sleep quality: Non-existent

After 24-Hour Surgery:

Total Monthly GCP Bill: â‚ 40,000 (33% reduction)
Network Intelligence Center: â‚ 0 (100% reduction)
Redis Compute Instance: â‚ 0 (100% reduction)
Annual savings: â‚ 240,000+
Extended runway: 50% longer (6-7 months to profitability)

The Real Impact:

That 50% runway extension? That’s not just numbers on a spreadsheet. That’s:

Two more months to validate product-market fit
â‚ 240,000 that can go toward actual product development
Peace of mind knowing our infrastructure won’t bankrupt us
Buffer time to secure our next funding round from a position of strength, not desperation

The Broader Lessons: What I Learned About Startup Cost Optimization

Lesson 1: IaC Can Be Infrastructure as Debt

Infrastructure as Code is powerful, but it’s also dangerous in the hands of optimistic founders. Every terraform apply OR pulumi up is a financial commitment, and those commitments compound faster than you think.

Pro tip: Treat your IaC like code reviews. Every new service should answer: “What’s the monthly cost, and what happens if we don’t use it?”

Lesson 2: Default Settings Favor Providers, Not Startups

Cloud providers design defaults for enterprise reliability, not startup affordability. Network Intelligence Center, detailed monitoring, multi-region redundancy… they’re all enabled by default because enterprises will pay for them.

Startups need to flip this mindset. Start minimal, add complexity only when you can measure its value.

Lesson 3: Sometimes “Wrong” Architecture is Right for Your Business

Using PostgreSQL as a cache might make Redis purists cringe. But you know what makes me cringe more? Running out of money before we prove our hypothesis.

Architecture decisions should serve business outcomes, not engineering aesthetics.

Lesson 4: The Best Optimization is the Service You Don’t Use

Every service you don’t deploy is money you don’t spend. Every SKU you avoid is complexity you don’t manage. Sometimes the most elegant solution is the one that doesn’t exist.

Table

Your Turn: A Startup Survival Checklist

If you’re a startup founder reading this at 2 AM (we’ve all been there), here’s your action plan:

Week 1: Bill Archaeology

Export 6 months of cloud billing data
Group by service, then drill down to SKU level
Identify your top 5 cost centers
Question everything… What value is each service actually providing?

Week 2: The Great Purge

Disable monitoring/intelligence services you don’t actively use
Right-size your compute instances (most startups over-provision by 2-3x)
Eliminate redundancy you’re not ready for (multi-region can wait)
Question every “just in case” service

Week 3: Architecture Reality Check

Do you need that Redis instance? (Spoiler: probably not)
Are you using managed services where simple solutions would work?
Can you consolidate workloads onto fewer instances?
What would happen if you turned off non-critical services for a week?

Week 4: Build Your Safety Nets

Set up billing alerts at 50%, 75%, and 90% of your budget
Create a monthly cost review process (15 minutes, every month)
Document what you turned off and why (future-you will thank present-you)
Build cost awareness into your deployment process

The Plot Twist: It’s Not Just About the Money

Here’s what I didn’t expect: optimizing our infrastructure costs didn’t just save money… it made us better engineers.

When you’re forced to question every service, every instance, every line item, you develop a deeper understanding of what your system actually needs versus what you think it needs. You start building with constraints, and constraints breed creativity.

Our PostgreSQL cache solution is actually more robust than our Redis setup was. We have better monitoring, more flexibility, and one less failure point. Sometimes limitations force innovation.

The Ending (Spoiler: We’re Still Here)

Three months later, QIQ Health is still running. Our infrastructure costs have stayed stable even as our usage has grown. We’ve used the saved money to hire a part-time developer and extend our runway until we hit profitability.

More importantly, we’ve built a culture of cost consciousness without sacrificing product quality. Every architectural decision now includes a cost consideration, not as an afterthought but as a first-class requirement.

The moral of the story? Sometimes the best code is the code you don’t write, and sometimes the best infrastructure is the infrastructure you don’t deploy.

Your startup’s survival might depend on what you choose NOT to build.

P.S. For the Infrastructure Purists Reading This

Yes, I know PostgreSQL isn’t technically a cache. Yes, I know Redis would be faster at scale. Yes, I know Network Intelligence Center provides valuable insights for large deployments.

But you know what’s more valuable than perfect architecture? A startup that’s still alive to iterate on its architecture.

Sometimes survival trumps purity. And sometimes, just sometimes, the “wrong” solution is exactly right.

If this helped you optimize your own infrastructure costs (or if you think I’m completely wrong), I’d love to hear about it. Hit me up on LinkedIn or check out what we’re building at QIQ Health.

And remember: the best architecture is the one that keeps your startup funded long enough to build the architecture you actually need.