Case study: 18 hours a week, automated with one agent
How Astral Mantra Labs cut 18 hours a week of manual operations for an anonymised mid-sized hospitality client — the workflow we automated, the architecture we shipped, the numbers in production, and what we'd do differently next time.
TL;DR
Client: mid-sized hospitality operator (anonymised), ~40 properties, Nepal + India.
Problem: three ops staff spent 6+ hours each, every week, manually triaging guest emails, drafting replies, syncing bookings between channels, and chasing missing payment confirmations.
What we shipped: a single workflow agent reading every inbound guest email, classifying it, drafting a reply grounded in the property knowledge base, syncing the channel manager and PMS, and escalating only the genuine edge cases. Built in 6 weeks. Live for 7 months.
Result: 18 hours/week saved across the ops team, 96% inbound email classification accuracy, 22% faster guest-reply time, 11.4× ROI in the first year.
The problem
The client runs a portfolio of ~40 boutique hospitality properties across Nepal and India. Mid-sized operator. Strong brand. Heavy email volume. Their ops team had three full-time staff spending the bulk of their week on what they called "the inbox":
- Reading every inbound guest email and figuring out which property and which booking it referred to.
- Drafting a reply — usually a copy-paste from a Notion doc of templates, edited to context.
- If the email referred to a booking change, manually updating the channel manager and the property's local PMS.
- Chasing payment confirmations between the bank, the gateway, and the booking record.
Combined: roughly 18 hours of repetitive work per week, by their own log. The problem wasn't volume — they could keep up. The problem was that this work was eating hours that should have gone to actual guest experience and revenue management.
What we scoped in discovery
Discovery ran two weeks. We did three things:
- Mapped the actual process. Sat with the ops staff for two days, watched them handle real emails, took notes. Documented every tool, every decision, every edge case.
- Sampled and labelled inbound email. Pulled three months of email history (~14,000 messages) and classified them into 11 categories: booking inquiry, booking change, payment confirmation, payment dispute, complaint, special request, post-stay feedback, B2B partner, internal, spam, and "actually needs a human."
- Identified what the agent could and couldn't do. The first 8 categories are largely deterministic-with-judgement — perfect for an agent. The last 3 (genuine complaints, payment disputes, escalations) need a human, full stop.
We came back with a fixed-price discovery deliverable: written spec, system diagram, data model, integration list (Gmail, channel manager API, PMS API, payment gateway, Slack), model + infra decision, evaluation plan, and a fixed-price quote for a 6-week build.
What we shipped
A single workflow agent with the following architecture:
- Email ingest. Gmail webhook fires on every new inbound to the ops inbox. Message goes to the agent's queue.
- Property + booking resolution. The agent extracts the property, guest, and booking ID from the email — using a combination of structured matching (for emails with booking refs) and LLM extraction (for free-form messages).
- Classification. The agent classifies the email into one of the 11 categories. Confidence threshold of 0.85 — below that, the email goes to a human.
- Knowledge retrieval. For replies, the agent retrieves the relevant property's knowledge base (check-in time, amenities, local recommendations, policies) via RAG.
- Reply drafting. The agent drafts a reply in the brand voice, grounded in the retrieved knowledge. For booking changes, it also drafts the change action against the channel manager and PMS.
- Human checkpoint (first 4 weeks). Every reply was reviewed by a human before sending. We measured the override rate — how often the human edited the draft — and used that as the quality metric.
- Auto-send (after week 4). Once override rate dropped below 8% on Tier-1 categories, we cut over to auto-send for those categories. Edge cases still routed to humans.
- Slack escalation. Any email tagged as complaint, dispute, or escalation gets Slack-pinged to the right ops person, with the full thread and a suggested response already drafted.
What we used under the hood
- Reasoning core: GPT-4-class model from a frontier vendor. We tested an open-source equivalent — accuracy was 4 points lower at the categorisation step, which mattered.
- Retrieval: pgvector over the property knowledge base. ~6,000 chunks across 40 properties.
- Orchestration: n8n self-hosted, with a custom-coded agent step.
- Integrations: Gmail API, channel manager (a major industry vendor — anonymised), client's PMS (custom, REST API), Stripe + a local Nepali payment gateway, Slack.
- Evaluation: regression suite of 250 labelled emails covering all 11 categories. Run on every deploy. Drift dashboard checked weekly.
The numbers after 7 months in production
- Email volume handled: ~16,500/month at peak. Zero growth in ops headcount — they've reassigned the freed time to revenue management.
- Hours saved per week: 18, measured by the same time-log methodology the team used pre-launch.
- Classification accuracy: 96.2% over the held-out test set. Higher in production because the agent gets to abstain on low-confidence cases.
- Auto-send rate: 71% of replies sent without a human in the loop. The other 29% are either edge cases or categories we deliberately keep human-gated.
- Guest-reply time: down 22% on average. Down 64% on the median booking-inquiry, which converts faster as a result.
- ROI in year 1: 11.4× the build cost. By month 3, the system had paid for itself.
What we'd do differently
Honesty is more useful than a victory lap. Three things we'd change with hindsight:
- Start the human-checkpoint phase shorter. We ran 4 weeks of human review before cutting over to auto-send. The data showed the agent was ready at week 2.5. We left ~5 days of savings on the table by being conservative.
- Build the dashboard sooner. The ops team didn't trust the agent until they could see what it was doing. We added a live dashboard in week 5. It should have shipped in week 2 — adoption would have happened faster.
- Document the override policy in writing earlier. "When does the agent escalate vs handle?" was clear in our heads but ambiguous to the ops team for the first month. A written one-pager from week one would have prevented a lot of small confusion.
What it cost
Build: low five figures USD, fixed-price after the discovery. Operate-and-improve retainer in months 1–6: low four figures USD/month, mostly model tuning and adding new properties to the knowledge base. Model API costs: ~USD 280/month at peak email volume — passed through transparently. The full breakdown is consistent with our guide to AI development cost in Nepal.
The pattern behind this case study
The reason this worked isn't unique to hospitality. The pattern is:
- One workflow that costs your team many hours every week.
- Mostly deterministic, but with enough judgement that pure rules fail.
- An LLM grounded in your domain knowledge, with tool calls into your existing systems.
- A human-in-the-loop ramp before auto-send.
- Evaluation harness from day one.
If you have a workflow that fits this shape, agents are extremely likely to pay back. Send us the workflow — we'll tell you within a day whether it does.