Wang.se - AI Agency

Article5

Here’s a comprehensive AI agent architecture for automated customer support:

Architecture Overview

Router Agent — Classifies incoming queries and routes to the appropriate specialist agent.
FAQ Agent — Uses RAG over your knowledge base to answer common questions with cited sources.
Escalation Agent — Detects when a query requires human intervention and routes to a human agent with full conversation context.
Action Agent — Handles transactional requests (order status, password resets, account changes) by calling backend APIs via function calling.

Article4

Here’s a production-ready vector similarity search implementation:

import numpy as np
from typing import List, Tuple

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    dot = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot / (norm_a * norm_b + 1e-8)

class VectorSearchIndex:
    def __init__(self, dimension: int = 768):
        self.dimension = dimension
        self.vectors: List[np.ndarray] = []
        self.metadata: List[dict] = []

    def add(self, vector: np.ndarray, meta: dict = None):
        """Add a vector to the index with optional metadata."""
        assert vector.shape[0] == self.dimension
        self.vectors.append(vector)
        self.metadata.append(meta or {})

    def search(
        self,
        query: np.ndarray,
        top_k: int = 5
    ) -> List[Tuple[float, dict]]:
        """Find top-k most similar vectors using cosine distance."""
        similarities = [
            (cosine_similarity(query, vec), self.metadata[i])
            for i, vec in enumerate(self.vectors)
        ]
        similarities.sort(key=lambda x: x[0], reverse=True)
        return similarities[:top_k]

For production use, consider using FAISS for GPU-accelerated search, or ScaNN for billion-scale datasets.

Article3

Here’s a comprehensive comparison of the leading LLMs for enterprise use:

GPT-4o (OpenAI)

Strengths: Best overall reasoning, multimodal, massive ecosystem
Context Window: 128K tokens
Best For: General-purpose enterprise apps, complex reasoning
Pricing: ~$2.50 / 1M input tokens

Claude 3.5 Sonnet (Anthropic)

Strengths: Nuanced writing, excellent code generation, strong safety
Context Window: 200K tokens
Best For: Long document analysis, code-heavy tasks
Pricing: ~$3.00 / 1M input tokens

Llama 3.1 405B (Meta)

Strengths: Open-source, self-hostable, customizable
Context Window: 128K tokens
Best For: Data sovereignty, custom fine-tuning, on-premise
Pricing: Infrastructure cost only (self-hosted)

For most enterprises, a multi-model strategy using GPT-4o for primary tasks and Llama for data-sensitive workloads provides the best balance of capability and compliance.

Article2

Great question! Let me break down RAG (Retrieval-Augmented Generation) and its best practices.

RAG combines the power of large language models with external knowledge retrieval, creating systems that can access up-to-date information beyond their training data.

How RAG Pipelines Work

Document Ingestion — Source documents are chunked into manageable pieces (typically 256-1024 tokens) and converted into vector embeddings.
Vector Storage — Embeddings are stored in a vector database (Pinecone, Weaviate, Qdrant, or pgvector) with metadata for filtering.

Tell me more about transformer architechtures

Great question! Let me walk you through this step by step.

AI systems today leverage transformer architectures that process input through multiple attention layers, enabling contextual understanding at scale. The key components include:

Self-Attention Mechanism — Allows the model to weigh the importance of different parts of the input relative to each other, creating rich contextual representations.
Positional Encoding — Since transformers don’t process data sequentially, positional encodings inject order information so the model understands sequence.