💼 ビジネスコミュニティ 🟡 少し慣れが必要 👤 経営者・事業責任者・マーケ

💼 RAGAccuracy最適化ツール

rag-accuracy-optimizer

RAGシステムの精度を最適化するため、データベース設計、チャンキング戦略、検索最適化、精度テスト、幻覚防止策などを網羅的に支援するSkillです。

⚡ ⏱ 提案書ドラフト 2日 → 半日

📜 元の英語説明(参考)

Optimize accuracy for RAG (Retrieval-Augmented Generation) systems. Covers: DB schema design, chunking strategies, retrieval optimization, accuracy testing, and anti-hallucination safeguards. Use when: (1) designing or improving a RAG pipeline, (2) choosing the right chunking strategy, (3) optimizing retrieval accuracy (hybrid search, reranking, multi-query), (4) evaluating chunk quality or testing accuracy, (5) setting up monitoring & safeguards for RAG production, (6) choosing SQL vs Vector DB, (7) designing metadata schemas for domain-specific data (insurance, finance, healthcare, e-commerce).

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o rag-accuracy-optimizer.zip https://jpskill.com/download/5285.zip && unzip -o rag-accuracy-optimizer.zip && rm rag-accuracy-optimizer.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/5285.zip -OutFile "$d\rag-accuracy-optimizer.zip"; Expand-Archive "$d\rag-accuracy-optimizer.zip" -DestinationPath $d -Force; ri "$d\rag-accuracy-optimizer.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して rag-accuracy-optimizer.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → rag-accuracy-optimizer フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-17
取得日時: 2026-05-17
同梱ファイル: 13

💬 こう話しかけるだけ — サンプルプロンプト

› RAG Accuracy Optimizer で、私のビジネスを分析して改善案を3つ提案して
› RAG Accuracy Optimizer を使って、来週の会議用の資料を作って
› RAG Accuracy Optimizer で、現状の課題を整理してアクションプランに落として

これをClaude Code に貼るだけで、このSkillが自動発動します。

📖 Claude が読む原文 SKILL.md(中身を展開)

この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。

RAG Accuracy Optimizer

A skill for optimizing end-to-end accuracy in RAG systems.

Workflow Overview

Data Design → Chunking → Indexing → Retrieval → Generation → Testing → Monitoring

Each step impacts accuracy. Optimize each step in order.

1. Structured Data Design

SQL vs Vector DB — When to Use What?

Criteria	SQL (PostgreSQL, MySQL)	Vector DB (Pinecone, Qdrant, Weaviate)
Exact facts (price, date, product code)	✅ Optimal	❌ Not suitable
Semantic search (query meaning)	❌ Not supported	✅ Optimal
Aggregation (SUM, COUNT, AVG)	✅ Native	❌ Not supported
Fuzzy matching ("similar to...")	⚠️ Limited	✅ Optimal
Hybrid (recommended)	pgvector for both	Vector DB + SQL metadata store

Principle: Clearly structured data → SQL. Unstructured data requiring semantic understanding → Vector DB. Most production systems need both.

Schema Design Patterns by Domain

Insurance:

policies(policy_id, product_type, effective_date)
clauses(clause_id, policy_id, clause_number, title, content)
exclusions(exclusion_id, clause_id, description)
-- Vector: embedding for clause.content + exclusion.description

Finance:

securities(ticker, name, sector, exchange)
reports(report_id, ticker, period, report_type)
sections(section_id, report_id, heading, content)
-- Vector: embedding for section.content, metadata: ticker + period

Healthcare:

drugs(drug_id, generic_name, brand_name, category)
guidelines(guideline_id, condition, recommendation, evidence_level)
interactions(drug_a_id, drug_b_id, severity, description)
-- Vector: embedding for guidelines.recommendation

E-commerce:

products(product_id, name, category, brand, price)
reviews(review_id, product_id, rating, content)
specs(product_id, attribute, value)
-- Vector: embedding for review.content + product description

Metadata Tagging Strategy

Each chunk/document needs at minimum:

metadata = {
    "source": "policy_doc_v2.pdf",       # Origin
    "source_type": "pdf",                 # File type
    "domain": "insurance",                # Domain
    "category": "life_insurance",          # Classification
    "entity_id": "POL-2024-001",          # Related entity ID
    "section": "exclusions",              # Section in doc
    "chunk_index": 3,                      # Chunk position
    "total_chunks": 12,                    # Total chunks in doc
    "created_at": "2024-01-15",           # Creation date
    "version": "2.0",                      # Version
    "language": "en"                       # Language
}

Metadata principles:

Always include source for traceability and citation
entity_id enables pre-filtering before search → reduces noise
chunk_index + total_chunks enables fetching surrounding context
Domain-specific fields (clause_number, ticker, drug_id) vary by use case

Normalization vs Denormalization

	Normalized	Denormalized
Pros	Less duplication, easy to update	Faster queries, fewer JOINs
Cons	Requires JOINs, slower	Duplication, harder to sync
Use when	Source of truth (SQL)	Vector store chunks

Recommendation: Normalized for SQL source → Denormalized when creating chunks for Vector DB. Each chunk should contain sufficient context, no JOINs needed at retrieval time.

2. Chunking Strategies

Detailed code examples: read references/chunking-patterns.md

Choosing the Right Strategy

Data has clear structure (clauses, sections)?
  → Semantic chunking (by heading/section)

Long, continuous data (articles, transcripts)?
  → Fixed size + overlap (512 tokens, 10-20% overlap)

Need both overview + detail?
  → Hierarchical chunking (parent-child)

Domain-specific with its own logical units?
  → Domain-specific chunking

Chunk Size Guidelines

Size	Use case	Trade-off
128-256 tokens	FAQ, short definitions	High precision, less context
256-512 tokens	Recommended default	Good balance
512-1024 tokens	Complex text, legal docs	More context, potential noise
>1024 tokens	Rarely used	Too much noise

Semantic Chunking

Split by meaning (section, topic) instead of fixed size:

# Split by markdown headings
# Split by paragraph breaks (\n\n)
# Split by topic change (using NLP or LLM detection)

Overlap Strategy

10-20% overlap between adjacent chunks
Ensures information at boundaries is not lost
Chunk N ends with 1-2 opening sentences of chunk N+1

Hierarchical Chunking (Parent-Child)

Document (summary)
  └── Section (heading + key points)
        └── Paragraph (details)

Search at paragraph level (most detailed)
When matched, pull parent section for additional context
Keep parent_id in metadata

Domain-Specific Chunking

Insurance: 1 chunk = 1 clause
Finance: 1 chunk = 1 report section, metadata = ticker + period
Healthcare: 1 chunk = 1 guideline/recommendation
E-commerce: 1 chunk = 1 review or 1 product description
Legal: 1 chunk = 1 article/clause/section

Metadata Enrichment Per Chunk

Each chunk should be enriched with:

Summary: 1-2 sentence content summary (LLM-generated)
Keywords: Key terms (supports BM25)
Questions: 2-3 questions this chunk can answer (hypothetical questions)
Entities: Named entities (product names, codes, dates)

3. Retrieval Optimization

Detailed code examples: read references/retrieval-patterns.md

Recommended Retrieval Pipeline

User Query
  → Query Rewriting (expand/reformulate)
  → Multi-Query Generation (3-5 variants)
  → Metadata Filtering (narrow scope)
  → Hybrid Search (Vector + BM25)
  → Merge & Deduplicate
  → Reranking (top 20 → top 5)
  → Contextual Compression
  → LLM Generation (with citations)

Hybrid Search (Vector + BM25)

Vector search: Find by meaning (semantic similarity)
BM25 (keyword): Find by exact keywords (product names, codes)
Combined: Weighted fusion or Reciprocal Rank Fusion (RRF)

final_score = α × vector_score + (1-α) × bm25_score
# α = 0.7 is a good starting point, tune per domain

Query Rewriting

Use LLM to reformulate the user question for clarity:

User: "does insurance pay?"
→ Rewritten: "Under what circumstances does life insurance pay out benefits?"

Multi-Query

From 1 question, generate 3-5 variants → search each variant → merge results:

Original: "Which bank has the highest savings rate?"
Query 1: "Compare savings interest rates across banks 2024"
Query 2: "Bank with highest deposit rate currently"
Query 3: "Top banks with best deposit interest rates"

Reranking

After retrieval, use a reranking model to re-sort by relevance:

Cohere Rerank: Simple API, highly effective
Cross-encoder: More accurate than bi-encoder, but slower
GPT Rerank: Use LLM to evaluate relevance (expensive but flexible)

Retrieve top 20 → rerank → take top 3-5 for generation.

Contextual Compression

After reranking, compress each chunk: keep only the part relevant to the question.

Original chunk (500 tokens) → Compressed (150 tokens, relevant part only)

Reduces noise, saves context window, improves accuracy.

Metadata Filtering

Narrow the search space BEFORE vector search:

# Instead of searching all 1M chunks:
filter = {"domain": "insurance", "product_type": "life"}
# Only search within ~50K relevant chunks
results = vector_db.search(query, filter=filter, top_k=20)

4. Accuracy Testing & Monitoring

Test Suite Design

Create ground truth Q&A pairs:

{
    "test_cases": [
        {
            "question": "Does life insurance pay out for suicide?",
            "expected_answer": "No payout within the first 2 years",
            "expected_source": "clause_15_exclusions.pdf",
            "category": "exclusions",
            "difficulty": "medium"
        }
    ]
}

Recommendation: Minimum 50-100 test cases, evenly distributed across categories and difficulty levels.

Metrics

Metric	Meaning	Target
Precision@K	% relevant results in top K	>0.8
Recall@K	% ground truth found in top K	>0.9
F1	Harmonic mean of Precision and Recall	>0.85
MRR	Mean Reciprocal Rank — average position of first correct result	>0.8
NDCG	Normalized Discounted Cumulative Gain — ranking quality	>0.85
Answer Accuracy	% correct answers (human eval or LLM judge)	>0.9

A/B Testing

Compare strategies by running the same test suite:

Config A: chunk_size=256, overlap=10%, no_rerank
Config B: chunk_size=512, overlap=20%, cohere_rerank
→ Compare MRR, NDCG, Answer Accuracy
→ Choose the config with better metrics

Error Analysis Framework

Classify errors to know where to optimize:

Error Type	Cause	Solution
Retrieval Miss	Correct chunk not found	Improve chunking, add hypothetical Q
Ranking Error	Correct chunk found but ranked low	Add reranking
Generation Error	Correct chunk but LLM answers wrong	Improve prompt, add few-shot
No Answer	Information not in DB	Expand knowledge base
Hallucination	LLM fabricates information	Add citation enforcement

Production Monitoring

Log each query:

log_entry = {
    "timestamp": "2024-01-15T10:30:00",
    "query": "...",
    "retrieved_chunks": [...],
    "reranked_chunks": [...],
    "answer": "...",
    "confidence": 0.85,
    "latency_ms": 450,
    "user_feedback": None  # thumbs up/down
}

Alerts:

Continuous confidence < 0.5 → review chunking/retrieval
Latency > 2s → optimize index or reduce top_k
Negative feedback > 20% → audit error patterns

5. Safeguards

Hallucination Prevention

Mandatory system prompt:

Answer ONLY based on the information provided in the context.
If you cannot find the information, respond: "I could not find this
information in the available data."
NEVER fabricate information.

Citation Enforcement

Require source citations:

Every answer must include [Source: file_name, section/clause].
If a specific source cannot be cited, mark it as "unverified".

Confidence Thresholds

if max_relevance_score < 0.3:
    return "No relevant information found."
elif max_relevance_score < 0.6:
    return answer + "\n⚠️ Low confidence. Please verify."
else:
    return answer + f"\n📎 Source: {sources}"

Answer Verification

Cross-check the answer with the DB:

Extract claims from the answer (using LLM)
Verify each claim against retrieved chunks
Flag claims without supporting evidence
Return only verified claims

6. Embedding Model Selection

Detailed comparison: read references/embedding-models.md

Quick Decision

Scenario	Model	Reason
Production, budget OK	Cohere embed-v4	Highest MTEB, input_type optimization
Production, low cost	OpenAI text-embedding-3-small	$0.02/1M tokens, good quality
Self-host, multilingual	BGE-M3 ⭐	Hybrid dense+sparse, 100+ languages, free
Self-host, Vietnamese	BGE-M3 or multilingual-e5-large	Best for Vietnamese RAG
POC / Prototype	all-MiniLM-L6-v2	90MB, runs on CPU

Key Principles

Dimension reduction: OpenAI embed-3 supports Matryoshka — reduce 3072→512 with only ~3% quality loss
Normalize embeddings: Always normalize_embeddings=True when encoding for cosine similarity
Batch processing: Encode in batches (256-2000 items) instead of one at a time
Consistency: Use the SAME model for indexing and querying

7. Vector DB Comparison

Detailed comparison + HNSW tuning: read references/vector-db-comparison.md

Quick Decision

Already have PostgreSQL and <5M vectors? → pgvector
Just prototype/POC? → ChromaDB
Production, want zero-ops? → Pinecone
Need performance + HNSW control? → Qdrant
Need hybrid BM25+vector built-in? → Weaviate

HNSW Tuning Quick Reference

Param	Default	Accuracy-critical	Speed-critical
M	16	48-64	8-16
ef_construction	200	400-500	100-200
ef (search)	100	200-256	50-100

Trade-off: Higher M and ef → better recall but more RAM and slower. Tune per SLA.

8. Advanced Techniques

Detailed code examples: read references/advanced-rag.md

Late Chunking

Embed the entire document first, then pool embeddings by chunk boundaries. Each chunk retains context from surrounding text.

Traditional: Doc → Chunk → Embed each (loses context)
Late Chunking: Doc → Embed full → Pool by boundaries (retains context)

Use when: Documents have many co-references ("it", "this", "the package"). Quality gain: +5-10%.

RAPTOR (Recursive Abstractive Processing)

Build a multi-level summary tree: Level 0 (chunks) → Level 1 (summaries) → Level 2 (summary of summaries).

Use when: Need to answer both broad queries ("Compare all insurance packages") and narrow queries ("Clause X of Package Y"). Quality gain: +10-15%.

GraphRAG (Microsoft)

Build a knowledge graph from documents → detect communities → summarize communities → query via map-reduce.

Use when: Multi-hop reasoning, synthesize across many documents. Quality gain: +15-25% for synthesis queries. High overhead (many LLM calls when building the graph).

Combining Techniques (Production Stack)

1. Late Chunking → better embeddings
2. Hybrid Search (BM25 + vector) → high recall
3. Reranking (Cohere/Cross-encoder) → high precision
4. RAPTOR → multi-level retrieval (optional)
5. GraphRAG → synthesis queries (optional, high cost)

9. Performance Optimization

Caching Layer

# Cache embeddings (avoid re-computation)
import hashlib, json, redis

r = redis.Redis()

def cached_embed(text, model):
    key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    embedding = model.encode([text])[0].tolist()
    r.setex(key, 3600, json.dumps(embedding))  # TTL 1h
    return embedding

# Cache search results (avoid re-searching)
def cached_search(query, search_fn, ttl=300):
    key = f"search:{hashlib.md5(query.encode()).hexdigest()}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    results = search_fn(query)
    r.setex(key, ttl, json.dumps(results))
    return results

Async Retrieval

import asyncio

async def parallel_retrieve(query, retrievers):
    """Run multiple retrievers in parallel."""
    tasks = [r.search(query) for r in retrievers]
    results = await asyncio.gather(*tasks)
    return merge_and_deduplicate(results)

HNSW Index Tuning

See details in references/vector-db-comparison.md HNSW section. Key: tune ef (search) per latency SLA, tune M per recall target.

10. Vietnamese-Specific RAG

Details: read references/vietnam-nlp.md

Key Challenges

Issue	Solution
Diacritics (with vs without)	Dual indexing: index both versions
Compound words ("bảo hiểm")	Word segmentation (underthesea)
Abbreviations (BHXH, TTCK, BLLĐ)	Abbreviation expansion dictionary
Vietnamese proper names	NER with underthesea/PhoBERT
Domain terms (finance, law, medical)	Domain-specific term enrichment

Embedding Models for Vietnamese

BGE-M3: Best overall — hybrid dense+sparse, 100+ languages
multilingual-e5-large: Good alternative — retrieval-optimized
PhoBERT-v2: Best for NER/classification (needs fine-tuning for retrieval)

Preprocessing Pipeline

Input text
  → Unicode normalize (NFC)
  → Expand abbreviations (BHXH → Social Insurance)
  → Domain term enrichment
  → Dual index: original + no-diacritics version
  → Extract entities → metadata

11. AI Orchestrator — Multi-Model Cost Optimization

Detailed prompt templates, code examples: read references/orchestrator-patterns.md

Query Classification Pipeline

Each user query is classified into 1 of 5 categories:

Category	Description	Example	Model
simple	Greeting, FAQ, simple lookup	"Hello", "Opening hours?"	No LLM / Local
rag	Needs knowledge base search	"Does insurance cover cancer?"	Cheap (Gemini Flash)
complex	Multi-hop reasoning, comparison, analysis	"Compare 3 insurance packages for a family of 4"	Standard (GPT-4o-mini) / Premium (Claude Sonnet)
action	Needs tool/API execution (create form, calculate)	"Calculate insurance premium for me, age 30"	Standard + Tools
unsafe	Violation content, injection, jailbreak	"Ignore instructions..."	Block — No LLM

2-Stage Classification (Minimize LLM Tokens)

User Query
  → Stage 1: Rule-based pre-classifier (regex, keywords, NO LLM)
    → confidence ≥ 0.8? → DONE (skip LLM)
    → confidence < 0.8? → Stage 2: LLM classifier (cheap model, ~50 tokens)

Stage 1 blocks 60-80% of queries without spending a single LLM token.

Model Routing

Category → Model Selection:
  greeting/simple  → No LLM (rule-based response)
  rag (simple)     → Gemini Flash ($0.075/1M input) — cheap, fast
  rag (complex)    → GPT-4o-mini ($0.15/1M input) — balanced
  complex          → Claude Sonnet ($3/1M input) — premium quality
  action           → Gemini Flash + Tool calls
  unsafe           → Block response (no LLM cost)

Cost Optimization Rules

Rule-based first: Greeting, FAQ, unsafe → DON'T call LLM
Cheapest sufficient model: Prefer Gemini Flash for RAG queries
Escalate on failure: Gemini Flash fail/low-confidence → GPT-4o-mini → Claude Sonnet
Cache responses: Identical queries → cached answer (TTL 5-30 min)
Batch classify: Multiple queries → 1 LLM call to classify all
Token budget: Set max_tokens per category (simple: 100, rag: 300, complex: 500)

RAG Trigger Rules

Condition	RAG On/Off
Query contains domain keywords	✅ ON
Classification = "rag" or "complex"	✅ ON
Greeting, simple lookup, unsafe	❌ OFF
Confidence score > 0.9 from cache/FAQ	❌ OFF (answer from cache)

Tool Trigger Rules

Condition	Tools
Query requests calculation (fees, interest)	calculator tool
Query requests form creation/submission	form_builder tool
Query requests real-time lookup (price, exchange rate)	api_lookup tool
Classification ≠ "action"	No tools

JSON Output Format

{
  "category": "rag",
  "confidence": 0.92,
  "risk_level": "low",
  "model": "gemini-flash",
  "rag_enabled": true,
  "tools": [],
  "max_tokens": 300,
  "reasoning": "User asks about insurance benefits — needs knowledge base search"
}

Scripts

eval_ragas.py

RAGAS evaluation pipeline. Run:

python3 scripts/eval_ragas.py --test-file eval_dataset.json --output results.json
python3 scripts/eval_ragas.py --test-file eval_dataset.json --metrics faithfulness,answer_relevancy

Input: JSON file with test cases (question, answer, contexts, ground_truth). Output: metrics report + threshold checks. Requires: pip install ragas langchain-openai datasets

embedding_benchmark.py

Benchmark embedding models on a Vietnamese dataset. Run:

python3 scripts/embedding_benchmark.py --models bge-m3,multilingual-e5 --dataset vi_pairs.json
python3 scripts/embedding_benchmark.py --models all --quick  # Use built-in test pairs

Input: JSON file with query-positive-negative pairs. Output: accuracy + latency comparison. Requires: pip install sentence-transformers numpy torch

chunk_optimizer.py

Evaluate chunk quality. Run:

python3 scripts/chunk_optimizer.py --input chunks.jsonl --output report.json

Input: JSONL file, each line is {"text": "...", "metadata": {...}}. Output: quality report with scores.

accuracy_test.py

Test framework for RAG accuracy. Run:

python3 scripts/accuracy_test.py --test-file tests.json --results-dir ./results

Input: JSON file with test cases (question, expected_answer, expected_source). Output: metrics report.

References

references/chunking-patterns.md — Python code examples for chunking strategies
references/retrieval-patterns.md — Code examples for hybrid search, reranking, multi-query
references/embedding-models.md — Detailed embedding model comparison (OpenAI, Cohere, BGE-M3, PhoBERT...)
references/vector-db-comparison.md — Vector DB comparison + HNSW tuning guide
references/advanced-rag.md — Late Chunking, RAPTOR, GraphRAG with code examples
references/testing-frameworks.md — RAGAS, LLM-as-Judge, Adversarial testing
references/vietnam-nlp.md — Vietnamese NLP: diacritics, abbreviations, NER, domain terms
references/orchestrator-patterns.md — Multi-model orchestrator: prompt templates, rule-based pre-classifier, cost comparison, fallback chain, monitoring

同梱ファイル

※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。

📄 SKILL.md (22,081 bytes)
📎 references/advanced-rag.md (16,649 bytes)
📎 references/chunking-patterns.md (12,981 bytes)
📎 references/embedding-models.md (7,998 bytes)
📎 references/orchestrator-patterns.md (25,223 bytes)
📎 references/retrieval-patterns.md (14,425 bytes)
📎 references/testing-frameworks.md (12,968 bytes)
📎 references/vector-db-comparison.md (9,403 bytes)
📎 references/vietnam-nlp.md (14,453 bytes)
📎 scripts/accuracy_test.py (22,864 bytes)
📎 scripts/chunk_optimizer.py (17,196 bytes)
📎 scripts/embedding_benchmark.py (12,124 bytes)
📎 scripts/eval_ragas.py (6,176 bytes)