💬 コミュニケーションコミュニティ 🔴 エンジニア向け 👤 エンジニア・AI開発者

🔢 AWQ量子化で4bit化(3倍速)

awq-quantization

Activation-aware で LLM を 4bit に圧縮するAWQ Skill。

⚡ ⏱ Slack絵文字GIF制作 1時間 → 5分

📺 まず動画で見る(YouTube)

※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。

📜 元の英語説明(参考)

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

🇯🇵 日本人クリエイター向け解説

一言でいうと

Activation-aware で LLM を 4bit に圧縮するAWQ Skill。

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o awq-quantization.zip https://jpskill.com/download/122.zip && unzip -o awq-quantization.zip && rm awq-quantization.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/122.zip -OutFile "$d\awq-quantization.zip"; Expand-Archive "$d\awq-quantization.zip" -DestinationPath $d -Force; ri "$d\awq-quantization.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して awq-quantization.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → awq-quantization フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-17
取得日時: 2026-05-17
同梱ファイル: 3

💬 こう話しかけるだけ — サンプルプロンプト

› AWQ量子化で4bit化(3倍速) を使って、最小構成のサンプルコードを示して
› AWQ量子化で4bit化(3倍速) の主な使い方と注意点を教えて
› AWQ量子化で4bit化(3倍速) を既存プロジェクトに組み込む方法を教えて

これをClaude Code に貼るだけで、このSkillが自動発動します。

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

AWQ (Activation-aware Weight Quantization)

アクティベーションパターンに基づいて重要な重みを保持する4ビット量子化により、最小限の精度損失で3倍の高速化を実現します。

AWQ を使用するタイミング

AWQ を使用するのは次の場合です。

4ビット量子化で5%未満の精度損失が必要な場合
命令チューニングモデルまたはチャットモデルをデプロイする場合 (AWQ はより汎用性が高いです)
FP16と比較して約2.5～3倍の推論速度向上を望む場合
本番環境でのサービス提供に vLLM を使用している場合
Marlin カーネルをサポートする Ampere+ GPU (A100、H100、RTX 40xx) をお持ちの場合

代わりに GPTQ を使用するのは次の場合です。

最大限のエコシステム互換性が必要な場合 (より多くのツールが GPTQ をサポートしています)
特に ExLlamaV2 バックエンドを使用している場合
Marlin をサポートしない古い GPU をお持ちの場合

代わりに bitsandbytes を使用するのは次の場合です。

キャリブレーションのオーバーヘッドがゼロで済む場合 (オンザフライで量子化)
QLoRA でファインチューニングしたい場合
よりシンプルな統合を好む場合

クイックスタート

インストール

# デフォルト (Triton カーネル)
pip install autoawq

# 最適化された CUDA カーネル + Flash Attention を使用する場合
pip install autoawq[kernels]

# Intel CPU/XPU 最適化
pip install autoawq[cpu]

要件: Python 3.8+、CUDA 11.8+、Compute Capability 7.5+

事前量子化済みモデルのロード

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # 速度向上のために fused attention を有効にする
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 生成
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

独自のモデルの量子化

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

# モデルとトークナイザーをロード
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 量子化設定
quant_config = {
    "zero_point": True,      # ゼロポイント量子化を使用
    "q_group_size": 128,     # グループサイズ (128を推奨)
    "w_bit": 4,              # 4ビット重み
    "version": "GEMM"        # バッチには GEMM、シングル・トークンには GEMV
}

# 量子化 (デフォルトで pileval データセットを使用)
model.quantize(tokenizer, quant_config=quant_config)

# 保存
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")

時間: 7B モデルで約10～15分、70B モデルで約1時間です。

AWQ vs GPTQ vs bitsandbytes

機能	AWQ	GPTQ	bitsandbytes
高速化 (4ビット)	約2.5～3倍	約2倍	約1.5倍
精度損失	5%未満	約5～10%	約5～15%
キャリブレーション	最小限 (128～1Kトークン)	より広範囲	なし
過学習リスク	低い	高い	N/A
最適な用途	本番推論	GPU推論	簡単な統合
vLLM サポート	ネイティブ	はい	限定的

重要な洞察: AWQ は、すべての重みが等しく重要であるとは限らないと仮定しています。アクティベーションパターンによって識別された約1%の重要な重みを保護し、混合精度オーバーヘッドなしで量子化誤差を削減します。

カーネルバックエンド

GEMM (デフォルト、バッチ推論)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # バッチサイズが1より大きい場合に最適
}

GEMV (シングル・トークン生成)

quant_config = {
    "version": "GEMV"  # batch_size=1 の場合、20%高速
}

制限: バッチサイズは1のみで、大規模なコンテキストには適していません。

Marlin (Ampere+ GPU)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # A100/H100 で2倍高速
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

要件: Compute Capability 8.0+ (A100、H100、RTX 40xx)

ExLlamaV2 (AMD 互換)

config = AwqConfig(
    bits=4,
    version="exllama"  # より高速なプリフィル、AMD GPU サポート
)

HuggingFace Transformers 統合

直接ロード

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused モジュール (推奨)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # フュージングの最大シーケンス長
    do_fuse=True           # fused attention/MLP を有効にする
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

注: Fused モジュールは FlashAttention2 と組み合わせることはできません。

vLLM 統合

from vllm import LLM, SamplingParams

# vLLM は AWQ モデルを自動検出します
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)

パフォーマンスベンチマーク

メモリ削減

モデル	FP16	AWQ 4ビット	削減率
Mistral 7B	14 GB	5.5 GB	2.5倍
Llama 2-13B	26 GB	10 GB	2.6倍
Llama 2-70B	140 GB	35 GB	4倍

推論速度 (RTX 4090)

モデル	プリフィル (tok/s)	デコード (tok/s)	メモリ
Mistral 7B GEMM	3,897	114	5.55 GB
TinyLlama 1B GEMV	5,179	431	2.10 GB
Llama 2-13B GEMM	2,279	74	10.28 GB

精度 (パープレキシティ)

モデル	FP16	AWQ 4ビット	劣化率
Llama 3 8B	8.20	8.48	+3.4%
Mistral 7B	5.25	5.42	+3.2%
Qwen2 72B	4.85	4.95	+2.1%

カスタムキャリブレーションデータ

# ドメイン固有のモデルにはカスタムデータセットを使用
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data="wik

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

AWQ (Activation-aware Weight Quantization)

4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.

When to use AWQ

Use AWQ when:

Need 4-bit quantization with <5% accuracy loss
Deploying instruction-tuned or chat models (AWQ generalizes better)
Want ~2.5-3x inference speedup over FP16
Using vLLM for production serving
Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support

Use GPTQ instead when:

Need maximum ecosystem compatibility (more tools support GPTQ)
Working with ExLlamaV2 backend specifically
Have older GPUs without Marlin support

Use bitsandbytes instead when:

Need zero calibration overhead (quantize on-the-fly)
Want to fine-tune with QLoRA
Prefer simpler integration

Quick start

Installation

# Default (Triton kernels)
pip install autoawq

# With optimized CUDA kernels + Flash Attention
pip install autoawq[kernels]

# Intel CPU/XPU optimization
pip install autoawq[cpu]

Requirements: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+

Load pre-quantized model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantize your own model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,      # Use zero-point quantization
    "q_group_size": 128,     # Group size (128 recommended)
    "w_bit": 4,              # 4-bit weights
    "version": "GEMM"        # GEMM for batch, GEMV for single-token
}

# Quantize (uses pileval dataset by default)
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")

Timing: ~10-15 min for 7B, ~1 hour for 70B models.

AWQ vs GPTQ vs bitsandbytes

Feature	AWQ	GPTQ	bitsandbytes
Speedup (4-bit)	~2.5-3x	~2x	~1.5x
Accuracy loss	<5%	~5-10%	~5-15%
Calibration	Minimal (128-1K tokens)	More extensive	None
Overfitting risk	Low	Higher	N/A
Best for	Production inference	GPU inference	Easy integration
vLLM support	Native	Yes	Limited

Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.

Kernel backends

GEMM (default, batch inference)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # Best for batch sizes > 1
}

GEMV (single-token generation)

quant_config = {
    "version": "GEMV"  # 20% faster for batch_size=1
}

Limitation: Only batch size 1, not good for large context.

Marlin (Ampere+ GPUs)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 2x faster on A100/H100
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)

ExLlamaV2 (AMD compatible)

config = AwqConfig(
    bits=4,
    version="exllama"  # Faster prefill, AMD GPU support
)

HuggingFace Transformers integration

Direct loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused modules (recommended)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # Max sequence length for fusing
    do_fuse=True           # Enable fused attention/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

Note: Fused modules cannot combine with FlashAttention2.

vLLM integration

from vllm import LLM, SamplingParams

# vLLM auto-detects AWQ models
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)

Performance benchmarks

Memory reduction

Model	FP16	AWQ 4-bit	Reduction
Mistral 7B	14 GB	5.5 GB	2.5x
Llama 2-13B	26 GB	10 GB	2.6x
Llama 2-70B	140 GB	35 GB	4x

Inference speed (RTX 4090)

Model	Prefill (tok/s)	Decode (tok/s)	Memory
Mistral 7B GEMM	3,897	114	5.55 GB
TinyLlama 1B GEMV	5,179	431	2.10 GB
Llama 2-13B GEMM	2,279	74	10.28 GB

Accuracy (perplexity)

Model	FP16	AWQ 4-bit	Degradation
Llama 3 8B	8.20	8.48	+3.4%
Mistral 7B	5.25	5.42	+3.2%
Qwen2 72B	4.85	4.95	+2.1%

Custom calibration data

# Use custom dataset for domain-specific models
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data="wikitext",       # Or custom list of strings
    max_calib_samples=256,       # More samples = better accuracy
    max_calib_seq_len=512        # Sequence length
)

# Or provide your own samples
calib_samples = [
    "Your domain-specific text here...",
    "More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

Multi-GPU deployment

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # Auto-split across GPUs
    max_memory={0: "40GB", 1: "40GB"}
)

Supported models

35+ architectures including:

Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
Qwen: Qwen, Qwen2, Qwen2.5-VL
Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
Multimodal: LLaVA, LLaVA-Next, Qwen2-VL

Common issues

CUDA OOM during quantization:

# Reduce batch size
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

Slow inference:

# Enable fused layers
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

AMD GPU support:

# Use ExLlama backend
config = AwqConfig(bits=4, version="exllama")

Deprecation notice

AutoAWQ is officially deprecated. For new projects, consider:

vLLM llm-compressor: https://github.com/vllm-project/llm-compressor
MLX-LM: For Mac devices with Apple Silicon

Existing quantized models remain usable.

References

Paper: AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 Best Paper
GitHub: https://github.com/casper-hansen/AutoAWQ
MIT Han Lab: https://github.com/mit-han-lab/llm-awq
Models: https://huggingface.co/models?library=awq

同梱ファイル

※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。

📄 SKILL.md (8,383 bytes)
📎 references/advanced-usage.md (7,983 bytes)
📎 references/troubleshooting.md (7,733 bytes)