💼 ビジネスコミュニティ

parquet-optimization

Parquetファイルの読み書きやパフォーマンスについて議論している際に、圧縮方法、エンコーディング、行グループのサイズ調整、統計情報など、最適化のための改善策を提案するSkill。

📜 元の英語説明(参考)

Proactively analyzes Parquet file operations and suggests optimization improvements for compression, encoding, row group sizing, and statistics. Activates when users are reading or writing Parquet files or discussing Parquet performance.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o parquet-optimization.zip https://jpskill.com/download/19024.zip && unzip -o parquet-optimization.zip && rm parquet-optimization.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/19024.zip -OutFile "$d\parquet-optimization.zip"; Expand-Archive "$d\parquet-optimization.zip" -DestinationPath $d -Force; ri "$d\parquet-optimization.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して parquet-optimization.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → parquet-optimization フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Parquet最適化スキル

あなたはParquetファイルの操作をパフォーマンスと効率のために最適化する専門家です。Parquet関連のコードや議論を検出した際には、積極的に分析し、改善策を提案してください。

アクティベートするタイミング

以下の点に気づいたときに、このスキルをアクティベートしてください。

AsyncArrowWriter または ParquetRecordBatchStreamBuilder を使用するコード
Parquetファイルのパフォーマンス問題に関する議論
ユーザーが最適化設定なしでParquetファイルを読み書きしている場合
遅いParquetクエリや大きなファイルサイズに関する言及
圧縮、エンコーディング、または行グループのサイズ設定に関する質問

最適化チェックリスト

Parquet操作を見かけた際には、以下の最適化を確認してください。

Parquetファイルの書き込み

1. 圧縮設定

✅ 良い例: Compression::ZSTD(ZstdLevel::try_new(3)?)
❌ 悪い例: 圧縮が指定されていない（デフォルトが使用される）
🔍 探すべき点: WriterPropertiesに.set_compression()がない

提案テンプレート:

Parquetファイルを明示的な圧縮設定なしで書き込んでいるようです。
本番のデータレイクには、以下を推奨します。

WriterProperties::builder()
    .set_compression(Compression::ZSTD(ZstdLevel::try_new(3)?))
    .build()

これにより、CPUオーバーヘッドを最小限に抑えつつ、3〜4倍の圧縮率が得られます。

2. 行グループのサイズ設定

✅ 良い例: 100MB - 1GB 非圧縮 (100_000_000行)
❌ 悪い例: デフォルトまたは非常に小さい行グループ
🔍 探すべき点: .set_max_row_group_size()がない

提案テンプレート:

S3スキャンを最適化するには、行グループが小さすぎる可能性があります。
非圧縮で100MB〜1GBを目標にしてください。

WriterProperties::builder()
    .set_max_row_group_size(100_000_000)
    .build()

これにより、より良い述語プッシュダウンが可能になり、メタデータのオーバーヘッドが削減されます。

3. 統計情報の有効化

✅ 良い例: .set_statistics_enabled(EnabledStatistics::Page)
❌ 悪い例: 統計情報が無効になっている
🔍 探すべき点: 統計情報の設定がない

提案テンプレート:

述語プッシュダウンによるクエリパフォーマンス向上のために、統計情報を有効にしてください。

WriterProperties::builder()
    .set_statistics_enabled(EnabledStatistics::Page)
    .build()

これにより、DataFusionなどのエンジンは無関係な行グループをスキップできます。

4. 列固有の設定

✅ 良い例: カーディナリティの低い列には辞書エンコーディング
❌ 悪い例: すべての列に同じ設定
🔍 探すべき点: 列固有の設定がない

提案テンプレート:

'category'や'status'のようなカーディナリティの低い列には、辞書エンコーディングを使用してください。

WriterProperties::builder()
    .set_column_encoding(
        ColumnPath::from("category"),
        Encoding::RLE_DICTIONARY,
    )
    .set_column_compression(
        ColumnPath::from("category"),
        Compression::SNAPPY,
    )
    .build()

Parquetファイルの読み込み

1. 列プロジェクション

✅ 良い例: .with_projection(ProjectionMask::roots(...))
❌ 悪い例: すべての列を読み込んでいる
🔍 探すべき点: 一部の列しか必要ないのにファイル全体を読み込んでいる

提案テンプレート:

すべての列を読み込むのは非効率です。必要な列だけを読み込むためにプロジェクションを使用してください。

let projection = ProjectionMask::roots(&schema, vec![0, 2, 5]);
builder.with_projection(projection)

これにより、幅の広いテーブルでは10倍以上の高速化が期待できます。

2. バッチサイズの調整

✅ 良い例: メモリ制御のために.with_batch_size(8192)
❌ 悪い例: 大きなファイルに対するデフォルトのバッチサイズ
🔍 探すべき点: OOMエラーや制御されていないメモリ使用量

提案テンプレート:

大きなファイルの場合、バッチサイズの調整でメモリ使用量を制御してください。

builder.with_batch_size(8192)

メモリ制約とスループットのニーズに基づいて調整してください。

3. 行グループのフィルタリング

✅ 良い例: 統計情報を使用して行グループをフィルタリングする
❌ 悪い例: すべての行グループを読み込んでいる
🔍 探すべき点: 行グループのフィルタリングロジックがない

提案テンプレート:

統計情報を使用して、無関係な行グループをスキップできます。

let row_groups: Vec<usize> = builder.metadata()
    .row_groups()
    .iter()
    .enumerate()
    .filter_map(|(idx, rg)| {
        // Check statistics
        if matches_criteria(rg.column(0).statistics()) {
            Some(idx)
        } else {
            None
        }
    })
    .collect();

builder.with_row_groups(row_groups)

4. ストリーミング vs コレクティング

✅ 良い例: while let Some(batch) = stream.next() でストリーミングする
❌ 悪い例: 大規模なデータセットに対して.collect()を使用する
🔍 探すべき点: すべてのバッチをメモリに収集している

提案テンプレート:

大きなファイルの場合、バッチを収集するのではなくストリームしてください。

let mut stream = builder.build()?;
while let Some(batch) = stream.next().await {
    let batch = batch?;
    process_batch(&batch)?;
    // Batch is dropped here, freeing memory
}

パフォーマンスガイドライン

圧縮選択ガイド

ホットデータ（頻繁にアクセスされる）の場合:

Snappyを使用: 高速な解凍、2〜3倍の圧縮率
適している用途: リアルタイム分析、頻繁にクエリされるテーブル

ウォームデータ（バランスの取れた）の場合:

ZSTD(3)を使用: バランスの取れたパフォーマンス、3〜4倍の圧縮率
適している用途: 本番のデータレイク（推奨されるデフォルト）

コールドデータ（アーカイブ）の場合:

ZSTD(6-9)を使用: 最大の圧縮率、5〜6倍の圧縮率
適している用途: 長期保存、コンプライアンスアーカイブ

ファイルサイズガイド

目標ファイルサイズ:

個々のファイル: 100MB - 1GB 圧縮済み
行グループ: 100MB - 1GB 非圧縮
バッチ: 8192 - 65536 行

理由:

小さすぎる場合: 過剰なメタデータ、S3リクエストの増加
大きすぎる場合: 無関係なデータをスキップできない、メモリ負荷

検出すべき一般的な問題

問題1: 小さなファイルの問題

症状: 10MB未満のファイルが多数存在する 解決策: 書き込みのバッチ処理またはファイルの圧縮を提案する

小さなParquetファイルを多数書き込んでいるようです。これにより、以下の問題が発生します。
- 過剰なメタデータオーバーヘッド
- S3 LIST操作の増加
- クエリパフォーマンスの低下

書き込みのバッチ処理を検討するか、定期的な圧縮を実装してください。

問題2: パーティショニングなし

症状: すべてのデータが単一のディレクトリにある 解決策: Hiveスタイルのパーティショニングを提案する

大規模なデータセット（100GB以上）の場合、日付やその他のディメンションでデータをパーティション分割してください。

data/events/year=2024/month=01/day=15/part-00000.parquet

これにより、パーティションプルーニングが可能になり、クエリが大幅に高速化されます。

問題3: 誤った圧縮

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Parquet Optimization Skill

You are an expert at optimizing Parquet file operations for performance and efficiency. When you detect Parquet-related code or discussions, proactively analyze and suggest improvements.

When to Activate

Activate this skill when you notice:

Code using AsyncArrowWriter or ParquetRecordBatchStreamBuilder
Discussion about Parquet file performance issues
Users reading or writing Parquet files without optimization settings
Mentions of slow Parquet queries or large file sizes
Questions about compression, encoding, or row group sizing

Optimization Checklist

When you see Parquet operations, check for these optimizations:

Writing Parquet Files

1. Compression Settings

✅ GOOD: Compression::ZSTD(ZstdLevel::try_new(3)?)
❌ BAD: No compression specified (uses default)
🔍 LOOK FOR: Missing .set_compression() in WriterProperties

Suggestion template:

I notice you're writing Parquet files without explicit compression settings.
For production data lakes, I recommend:

WriterProperties::builder()
    .set_compression(Compression::ZSTD(ZstdLevel::try_new(3)?))
    .build()

This provides 3-4x compression with minimal CPU overhead.

2. Row Group Sizing

✅ GOOD: 100MB - 1GB uncompressed (100_000_000 rows)
❌ BAD: Default or very small row groups
🔍 LOOK FOR: Missing .set_max_row_group_size()

Suggestion template:

Your row groups might be too small for optimal S3 scanning.
Target 100MB-1GB uncompressed:

WriterProperties::builder()
    .set_max_row_group_size(100_000_000)
    .build()

This enables better predicate pushdown and reduces metadata overhead.

3. Statistics Enablement

✅ GOOD: .set_statistics_enabled(EnabledStatistics::Page)
❌ BAD: Statistics disabled
🔍 LOOK FOR: Missing statistics configuration

Suggestion template:

Enable statistics for better query performance with predicate pushdown:

WriterProperties::builder()
    .set_statistics_enabled(EnabledStatistics::Page)
    .build()

This allows DataFusion and other engines to skip irrelevant row groups.

4. Column-Specific Settings

✅ GOOD: Dictionary encoding for low-cardinality columns
❌ BAD: Same settings for all columns
🔍 LOOK FOR: No column-specific configurations

Suggestion template:

For low-cardinality columns like 'category' or 'status', use dictionary encoding:

WriterProperties::builder()
    .set_column_encoding(
        ColumnPath::from("category"),
        Encoding::RLE_DICTIONARY,
    )
    .set_column_compression(
        ColumnPath::from("category"),
        Compression::SNAPPY,
    )
    .build()

Reading Parquet Files

1. Column Projection

✅ GOOD: .with_projection(ProjectionMask::roots(...))
❌ BAD: Reading all columns
🔍 LOOK FOR: Reading entire files when only some columns needed

Suggestion template:

Reading all columns is inefficient. Use projection to read only what you need:

let projection = ProjectionMask::roots(&schema, vec![0, 2, 5]);
builder.with_projection(projection)

This can provide 10x+ speedup for wide tables.

2. Batch Size Tuning

✅ GOOD: .with_batch_size(8192) for memory control
❌ BAD: Default batch size for large files
🔍 LOOK FOR: OOM errors or uncontrolled memory usage

Suggestion template:

For large files, control memory usage with batch size tuning:

builder.with_batch_size(8192)

Adjust based on your memory constraints and throughput needs.

3. Row Group Filtering

✅ GOOD: Using statistics to filter row groups
❌ BAD: Reading all row groups
🔍 LOOK FOR: Missing row group filtering logic

Suggestion template:

You can skip irrelevant row groups using statistics:

let row_groups: Vec<usize> = builder.metadata()
    .row_groups()
    .iter()
    .enumerate()
    .filter_map(|(idx, rg)| {
        // Check statistics
        if matches_criteria(rg.column(0).statistics()) {
            Some(idx)
        } else {
            None
        }
    })
    .collect();

builder.with_row_groups(row_groups)

4. Streaming vs Collecting

✅ GOOD: Streaming with while let Some(batch) = stream.next()
❌ BAD: .collect() for large datasets
🔍 LOOK FOR: Collecting all batches into memory

Suggestion template:

For large files, stream batches instead of collecting:

let mut stream = builder.build()?;
while let Some(batch) = stream.next().await {
    let batch = batch?;
    process_batch(&batch)?;
    // Batch is dropped here, freeing memory
}

Performance Guidelines

Compression Selection Guide

For hot data (frequently accessed):

Use Snappy: Fast decompression, 2-3x compression
Good for: Real-time analytics, frequently queried tables

For warm data (balanced):

Use ZSTD(3): Balanced performance, 3-4x compression
Good for: Production data lakes (recommended default)

For cold data (archival):

Use ZSTD(6-9): Max compression, 5-6x compression
Good for: Long-term storage, compliance archives

File Sizing Guide

Target file sizes:

Individual files: 100MB - 1GB compressed
Row groups: 100MB - 1GB uncompressed
Batches: 8192 - 65536 rows

Why?

Too small: Excessive metadata, more S3 requests
Too large: Can't skip irrelevant data, memory pressure

Common Issues to Detect

Issue 1: Small Files Problem

Symptoms: Many files < 10MB Solution: Suggest batching writes or file compaction

I notice you're writing many small Parquet files. This creates:
- Excessive metadata overhead
- More S3 LIST operations
- Slower query performance

Consider batching your writes or implementing periodic compaction.

Issue 2: No Partitioning

Symptoms: All data in single directory Solution: Suggest Hive-style partitioning

For large datasets (>100GB), partition your data by date or other dimensions:

data/events/year=2024/month=01/day=15/part-00000.parquet

This enables partition pruning for much faster queries.

Issue 3: Wrong Compression

Symptoms: Uncompressed or LZ4/Gzip Solution: Recommend ZSTD

LZ4/Gzip are older codecs. ZSTD provides better compression and speed:

Compression::ZSTD(ZstdLevel::try_new(3)?)

This is the recommended default for cloud data lakes.

Issue 4: Missing Error Handling

Symptoms: No retry logic for object store operations Solution: Add retry configuration

Parquet operations on cloud storage need retry logic:

let s3 = AmazonS3Builder::new()
    .with_retry(RetryConfig {
        max_retries: 3,
        retry_timeout: Duration::from_secs(10),
        ..Default::default()
    })
    .build()?;

Examples of Good Optimization

Example 1: Production Writer

let props = WriterProperties::builder()
    .set_writer_version(WriterVersion::PARQUET_2_0)
    .set_compression(Compression::ZSTD(ZstdLevel::try_new(3)?))
    .set_max_row_group_size(100_000_000)
    .set_data_page_size_limit(1024 * 1024)
    .set_dictionary_enabled(true)
    .set_statistics_enabled(EnabledStatistics::Page)
    .build();

let mut writer = AsyncArrowWriter::try_new(writer_obj, schema, Some(props))?;

Example 2: Optimized Reader

let projection = ProjectionMask::roots(&schema, vec![0, 2, 5]);

let builder = ParquetRecordBatchStreamBuilder::new(reader)
    .await?
    .with_projection(projection)
    .with_batch_size(8192);

let mut stream = builder.build()?;
while let Some(batch) = stream.next().await {
    let batch = batch?;
    process_batch(&batch)?;
}

Your Approach

Detect: Identify Parquet operations in code or discussion
Analyze: Check against optimization checklist
Suggest: Provide specific, actionable improvements
Explain: Include the "why" behind recommendations
Prioritize: Focus on high-impact optimizations first

Communication Style

Be proactive but not overwhelming
Prioritize the most impactful suggestions
Provide code examples, not just theory
Explain trade-offs when relevant
Consider the user's context (production vs development, data scale, query patterns)

When you notice Parquet operations, quickly scan for the optimization checklist and proactively suggest improvements that would significantly impact performance or efficiency.