Get Started →
Free Download — No Sign-up

Free Sample Parquet Files
for Developers & Data Engineers

Five production-realistic Apache Parquet datasets spanning e-commerce, web logs, HR, IoT, and finance. Perfect for testing pipelines, benchmarking query engines, and learning the Parquet format — zero friction, zero cost.

5 datasets
8,000+ total rows
Parquet v2 format
Schema documented

Download Datasets

5 files
🛒

E-Commerce Orders

ecommerce_orders.parquet
Rows1,000
Columns7
File Size~65 KB
CompressionNone
ColumnTypeNullableSample
order_idINT321, 2, 3…
customer_idINT321042, 8731…
product_nameSTRINGnullProduct_247…
categorySTRINGnullElectronics…
quantityINT323, 12, 1…
statusSTRINGnulldelivered…
regionSTRINGnullNorth America…
65 KB Parquet v2
Download .parquet
🌐

Web Server Access Logs

web_server_logs.parquet
Rows2,000
Columns8
File Size~146 KB
CompressionNone
ColumnTypeNullableSample
log_idINT321, 2, 3…
ip_addressSTRINGnull192.168.1.42
http_methodSTRINGnullGET, POST…
request_pathSTRINGnull/api/users…
status_codeINT32200, 404…
response_time_msINT32142, 892…
bytes_sentINT324096, 1200…
user_agentSTRINGnullMozilla…
146 KB Parquet v2
Download .parquet
👥

Employee HR Records

employee_hr_records.parquet
Rows500
Columns9
File Size~35 KB
CompressionNone
ColumnTypeNullableSample
employee_idINT3210001…
first_nameSTRINGnullAlice, Bob…
last_nameSTRINGnullSmith…
departmentSTRINGnullEngineering…
role_levelSTRINGnullSenior, Lead…
years_experienceINT323, 12, 7…
ageINT3228, 45…
employment_typeSTRINGnullFull-time…
performance_scoreINT321 – 5
35 KB Parquet v2
Download .parquet
📡

IoT Sensor Readings

iot_sensor_readings.parquet
Rows3,000
Columns9
File Size~208 KB
CompressionNone
ColumnTypeNullableSample
reading_idINT321, 2…
device_idSTRINGnullDEV-4821…
sensor_typeSTRINGnulltemperature…
locationSTRINGnullWarehouse-A…
value_rawINT322341, -12…
battery_levelINT3287, 12…
signal_strengthINT32-72, -45…
device_modelSTRINGnullSensorX-100…
alert_triggeredINT320, 1
208 KB Parquet v2
Download .parquet
💳

Financial Transactions

financial_transactions.parquet
Rows1,500
Columns9
File Size~93 KB
CompressionNone
ColumnTypeNullableSample
transaction_idINT64900001…
account_idINT3254821…
transaction_typeSTRINGnullcredit, debit…
amount_centsINT6425099…
currencySTRINGnullUSD, EUR…
channelSTRINGnullonline, ATM…
merchant_idINT323841…
risk_levelSTRINGnulllow, high…
is_flaggedINT320, 1
93 KB Parquet v2
Download .parquet

Read Parquet Files — Quick Start

Why Apache Parquet?

Columnar Storage

Data is stored column-by-column instead of row-by-row. Analytical queries that touch only a few columns skip irrelevant data entirely, slashing I/O by up to 99% on wide tables.

Efficient Compression

Columnar layout means adjacent values are similar, making compression algorithms (Snappy, GZIP, Zstd, LZ4) dramatically more effective. Real-world files are 75–90% smaller than equivalent CSV.

Rich Type System

Supports INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY (strings), BOOLEAN, and nested types (LIST, MAP, STRUCT). Schema is embedded in the file — no external definition needed.

Predicate Pushdown

Row group and page-level statistics (min/max) let query engines skip entire chunks of data without reading them. A WHERE id = 42 on a billion-row file might read less than 1 MB.

Cross-Platform

Read and write from Python, Java, Scala, R, C++, Go, JavaScript, and virtually every data platform: Spark, Flink, Presto, Trino, Hive, Athena, BigQuery, Redshift Spectrum, and Snowflake.

ACID & Schema Evolution

Used as the foundation for Delta Lake, Apache Iceberg, and Apache Hudi — the leading open table formats. Supports schema evolution (adding/renaming columns) without rewriting data.

Common Use Cases for These Sample Files

🧪Unit & integration testing
Benchmark query engines
📊BI tool prototyping
🎓Learning Parquet format
🔄ETL pipeline dev
☁️AWS Glue / Athena demos
🔥Apache Spark tutorials
🦆DuckDB experimentation
📐Schema design practice
🗜️Compression benchmarks

Parquet vs CSV vs JSON — At a Glance

Feature Parquet ✓ CSV JSON / JSONL ORC Avro
Storage layoutColumnarRowRowColumnarRow
Schema embeddedPartial
Compression efficiencyExcellentPoorPoorExcellentGood
Column pruning
Predicate pushdown
Nested types
Human-readable
Streaming write~~
Analytics performance⚡ BestSlowSlow⚡ BestMedium
Open table format base✓ Delta/Iceberg/HudiPartial

Anatomy of a Parquet File

Every .parquet file follows the same binary layout — understanding it helps you tune performance.

Magic Bytes

4-byte header PAR1 and identical 4-byte footer. Used to validate the file and detect corruption.

Row Groups

Horizontal partitions of data. Default ~128 MB each. Each row group is independently decodable — great for parallel reads.

Column Chunks

Within each row group, one chunk per column. Chunks can be individually compressed with different codecs.

Pages

The smallest unit of I/O inside a column chunk (~1 MB). Each page has its own header with encoding and compression info.

Footer (Thrift)

Contains the full FileMetaData: schema, row group offsets, column statistics (min/max/nulls), and encoding info. Always read first.

Statistics

Per-column min, max, and null counts stored in the footer. Query engines use these for predicate pushdown without reading data pages.

Frequently Asked Questions

8 questions

About These Sample Parquet Files

These free sample Apache Parquet files were designed to cover the most common real-world data shapes a data engineer encounters: transactional data with mixed string/integer fields (e-commerce orders, financial transactions), high-cardinality string data typical of log analytics (web server access logs), time-series-flavoured measurement data (IoT sensor readings), and narrow, low-cardinality HR data with enum-like columns (employee records). Each dataset exercises a different read pattern so you can benchmark column pruning, predicate pushdown, and decompression independently.

All files comply with the Apache Parquet Format Specification v2.6. They use uncompressed PLAIN encoding to eliminate external codec dependencies — you can validate a parser against them without linking Snappy or GZIP. For real-world compression benchmarks, load the files into PyArrow or DuckDB and re-write with compression='snappy' or compression='zstd'; you'll typically see 40–70% size reduction on the string-heavy datasets.

These files are also useful for practicing with modern data lakehouse tooling: load them into Delta Lake via spark.write.format("delta"), convert to Apache Iceberg using the CREATE TABLE … AS SELECT syntax in Trino or Flink, or use them as the raw input layer in an Apache Hudi Upsert pipeline. The schema definitions above serve as the DDL reference for any of these workflows.

More Free Online Tools

Simple tools. Surgical fixes. Zero friction.

Amazon Connect CCP Log Parser

Parse Amazon Connect CCP logs into structured, searchable diagnostics.

Open

Amazon Connect CTR Parser

Turn raw Amazon Connect CTR JSON into a rich visual breakdown.

Open

Amazon Connect Agent Workstation Validator

Pre-flight check for Amazon Connect softphone agents.

Open

CloudTrail Log Analyser

Security audit & threat detection for AWS environments.

Open

Amazon Connect Pricing Calculator

Instantly estimate monthly AWS Connect costs — voice, chat, email, campaigns, telephony & more.

Open

Connect CloudWatch Log Analyzer

Drop any Amazon Connect CloudWatch log and get a rich visual breakdown.

Open