What Parquet version are these sample files?

All five files use Apache Parquet format version 2 with PLAIN encoding and no compression, compatible with all major readers including PyArrow, Spark, DuckDB, Presto, and Trino.

Are these Parquet files free to use commercially?

Yes. All datasets are synthetically generated with no real personal or financial data. They are free for any use including commercial projects, tutorials, and benchmarks.

How do I open a Parquet file without Python?

You can use DBeaver, the VS Code Parquet Viewer extension, Tad, or Apache Parquet Viewer. Cloud options include AWS Glue DataBrew.

Free Download — No Sign-up

Free Sample Parquet Files
for Developers & Data Engineers

Five production-realistic Apache Parquet datasets spanning e-commerce, web logs, HR, IoT, and finance. Perfect for testing pipelines, benchmarking query engines, and learning the Parquet format — zero friction, zero cost.

5 datasets

8,000+ total rows

Parquet v2 format

Schema documented

Download Datasets

5 files

🛒

E-Commerce Orders

ecommerce_orders.parquet

Rows1,000

Columns7

File Size~65 KB

CompressionNone

Column	Type	Nullable	Sample
order_id	INT32	—	1, 2, 3…
customer_id	INT32	—	1042, 8731…
product_name	STRING	null	Product_247…
category	STRING	null	Electronics…
quantity	INT32	—	3, 12, 1…
status	STRING	null	delivered…
region	STRING	null	North America…

65 KB Parquet v2

Download .parquet

🌐

Web Server Access Logs

web_server_logs.parquet

Rows2,000

Columns8

File Size~146 KB

CompressionNone

Column	Type	Nullable	Sample
log_id	INT32	—	1, 2, 3…
ip_address	STRING	null	192.168.1.42
http_method	STRING	null	GET, POST…
request_path	STRING	null	/api/users…
status_code	INT32	—	200, 404…
response_time_ms	INT32	—	142, 892…
bytes_sent	INT32	—	4096, 1200…
user_agent	STRING	null	Mozilla…

146 KB Parquet v2

Download .parquet

👥

Employee HR Records

employee_hr_records.parquet

Rows500

Columns9

File Size~35 KB

CompressionNone

Column	Type	Nullable	Sample
employee_id	INT32	—	10001…
first_name	STRING	null	Alice, Bob…
last_name	STRING	null	Smith…
department	STRING	null	Engineering…
role_level	STRING	null	Senior, Lead…
years_experience	INT32	—	3, 12, 7…
age	INT32	—	28, 45…
employment_type	STRING	null	Full-time…
performance_score	INT32	—	1 – 5

35 KB Parquet v2

Download .parquet

📡

IoT Sensor Readings

iot_sensor_readings.parquet

Rows3,000

Columns9

File Size~208 KB

CompressionNone

Column	Type	Nullable	Sample
reading_id	INT32	—	1, 2…
device_id	STRING	null	DEV-4821…
sensor_type	STRING	null	temperature…
location	STRING	null	Warehouse-A…
value_raw	INT32	—	2341, -12…
battery_level	INT32	—	87, 12…
signal_strength	INT32	—	-72, -45…
device_model	STRING	null	SensorX-100…
alert_triggered	INT32	—	0, 1

208 KB Parquet v2

Download .parquet

💳

Financial Transactions

financial_transactions.parquet

Rows1,500

Columns9

File Size~93 KB

CompressionNone

Column	Type	Nullable	Sample
transaction_id	INT64	—	900001…
account_id	INT32	—	54821…
transaction_type	STRING	null	credit, debit…
amount_cents	INT64	—	25099…
currency	STRING	null	USD, EUR…
channel	STRING	null	online, ATM…
merchant_id	INT32	—	3841…
risk_level	STRING	null	low, high…
is_flagged	INT32	—	0, 1

93 KB Parquet v2

Download .parquet

Read Parquet Files — Quick Start

Read with Python + PyArrow or Pandas

PyArrow is the fastest path. Install it and read any file in two lines:

pip install pyarrow pandas

import pyarrow.parquet as pq
import pandas as pd

# Read the full file
table = pq.read_table('ecommerce_orders.parquet')
df = table.to_pandas()
print(df.head())

# Read only specific columns (columnar advantage!)
df_slim = pq.read_table('iot_sensor_readings.parquet',
    columns=['device_id','sensor_type','value_raw']).to_pandas()

# Inspect schema without loading data
schema = pq.read_schema('financial_transactions.parquet')
print(schema)

Query with DuckDB (SQL — no server needed)

DuckDB lets you run full SQL directly on parquet files — no database required. It's blazing fast for analytics:

pip install duckdb

import duckdb

# SQL on a parquet file directly
duckdb.sql("""
    SELECT status, COUNT(*) as cnt, SUM(quantity) as total_qty
    FROM 'ecommerce_orders.parquet'
    GROUP BY status
    ORDER BY cnt DESC
""").show()

# Join two parquet files
duckdb.sql("""
    SELECT o.region, l.status_code, COUNT(*) as hits
    FROM 'ecommerce_orders.parquet' o
    JOIN 'web_server_logs.parquet' l ON o.order_id = l.log_id
    GROUP BY 1, 2
""").show()

Load into Apache Spark

Parquet is Spark's native format — no conversion needed:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetDemo").getOrCreate()

df = spark.read.parquet("iot_sensor_readings.parquet")
df.printSchema()
df.createOrReplaceTempView("sensors")

spark.sql("""
    SELECT sensor_type,
           AVG(value_raw)        AS avg_value,
           COUNT(*)              AS readings,
           SUM(alert_triggered)  AS total_alerts
    FROM sensors
    GROUP BY sensor_type
""").show()

Read with R (arrow package)

The arrow package brings full Parquet support to R with a tidy interface:

install.packages("arrow")
library(arrow)
library(dplyr)

# Read and convert to tibble
orders <- read_parquet("ecommerce_orders.parquet")
glimpse(orders)

# Use dplyr verbs directly on the Arrow table (lazy)
orders |>
  group_by(status) |>
  summarise(n = n(), total_qty = sum(quantity)) |>
  arrange(desc(n)) |>
  collect()  # pulls into R data.frame

AWS Athena / S3 — query without downloading

Upload the files to an S3 bucket, then create an external table in Athena — pay only for data scanned:

-- Upload first: aws s3 cp *.parquet s3://my-bucket/data/

-- Create external table in Athena
CREATE EXTERNAL TABLE IF NOT EXISTS iot_readings (
    reading_id       INT,
    device_id        STRING,
    sensor_type      STRING,
    location         STRING,
    value_raw        INT,
    battery_level    INT,
    signal_strength  INT,
    device_model     STRING,
    alert_triggered  INT
)
STORED AS PARQUET
LOCATION 's3://my-bucket/data/iot_sensor_readings/';

-- Query it
SELECT sensor_type, AVG(value_raw) FROM iot_readings GROUP BY 1;

Why Apache Parquet?

Columnar Storage

Data is stored column-by-column instead of row-by-row. Analytical queries that touch only a few columns skip irrelevant data entirely, slashing I/O by up to 99% on wide tables.

Efficient Compression

Columnar layout means adjacent values are similar, making compression algorithms (Snappy, GZIP, Zstd, LZ4) dramatically more effective. Real-world files are 75–90% smaller than equivalent CSV.

Rich Type System

Supports INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY (strings), BOOLEAN, and nested types (LIST, MAP, STRUCT). Schema is embedded in the file — no external definition needed.

Predicate Pushdown

Row group and page-level statistics (min/max) let query engines skip entire chunks of data without reading them. A WHERE id = 42 on a billion-row file might read less than 1 MB.

Cross-Platform

Read and write from Python, Java, Scala, R, C++, Go, JavaScript, and virtually every data platform: Spark, Flink, Presto, Trino, Hive, Athena, BigQuery, Redshift Spectrum, and Snowflake.

ACID & Schema Evolution

Used as the foundation for Delta Lake, Apache Iceberg, and Apache Hudi — the leading open table formats. Supports schema evolution (adding/renaming columns) without rewriting data.

Common Use Cases for These Sample Files

🧪Unit & integration testing

⚡Benchmark query engines

📊BI tool prototyping

🎓Learning Parquet format

🔄ETL pipeline dev

☁️AWS Glue / Athena demos

🔥Apache Spark tutorials

🦆DuckDB experimentation

📐Schema design practice

🗜️Compression benchmarks

Parquet vs CSV vs JSON — At a Glance

Feature	Parquet ✓	CSV	JSON / JSONL	ORC	Avro
Storage layout	Columnar	Row	Row	Columnar	Row
Schema embedded	✓	✗	Partial	✓	✓
Compression efficiency	Excellent	Poor	Poor	Excellent	Good
Column pruning	✓	✗	✗	✓	✗
Predicate pushdown	✓	✗	✗	✓	✗
Nested types	✓	✗	✓	✓	✓
Human-readable	✗	✓	✓	✗	✗
Streaming write	~	✓	✓	~	✓
Analytics performance	⚡ Best	Slow	Slow	⚡ Best	Medium
Open table format base	✓ Delta/Iceberg/Hudi	✗	✗	Partial	✗

Anatomy of a Parquet File

Every .parquet file follows the same binary layout — understanding it helps you tune performance.

Magic Bytes

4-byte header PAR1 and identical 4-byte footer. Used to validate the file and detect corruption.

Row Groups

Horizontal partitions of data. Default ~128 MB each. Each row group is independently decodable — great for parallel reads.

Column Chunks

Within each row group, one chunk per column. Chunks can be individually compressed with different codecs.

Pages

The smallest unit of I/O inside a column chunk (~1 MB). Each page has its own header with encoding and compression info.

Footer (Thrift)

Contains the full FileMetaData: schema, row group offsets, column statistics (min/max/nulls), and encoding info. Always read first.

Statistics

Per-column min, max, and null counts stored in the footer. Query engines use these for predicate pushdown without reading data pages.

Frequently Asked Questions

8 questions

All five files use Parquet format version 2 (the spec version stored in the FileMetaData.version field). They use PLAIN encoding for all columns and no compression — making them ideal for validating parsers since there is no codec dependency. Compatible with all major readers: PyArrow ≥ 0.17, Spark ≥ 2.x, Hive ≥ 1.2, Presto, Trino, DuckDB, and Pandas.
Yes — completely free for any purpose. All datasets are synthetically generated; they contain no real personal, financial, or proprietary information. You may use them in commercial software, open-source projects, tutorials, courses, test suites, and benchmarks without attribution.
Several GUI tools can open Parquet files directly:
- DBeaver — free, cross-platform, supports Parquet natively
- Apache Parquet Viewer — lightweight desktop app
- VS Code — install the Parquet Viewer extension
- Tad — fast tabular data viewer with Parquet support
- AWS Glue DataBrew — cloud-based, no install
- Command-line: parquet-tools schema file.parquet
Parquet v2 (also called "format version 2") introduced several improvements over v1: the Data Page V2 format (separate repetition/definition levels that can be compressed independently), better statistics coverage, improved encoding for nested types, and support for newer encodings like DELTA_BINARY_PACKED and BYTE_STREAM_SPLIT. Most modern tools default to v2. The version field in the footer is 2 for v2 files. The physical layout is largely compatible — all v1 readers can read v2 files if they don't use v2-only page headers.

How do I convert Parquet to CSV?

With Pandas, it's one line:

import pandas as pd
df = pd.read_parquet('ecommerce_orders.parquet')
df.to_csv('ecommerce_orders.csv', index=False)

Or with DuckDB, which handles very large files efficiently:

duckdb.sql("""
  COPY (SELECT * FROM 'ecommerce_orders.parquet')
  TO 'ecommerce_orders.csv' (HEADER, DELIMITER ',')
""")

In PLAIN encoding: INT32 values are stored as 4 consecutive little-endian bytes; INT64 as 8 bytes LE; FLOAT as IEEE 754 single (4 bytes); DOUBLE as IEEE 754 double (8 bytes); BYTE_ARRAY (strings) as a 4-byte LE length prefix followed by raw UTF-8 bytes; BOOLEAN as 1 bit per value packed LSB-first. Definition and repetition levels use RLE/bit-packed hybrid encoding regardless of value encoding.
BigQuery: Upload via the console (Create Table → Upload → File format: Parquet) or use bq load --source_format=PARQUET dataset.table file.parquet. Schema is auto-detected from the embedded Parquet schema.

Snowflake: Stage the files in an S3 or GCS bucket, then: COPY INTO my_table FROM @my_stage/file.parquet FILE_FORMAT = (TYPE = 'PARQUET');. Use $1:column_name syntax to reference columns during the COPY.

How do I generate my own sample Parquet files?

The quickest way with Python:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create a DataFrame
df = pd.DataFrame({
    'id':     range(1, 1001),
    'name':   [f'User_{i}' for i in range(1, 1001)],
    'score':  [i * 1.5 for i in range(1, 1001)],
    'active': [i % 2 == 0 for i in range(1, 1001)],
})

# Write with compression
table = pa.Table.from_pandas(df)
pq.write_table(table, 'my_sample.parquet',
               compression='snappy',
               row_group_size=512)

About These Sample Parquet Files

These free sample Apache Parquet files were designed to cover the most common real-world data shapes a data engineer encounters: transactional data with mixed string/integer fields (e-commerce orders, financial transactions), high-cardinality string data typical of log analytics (web server access logs), time-series-flavoured measurement data (IoT sensor readings), and narrow, low-cardinality HR data with enum-like columns (employee records). Each dataset exercises a different read pattern so you can benchmark column pruning, predicate pushdown, and decompression independently.

All files comply with the Apache Parquet Format Specification v2.6. They use uncompressed PLAIN encoding to eliminate external codec dependencies — you can validate a parser against them without linking Snappy or GZIP. For real-world compression benchmarks, load the files into PyArrow or DuckDB and re-write with compression='snappy' or compression='zstd'; you'll typically see 40–70% size reduction on the string-heavy datasets.

These files are also useful for practicing with modern data lakehouse tooling: load them into Delta Lake via spark.write.format("delta"), convert to Apache Iceberg using the CREATE TABLE … AS SELECT syntax in Trino or Flink, or use them as the raw input layer in an Apache Hudi Upsert pipeline. The schema definitions above serve as the DDL reference for any of these workflows.

More Free Online Tools

Simple tools. Surgical fixes. Zero friction.

Amazon Connect CCP Log Parser

Parse Amazon Connect CCP logs into structured, searchable diagnostics.

Open

Amazon Connect CTR Parser

Turn raw Amazon Connect CTR JSON into a rich visual breakdown.

Open

Amazon Connect Agent Workstation Validator

Pre-flight check for Amazon Connect softphone agents.

Open

CloudTrail Log Analyser

Security audit & threat detection for AWS environments.

Open

Amazon Connect Pricing Calculator

Instantly estimate monthly AWS Connect costs — voice, chat, email, campaigns, telephony & more.

Open

Connect CloudWatch Log Analyzer

Drop any Amazon Connect CloudWatch log and get a rich visual breakdown.

Open

Free Sample Parquet Files
for Developers & Data Engineers

Download Datasets

E-Commerce Orders

Web Server Access Logs

Employee HR Records

IoT Sensor Readings

Financial Transactions

Read Parquet Files — Quick Start

Why Apache Parquet?

Columnar Storage

Efficient Compression

Rich Type System

Predicate Pushdown

Cross-Platform

ACID & Schema Evolution

Common Use Cases for These Sample Files

Parquet vs CSV vs JSON — At a Glance

Anatomy of a Parquet File

Magic Bytes

Row Groups

Column Chunks

Pages

Footer (Thrift)

Statistics

Frequently Asked Questions

About These Sample Parquet Files

More Free Online Tools

Amazon Connect CCP Log Parser

Amazon Connect CTR Parser

Amazon Connect Agent Workstation Validator

CloudTrail Log Analyser

Amazon Connect Pricing Calculator

Connect CloudWatch Log Analyzer

Report an Issue

Report Received!

Free Sample Parquet Filesfor Developers & Data Engineers

Download Datasets

E-Commerce Orders

Web Server Access Logs

Employee HR Records

IoT Sensor Readings

Financial Transactions

Read Parquet Files — Quick Start

Why Apache Parquet?

Columnar Storage

Efficient Compression

Rich Type System

Predicate Pushdown

Cross-Platform

ACID & Schema Evolution

Common Use Cases for These Sample Files

Parquet vs CSV vs JSON — At a Glance

Anatomy of a Parquet File

Magic Bytes

Row Groups

Column Chunks

Pages

Footer (Thrift)

Statistics

Frequently Asked Questions

About These Sample Parquet Files

More Free Online Tools

Amazon Connect CCP Log Parser

Amazon Connect CTR Parser

Amazon Connect Agent Workstation Validator

CloudTrail Log Analyser

Amazon Connect Pricing Calculator

Connect CloudWatch Log Analyzer

Free Sample Parquet Files
for Developers & Data Engineers