Free Sample Parquet Files
for Developers & Data Engineers
Five production-realistic Apache Parquet datasets spanning e-commerce, web logs, HR, IoT, and finance. Perfect for testing pipelines, benchmarking query engines, and learning the Parquet format — zero friction, zero cost.
Download Datasets
5 filesE-Commerce Orders
ecommerce_orders.parquet| Column | Type | Nullable | Sample |
|---|---|---|---|
| order_id | INT32 | — | 1, 2, 3… |
| customer_id | INT32 | — | 1042, 8731… |
| product_name | STRING | null | Product_247… |
| category | STRING | null | Electronics… |
| quantity | INT32 | — | 3, 12, 1… |
| status | STRING | null | delivered… |
| region | STRING | null | North America… |
Web Server Access Logs
web_server_logs.parquet| Column | Type | Nullable | Sample |
|---|---|---|---|
| log_id | INT32 | — | 1, 2, 3… |
| ip_address | STRING | null | 192.168.1.42 |
| http_method | STRING | null | GET, POST… |
| request_path | STRING | null | /api/users… |
| status_code | INT32 | — | 200, 404… |
| response_time_ms | INT32 | — | 142, 892… |
| bytes_sent | INT32 | — | 4096, 1200… |
| user_agent | STRING | null | Mozilla… |
Employee HR Records
employee_hr_records.parquet| Column | Type | Nullable | Sample |
|---|---|---|---|
| employee_id | INT32 | — | 10001… |
| first_name | STRING | null | Alice, Bob… |
| last_name | STRING | null | Smith… |
| department | STRING | null | Engineering… |
| role_level | STRING | null | Senior, Lead… |
| years_experience | INT32 | — | 3, 12, 7… |
| age | INT32 | — | 28, 45… |
| employment_type | STRING | null | Full-time… |
| performance_score | INT32 | — | 1 – 5 |
IoT Sensor Readings
iot_sensor_readings.parquet| Column | Type | Nullable | Sample |
|---|---|---|---|
| reading_id | INT32 | — | 1, 2… |
| device_id | STRING | null | DEV-4821… |
| sensor_type | STRING | null | temperature… |
| location | STRING | null | Warehouse-A… |
| value_raw | INT32 | — | 2341, -12… |
| battery_level | INT32 | — | 87, 12… |
| signal_strength | INT32 | — | -72, -45… |
| device_model | STRING | null | SensorX-100… |
| alert_triggered | INT32 | — | 0, 1 |
Financial Transactions
financial_transactions.parquet| Column | Type | Nullable | Sample |
|---|---|---|---|
| transaction_id | INT64 | — | 900001… |
| account_id | INT32 | — | 54821… |
| transaction_type | STRING | null | credit, debit… |
| amount_cents | INT64 | — | 25099… |
| currency | STRING | null | USD, EUR… |
| channel | STRING | null | online, ATM… |
| merchant_id | INT32 | — | 3841… |
| risk_level | STRING | null | low, high… |
| is_flagged | INT32 | — | 0, 1 |
Read Parquet Files — Quick Start
-
PyArrow is the fastest path. Install it and read any file in two lines:
pip install pyarrow pandas import pyarrow.parquet as pq import pandas as pd # Read the full file table = pq.read_table('ecommerce_orders.parquet') df = table.to_pandas() print(df.head()) # Read only specific columns (columnar advantage!) df_slim = pq.read_table('iot_sensor_readings.parquet', columns=['device_id','sensor_type','value_raw']).to_pandas() # Inspect schema without loading data schema = pq.read_schema('financial_transactions.parquet') print(schema) -
DuckDB lets you run full SQL directly on parquet files — no database required. It's blazing fast for analytics:
pip install duckdb import duckdb # SQL on a parquet file directly duckdb.sql(""" SELECT status, COUNT(*) as cnt, SUM(quantity) as total_qty FROM 'ecommerce_orders.parquet' GROUP BY status ORDER BY cnt DESC """).show() # Join two parquet files duckdb.sql(""" SELECT o.region, l.status_code, COUNT(*) as hits FROM 'ecommerce_orders.parquet' o JOIN 'web_server_logs.parquet' l ON o.order_id = l.log_id GROUP BY 1, 2 """).show() -
Parquet is Spark's native format — no conversion needed:
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ParquetDemo").getOrCreate() df = spark.read.parquet("iot_sensor_readings.parquet") df.printSchema() df.createOrReplaceTempView("sensors") spark.sql(""" SELECT sensor_type, AVG(value_raw) AS avg_value, COUNT(*) AS readings, SUM(alert_triggered) AS total_alerts FROM sensors GROUP BY sensor_type """).show() -
The
arrowpackage brings full Parquet support to R with a tidy interface:install.packages("arrow") library(arrow) library(dplyr) # Read and convert to tibble orders <- read_parquet("ecommerce_orders.parquet") glimpse(orders) # Use dplyr verbs directly on the Arrow table (lazy) orders |> group_by(status) |> summarise(n = n(), total_qty = sum(quantity)) |> arrange(desc(n)) |> collect() # pulls into R data.frame -
Upload the files to an S3 bucket, then create an external table in Athena — pay only for data scanned:
-- Upload first: aws s3 cp *.parquet s3://my-bucket/data/ -- Create external table in Athena CREATE EXTERNAL TABLE IF NOT EXISTS iot_readings ( reading_id INT, device_id STRING, sensor_type STRING, location STRING, value_raw INT, battery_level INT, signal_strength INT, device_model STRING, alert_triggered INT ) STORED AS PARQUET LOCATION 's3://my-bucket/data/iot_sensor_readings/'; -- Query it SELECT sensor_type, AVG(value_raw) FROM iot_readings GROUP BY 1;
Why Apache Parquet?
Columnar Storage
Data is stored column-by-column instead of row-by-row. Analytical queries that touch only a few columns skip irrelevant data entirely, slashing I/O by up to 99% on wide tables.
Efficient Compression
Columnar layout means adjacent values are similar, making compression algorithms (Snappy, GZIP, Zstd, LZ4) dramatically more effective. Real-world files are 75–90% smaller than equivalent CSV.
Rich Type System
Supports INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY (strings), BOOLEAN, and nested types (LIST, MAP, STRUCT). Schema is embedded in the file — no external definition needed.
Predicate Pushdown
Row group and page-level statistics (min/max) let query engines skip entire chunks of data without reading them. A WHERE id = 42 on a billion-row file might read less than 1 MB.
Cross-Platform
Read and write from Python, Java, Scala, R, C++, Go, JavaScript, and virtually every data platform: Spark, Flink, Presto, Trino, Hive, Athena, BigQuery, Redshift Spectrum, and Snowflake.
ACID & Schema Evolution
Used as the foundation for Delta Lake, Apache Iceberg, and Apache Hudi — the leading open table formats. Supports schema evolution (adding/renaming columns) without rewriting data.
Common Use Cases for These Sample Files
Parquet vs CSV vs JSON — At a Glance
| Feature | Parquet ✓ | CSV | JSON / JSONL | ORC | Avro |
|---|---|---|---|---|---|
| Storage layout | Columnar | Row | Row | Columnar | Row |
| Schema embedded | ✓ | ✗ | Partial | ✓ | ✓ |
| Compression efficiency | Excellent | Poor | Poor | Excellent | Good |
| Column pruning | ✓ | ✗ | ✗ | ✓ | ✗ |
| Predicate pushdown | ✓ | ✗ | ✗ | ✓ | ✗ |
| Nested types | ✓ | ✗ | ✓ | ✓ | ✓ |
| Human-readable | ✗ | ✓ | ✓ | ✗ | ✗ |
| Streaming write | ~ | ✓ | ✓ | ~ | ✓ |
| Analytics performance | ⚡ Best | Slow | Slow | ⚡ Best | Medium |
| Open table format base | ✓ Delta/Iceberg/Hudi | ✗ | ✗ | Partial | ✗ |
Anatomy of a Parquet File
Every .parquet file follows the same binary layout — understanding it helps you tune performance.
Magic Bytes
4-byte header PAR1 and identical 4-byte footer. Used to validate the file and detect corruption.
Row Groups
Horizontal partitions of data. Default ~128 MB each. Each row group is independently decodable — great for parallel reads.
Column Chunks
Within each row group, one chunk per column. Chunks can be individually compressed with different codecs.
Pages
The smallest unit of I/O inside a column chunk (~1 MB). Each page has its own header with encoding and compression info.
Footer (Thrift)
Contains the full FileMetaData: schema, row group offsets, column statistics (min/max/nulls), and encoding info. Always read first.
Statistics
Per-column min, max, and null counts stored in the footer. Query engines use these for predicate pushdown without reading data pages.
Frequently Asked Questions
8 questions-
All five files use Parquet format version 2 (the spec version stored in the
FileMetaData.versionfield). They use PLAIN encoding for all columns and no compression — making them ideal for validating parsers since there is no codec dependency. Compatible with all major readers: PyArrow ≥ 0.17, Spark ≥ 2.x, Hive ≥ 1.2, Presto, Trino, DuckDB, and Pandas. -
Yes — completely free for any purpose. All datasets are synthetically generated; they contain no real personal, financial, or proprietary information. You may use them in commercial software, open-source projects, tutorials, courses, test suites, and benchmarks without attribution.
-
Several GUI tools can open Parquet files directly:
- DBeaver — free, cross-platform, supports Parquet natively
- Apache Parquet Viewer — lightweight desktop app
- VS Code — install the Parquet Viewer extension
- Tad — fast tabular data viewer with Parquet support
- AWS Glue DataBrew — cloud-based, no install
- Command-line:
parquet-tools schema file.parquet
-
Parquet v2 (also called "format version 2") introduced several improvements over v1: the Data Page V2 format (separate repetition/definition levels that can be compressed independently), better statistics coverage, improved encoding for nested types, and support for newer encodings like DELTA_BINARY_PACKED and BYTE_STREAM_SPLIT. Most modern tools default to v2. The
versionfield in the footer is2for v2 files. The physical layout is largely compatible — all v1 readers can read v2 files if they don't use v2-only page headers. -
With Pandas, it's one line:
import pandas as pd df = pd.read_parquet('ecommerce_orders.parquet') df.to_csv('ecommerce_orders.csv', index=False)Or with DuckDB, which handles very large files efficiently:
duckdb.sql(""" COPY (SELECT * FROM 'ecommerce_orders.parquet') TO 'ecommerce_orders.csv' (HEADER, DELIMITER ',') """) -
In PLAIN encoding:
INT32values are stored as 4 consecutive little-endian bytes;INT64as 8 bytes LE;FLOATas IEEE 754 single (4 bytes);DOUBLEas IEEE 754 double (8 bytes);BYTE_ARRAY(strings) as a 4-byte LE length prefix followed by raw UTF-8 bytes;BOOLEANas 1 bit per value packed LSB-first. Definition and repetition levels use RLE/bit-packed hybrid encoding regardless of value encoding. -
BigQuery: Upload via the console (Create Table → Upload → File format: Parquet) or use
bq load --source_format=PARQUET dataset.table file.parquet. Schema is auto-detected from the embedded Parquet schema.Snowflake: Stage the files in an S3 or GCS bucket, then:
COPY INTO my_table FROM @my_stage/file.parquet FILE_FORMAT = (TYPE = 'PARQUET');. Use$1:column_namesyntax to reference columns during the COPY. -
The quickest way with Python:
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # Create a DataFrame df = pd.DataFrame({ 'id': range(1, 1001), 'name': [f'User_{i}' for i in range(1, 1001)], 'score': [i * 1.5 for i in range(1, 1001)], 'active': [i % 2 == 0 for i in range(1, 1001)], }) # Write with compression table = pa.Table.from_pandas(df) pq.write_table(table, 'my_sample.parquet', compression='snappy', row_group_size=512)
About These Sample Parquet Files
These free sample Apache Parquet files were designed to cover the most common real-world data shapes a data engineer encounters: transactional data with mixed string/integer fields (e-commerce orders, financial transactions), high-cardinality string data typical of log analytics (web server access logs), time-series-flavoured measurement data (IoT sensor readings), and narrow, low-cardinality HR data with enum-like columns (employee records). Each dataset exercises a different read pattern so you can benchmark column pruning, predicate pushdown, and decompression independently.
All files comply with the Apache Parquet Format Specification v2.6. They use uncompressed PLAIN encoding to eliminate external codec dependencies — you can validate a parser against them without linking Snappy or GZIP. For real-world compression benchmarks, load the files into PyArrow or DuckDB and re-write with compression='snappy' or compression='zstd'; you'll typically see 40–70% size reduction on the string-heavy datasets.
These files are also useful for practicing with modern data lakehouse tooling: load them into Delta Lake via spark.write.format("delta"), convert to Apache Iceberg using the CREATE TABLE … AS SELECT syntax in Trino or Flink, or use them as the raw input layer in an Apache Hudi Upsert pipeline. The schema definitions above serve as the DDL reference for any of these workflows.
More Free Online Tools
Simple tools. Surgical fixes. Zero friction.
Amazon Connect CCP Log Parser
Parse Amazon Connect CCP logs into structured, searchable diagnostics.
OpenAmazon Connect Agent Workstation Validator
Pre-flight check for Amazon Connect softphone agents.
OpenAmazon Connect Pricing Calculator
Instantly estimate monthly AWS Connect costs — voice, chat, email, campaigns, telephony & more.
OpenConnect CloudWatch Log Analyzer
Drop any Amazon Connect CloudWatch log and get a rich visual breakdown.
Open