Building a Survivorship Bias-Free Crypto Dataset with CoinMarketCap API

When you look at a chart of Bitcoin’s price from 2010 to today, it tells a story of volatility, resilience, and long-term gains. But what about the thousands of coins that launched, pumped, and then disappeared along the way?

Most commonly used crypto datasets, especially those tied to current exchange listings or public dashboards, tend to highlight tokens that are still actively trading. This introduces a significant survivorship bias, as failed, rugged, or delisted tokens are often excluded or missing entirely. The absence of transparent sourcing and comprehensive historical data creates blind spots in research, particularly for historical analyses and backtesting.

To address this, we’re sharing the code used to build the dataset behind our recent paper, Catching Crypto Trends: A Tactical Approach for Bitcoin and Altcoins. Our goal is to help bring more transparency and reproducibility to crypto market research.

In traditional finance, survivorship bias has long been a well-known trap. In crypto, it’s even worse due to the sheer number of tokens launched and abandoned. If your dataset only includes coins that still exist, your strategies may look great on paper… but completely fall apart in the real world.

Key Concept

Survivorship bias — when you only analyze the winners and ignore everything that didn’t make it.

In this article, we’re going to build a custom crypto dataset that avoids survivorship bias — by including both active and defunct projects using CoinMarketCap’s historical data API.

Here’s what we’ll cover:

Start with a focused sample of 10 cryptocurrencies, including both well-known projects and failed tokens
Scale up to include the entire CoinMarketCap market, covering thousands of assets, dead and alive
Download daily OHLCV data (open, high, low, close, volume) for each token
Fetch rich metadata such as project categories, tags, and names
Use official CoinMarketCap IDs, not just common symbols like “BTC” or “ETH”
Show real examples of collapsed tokens like FTT and SafeMoon alongside long-term survivors like Bitcoin

And the best part?
You can run the entire pipeline yourself — directly in Google Colab.

10 Cryptocurrency Sample Dataset

Perfect for learning the data pipeline

This notebook builds a focused dataset with 10 cryptocurrencies, including both successful projects and failed tokens. Ideal for learning how to avoid survivorship bias without long processing times.

Open in Google Colab

No installation required. Click the link above to view and run the code.

Complete Crypto Market Dataset

Comprehensive market coverage

This notebook builds a comprehensive dataset covering the entire CoinMarketCap market, including thousands of active and defunct crypto projects.

Warning: Full execution takes over 20 hours to complete due to rate limits.

Open in Google Colab

No installation required. Click the link above to view and run the code.

The CoinMarketCap API: What You Need to Know

Before we dive into code, let’s break down how the CoinMarketCap (CMC) API works, and what you’ll need to use it effectively.

IDs, Not Just Symbols

CoinMarketCap identifies every cryptocurrency with a permanent numeric ID (called UCID). This is not the same as a ticker symbol like BTC or ETH.

These IDs are unique, fixed, and immune to changes in branding or delistings — making them perfect for long-term research.

Coin	Symbol	CoinMarketCap ID
Bitcoin	`BTC`	`1`
Ethereum	`ETH`	`1027`
Binance Coin	`BNB`	`1839`
Solana	`SOL`	`5426`
XRP	`XRP`	`52`
Dogecoin	`DOGE`	`74`
Shiba Inu	`SHIB`	`5994`
Cardano	`ADA`	`2010`
FTX Token (dead)	`FTT`	`4195`
SafeMoon (dead)	`SAFEMOON`	`8757`

You can find these IDs directly on each coin’s page at coinmarketcap.com. Look in the sidebar where it’s listed as UCID

API Key & Access

To use the API, you’ll need a developer account at coinmarketcap.com/api. Once registered, you’ll receive a personal API key.

🔔 As of April 16, 2025:

CoinMarketCap is offering a promotion for new users:

Use coupon hobbyist_1st_month_free for a free month of the Hobbyist plan

Or startup_1st_month_free for a free month of the Startup plan

Both of these plans include access to daily historical OHLCV data, which is essential for building a full market dataset.

We are not affiliated with CoinMarketCap or compensated in any way for this tutorial — we’re simply using their API because it’s one of the most comprehensive crypto data sources available.

Endpoint Used in This Project

We use two main endpoints from the CoinMarketCap API to construct the dataset:

1. `/v2/cryptocurrency/ohlcv/historical`

Returns daily historical data: open, high, low, close, volume, and market cap
Requires a specific CoinMarketCap ID (e.g., 1 for Bitcoin)
Supports custom time ranges (e.g., from 2010-1-1 to 2025-3-19)
Used to build the time series for each coin

2. `/v2/cryptocurrency/info`

Returns project metadata: name, symbol, category, and a list of tags
Tags include useful labels like: stablecoin, dao, mineable, layer-1, memes, etc.

Understanding Metadata: Tags, Categories, and Their Role

Price data tells us how a coin moves, but metadata helps us understand what kind of coin it is. CoinMarketCap includes rich metadata for every coin — from basic fields like name and category to a deeper set of descriptive tags.

Once we’ve downloaded OHLCV data for our sample coins, we’ll query the /v2/cryptocurrency/info endpoint to pull in this extra context.

What Metadata Do We Get?

Here are the key fields we care about:

Field	Description
`id`	CoinMarketCap ID
`name`	Full project name (e.g. “Bitcoin”)
`symbol`	Ticker symbol (e.g. “BTC”)
`category`	Layer 1, DAO, Token, etc.
`tags`	A list of thematic labels (e.g. “`stablecoin`”, “`meme`”, “`mineable`”)

These tags give us a powerful way to filter, group, or visualize the dataset, especially when working with thousands of coins.

Example: Tags in Action

Each crypto project in our dataset is tagged with relevant themes like mineable, memes, store-of-value, or layer-1. These tags help us quickly classify assets by purpose or technology.

For example:

Bitcoin (BTC) is tagged as mineable, pow, sha-256, and store-of-value.
Dogecoin (DOGE) shows up as a memes token and is also mineable using scrypt.
Shiba Inu (SHIB) is tagged purely as a memes token — no mining or technical layers.

These simple tags power smarter filtering, visualization, and analysis across thousands of coins.

We Focus on a Curated Set of Tags

To keep things consistent and relevant, we extract just the tags we care about — specifically:

Tag	Meaning
`wrapped`	Represents a tokenized version of another asset, often to enable cross-chain use (e.g., WBTC).
`stablecoin`	Pegged to a stable asset like USD (e.g., USDC, DAI).
`collectibles-nfts`	Related to NFTs or digital collectibles ecosystems.
`memes`	Community-driven, meme-inspired tokens (e.g., SHIB, DOGE).
`iot`	Internet-of-Things focused projects (e.g., IOTA, Helium).
`dao`	Governed by decentralized autonomous organizations.
`governance`	Includes voting features or token-based protocol control.
`mineable`	Uses mining to create new coins (typically via PoW).
`pow`	Proof-of-Work consensus (e.g., Bitcoin).
`pos`	Proof-of-Stake consensus (e.g., Cardano, ETH 2.0).
`sha-256`	Uses the SHA-256 hashing algorithm (common with Bitcoin forks).
`store-of-value`	Intended to hold long-term value like gold (e.g., Bitcoin).
`medium-of-exchange`	Designed for everyday transactions (e.g., Litecoin).
`scrypt`	Uses Scrypt hashing (used in Litecoin and Dogecoin).
`layer-1`	Base blockchain protocols (e.g., Bitcoin, Ethereum).
`layer-2`	Scaling solutions that operate on top of Layer 1 chains (e.g., Polygon, Optimism).

These are turned into boolean columns — so for each coin, you’ll know instantly whether it’s mineable, a meme token, a DAO, etc.

Merging Metadata with OHLCV

Once both datasets are downloaded (price data + metadata), we merge them on the id column:

Python

df_final = pd.merge(ohlcv_df, metadata_df.drop(columns=['symbol', 'name']), on='id', how='left')

This gives us one unified DataFrame — time series + context — ready for exploration, plotting, or saving.

Visualizing Crypto Lifespans with Price Charts

Once we’ve collected our sample dataset — complete with price history and metadata — it’s time to see what that data actually looks like. Plotting historical price data reveals how long each token was active, how it performed, and often, how it ended.

This is where survivorship bias becomes visibly obvious.

Bitcoin (BTC)

FTX Token (FTT)

SafeMoon (SAFEMOON)

Scaling to the Full Market: Processing All 40,000 CoinMarketCap IDs

While our sample dataset of 10 cryptocurrencies is perfect for learning the process, serious researchers and traders will want comprehensive market coverage. Let’s look at how we scale this approach to include the entire CoinMarketCap universe.

The Systematic ID Approach

CoinMarketCap assigns IDs sequentially, with Bitcoin as ID 1. New projects receive higher numbers as they’re added to the platform. As of early 2025, these IDs range from 1 to approximately 40,000.

In our full market approach, we systematically process these IDs in batches:

Python

# Process all IDs from 1 to 40,000 in batches of 1,000
M = 1_000  # Batch size

for N in range(1, 40+1):  # 40 batches of 1,000 IDs each
    # Calculate the range for this batch
    ids = range(M*(N-1)+1, M*N+1)
    
    # Process each ID in the batch
    res = []
    for the_id in tqdm(ids):  # Progress bar for each ID
        res.append(get_daily_OHLCV(the_id))
    
    # Combine and save this batch
    res = pd.concat(res)
    res.to_parquet(f"crypto/OHLCV_{N}.par")

This approach offers several advantages:

Resilience: By saving data in separate batch files, you don’t lose everything if the process is interrupted
Memory efficiency: Processing 40,000 cryptocurrencies at once would overwhelm most systems
API-friendly: Breaking the work into smaller chunks respects API rate limits
Progress tracking: You can monitor completion by counting the saved parquet files

Processing Time and Resources

Be prepared for significant processing time. Real-world measurements show each API call takes approximately 2 seconds, making the full 40,000 ID collection process require roughly 22+ hours of continuous runtime.

Key considerations:

Many IDs won’t return data (reserved or unused)
The actual dataset contains ~23,286 unique cryptocurrencies as of early 2025
Complete dataset comprises ~28.6 million daily observations
Storage requirements: ~600MB
Batch saving protects against potential interruptions

After Collection: Combining the Data

Once all batches are processed, they’re combined for analysis:

Python

# Load all OHLCV batch files
files = glob("crypto/OHLCV_*.par")
dfs = []
for file in files:
    df = pd.read_parquet(file)
    dfs.append(df)
df = pd.concat(dfs)

# Add statistics for each cryptocurrency
df['first_date'] = df.groupby(['id'])['ts'].transform('min')
df['last_date'] = df.groupby(['id'])['ts'].transform('max')
df['days_active'] = df.groupby(['id'])['ts'].transform('size')

print("{0:0,.0f} unique cryptocurrencies".format(df['id'].nunique()))
print("{0:0,.0f} total observations".format(len(df)))

The final dataset contains ~28.6 million daily observations across ~23,286 cryptocurrencies – both active and defunct – giving you a truly survivorship-bias-free view of the entire crypto market history.

Conclusion: Beyond Survivorship Bias

This comprehensive dataset overcomes one of crypto’s biggest analytical blind spots by including both winners and losers across market history. With ~23,286 cryptocurrencies and ~28.6 million daily observations, you now have the foundation for more realistic backtesting, risk assessment, and market research.

The difference between successful and unsuccessful crypto investing often comes down to seeing the complete picture – not just the survivors. This dataset gives you that edge.

Get started with either version:

Complete Crypto Market Dataset – Build the full ~23K cryptocurrency database

10 Cryptocurrency Sample Dataset – Learn the process quickly with a focused sample

Author

Mohamed Gabriel

Software Engineer @ Concretum Group

Building a Survivorship Bias-Free Crypto Dataset with CoinMarketCap API

10 Cryptocurrency Sample Dataset

Complete Crypto Market Dataset

The CoinMarketCap API: What You Need to Know

IDs, Not Just Symbols

API Key & Access

Endpoint Used in This Project

1. /v2/cryptocurrency/ohlcv/historical

2. /v2/cryptocurrency/info

Understanding Metadata: Tags, Categories, and Their Role

What Metadata Do We Get?

Example: Tags in Action

We Focus on a Curated Set of Tags

Merging Metadata with OHLCV

Visualizing Crypto Lifespans with Price Charts

Bitcoin (BTC)

FTX Token (FTT)

SafeMoon (SAFEMOON)

Scaling to the Full Market: Processing All 40,000 CoinMarketCap IDs

The Systematic ID Approach

Processing Time and Resources

After Collection: Combining the Data

Conclusion: Beyond Survivorship Bias

1. `/v2/cryptocurrency/ohlcv/historical`

2. `/v2/cryptocurrency/info`