Building a Survivorship Bias-Free Crypto Dataset with CoinMarketCap API
When you look at a chart of Bitcoin’s price from 2010 to today, it tells a story of volatility, resilience, and long-term gains. But what about the thousands of coins that launched, pumped, and then disappeared along the way?
Most commonly used crypto datasets, especially those tied to current exchange listings or public dashboards, tend to highlight tokens that are still actively trading. This introduces a significant survivorship bias, as failed, rugged, or delisted tokens are often excluded or missing entirely. The absence of transparent sourcing and comprehensive historical data creates blind spots in research, particularly for historical analyses and backtesting.
In traditional finance, survivorship bias has long been a well-known trap. In crypto, it’s even worse due to the sheer number of tokens launched and abandoned. If your dataset only includes coins that still exist, your strategies may look great on paper… but completely fall apart in the real world.
Key Concept
Survivorship bias — when you only analyze the winners and ignore everything that didn’t make it.
In this article, we’re going to build a custom crypto dataset that avoids survivorship bias — by including both active and defunct projects using CoinMarketCap’s historical data API.
Here’s what we’ll cover:
Start with a focused sample of 10 cryptocurrencies, including both well-known projects and failed tokens
Scale up to include the entire CoinMarketCap market, covering thousands of assets, dead and alive
Download daily OHLCV data (open, high, low, close, volume) for each token
Fetch rich metadata such as project categories, tags, and names
Use official CoinMarketCap IDs, not just common symbols like “BTC” or “ETH”
Show real examples of collapsed tokens like FTT and SafeMoon alongside long-term survivors like Bitcoin
And the best part? You can run the entire pipeline yourself — directly in Google Colab.
10 Cryptocurrency Sample Dataset
Perfect for learning the data pipeline
This notebook builds a focused dataset with 10 cryptocurrencies, including both successful projects and failed tokens. Ideal for learning how to avoid survivorship bias without long processing times.
No installation required. Click the link above to view and run the code.
The CoinMarketCap API: What You Need to Know
Before we dive into code, let’s break down how the CoinMarketCap (CMC) API works, and what you’ll need to use it effectively.
IDs, Not Just Symbols
CoinMarketCap identifies every cryptocurrency with a permanent numeric ID (called UCID). This is not the same as a ticker symbol like BTC or ETH.
These IDs are unique, fixed, and immune to changes in branding or delistings — making them perfect for long-term research.
Coin
Symbol
CoinMarketCap ID
Bitcoin
BTC
1
Ethereum
ETH
1027
Binance Coin
BNB
1839
Solana
SOL
5426
XRP
XRP
52
Dogecoin
DOGE
74
Shiba Inu
SHIB
5994
Cardano
ADA
2010
FTX Token (dead)
FTT
4195
SafeMoon (dead)
SAFEMOON
8757
You can find these IDs directly on each coin’s page at coinmarketcap.com. Look in the sidebar where it’s listed as UCID
API Key & Access
To use the API, you’ll need a developer account at coinmarketcap.com/api. Once registered, you’ll receive a personal API key.
🔔 As of April 16, 2025:
CoinMarketCap is offering a promotion for new users:
Use coupon hobbyist_1st_month_free for a free month of the Hobbyist plan
Or startup_1st_month_free for a free month of the Startup plan
Both of these plans include access to daily historical OHLCV data, which is essential for building a full market dataset.
We are not affiliated with CoinMarketCap or compensated in any way for this tutorial — we’re simply using their API because it’s one of the most comprehensive crypto data sources available.
Endpoint Used in This Project
We use two main endpoints from the CoinMarketCap API to construct the dataset:
1. /v2/cryptocurrency/ohlcv/historical
Returns daily historical data: open, high, low, close, volume, and market cap
Requires a specific CoinMarketCap ID (e.g., 1 for Bitcoin)
Supports custom time ranges (e.g., from 2010-1-1 to 2025-3-19)
Used to build the time series for each coin
2. /v2/cryptocurrency/info
Returns project metadata: name, symbol, category, and a list of tags
Tags include useful labels like: stablecoin, dao, mineable, layer-1, memes, etc.
Understanding Metadata: Tags, Categories, and Their Role
Price data tells us how a coin moves, but metadata helps us understand what kind of coin it is. CoinMarketCap includes rich metadata for every coin — from basic fields like name and category to a deeper set of descriptive tags.
Once we’ve downloaded OHLCV data for our sample coins, we’ll query the /v2/cryptocurrency/info endpoint to pull in this extra context.
What Metadata Do We Get?
Here are the key fields we care about:
Field
Description
id
CoinMarketCap ID
name
Full project name (e.g. “Bitcoin”)
symbol
Ticker symbol (e.g. “BTC”)
category
Layer 1, DAO, Token, etc.
tags
A list of thematic labels (e.g. “stablecoin”, “meme”, “mineable”)
These tags give us a powerful way to filter, group, or visualize the dataset, especially when working with thousands of coins.
Example: Tags in Action
Each crypto project in our dataset is tagged with relevant themes like mineable, memes, store-of-value, or layer-1. These tags help us quickly classify assets by purpose or technology.
For example:
Bitcoin (BTC) is tagged as mineable, pow, sha-256, and store-of-value.
Dogecoin (DOGE) shows up as a memes token and is also mineable using scrypt.
Shiba Inu (SHIB) is tagged purely as a memes token — no mining or technical layers.
These simple tags power smarter filtering, visualization, and analysis across thousands of coins.
We Focus on a Curated Set of Tags
To keep things consistent and relevant, we extract just the tags we care about — specifically:
Tag
Meaning
wrapped
Represents a tokenized version of another asset, often to enable cross-chain use (e.g., WBTC).
stablecoin
Pegged to a stable asset like USD (e.g., USDC, DAI).
collectibles-nfts
Related to NFTs or digital collectibles ecosystems.
This gives us one unified DataFrame — time series + context — ready for exploration, plotting, or saving.
Visualizing Crypto Lifespans with Price Charts
Once we’ve collected our sample dataset — complete with price history and metadata — it’s time to see what that data actually looks like. Plotting historical price data reveals how long each token was active, how it performed, and often, how it ended.
This is where survivorship bias becomes visibly obvious.
Bitcoin (BTC)
FTX Token (FTT)
SafeMoon (SAFEMOON)
Scaling to the Full Market: Processing All 40,000 CoinMarketCap IDs
While our sample dataset of 10 cryptocurrencies is perfect for learning the process, serious researchers and traders will want comprehensive market coverage. Let’s look at how we scale this approach to include the entire CoinMarketCap universe.
The Systematic ID Approach
CoinMarketCap assigns IDs sequentially, with Bitcoin as ID 1. New projects receive higher numbers as they’re added to the platform. As of early 2025, these IDs range from 1 to approximately 40,000.
In our full market approach, we systematically process these IDs in batches:
Python
# Process all IDs from 1 to 40,000 in batches of 1,000M = 1_000# Batch sizefor N inrange(1, 40+1): # 40 batches of 1,000 IDs each# Calculate the range for this batch ids = range(M*(N-1)+1, M*N+1)# Process each ID in the batch res = []for the_id in tqdm(ids): # Progress bar for each ID res.append(get_daily_OHLCV(the_id))# Combine and save this batch res = pd.concat(res) res.to_parquet(f"crypto/OHLCV_{N}.par")
This approach offers several advantages:
Resilience: By saving data in separate batch files, you don’t lose everything if the process is interrupted
Memory efficiency: Processing 40,000 cryptocurrencies at once would overwhelm most systems
API-friendly: Breaking the work into smaller chunks respects API rate limits
Progress tracking: You can monitor completion by counting the saved parquet files
Processing Time and Resources
Be prepared for significant processing time. Real-world measurements show each API call takes approximately 2 seconds, making the full 40,000 ID collection process require roughly 22+ hours of continuous runtime.
Key considerations:
Many IDs won’t return data (reserved or unused)
The actual dataset contains ~23,286 unique cryptocurrencies as of early 2025
Complete dataset comprises ~28.6 million daily observations
Storage requirements: ~600MB
Batch saving protects against potential interruptions
After Collection: Combining the Data
Once all batches are processed, they’re combined for analysis:
Python
# Load all OHLCV batch filesfiles = glob("crypto/OHLCV_*.par")dfs = []for file in files: df = pd.read_parquet(file) dfs.append(df)df = pd.concat(dfs)# Add statistics for each cryptocurrencydf['first_date'] = df.groupby(['id'])['ts'].transform('min')df['last_date'] = df.groupby(['id'])['ts'].transform('max')df['days_active'] = df.groupby(['id'])['ts'].transform('size')print("{0:0,.0f} unique cryptocurrencies".format(df['id'].nunique()))print("{0:0,.0f} total observations".format(len(df)))
The final dataset contains ~28.6 million daily observations across ~23,286 cryptocurrencies – both active and defunct – giving you a truly survivorship-bias-free view of the entire crypto market history.
Conclusion: Beyond Survivorship Bias
This comprehensive dataset overcomes one of crypto’s biggest analytical blind spots by including both winners and losers across market history. With ~23,286 cryptocurrencies and ~28.6 million daily observations, you now have the foundation for more realistic backtesting, risk assessment, and market research.
The difference between successful and unsuccessful crypto investing often comes down to seeing the complete picture – not just the survivors. This dataset gives you that edge.