When you look at a chart of Bitcoin’s price from 2010 to today, it tells a story of volatility, resilience, and long-term gains. But what about the thousands of coins that launched, pumped, and then disappeared along the way?

Most commonly used crypto datasets, especially those tied to current exchange listings or public dashboards, tend to highlight tokens that are still actively trading. This introduces a significant survivorship bias, as failed, rugged, or delisted tokens are often excluded or missing entirely. The absence of transparent sourcing and comprehensive historical data creates blind spots in research, particularly for historical analyses and backtesting.

To address this, we’re sharing the code used to build the dataset behind our recent paper, Catching Crypto Trends: A Tactical Approach for Bitcoin and Altcoins. Our goal is to help bring more transparency and reproducibility to crypto market research.

In traditional finance, survivorship bias has long been a well-known trap. In crypto, it’s even worse due to the sheer number of tokens launched and abandoned. If your dataset only includes coins that still exist, your strategies may look great on paper… but completely fall apart in the real world.

Key Concept
Survivorship bias — when you only analyze the winners and ignore everything that didn’t make it.

In this article, we’re going to build a custom crypto dataset that avoids survivorship bias — by including both active and defunct projects using CoinMarketCap’s historical data API.

Here’s what we’ll cover:

And the best part?
You can run the entire pipeline yourself — directly in Google Colab.

10 Cryptocurrency Sample Dataset

Perfect for learning the data pipeline

This notebook builds a focused dataset with 10 cryptocurrencies, including both successful projects and failed tokens. Ideal for learning how to avoid survivorship bias without long processing times.

Open in Google Colab

No installation required. Click the link above to view and run the code.

Complete Crypto Market Dataset

Comprehensive market coverage

This notebook builds a comprehensive dataset covering the entire CoinMarketCap market, including thousands of active and defunct crypto projects.

Warning: Full execution takes over 20 hours to complete due to rate limits.
Open in Google Colab

No installation required. Click the link above to view and run the code.

The CoinMarketCap API: What You Need to Know

Before we dive into code, let’s break down how the CoinMarketCap (CMC) API works, and what you’ll need to use it effectively.

IDs, Not Just Symbols

CoinMarketCap identifies every cryptocurrency with a permanent numeric ID (called UCID). This is not the same as a ticker symbol like BTC or ETH.

These IDs are unique, fixed, and immune to changes in branding or delistings — making them perfect for long-term research.

CoinSymbolCoinMarketCap ID
BitcoinBTC1
EthereumETH1027
Binance CoinBNB1839
SolanaSOL5426
XRPXRP52
DogecoinDOGE74
Shiba InuSHIB5994
CardanoADA2010
FTX Token (dead)FTT4195
SafeMoon (dead)SAFEMOON8757

You can find these IDs directly on each coin’s page at coinmarketcap.com. Look in the sidebar where it’s listed as UCID

API Key & Access

To use the API, you’ll need a developer account at coinmarketcap.com/api. Once registered, you’ll receive a personal API key.

🔔 As of April 16, 2025:

CoinMarketCap is offering a promotion for new users:

  • Use coupon hobbyist_1st_month_free for a free month of the Hobbyist plan
  • Or startup_1st_month_free for a free month of the Startup plan

Both of these plans include access to daily historical OHLCV data, which is essential for building a full market dataset.

We are not affiliated with CoinMarketCap or compensated in any way for this tutorial — we’re simply using their API because it’s one of the most comprehensive crypto data sources available.

Endpoint Used in This Project

We use two main endpoints from the CoinMarketCap API to construct the dataset:

1. /v2/cryptocurrency/ohlcv/historical
2. /v2/cryptocurrency/info

Understanding Metadata: Tags, Categories, and Their Role

Price data tells us how a coin moves, but metadata helps us understand what kind of coin it is. CoinMarketCap includes rich metadata for every coin — from basic fields like name and category to a deeper set of descriptive tags.

Once we’ve downloaded OHLCV data for our sample coins, we’ll query the /v2/cryptocurrency/info endpoint to pull in this extra context.

What Metadata Do We Get?

Here are the key fields we care about:

FieldDescription
idCoinMarketCap ID
nameFull project name (e.g. “Bitcoin”)
symbolTicker symbol (e.g. “BTC”)
categoryLayer 1, DAO, Token, etc.
tagsA list of thematic labels (e.g. “stablecoin”, “meme”, “mineable”)

These tags give us a powerful way to filter, group, or visualize the dataset, especially when working with thousands of coins.

Example: Tags in Action

Each crypto project in our dataset is tagged with relevant themes like mineable, memes, store-of-value, or layer-1. These tags help us quickly classify assets by purpose or technology.

For example:

These simple tags power smarter filtering, visualization, and analysis across thousands of coins.

We Focus on a Curated Set of Tags

To keep things consistent and relevant, we extract just the tags we care about — specifically:

TagMeaning
wrappedRepresents a tokenized version of another asset, often to enable cross-chain use (e.g., WBTC).
stablecoinPegged to a stable asset like USD (e.g., USDC, DAI).
collectibles-nftsRelated to NFTs or digital collectibles ecosystems.
memesCommunity-driven, meme-inspired tokens (e.g., SHIB, DOGE).
iotInternet-of-Things focused projects (e.g., IOTA, Helium).
daoGoverned by decentralized autonomous organizations.
governanceIncludes voting features or token-based protocol control.
mineableUses mining to create new coins (typically via PoW).
powProof-of-Work consensus (e.g., Bitcoin).
posProof-of-Stake consensus (e.g., Cardano, ETH 2.0).
sha-256Uses the SHA-256 hashing algorithm (common with Bitcoin forks).
store-of-valueIntended to hold long-term value like gold (e.g., Bitcoin).
medium-of-exchangeDesigned for everyday transactions (e.g., Litecoin).
scryptUses Scrypt hashing (used in Litecoin and Dogecoin).
layer-1Base blockchain protocols (e.g., Bitcoin, Ethereum).
layer-2Scaling solutions that operate on top of Layer 1 chains (e.g., Polygon, Optimism).

These are turned into boolean columns — so for each coin, you’ll know instantly whether it’s mineable, a meme token, a DAO, etc.

Merging Metadata with OHLCV

Once both datasets are downloaded (price data + metadata), we merge them on the id column:

Python
df_final = pd.merge(ohlcv_df, metadata_df.drop(columns=['symbol', 'name']), on='id', how='left')

This gives us one unified DataFrame — time series + context — ready for exploration, plotting, or saving.

Visualizing Crypto Lifespans with Price Charts

Once we’ve collected our sample dataset — complete with price history and metadata — it’s time to see what that data actually looks like. Plotting historical price data reveals how long each token was active, how it performed, and often, how it ended.

This is where survivorship bias becomes visibly obvious.

Bitcoin (BTC)

FTX Token (FTT)

SafeMoon (SAFEMOON)

Scaling to the Full Market: Processing All 40,000 CoinMarketCap IDs

While our sample dataset of 10 cryptocurrencies is perfect for learning the process, serious researchers and traders will want comprehensive market coverage. Let’s look at how we scale this approach to include the entire CoinMarketCap universe.

The Systematic ID Approach

CoinMarketCap assigns IDs sequentially, with Bitcoin as ID 1. New projects receive higher numbers as they’re added to the platform. As of early 2025, these IDs range from 1 to approximately 40,000.

In our full market approach, we systematically process these IDs in batches:

Python
# Process all IDs from 1 to 40,000 in batches of 1,000
M = 1_000  # Batch size

for N in range(1, 40+1):  # 40 batches of 1,000 IDs each
    # Calculate the range for this batch
    ids = range(M*(N-1)+1, M*N+1)
    
    # Process each ID in the batch
    res = []
    for the_id in tqdm(ids):  # Progress bar for each ID
        res.append(get_daily_OHLCV(the_id))
    
    # Combine and save this batch
    res = pd.concat(res)
    res.to_parquet(f"crypto/OHLCV_{N}.par")

This approach offers several advantages:

  1. Resilience: By saving data in separate batch files, you don’t lose everything if the process is interrupted
  2. Memory efficiency: Processing 40,000 cryptocurrencies at once would overwhelm most systems
  3. API-friendly: Breaking the work into smaller chunks respects API rate limits
  4. Progress tracking: You can monitor completion by counting the saved parquet files

Processing Time and Resources

Be prepared for significant processing time. Real-world measurements show each API call takes approximately 2 seconds, making the full 40,000 ID collection process require roughly 22+ hours of continuous runtime.

Key considerations:

After Collection: Combining the Data

Once all batches are processed, they’re combined for analysis:

Python
# Load all OHLCV batch files
files = glob("crypto/OHLCV_*.par")
dfs = []
for file in files:
    df = pd.read_parquet(file)
    dfs.append(df)
df = pd.concat(dfs)

# Add statistics for each cryptocurrency
df['first_date'] = df.groupby(['id'])['ts'].transform('min')
df['last_date'] = df.groupby(['id'])['ts'].transform('max')
df['days_active'] = df.groupby(['id'])['ts'].transform('size')

print("{0:0,.0f} unique cryptocurrencies".format(df['id'].nunique()))
print("{0:0,.0f} total observations".format(len(df)))

The final dataset contains ~28.6 million daily observations across ~23,286 cryptocurrencies – both active and defunct – giving you a truly survivorship-bias-free view of the entire crypto market history.

Conclusion: Beyond Survivorship Bias

This comprehensive dataset overcomes one of crypto’s biggest analytical blind spots by including both winners and losers across market history. With ~23,286 cryptocurrencies and ~28.6 million daily observations, you now have the foundation for more realistic backtesting, risk assessment, and market research.

The difference between successful and unsuccessful crypto investing often comes down to seeing the complete picture – not just the survivors. This dataset gives you that edge.

Get started with either version:

Complete Crypto Market Dataset – Build the full ~23K cryptocurrency database

10 Cryptocurrency Sample Dataset – Learn the process quickly with a focused sample