Inside the Data: 2,609 Files, Millions of Compounds

Original format Cite this article

What “the data” actually is

It’s easy to throw around big numbers — millions of compounds, thousands of procedures — without explaining what they refer to. This post unpacks the dataset so you can decide whether it fits your work.

The shape of the database

The core probability database lives in a directory called bluray_data. It contains 2,609 data files spanning four broad categories:

  1. Compound combinations — millions of element pairings and groupings, indexed for fast lookup. You can search by element symbols (for example, ['H', 'O']) and get ranked candidate compounds back.
  2. Probability matrices — statistical data on how likely each combination is to form, behave a certain way, or persist. This is the layer that lets you rank rather than just retrieve.
  3. Transformation database — a “how to” library of structured procedures for converting one compound into another. Thousands of synthesis methods, organized so that a query returns a usable path rather than a paper to read.
  4. Discovery reports — detailed compound analyses, the equivalent of having a reference write-up already done for many of the entries you’ll touch.

Formats and access

The dataset ships in multiple formats so it can drop into whatever stack you’re already using:

  • CSV for spreadsheet and pandas workflows.
  • JSON for programmatic access and web tooling.
  • Compressed formats for moving the full corpus efficiently.

Alongside the raw files, there is a processed-output directory (alchemy_data_v2_output) containing pre-computed views, plus a core_systems directory with the components that wire everything together.

Documentation, not just data

One of the things that distinguishes this package from “a folder of CSVs” is the documentation layer — 61 files covering quick starts, architecture notes, generation guides, and validation results:

  • QUICK_START_USING_DATA.md — the fast on-ramp.
  • USE_BLURAY_DATA.md — full usage reference.
  • HOW_TO_RUN_BLURAY_GENERATOR.md — for teams that want to extend the dataset.
  • METHOD_DATA_SYSTEM_README.md — system architecture.
  • TEST_ANALYSIS.md — validation results.

Why this organization matters

Datasets become useful when the structure makes the next question cheap to ask. Probability matrices sit next to the combinations they rank. Transformation guides are linked to the compounds they produce. Discovery reports are reachable from the compound that prompted them. The intent is that a researcher can move from a question to an answer without rebuilding the index every time.

The next post turns from “what is in here” to “what does it do for you” — a tour of the use cases that drove the design.

Copy one of the formats below: