API referencedata_base

data_base

Efficient, reproducible and flexible database with dictionary-like API.

This package provides efficient and scalable methods to store and access simulation results at a terrabyte scale. A wide variety of input data and output file formats are supported (see data_base.IO.LoaderDumper), including:

  • 1D and ND numpy arrays

  • pandas and dask dataframes

  • Cell objects

Databases saves keys as folders containing at least three files:

  • Loader: JSON file containing information on how to load the data

  • metadata: JSON file containing metadata.

  • Data file(s): The actual data, in a format specified by the Loader file. Some file formats split up the data in multiple files, such as parquet and msgpack.

Simulation results from single_cell_parser and simrun can be imported and converted to a high performance binary format using the data_base.db_initializers subpackage.

Example

Loader contains information on how to load the data. It contains which module to use (assuming it contains a Loader class):

{"Loader": "data_base.IO.LoaderDumper.dask_to_parquet"}

metadata contains the time, commit hash, module versions, creation date, file format, and whether or not the data was saved with uncommitted code (dirty). If the data was created within a Jupyter session, it also contains the code history that was used to produce this data:

{
    "dumper": "dask_to_parquet",
    "time": [2025, 2, 21, 15, 51, 23, 4, 52, -1],
    "module_list": "...",
    "module_versions": {
        "re": "2.2.1",
        ...
        "pygments": "2.18.0",
        "bluepyopt": "1.9.126"
        },
    "history": "import Interface as I ...",
    "hostname": "localhost",
    "metadata_creation_time": "together_with_new_key",
    "version": "heads/master",
    "full-revisionid": "9fd2c2a94cdc36ee806d4625e353cd289cd7ce16",
    "dirty": false,
    "error": null
}

Saving and loading data is easily achieved:

from data_base import DataBase

db = DataBase('/path/to/database')
obj = pd.DataFrame(...)  # some pandas dataframe for example
db['my_key'] = obj  # saves the object to the database with the default format
loaded_obj = db['my_key']  # loads the object from the database
db.set('my_other_key', obj, dumper='pandas_to_msgpack')  # saves the object with a specific format

When you don’t specify the dumper, the default dumper as specified in the configuration file is used. The default dumper is purposely chosen to prioritize flexibility (i.e. save anything), not performance (i.e. save something specific very efficiently). Performant data formats will need to be specified explicitly, as they often depend on the object being saved and the intended use case. You can (but shouldn’t) reconfigure the default dumper in config/db_settings.json

Functions

is_data_base(path)

Checks if a given path contains a DataBase.

is_sub_data_base(parent_db, key)

Check if a given key is a sub-database of the parent database.

get_db_by_unique_id(unique_id)

Get a DataBase by its unique ID, as registered in the data base register.

Attributes

DataBase

-

Modules

IO

Read and write data.

analyze

Analyze simrun-initialized databases.

db_initializers

Initialize a database from raw simulation data.

data_base_register

Registry of databases.

dbopen

Open files directly in a database.

distributed_lock

Configuration for locking servers

exceptions

data_base specific exceptions.

isf_data_base

The ISFDataBase class for robust and efficient data storage.

utils

Database utility and convenience functions.