PyHBR Design

This section describes the design of PyHBR and how the code is structured.

Data Sources and Analysis

The package contains routines for performing data analysis and fitting models. The source data for this analysis are tables stored in Microsoft SQL Server.

In order to make the models reusable, the analysis/model code expects the tables in a particular format, which is documented for each analysis/model script. The code for analysis is in the pyhbr.analysis module.

The database query and data fetch is performed by separate code, which is expected to be modified to port this package to a new data source. These data collection scripts are stored in the pyhbr.data_source module.

A middle preprocessing layer pyhbr.middle is used to converted raw data from the data sources into the form expected by analysis. This helps keep the raw data sources clean (there is no need for extensive transformations in the SQL layer).

SQL Queries

The approach taken to prepare SQL statements is to use SQLAlchemy to prepare a query, and then pass it to Pandas read_sql for execution. The advantage of using SQLAlchemy statements instead of raw strings in read_sql is the ability to construct statements using a declarative syntax (including binding parameters), and increased opportunity for error checking (which may be useful for porting the scripts to new databases).

An example of how SQL statements are built is shown below:

from sqlalchemy import select
from pyhbr.common import make_engine, CheckedTable
import pandas as pd
import datetime as dt

# All interactions with the database (including building queries,
# which queries the server to check columns) needs an sqlalchemy 
# engine
engine = make_engine()

# The CheckedTable is a simple wrapper around sqlalchemy.Table,
# for the purpose of checking for missing columns. It replaces
# sqlalchemy syntax table.c.column_name with table.col("column_name")
table = CheckedTable("cv1_episodes", engine)

# The SQL statement is built up using the sqlalchemy select function.
# The declarative syntax reduces the chance of errors typing a raw
# SQL string. This line will throw an error if any of the columns are
# wrong.
stmt = select(
    table.col("subject").label("patient_id"),
    table.col("episode_identifier").label("episode_id"),
    table.col("spell_identifier").label("spell_id"),
    table.col("episode_start_time").label("episode_start"),
    table.col("episode_end_time").label("episode_end"),
).where(
    table.col("episode_start_time") >= dt.date(2000, 1, 1),
    table.col("episode_end_time") <= dt.date(2005, 1, 1)
)

# The SQL query can be printed for inspection if required,
# or for using directly in a SQL script
print(stmt)

# Otherwise, execute the query using pandas to get a dataframe
df = pd.read_sql(stmt, engine)

See the pyhbr.data_source module for more examples of functions that return the stmt variable for different tables.

The following are some tips for building statements using the CheckedTable object:

CheckedTable contains the SQLAlchemy Table as the table member. This means you can use select(table.table) to initially fetch all the columns (useful for seeing what the table contains)
If you need to rename a column (using AS in SQL), use label; e.g. select(table.col("old_name").label("new_name")).
Sometimes (particularly with ID columns which are typed incorrectly), it is useful to be able to cast to a different type. You can do this using select(table.col("col_to_cast").cast(String)). The list of generic types is provided here; import the one you need using a line like from sqlalchemy import String.

Middle Layer

To account for differences in data sources and the analysis, the module pyhbr.middle contains modules like from_hic which contain function that return transformed versions of the data sources more suitable for analysis.

The outputs from this layer are documented here so that it is possible to take a new data source and write a new module in pyhbr.middle which exposes the new data source for analysis. These tables are grouped together into classes and (where the table name is used as the attribute name) used as the argument to analysis functions. Analysis functions may not use all the columns of each table, but when a column is present it should have the name and meaning given below.

All tables are Pandas DataFrames.

Episodes

Episodes correspond to individual consultant interactions within a hospital visit (called a spell). Episode information is stored in a table called episodes, which has the following columns:

episode_id (str, Pandas index): uniquely identifies the episode.
patient_id (str): the unique identifier for the patient.
spell_id (str): identifies the spell containing the episode.
episode_start (datetime.date): the episode start time.
age (float64): The patient age at the start of the episode.
gender (category): One of "male", "female", or "unknown".

Even though age and gender are demographic properties, it is convenient to keep them in the episodes table because they are eventually stored with index events, which come directly from episodes.

Note

Consider filtering the episodes table based on a date range of interest when it is fetched from the data source. This will speed up subsequent processing.

Code Groups

A table of code groups is required as input to functions to identify patients based on aspects of their coded history. This table is called code_groups, and has the following structure:

code (str): The ICD-10 or OPCS-4 code in normalised form (e.g. "i200")
codes (str): The description of the code
group (str): The name of the code group containing the code
type (category): One of "diagnosis" (for ICD-10 codes) or "procedure" (for OPCS-4 codes)

Only codes which are in a code group are included in the table. Codes may be present in multiple rows if they occur in multiple groups.

Codes

Episodes contain clinical code data, which lists the diagnoses made and the procedures performed in an episode. This is stored in a table called codes, with the following columns:

episode_id (str): which episode contains this clinical code.
code (str): the clinical code, all lowercase with no whitespace or dot, e.g. i212
position (int): the position of the code in the episode. 1 means the primary position (e.g. for a primary diagnosis), and >1 means a secondary code. Often episodes contain 5-10 clinical codes, and the maximum number depends on the data source.
type (category): either "diagnosis" (for ICD-10 diagnosis codes) or "procedure" (for OPCS-4 codes)
group (str): which group contains this clinical code.

The Pandas index is a unique integer (note that episode_id is not unique, since a single episode can contain many codes).

Note

This table only contains codes that are in a code group (i.e. the function making codes should filter out codes not in any group; the group column is not NaN). If all codes are required, make a code group "all" which contains every code. Note that codes occupy multiple rows in the codes table if they are in more than one group (take care when counting rows). In these cases, a duplicate code is identified by having the same code, position and type values, but a different group.

Demographics

Demographic information is stored in a table called demographics, which has the following columns:

patient_id (str, Pandas index): the unique patient identifier
gender (category): One of "male", "female", or "unknown".

Laboratory Tests

Write me please

Prescriptions

Write me please

Collections of DataFrames

Once the data source has been converted into the standard form described above, multiple tables are collected together into a dictionary mapping strings to Pandas DataFrames. The value of the keys matches the table name in the sections above.

Data/Model/Analysis Save Points

To support saving intermediate results of calculations, pyhbr.common includes two functions save_item and load_item, which save a Python object to a directory (by default save_data/ in your working directory).

The scripts in hbr_uhbw use these functions to create these checkpoints:

Data: After fetching data from databases or data sources and converting it into the raw format suitable for modelling or analysis. These files have _data in the file name. This data is then loaded again for modelling or analysis
Model: After training models using the data stored in the _data files. These files have _model in the file name. This data is loaded for analysis.
Analysis: After performing analysis using the _data or _model files. These files have _analysis in the file name. This data can be loaded and used to generate reports/outputs.

Splitting up the scripts in this way makes them easier to develop, because each of the three parts above can take quite long to run.

Multiple objects can be saved under one file by including them in a dictionary. It is up to the script to determine the format of the items being saved and loaded.

Warning

By default, save_item puts the saved files into a directory called save_data/ relative to your current working directory. Ensure that this is added to the .gitignore if the files contain sensitive data, to avoid committing them to your repository.

Saving Data

In addition to saving the item, pyhbr.common.save_item also includes in the file name:

The commit hash of the git repository at the time save_item is called. This is intended to make it easier to reproduce the state of the repository that generated the file. By default, save_item requires you to commit any changes before saving a file. (The cleanest/most reproducible thing to do is commit changes, and then run a script non-interactively from the top.) If you are not using a git repository, then "nogit" is used in place of the commit hash.
The timestamp when save_item was called, which is more granular than the commit hash (or useful in case you do not have a git repository).

Note

You can save multiple items with the same name, because the file names will use different timestamps. By default, load_item will load the most recently saved file with a given name.

The save_item function is shown below. The simplest way to call it is save_item(df, "my_df"), which will save the DataFrame df to the directory save_data/ using the name "my_df".

Use this function to save a Python object (e.g. a DataFrame)

save_item(item, name, save_dir='save_data/', enforce_clean_branch=True, prompt_commit=False)

Save an item to a pickle file

Saves a python object (e.g. a pandas DataFrame) dataframe in the save_dir folder, using a filename that includes the current timestamp and the current commit hash. Use load_item to retrieve the file.

Important

Ensure that save_data/ (or your chosen save_dir) is added to the .gitignore of your repository to ensure sensitive data is not committed.

By storing the commit hash and timestamp, it is possible to identify when items were created and what code created them. To make most effective use of the commit hash, ensure that you commit, and do not make any further code edits, before running a script that calls save_item (otherwise the commit hash will not quite reflect the state of the running code).

Parameters:

Name	Type	Description	Default
`item`	`Any`	The python object to save (e.g. pandas DataFrame)	required
`name`	`str`	The name of the item. The filename will be created by adding a suffix for the current commit and the timestamp to show when the data was saved (format: `name_commit_timestamp.pkl`)	required
`save_dir`	`str`	Where to save the data, relative to the current working directory. The directory will be created if it does not exist.	`'save_data/'`
`enforce_clean_branch`		If True, the function will raise an exception if an attempt is made to save an item when the repository has uncommitted changes.	`True`
`prompt_commit`		if enforce_clean_branch is true, choose whether the prompt the user to commit on an unclean branch. This can help avoiding losing the results of a long-running script. Prefer to use false if the script is cheap to run.	`False`

Source code in src\pyhbr\common.py

def save_item(
    item: Any,
    name: str,
    save_dir: str = "save_data/",
    enforce_clean_branch=True,
    prompt_commit=False,
) -> None:
    """Save an item to a pickle file

    Saves a python object (e.g. a pandas DataFrame) dataframe in the save_dir
    folder, using a filename that includes the current timestamp and the current
    commit hash. Use load_item to retrieve the file.

    !!! important
        Ensure that `save_data/` (or your chosen `save_dir`) is added to the
        .gitignore of your repository to ensure sensitive data is not committed.

    By storing the commit hash and timestamp, it is possible to identify when items
    were created and what code created them. To make most effective use of the
    commit hash, ensure that you commit, and do not make any further code edits,
    before running a script that calls save_item (otherwise the commit hash will
    not quite reflect the state of the running code).

    Args:
        item: The python object to save (e.g. pandas DataFrame)
        name: The name of the item. The filename will be created by adding
            a suffix for the current commit and the timestamp to show when the
            data was saved (format: `name_commit_timestamp.pkl`)
        save_dir: Where to save the data, relative to the current working directory.
            The directory will be created if it does not exist.
        enforce_clean_branch: If True, the function will raise an exception if an attempt
            is made to save an item when the repository has uncommitted changes.
        prompt_commit: if enforce_clean_branch is true, choose whether the prompt the
            user to commit on an unclean branch. This can help avoiding losing
            the results of a long-running script. Prefer to use false if the script
            is cheap to run.
    """

    if enforce_clean_branch:

        abort_msg = "Aborting save_item() because branch is not clean. Commit your changes before saving item to increase the chance of reproducing the item based on the filename commit hash."

        if prompt_commit:
            # If the branch is not clean, prompt the user to commit to avoid losing
            # long-running model results. Take care to only commit if the state of
            # the repository truly reflects what was run (i.e. if no changes were made
            # while the script was running).
            while requires_commit():
                print(abort_msg)
                print(
                    "You can commit now and then retry the save after committing."
                )
                retry_save = query_yes_no(
                    "Do you want to retry the save? Commit, then select yes, or choose no to abort the save."
                )

                if not retry_save:
                    print(f"Aborting save of {name}")
                    return

            # If we get out the loop without returning, then the branch
            # is not clean and the save can proceed.
            print("Branch now clean, proceeding to save")

        else:

            if requires_commit():
                # In this case, unconditionally throw an error
                raise RuntimeError(abort_msg)

    if not Path(save_dir).exists():
        print(f"Creating missing folder '{save_dir}' for storing item")
        Path(save_dir).mkdir(parents=True, exist_ok=True)

    path = make_new_save_item_path(name, save_dir, "pkl")
    with open(path, "wb") as file:
        print(f"Saving {str(path)}")
        pickle.dump(item, file)

Loading Data

To load a previously saved item, using pyhbr.common.load_item. It can be called most simply using load_item("my_df"), assuming you previously saved an object in the default directory (save_data) with the name "my_df". By default, the most recent item is loaded, but using load_item("my_df", True) will let you pick which file you want to load.

The function load_item is shown below:

Use this function to load a previously saved Python object

load_item(name, interactive=False, save_dir='save_data')

Load a previously saved item (pickle) from file

Use this function to load a file that was previously saved using save_item(). By default, the latest version of the item will be returned (the one with the most recent timestamp).

None is returned if an interactive load is cancelled by the user.

To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the item to load	required
`interactive`	`bool`	If True, let the user pick which item version to load interactively. If False, non-interactively load the most recent item (i.e. with the most recent timestamp). The commit hash is not considered when loading the item.	`False`
`save_fir`		Which folder to load the item from.	required

Returns:

Type	Description
`(Any, Path)`	A tuple, with the python object loaded from file as first element and the Path to the item as the second element, or None if the user cancelled an interactive load.

Source code in src\pyhbr\common.py

def load_item(
    name: str, interactive: bool = False, save_dir: str = "save_data"
) -> (Any, Path):
    """Load a previously saved item (pickle) from file

    Use this function to load a file that was previously saved using
    save_item(). By default, the latest version of the item will be returned
    (the one with the most recent timestamp).

    None is returned if an interactive load is cancelled by the user.

    To load an item that is an object from a library (e.g. a pandas DataFrame),
    the library must be installed (otherwise you will get a ModuleNotFound
    exception). However, you do not have to import the library before calling this
    function.

    Args:
        name: The name of the item to load
        interactive: If True, let the user pick which item version to load interactively.
            If False, non-interactively load the most recent item (i.e. with the most
            recent timestamp). The commit hash is not considered when loading the item.
        save_fir: Which folder to load the item from.

    Returns:
        A tuple, with the python object loaded from file as first element and the
            Path to the item as the second element, or None if the user cancelled
            an interactive load.

    """
    if interactive:
        item_path = pick_saved_file_interactive(name, save_dir, "pkl")
    else:
        item_path = pick_most_recent_saved_file(name, save_dir, "pkl")

    if item_path is None:
        print("Aborted (interactive) load item")
        return None, None

    print(f"Loading {item_path}")

    # Load a generic pickle. Note that if this is a pandas dataframe,
    # pandas must be installed (otherwise you will get module not found).
    # The same goes for a pickle storing an object from any other library.
    with open(item_path, "rb") as file:
        return pickle.load(file), item_path

Clinical Codes

PyHBR has functions for creating and using lists of ICD-10 and OPCS-4 codes. A prototype version of the graphical program to create the code lists was written in Tauri here. However, it is simpler and more portable to have the codes editor bundled in this python package, and written in python.

Users should be able to do the following things with the codes editor GUI:

Open the GUI program, and select a blank ICD-10 or OPCS-4 codes tree to begin creating code groups.
Create new code groups starting from the blank template.
Search for strings within the ICD-10/OPCS-4 descriptions to make creation of groups easier.
Save the resulting codes file to a working directory.
Open and edit a previously saved codes file from a working directory.

Once the groups have been defined, the user should be able to perform the following actions with the code groups files:

Import codes files from the package (i.e. predefined code groups).
Import codes files (containing custom groups) from a working directory.
Extract the code groups, and show which codes are in which groups.
Use the code groups in analysis (i.e. get a Pandas DataFrame showing which codes are in which groups)

Multiple code groups are stored in a single file, which means that only two codes files are necessary: icd10-yaml and opcs4.yaml. There is no limit to the number of code groups.

Previously implemented functionality to check whether a clinical code is valid will not be implemented here, because sufficiently performant code cannot be written in pure python (and this package is intended to contain only pure Python to maximise portability).

Instead, all codes are converted to a standard "normal form" where upper-case letters are replaced with lower-case, and dots/whitespace is removed. Codes can then be compared, and most codes will match under this condition. (Codes that will not match include those with suffixes, such as dagger or asterix, or codes that contain further qualifying suffixes that are not present in the codes tree.).

Counting Codes

Diagnosis and procedure codes can be grouped together and used as features for building models. One way to do this is to count the codes in a particular time window (for example, one year before an index event), and use that as a predictor for subsequent outcomes.

This sections describes how raw episode data is converted into this counted form in PyHBR.

Getting Clinical Code Data

Hospital episodes contain multiple diagnosis and procedure codes. The starting point for counting codes is using the pyhbr.middle.*.get_clinical_codes function, which returns a data frame with the following columns:

episode_id: Which episode the code was in
code: The name of the clinical code in normal form (lowercase, no whitespace/dots), e.g. "n183"
group: The group containing the code. The table only contains codes that are defined in a code group, which is based on the codes files from the previous section
position: The priority of the clinical code, where 1 means the primary diagnosis/procedure, and > 1 means a secondary code.
type: Either "diagnosis" or "procedure" depending on the type of code.

This table does not use episode_id as the index because a single episode ID often has many rows.

An example of this function in pyhbr.middle.from_hic is:

Example function which fetches clinical codes

get_clinical_codes(engine, diagnoses_file, procedures_file)

Main diagnoses/procedures fetch for the HIC data

This function wraps the diagnoses/procedures queries and a filtering operation to reduce the tables to only those rows which contain a code in a group. One table is returned which contains both the diagnoses and procedures in long format, along with the associated episode ID and the primary/secondary position of the code in the episode.

Parameters:

Name	Type	Description	Default
`engine`	`Engine`	The connection to the database	required
`diagnoses_file`	`str`	The diagnoses codes file name (loaded from the package)	required
`procedures_file`	`str`	The procedures codes file name (loaded from the package)	required

Returns:

Type	Description
`DataFrame`	A table containing diagnoses/procedures, normalised codes, code groups, diagnosis positions, and associated episode ID.

Source code in src\pyhbr\middle\from_hic.py

def get_clinical_codes(
    engine: Engine, diagnoses_file: str, procedures_file: str
) -> pd.DataFrame:
    """Main diagnoses/procedures fetch for the HIC data

    This function wraps the diagnoses/procedures queries and a filtering
    operation to reduce the tables to only those rows which contain a code
    in a group. One table is returned which contains both the diagnoses and
    procedures in long format, along with the associated episode ID and the
    primary/secondary position of the code in the episode.

    Args:
        engine: The connection to the database
        diagnoses_file: The diagnoses codes file name (loaded from the package)
        procedures_file: The procedures codes file name (loaded from the package)

    Returns:
        A table containing diagnoses/procedures, normalised codes, code groups,
            diagnosis positions, and associated episode ID.
    """

    diagnosis_codes = clinical_codes.load_from_package(diagnoses_file)
    procedures_codes = clinical_codes.load_from_package(procedures_file)

    # Fetch the data from the server
    diagnoses = get_data(engine, hic.diagnoses_query)
    procedures = get_data(engine, hic.procedures_query)

    # Reduce data to only code groups, and combine diagnoses/procedures
    filtered_diagnoses = clinical_codes.filter_to_groups(diagnoses, diagnosis_codes)
    filtered_procedures = clinical_codes.filter_to_groups(procedures, procedures_codes)

    # Tag the diagnoses/procedures, and combine the tables
    filtered_diagnoses["type"] = "diagnosis"
    filtered_procedures["type"] = "procedure"

    codes = pd.concat([filtered_diagnoses, filtered_procedures])
    codes["type"] = codes["type"].astype("category")

    return codes

Codes in Other Episodes Relative to a Base Episode

To count up codes that occur in a time window before or after a particular base episode, it is necessary to join together each base episode with all the other episodes for the same patient.

To do this, three tables are needed:

base_episodes: A table of the base episodes of interest, containing episode_id as an index.
episodes: A table of episode information (all episodes), which is indexed by episode_id and contains patient_id and episode_start as columns.
codes: The table of diagnosis/procedure codes from the previous section, containing a column episode_id and other code data columns.

A function which combines these into a table containing all codes for other episodes relative to a base episode is pyhbr.clinical_codes.counting.get_all_other_codes:

Function that gets data from other episodes relative to a base episode

get_all_other_codes(index_spells, episodes, codes)

For each patient, get clinical codes in other episodes before/after the index

This makes a table of index episodes (which is the first episode of the index spell) along with all other episodes for a patient. Two columns index_episode_id and other_episode_id identify the two episodes for each row (they may be equal), and other information is stored such as the time of the base episode, the time to the other episode, and clinical code information for the other episode.

This table is used as the basis for all processing involving counting codes before and after an episode.

Note

Episodes will not be included in the result if they do not have any clinical codes that are in any code group.

Parameters:

Name	Type	Description	Default
`index_spells`	`DataFrame`	Contains `episode_id` as an index.	required
`episodes`	`DataFrame`	Contains `episode_id` as an index, and `patient_id` and `episode_start` as columns	required
`codes`	`DataFrame`	Contains `episode_id` and other code data as columns	required

Returns:

Type	Description
`DataFrame`	A table containing columns `index_episode_id`, `other_episode_id`, `index_episode_start`, `time_to_other_episode`, and code data columns for the other episode. Note that the base episode itself is included as an other episode.

Source code in src\pyhbr\clinical_codes\counting.py

def get_all_other_codes(
    index_spells: DataFrame, episodes: DataFrame, codes: DataFrame
) -> DataFrame:
    """For each patient, get clinical codes in other episodes before/after the index

    This makes a table of index episodes (which is the first episode of the index spell)
    along with all other episodes for a patient. Two columns `index_episode_id` and
    `other_episode_id` identify the two episodes for each row (they may be equal), and
    other information is stored such as the time of the base episode, the time to the
    other episode, and clinical code information for the other episode.

    This table is used as the basis for all processing involving counting codes before
    and after an episode.

    !!! note
        Episodes will not be included in the result if they do not have any clinical
            codes that are in any code group.

    Args:
        index_spells: Contains `episode_id` as an index.
        episodes: Contains `episode_id` as an index, and `patient_id` and `episode_start` as columns
        codes: Contains `episode_id` and other code data as columns

    Returns:
        A table containing columns `index_episode_id`, `other_episode_id`,
            `index_episode_start`, `time_to_other_episode`, and code data columns
            for the other episode. Note that the base episode itself is included
            as an other episode.
    """

    # Remove everything but the index episode_id (in case base_episodes
    # already has the columns)
    df = index_spells.reset_index(names="spell_id").set_index("episode_id")[
        ["spell_id"]
    ]

    index_episode_info = df.merge(
        episodes[["patient_id", "episode_start"]], how="left", on="episode_id"
    ).rename(
        columns={"episode_start": "index_episode_start", "spell_id": "index_spell_id"}
    )

    other_episodes = (
        index_episode_info.reset_index(names="index_episode_id")
        .merge(
            episodes[["episode_start", "patient_id", "spell_id"]].reset_index(
                names="other_episode_id"
            ),
            how="left",
            on="patient_id",
        )
        .rename(columns={"spell_id": "other_spell_id"})
    )

    other_episodes["time_to_other_episode"] = (
        other_episodes["episode_start"] - other_episodes["index_episode_start"]
    )

    # Use an inner join to filter out other episodes that have no associated codes
    # in any group.
    with_codes = other_episodes.merge(
        codes, how="inner", left_on="other_episode_id", right_on="episode_id"
    ).drop(columns=["patient_id", "episode_start", "episode_id"])

    return with_codes

Counting Codes Group Occurrences

In any table that has ``