Skip to content

PyHBR Function Reference

This page contains the documentation for all objects in PyHBR.

Data Sources

analysis

Routines for performing statistics, analysis, or fitting models

acs

filter_by_code_groups(episode_codes, code_group, max_position, exclude_index_spell)

Filter based on matching code conditions occurring in other episodes

From any table derived from get_all_other_episodes (e.g. the output of get_time_window), identify clinical codes (and therefore episodes) which correspond to an outcome of interest.

The input table has one row per clinical code, which is grouped into episodes and spells by other columns. The outcome only contains codes that define an episode or spell as an outcome. The result from this function can be used to analyse the make-up of outcomes.

Parameters:

Name Type Description Default
episode_codes DataFrame

Table of other episodes to filter. This can be narrowed to either the previous or subsequent year, or a different time frame. (In particular, exclude the index event if required.) The table must contain these columns:

  • other_episode_id: The ID of the other episode containing the code (relative to the index episode).
  • other_spell_id: The spell containing the other episode.
  • group: The name of the code group.
  • type: The code type, "diagnosis" or "procedure".
  • position: The position of the code (1 for primary, > 1 for secondary).
  • time_to_other_episode: The time elapsed between the index episode start and the other episode start.
required
code_group str

The code group name used to identify outcomes

required
max_position int

The maximum clinical code position that will be allowed to define an outcome. Pass 1 to allow primary diagnosis only, 2 to allow primary diagnosis and the first secondary diagnosis, etc.

required
exclude_index_spell bool

Do not allow any code present in the index spell to define an outcome.

required

Returns:

Type Description
DataFrame

A series containing the number of code group occurrences in the other_episodes table.

Source code in src\pyhbr\analysis\acs.py
def filter_by_code_groups(
    episode_codes: DataFrame,
    code_group: str,
    max_position: int,
    exclude_index_spell: bool,
) -> DataFrame:
    """Filter based on matching code conditions occurring in other episodes

    From any table derived from get_all_other_episodes (e.g. the
    output of get_time_window), identify clinical codes (and
    therefore episodes) which correspond to an outcome of interest.

    The input table has one row per clinical code, which is grouped
    into episodes and spells by other columns. The outcome only
    contains codes that define an episode or spell as an outcome.
    The result from this function can be used to analyse the make-up
    of outcomes.

    Args:
        episode_codes: Table of other episodes to filter.
            This can be narrowed to either the previous or subsequent
            year, or a different time frame. (In particular, exclude the
            index event if required.) The table must contain these
            columns:

            * `other_episode_id`: The ID of the other episode
                containing the code (relative to the index episode).
            * `other_spell_id`: The spell containing the other episode.
            * `group`: The name of the code group.
            * `type`: The code type, "diagnosis" or "procedure".
            * `position`: The position of the code (1 for primary, > 1
                for secondary).
            * `time_to_other_episode`: The time elapsed between the index
                episode start and the other episode start.

        code_group: The code group name used to identify outcomes
        max_position: The maximum clinical code position that will be allowed
            to define an outcome. Pass 1 to allow primary diagnosis only,
            2 to allow primary diagnosis and the first secondary diagnosis,
            etc.
        exclude_index_spell: Do not allow any code present in the index
            spell to define an outcome.

    Returns:
        A series containing the number of code group occurrences in the
            other_episodes table.
    """

    # Reduce to only the code groups of interest
    df = episode_codes[episode_codes["group"] == code_group]

    # Keep only necessary columns
    df = df[
        [
            "index_spell_id",
            "other_spell_id",
            "code",
            "docs",
            "position",
            "time_to_other_episode",
        ]
    ]

    # Optionally remove rows corresponding to the index spell
    if exclude_index_spell:
        df = df[~(df["other_spell_id"] == df["index_spell_id"])]

    # Only keep codes that match the position-based inclusion criterion
    df = df[df["position"] <= max_position]

    return df

get_code_features(index_spells, all_other_codes)

Get counts of previous clinical codes in code groups before the index.

Predictors derived from clinical code groups use clinical coding data from 365 days before the index to 30 days before the index (this excludes episodes where no coding data would be available, because the coding process itself takes approximately one month).

All groups included anywhere in the group column of all_other_codes are included, and each one becomes a new column with "_before" appended.

Parameters:

Name Type Description Default
index_spells DataFrame

A table containing spell_id as Pandas index and a column episode_id for the first episode in the index spell.

required
all_other_codes DataFrame

A table of other episodes (and their clinical codes) relative to the index spell, output from counting.get_all_other_codes.

required

Returns:

Type Description
DataFrame

A table with one column per code group, counting the number of codes in that group that appeared in the year before the index.

Source code in src\pyhbr\analysis\acs.py
def get_code_features(index_spells: DataFrame, all_other_codes: DataFrame) -> DataFrame:
    """Get counts of previous clinical codes in code groups before the index.

    Predictors derived from clinical code groups use clinical coding data from 365
    days before the index to 30 days before the index (this excludes episodes where
    no coding data would be available, because the coding process itself takes
    approximately one month).

    All groups included anywhere in the `group` column of all_other_codes are
    included, and each one becomes a new column with "_before" appended.

    Args:
        index_spells: A table containing `spell_id` as Pandas index and a
            column `episode_id` for the first episode in the index spell.
        all_other_codes: A table of other episodes (and their clinical codes)
            relative to the index spell, output from counting.get_all_other_codes.

    Returns:
        A table with one column per code group, counting the number of codes
            in that group that appeared in the year before the index.
    """
    code_groups = all_other_codes["group"].unique()
    max_position = 999  # Allow any primary/secondary position
    exclude_index_spell = False
    max_before = dt.timedelta(days=365)
    min_before = dt.timedelta(days=30)

    # Get the episodes that occurred in the previous year (for clinical code features)
    previous_year = counting.get_time_window(all_other_codes, -max_before, -min_before)

    code_features = {}
    for group in code_groups:
        group_episodes = filter_by_code_groups(
            previous_year,
            group,
            max_position,
            exclude_index_spell,
        )
        code_features[group + "_before"] = counting.count_code_groups(
            index_spells, group_episodes
        )

    return DataFrame(code_features)

get_index_attributes(swd_index_spells, primary_care_attributes)

Link the primary care patient data to the index spells

Parameters:

Name Type Description Default
swd_index_spells DataFrame

Index_spells linked to a a recent, valid patient attributes row. Contains the columns patient_id and date for linking, and has Pandas index spell_id.

required
primary_care_attributes DataFrame

The full attributes table.

required

Returns:

Type Description
DataFrame

The table of index-spell patient attributes, indexed by spell_id.

Source code in src\pyhbr\analysis\acs.py
def get_index_attributes(
    swd_index_spells: DataFrame, primary_care_attributes: DataFrame
) -> DataFrame:
    """Link the primary care patient data to the index spells

    Args:
        swd_index_spells: Index_spells linked to a a recent, valid
            patient attributes row. Contains the columns `patient_id` and
            `date` for linking, and has Pandas index `spell_id`.
        primary_care_attributes: The full attributes table.

    Returns:
        The table of index-spell patient attributes, indexed by `spell_id`.
    """

    return (
        (
            swd_index_spells[["patient_id", "date"]]
            .reset_index()
            .merge(
                primary_care_attributes,
                how="left",
                on=["patient_id", "date"],
            )
        )
        .set_index("spell_id")
        .drop(columns=["patient_id", "date"])
    )

get_index_spells(episodes, codes, acs_group, pci_group, stemi_group, nstemi_group, complex_pci_group)

Get the index spells for ACS/PCI patients

Index spells are defined by the contents of the first episode of the spell (i.e. the cause of admission to hospital). Spells are considered an index event if either of the following hold:

  • The primary diagnosis of the first episode contains an ACS ICD-10 code. This is to ensure that only episodes where the main diagnosis of the episode is ACS are considered, and not cases where a secondary ACS is present that could refer to a historical event.
  • There is a PCI procedure in any primary or secondary position in the first episode of the spell. It is assumed that a procedure is only coded in secondary positions if it did occur in that episode.

A prerequisite for spell to be an index spell is that it contains episodes present in both the episodes and codes tables. The episodes table contains start-time/spell information, and the codes table contains information about what diagnoses/procedures occurred in each episode.

The table returned contains one row per index spell (and is indexed by spell id). It also contains other information about the index spell, which is derived from the first episode of the spell.

Parameters:

Name Type Description Default
episodes DataFrame

All patient episodes. Must contain episode_id, spell_id and episode_start, age and gender.

required
codes DataFrame

All diagnosis/procedure codes by episode. Must contain episode_id, position (indexed from 1 which is the primary code, >1 are secondary codes), and group (expected to contain the value of the acs_group and pci_group arguments).

required
acs_group str

The name of the ICD-10 code group used to define ACS.

required
pci_group str | None

The name of the OPCS-4 code group used to define PCI. Pass None to not use PCI as an inclusion criterion for index events. In this case, the pci_index column is omitted, and only ACS primary diagnoses are allowed.

required
stemi_group str

The name of the ICD-10 code group used to identify STEMI MI

required
nstemi_group str

The name of the ICD-10 code group used to identify NSTEMI MI

required
complex_pci_group str | None

The name of the OPCS-4 code group used to define complex PCI (in any primary/secondary position)

required

Returns:

Type Description
DataFrame

A table of index spells and associated information about the first episode of the spell.

Source code in src\pyhbr\analysis\acs.py
def get_index_spells(
    episodes: DataFrame,
    codes: DataFrame,
    acs_group: str,
    pci_group: str | None,
    stemi_group: str,
    nstemi_group: str,
    complex_pci_group: str | None,
) -> DataFrame:
    """Get the index spells for ACS/PCI patients

    Index spells are defined by the contents of the first episode of
    the spell (i.e. the cause of admission to hospital). Spells are
    considered an index event if either of the following hold:

    * The primary diagnosis of the first episode contains an
      ACS ICD-10 code. This is to ensure that only episodes where the
      main diagnosis of the episode is ACS are considered, and not
      cases where a secondary ACS is present that could refer to a
      historical event.
    * There is a PCI procedure in any primary or secondary position
      in the first episode of the spell. It is assumed that a procedure
      is only coded in secondary positions if it did occur in that
      episode.

    A prerequisite for spell to be an index spell is that it contains
    episodes present in both the episodes and codes tables. The episodes table
    contains start-time/spell information, and the codes table contains
    information about what diagnoses/procedures occurred in each episode.

    The table returned contains one row per index spell (and is indexed by
    spell id). It also contains other information about the index spell,
    which is derived from the first episode of the spell.

    Args:
        episodes: All patient episodes. Must contain `episode_id`, `spell_id`
            and `episode_start`, `age` and `gender`.
        codes: All diagnosis/procedure codes by episode. Must contain
            `episode_id`, `position` (indexed from 1 which is the primary
            code, >1 are secondary codes), and `group` (expected to contain
            the value of the acs_group and pci_group arguments).
        acs_group: The name of the ICD-10 code group used to define ACS.
        pci_group: The name of the OPCS-4 code group used to define PCI. Pass None
            to not use PCI as an inclusion criterion for index events. In this
            case, the pci_index column is omitted, and only ACS primary diagnoses
            are allowed.
        stemi_group: The name of the ICD-10 code group used to identify STEMI MI
        nstemi_group: The name of the ICD-10 code group used to identify NSTEMI MI
        complex_pci_group: The name of the OPCS-4 code group used to define complex
            PCI (in any primary/secondary position)

    Returns:
        A table of index spells and associated information about the
            first episode of the spell.
    """

    # Index spells are defined by the contents of the first episode in the
    # spell (to capture the cause of admission to hospital).
    first_episodes = episodes.sort_values("episode_start").groupby("spell_id").head(1)

    # In the codes dataframe, if one code is in multiple groups, it gets multiple
    # (one per code group). Concatenate the code groups to reduce to one row per
    # code, then use str.contains() later to identify code groups
    reduced_codes = codes.copy()
    non_group_cols = [c for c in codes.columns if c != "group"]
    reduced_codes["group"] = codes.groupby(non_group_cols)["group"].transform(
        lambda x: ",".join(x)
    )
    reduced_codes = reduced_codes.drop_duplicates()

    # Join the diagnosis/procedure codes. The inner join reduces to episodes which
    # have codes in any group, which is a superset of the index episodes -- if an
    # episode has no codes in any code group, it cannot be an index event.
    first_episodes_with_codes = first_episodes.merge(
        reduced_codes, how="inner", on="episode_id"
    )

    # ACS matches based on a primary diagnosis of ACS (this is to rule out
    # cases where patient history may contain ACS recorded as a secondary
    # diagnosis).
    acs_match = (first_episodes_with_codes["group"].str.contains(acs_group)) & (
        first_episodes_with_codes["position"] == 1
    )

    # A PCI match is allowed anywhere in the procedures list, but must still
    # be present in the first episode of the index spell.
    if pci_group is not None:
        pci_match = first_episodes_with_codes["group"].str.contains(pci_group)
    else:
        pci_match = False

    # Get all the episodes matching the ACS or PCI condition (multiple rows
    # per episode)
    matching_episodes = first_episodes_with_codes[acs_match | pci_match]
    matching_episodes.set_index("episode_id", drop=True, inplace=True)

    index_spells = DataFrame()

    # Reduce to one row per episode, and store a flag for whether the ACS
    # or PCI condition was present. If PCI is none, there is no need for these
    # columns because all rows are ACS index events
    if pci_group is not None:
        index_spells["pci_index"] = (
            matching_episodes["group"].str.contains(pci_group).groupby("episode_id").any()
        )
        index_spells["acs_index"] = (
            matching_episodes["group"].str.contains(acs_group).groupby("episode_id").any()
        )

    # The stemi/nstemi columns are always needed to distinguish the type of ACS. If 
    # both are false, the result is unstable angina
    index_spells["stemi_index"] = (
        matching_episodes["group"].str.contains(stemi_group).groupby("episode_id").any()
    )   
    index_spells["nstemi_index"] = (
        matching_episodes["group"]
        .str.contains(nstemi_group)
        .groupby("episode_id")
        .any()
    )

    # Check if the PCI is complex
    if complex_pci_group is not None:
        index_spells["complex_pci_index"] = (
            matching_episodes["group"]
            .str.contains(complex_pci_group)
            .groupby("episode_id")
            .any()
        )    

    # Join some useful information about the episode
    index_spells = (
        index_spells.merge(
            episodes[["patient_id", "episode_start", "spell_id", "age", "gender"]],
            how="left",
            on="episode_id",
        )
        .rename(columns={"episode_start": "spell_start"})
        .reset_index("episode_id")
        .set_index("spell_id")
    )

    # Convert the age column to a float. This should probably
    # be done upstream
    index_spells["age"] = index_spells["age"].astype(float)

    return index_spells

get_management(index_spells, all_other_codes, min_after, max_after, pci_group, cabg_group)

Get the management type for each index event

The result is a category series containing "PCI" if a PCI was performed, "CABG" if CABG was performed, or "Conservative" if neither were performed.

Parameters:

Name Type Description Default
index_spells DataFrame
required
all_other_codes DataFrame

description

required
min_after timedelta

The start of the window after the index to look for management

required
max_after timedelta

The end of the window after the index which defines management

required
pci_group str

The name of the code group defining PCI management

required
cabg_management

The name of the code group defining CABG management

required

Returns:

Type Description
Series

A category series containing "PCI", "CABG", or "Conservative"

Source code in src\pyhbr\analysis\acs.py
def get_management(
    index_spells: DataFrame,
    all_other_codes: DataFrame,
    min_after: dt.timedelta,
    max_after: dt.timedelta,
    pci_group: str,
    cabg_group: str,
) -> Series:
    """Get the management type for each index event

    The result is a category series containing "PCI" if a PCI was performed, "CABG"
    if CABG was performed, or "Conservative" if neither were performed.

    Args:
        index_spells:
        all_other_codes (DataFrame): _description_
        min_after: The start of the window after the index to look for management
        max_after: The end of the window after the index which defines management
        pci_group: The name of the code group defining PCI management
        cabg_management: The name of the code group defining CABG management

    Returns:
        A category series containing "PCI", "CABG", or "Conservative"
    """

    management_window = counting.get_time_window(all_other_codes, min_after, max_after)

    # Ensure that rows are only kept if they are from the same spell (management
    # must occur before a hospital discharge and readmission)
    same_spell_management_window = management_window[
        management_window["index_spell_id"].eq(management_window["other_spell_id"])
    ]

    def check_management_type(g):
        if g.eq(cabg_group).any():
            return "CABG"
        elif g.eq(pci_group).any():
            return "PCI"
        else:
            return "Conservative"

    return (
        same_spell_management_window.groupby("index_spell_id")[["group"]]
        .agg(check_management_type)
        .astype("category")
    )

get_outcomes(index_spells, all_other_codes, date_of_death, cause_of_death, non_fatal_group, fatal_group)

Get non-fatal and fatal outcomes defined by code groups

Parameters:

Name Type Description Default
index_spells DataFrame

A table containing spell_id as Pandas index and a column episode_id for the first episode in the index spell.

required
all_other_codes DataFrame

A table of other episodes (and their clinical codes) relative to the index spell, output from counting.get_all_other_codes.

required
date_of_death DataFrame

Contains a column date_of_death, with Pandas index patient_id

required
cause_of_death DataFrame

Contains columns patient_id, code (ICD-10) for cause of death, position of the code, and group.

required
non_fatal_group str

The name of the ICD-10 group defining the non-fatal outcome (the primary diagnosis of subsequent episodes are checked for codes in this group)

required
fatal_group str

The name of the ICD-10 group defining the fatal outcome (the primary diagnosis in the cause-of-death is checked for codes in this group).

required

Returns:

Type Description
DataFrame

A dataframe, indexed by spell_id (i.e. the index spell), with columns all (which counts the total fatal and non-fatal outcomes), and fatal (which just contains the fatal outcome)

Source code in src\pyhbr\analysis\acs.py
def get_outcomes(
    index_spells: DataFrame,
    all_other_codes: DataFrame,
    date_of_death: DataFrame,
    cause_of_death: DataFrame,
    non_fatal_group: str,
    fatal_group: str,
) -> DataFrame:
    """Get non-fatal and fatal outcomes defined by code groups

    Args:
        index_spells: A table containing `spell_id` as Pandas index and a
            column `episode_id` for the first episode in the index spell.
        all_other_codes: A table of other episodes (and their clinical codes)
            relative to the index spell, output from counting.get_all_other_codes.
        date_of_death: Contains a column date_of_death, with Pandas index
            `patient_id`
        cause_of_death: Contains columns `patient_id`, `code` (ICD-10) for
            cause of death, `position` of the code, and `group`.
        non_fatal_group: The name of the ICD-10 group defining the non-fatal
            outcome (the primary diagnosis of subsequent episodes are checked
            for codes in this group)
        fatal_group: The name of the ICD-10 group defining the fatal outcome
            (the primary diagnosis in the cause-of-death is checked for codes
            in this group).

    Returns:
        A dataframe, indexed by `spell_id` (i.e. the index spell), with columns
            `all` (which counts the total fatal and non-fatal outcomes),
            and `fatal` (which just contains the fatal outcome)
    """

    # Follow-up time for fatal and non-fatal events
    max_after = dt.timedelta(days=365)

    # Properties of non-fatal events
    primary_only = True
    exclude_index_spell = False
    first_episode_only = False
    min_after = dt.timedelta(hours=48)

    # Work out fatal outcome
    fatal = get_fatal_outcome(
        index_spells, date_of_death, cause_of_death, fatal_group, max_after
    )

    # Get the episodes (and all their codes) in the follow-up window
    following_year = counting.get_time_window(all_other_codes, min_after, max_after)

    # Get non-fatal outcome
    outcome_episodes = filter_by_code_groups(
        following_year,
        [non_fatal_group],
        primary_only,
        exclude_index_spell,
        first_episode_only,
    )
    non_fatal = counting.count_code_groups(index_spells, outcome_episodes)

    return DataFrame({"all": non_fatal + fatal, "fatal": fatal})

get_secondary_care_prescriptions_features(prescriptions, index_spells, episodes)

Get dummy feature columns for OAC and NSAID medications on admission

Parameters:

Name Type Description Default
prescriptions DataFrame

The table of secondary care prescriptions, containing a group column and spell_id.

required
index_spells DataFrame

The index spells, which must be indexed by spell_id

required
episodes DataFrame

The episodes table containing admission and discharge, for linking prescriptions to spells.

required
Source code in src\pyhbr\analysis\acs.py
def get_secondary_care_prescriptions_features(
    prescriptions: DataFrame, index_spells: DataFrame, episodes: DataFrame
) -> DataFrame:
    """Get dummy feature columns for OAC and NSAID medications on admission

    Args:
        prescriptions: The table of secondary care prescriptions, containing
            a `group` column and `spell_id`.
        index_spells: The index spells, which must be indexed by `spell_id`
        episodes: The episodes table containing `admission` and `discharge`,
            for linking prescriptions to spells.
    """

    # Get all the data required
    df = (
        index_spells.reset_index("spell_id")
        .merge(prescriptions, on="patient_id", how="left")
        .merge(episodes[["admission", "discharge"]], on="episode_id", how="left")
    )

    # Keep only prescriptions ordered between admission and discharge
    # marked as present on admission
    within_spell = (df["order_date"] >= df["admission"]) & (
        df["order_date"] <= df["discharge"]
    )

    # Filter and create dummy variables for on-admission medication
    dummies = (
        pd.get_dummies(
            df[within_spell & df["on_admission"]].set_index("spell_id")["group"]
        )
        .groupby("spell_id")
        .max()
        .astype(int)
    )

    # Join back onto index events and set missing entries to zero
    return index_spells[[]].merge(dummies, how="left", on="spell_id").fillna(0)

get_survival_data(index_spells, fatal, non_fatal, max_after)

Get survival data from fatal and non-fatal outcomes

Parameters:

Name Type Description Default
index_spells DataFrame

The index spells, indexed by spell_id

required
fatal DataFrame

The table of fatal outcomes, containing a survival_time column

required
non_fatal DataFrame

The table of non-fatal outcomes, containing a time_to_other_episode column

required
max_after timedelta

The right censor time. This is the maximum time for data contained in the fatal and non_fatal tables; any index spells with no events in either table will be right-censored with this time.

required

Returns:

Type Description
DataFrame

The survival data containing both fatal and non-fatal events. The survival time is the time_to_event column, the fatal column contains a flag indicating whether the event was fatal, and the right_censor column indicates whether the survival time is censored. The code and docs column provide information about the type of event for non-censored data (NA otherwise).

Source code in src\pyhbr\analysis\acs.py
def get_survival_data(
    index_spells: DataFrame,
    fatal: DataFrame,
    non_fatal: DataFrame,
    max_after: dt.timedelta,
) -> DataFrame:
    """Get survival data from fatal and non-fatal outcomes

    Args:
        index_spells: The index spells, indexed by `spell_id`
        fatal: The table of fatal outcomes, containing a `survival_time` column
        non_fatal: The table of non-fatal outcomes, containing a `time_to_other_episode` column
        max_after: The right censor time. This is the maximum time for data contained in the
            fatal and non_fatal tables; any index spells with no events in either table
            will be right-censored with this time.

    Returns:
        The survival data containing both fatal and non-fatal events. The survival time is the
            `time_to_event` column, the `fatal` column contains a flag indicating whether the
            event was fatal, and the `right_censor` column indicates whether the survival time
            is censored. The `code` and `docs` column provide information about the type of
            event for non-censored data (NA otherwise).
    """
    # Get bleeding survival analysis data (for both fatal
    # and non-fatal bleeding). First, combine the fatal
    # and non-fatal data
    cols_to_keep = ["index_spell_id", "code", "docs", "time_to_event"]
    non_fatal_survival = non_fatal.rename(
        columns={"time_to_other_episode": "time_to_event"}
    )[cols_to_keep]
    non_fatal_survival["fatal"] = False
    fatal_survival = fatal.rename(columns={"survival_time": "time_to_event"})[
        cols_to_keep
    ]
    fatal_survival["fatal"] = True
    survival = pd.concat([fatal_survival, non_fatal_survival])

    # Take only the first event for each index spell
    first_event = (
        survival.sort_values("time_to_event")
        .groupby("index_spell_id")
        .head(1)
        .set_index("index_spell_id")
    )
    first_event["right_censor"] = False
    with pd.option_context("future.no_silent_downcasting", True):
        with_censor = (
            index_spells[[]]
            .merge(first_event, left_index=True, right_index=True, how="left")
            .fillna({"fatal": False, "time_to_event": max_after, "right_censor": True})
            .infer_objects(copy=False)
        )
    return with_censor

get_therapy(index_spells, primary_care_prescriptions)

Get therapy (DAPT, etc.) recorded in primary care prescriptions in 60 days after index

Parameters:

Name Type Description Default
index_spells DataFrame

Index spells, containing spell_id

required
primary_care_prescriptions DataFrame

Contains a column name with the prescription and date when the prescription was recorded.

required

Returns:

Type Description
DataFrame

DataFrame with a column therapy indexed by spell_id

Source code in src\pyhbr\analysis\acs.py
def get_therapy(index_spells: DataFrame, primary_care_prescriptions: DataFrame) -> DataFrame:
    """Get therapy (DAPT, etc.) recorded in primary care prescriptions in 60 days after index

    Args:
        index_spells: Index spells, containing `spell_id`
        primary_care_prescriptions: Contains a column `name` with the prescription
            and `date` when the prescription was recorded.

    Returns:
        DataFrame with a column `therapy` indexed by `spell_id`
    """

    # Fetch a particular table or item from raw_data
    df = primary_care_prescriptions.copy()


    def map_medicine(x):
        if x is None:
            return np.nan
        medicines = ["warfarin", "ticagrelor", "prasugrel", "clopidogrel", "aspirin"]
        for m in medicines:
            if m in x.lower():
                return m
        return np.nan


    df["medicine"] = df["name"].apply(map_medicine)

    # Join primary care prescriptions onto index spells
    df = index_spells.reset_index().merge(
        df, on="patient_id", how="left"
    )

    # Filter to only prescriptions seen in the following month
    df = df[
        (df["spell_start"] - df["date"] < dt.timedelta(days=0))
        & (df["date"] - df["spell_start"] < dt.timedelta(days=60))
        & ~df["medicine"].isna()
    ]

    def map_therapy(x):

        aspirin = x["medicine"].eq("aspirin").any()
        oac = x["medicine"].eq("warfarin").any()
        p2y12 = x["medicine"].isin(["ticagrelor", "prasugrel", "clopidogrel"]).any()

        if aspirin & p2y12 & oac:
            return "Triple"
        elif aspirin & x["medicine"].eq("ticagrelor").any():
            return "DAPT-AT"
        elif aspirin & x["medicine"].eq("prasugrel").any():
            return "DAPT-AP"
        elif aspirin & x["medicine"].eq("clopidogrel").any():
            return "DAPT-AC"
        elif aspirin:
            return "Single"
        else:
            return np.nan

    # Get the type of therapy seen after the index spell
    therapy = df.groupby("spell_id")[["medicine"]].apply(map_therapy).rename("therapy")

    # Join back onto the index spells to include cases where no
    # therapy was seen
    return index_spells[[]].merge(therapy, on="spell_id", how="left")

identify_fatal_outcome(index_spells, date_of_death, cause_of_death, outcome_group, max_position, max_after)

Get fatal outcomes defined by a diagnosis code in a code group

Parameters:

Name Type Description Default
index_spells DataFrame

A table containing spell_id as Pandas index and a column episode_id for the first episode in the index spell.

required
date_of_death DataFrame

Contains a column date_of_death, with Pandas index patient_id

required
cause_of_death DataFrame

Contains columns patient_id, code (ICD-10) for cause of death, position of the code, and group.

required
outcome_group str

The name of the ICD-10 code group which defines the fatal outcome.

required
max_position int

The maximum primary/secondary cause of death that will be checked for the code group.

required
max_after timedelta

The maximum follow-up period after the index for valid outcomes.

required

Returns:

Type Description
Series

A series of boolean containing whether a fatal outcome occurred in the follow-up period.

Source code in src\pyhbr\analysis\acs.py
def identify_fatal_outcome(
    index_spells: DataFrame,
    date_of_death: DataFrame,
    cause_of_death: DataFrame,
    outcome_group: str,
    max_position: int,
    max_after: dt.timedelta,
) -> Series:
    """Get fatal outcomes defined by a diagnosis code in a code group

    Args:
        index_spells: A table containing `spell_id` as Pandas index and a
            column `episode_id` for the first episode in the index spell.
        date_of_death: Contains a column date_of_death, with Pandas index
            `patient_id`
        cause_of_death: Contains columns `patient_id`, `code` (ICD-10) for
            cause of death, `position` of the code, and `group`.
        outcome_group: The name of the ICD-10 code group which defines the fatal
            outcome.
        max_position: The maximum primary/secondary cause of death that will be
            checked for the code group.
        max_after: The maximum follow-up period after the index for valid outcomes.

    Returns:
        A series of boolean containing whether a fatal outcome occurred in the follow-up
            period.
    """

    # Inner join to get a table of index patients with death records
    mortality_after_index = (
        index_spells.reset_index()
        .merge(date_of_death, on="patient_id", how="inner")
        .merge(cause_of_death, on="patient_id", how="inner")
    )
    mortality_after_index["survival_time"] = (
        mortality_after_index["date_of_death"] - mortality_after_index["spell_start"]
    )

    # Reduce to only the fatal outcomes that meet the time window and
    # code inclusion criteria
    df = mortality_after_index[
        (mortality_after_index["survival_time"] < max_after)
        & (mortality_after_index["position"] <= max_position)
        & (mortality_after_index["group"] == outcome_group)
    ]

    # Rename the id columns to be compatible with counting.count_code_groups
    # and select columns of interest
    return df.rename(columns={"spell_id": "index_spell_id"})[
        ["index_spell_id", "survival_time", "code", "position", "docs", "group"]
    ]

Link primary care attributes to index spells by attribute date

The date column of an attributes row indicates that the attribute was valid at the end of the interval (date, date + 1month). It is important that no attribute is used in modelling that could have occurred after the index event, meaning that date + 1month < spell_start must hold for any attribute used as a predictor. On the other hand, data substantially before the index event should not be used. The valid window is controlled by imposing:

date < spell_start - attribute_valid_window

Parameters:

Name Type Description Default
index_spells DataFrame

The index spell table, containing a spell_start column and patient_id

required
primary_care_attributes DataFrame

The patient attributes table, containing date and patient_id

required

Returns:

Type Description
DataFrame

The index_spells table with a date column added to link the attributes (along with patient_id). This may be NaT if there is no valid attribute for this index event.

Source code in src\pyhbr\analysis\acs.py
def link_attribute_period_to_index(
    index_spells: DataFrame, primary_care_attributes: DataFrame
) -> DataFrame:
    """Link primary care attributes to index spells by attribute date

    The date column of an attributes row indicates that
    the attribute was valid at the end of the interval
    (date, date + 1month). It is important
    that no attribute is used in modelling that could have occurred
    after the index event, meaning that date + 1month < spell_start
    must hold for any attribute used as a predictor. On the other hand,
    data substantially before the index event should not be used. The
    valid window is controlled by imposing:

        date < spell_start - attribute_valid_window

    Args:
        index_spells: The index spell table, containing a `spell_start`
            column and `patient_id`
        primary_care_attributes: The patient attributes table, containing
            `date` and `patient_id`

    Returns:
        The index_spells table with a `date` column added to link the
            attributes (along with `patient_id`). This may be NaT if 
            there is no valid attribute for this index event.
    """

    # Define a window before the index event where SWD attributes will be considered valid.
    # 41 days is used to ensure that a full month is definitely captured. This
    # ensures that attribute data that is fairly recent is used as predictors.
    attribute_valid_window = dt.timedelta(days=60)

    # Add all the patient's attributes onto each index spell
    df = index_spells.reset_index().merge(
        primary_care_attributes[["patient_id", "date"]],
        how="left",
        on="patient_id",
    )

    # Only keep attributes that are from strictly before the index spell
    # (note date represents the start of the month that attributes
    # apply to)
    attr_before_index = df[(df["date"] + dt.timedelta(days=31)) < df["spell_start"]]

    # Keep only the most recent attribute before the index spell
    most_recent = attr_before_index.sort_values("date").groupby("spell_id").tail(1)

    # Exclude attributes that occurred outside the attribute_value_window before the index
    swd_index_spells = most_recent[
        most_recent["date"] > (most_recent["spell_start"] - attribute_valid_window)
    ]

    return index_spells.merge(
        swd_index_spells[["spell_id", "date"]].set_index("spell_id"),
        how="left",
        on="spell_id",
    )

prescriptions_before_index(swd_index_spells, primary_care_prescriptions)

Get the number of primary care prescriptions before each index spell

Parameters:

Name Type Description Default
index_spells

Must have Pandas index spell_id

required
primary_care_prescriptions DataFrame

Must contain a name column that contains a string containing the medicine name somewhere (any case), a date column with the prescription date, and a patient_id column.

required

Returns:

Type Description
DataFrame

A table indexed by spell_id that contains one column for each prescription type, prefexed with "prior_"

Source code in src\pyhbr\analysis\acs.py
def prescriptions_before_index(
    swd_index_spells: DataFrame, primary_care_prescriptions: DataFrame
) -> DataFrame:
    """Get the number of primary care prescriptions before each index spell

    Args:
        index_spells: Must have Pandas index `spell_id`
        primary_care_prescriptions: Must contain a `name` column
            that contains a string containing the medicine name
            somewhere (any case), a `date` column with the
            prescription date, and a `patient_id` column.

    Returns:
        A table indexed by `spell_id` that contains one column
            for each prescription type, prefexed with "prior_"
    """

    df = primary_care_prescriptions

    # Filter for relevant prescriptions
    df = from_hic.filter_by_medicine(df)

    # Drop rows where the prescription date is not known
    df = df[~df["date"].isna()]

    # Join the prescriptions to the index spells
    df = (
        swd_index_spells[["spell_start", "patient_id"]]
        .reset_index()
        .merge(df, how="left", on="patient_id")
    )
    df["time_to_index_spell"] = df["spell_start"] - df["date"]

    # Only keep prescriptions occurring in the year before the index event
    min_before = dt.timedelta(days=0)
    max_before = dt.timedelta(days=365)
    events_before_index = counting.get_time_window(
        df, -max_before, -min_before, "time_to_index_spell"
    )

    # Pivot each row (each prescription) to one column per
    # prescription group.
    all_counts = counting.count_events(
        swd_index_spells, events_before_index, "group"
    ).add_prefix("prior_")

    return all_counts

remove_features(index_attributes, max_missingness, const_threshold)

Reduce to just the columns meeting minimum missingness and variability criteria.

Parameters:

Name Type Description Default
index_attributes DataFrame

The table of primary care attributes for the index spells

required
max_missingness

The maximum allowed missingness in a column before a column is removed as a feature.

required
const_threshold

The maximum allowed constant-value proportion (NA + most common non-NA value) before a column is removed as a feature

required

Returns:

Type Description
DataFrame

A table containing the features that remain, which contain sufficient non-missing values and sufficient variance.

Source code in src\pyhbr\analysis\acs.py
def remove_features(
    index_attributes: DataFrame, max_missingness, const_threshold
) -> DataFrame:
    """Reduce to just the columns meeting minimum missingness and variability criteria.

    Args:
        index_attributes: The table of primary care attributes for the index spells
        max_missingness: The maximum allowed missingness in a column before a column
            is removed as a feature.
        const_threshold: The maximum allowed constant-value proportion (NA + most
            common non-NA value) before a column is removed as a feature

    Returns:
        A table containing the features that remain, which contain sufficient
            non-missing values and sufficient variance.
    """
    missingness = describe.proportion_missingness(index_attributes)
    nearly_constant = describe.nearly_constant(index_attributes, const_threshold)
    to_keep = (missingness < max_missingness) & ~nearly_constant
    return index_attributes.loc[:, to_keep]

arc_hbr

Calculation of the ARC HBR score

all_index_spell_episodes(index_episodes, episodes)

Get all the other episodes in the index spell

This is a dataframe of index spells (defined as the spell containing an episode in index_episodes), along with all the episodes in that spell (including the index episode itself). This is useful for performing operations at index-spell granularity

Parameters:

Name Type Description Default
index_episodes DataFrame

Must contain Pandas index episode_id

required
episodes DataFrame

Must contain Pandas index episode_id and have a columne spell_id.

required

Returns:

Type Description
DataFrame

A dataframe with a column spell_id for index spells, and episode_id for all episodes in that spell. A column index_episode shows which of the episodes is the first episode in the spell.

Source code in src\pyhbr\analysis\arc_hbr.py
def all_index_spell_episodes(
    index_episodes: DataFrame, episodes: DataFrame
) -> DataFrame:
    """Get all the other episodes in the index spell

    This is a dataframe of index spells (defined as the spell containing
    an episode in index_episodes), along with all the episodes in that
    spell (including the index episode itself). This is useful for
    performing operations at index-spell granularity

    Args:
        index_episodes: Must contain Pandas index `episode_id`
        episodes: Must contain Pandas index `episode_id` and have a columne
            `spell_id`.

    Returns:
        A dataframe with a column `spell_id` for index spells, and `episode_id`
            for all episodes in that spell. A column `index_episode` shows which
            of the episodes is the first episode in the spell.
    """
    index_spells = (
        index_episodes[[]]
        .merge(episodes["spell_id"], how="left", on="episode_id")
        .set_index("spell_id")
    )
    return index_spells.merge(episodes.reset_index(), how="left", on="spell_id")[
        ["episode_id", "spell_id"]
    ]

arc_hbr_age(has_age)

Calculate the age ARC-HBR criterion

Calculate the age ARC HBR criterion (0.5 points if > 75 at index, 0 otherwise.

Parameters:

Name Type Description Default
has_age DataFrame

Dataframe which has a column age

required

Returns:

Type Description
Series

A series of values 0.5 (if age > 75 at index) or 0 otherwise, indexed by input dataframe index.

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_age(has_age: DataFrame) -> Series:
    """Calculate the age ARC-HBR criterion

    Calculate the age ARC HBR criterion (0.5 points if > 75 at index, 0 otherwise.

    Args:
        has_age: Dataframe which has a column `age`

    Returns:
        A series of values 0.5 (if age > 75 at index) or 0 otherwise, indexed
            by input dataframe index.
    """
    return Series(np.where(has_age["age"] > 75, 0.5, 0), index=has_age.index)

arc_hbr_anaemia(has_index_hb_and_gender)

Calculate the ARC HBR anaemia (low Hb) criterion

Calculates anaemia based on the worst (lowest) index Hb measurement and gender currently. Should be modified to take most recent Hb value or clinical code.

Parameters:

Name Type Description Default
has_index_hb_and_gender DataFrame

Dataframe having the column index_hb containing the Hb measurement (g/dL) at the index event, or NaN if no Hb measurement was made. Also contains gender (categorical with categories "male", "female", and "unknown").

required

Returns:

Type Description
Series

A series containing the HBR score for the index episode.

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_anaemia(has_index_hb_and_gender: DataFrame) -> Series:
    """Calculate the ARC HBR anaemia (low Hb) criterion

    Calculates anaemia based on the worst (lowest) index Hb measurement
    and gender currently. Should be modified to take most recent Hb value
    or clinical code.

    Args:
        has_index_hb_and_gender: Dataframe having the column `index_hb` containing the
            Hb measurement (g/dL) at the index event, or NaN if no Hb measurement
            was made. Also contains `gender` (categorical with categories "male",
            "female", and "unknown").

    Returns:
        A series containing the HBR score for the index episode.
    """

    df = has_index_hb_and_gender

    # Evaluated in order
    arc_score_conditions = [
        df["hb"] < 11.0,  # Major for any gender
        df["hb"] < 11.9,  # Minor for any gender
        (df["hb"] < 12.9) & (df["gender"] == "male"),  # Minor for male
        df["hb"] >= 12.9,  # None for any gender
    ]
    arc_scores = [1.0, 0.5, 0.5, 0.0]

    # Default is used to fill missing Hb score with 0.0 for now. TODO: replace with
    # fall-back to recent Hb, or codes.
    return Series(
        np.select(arc_score_conditions, arc_scores, default=0.0),
        index=df.index,
    )

arc_hbr_cancer(has_prior_cancer)

Calculate the cancer ARC HBR criterion

This function takes a dataframe with a column prior_cancer with a count of the cancer diagnoses in the previous year.

Parameters:

Name Type Description Default
has_prior_cancer DataFrame

Has a column prior_cancer with a count of the number of cancer diagnoses occurring in the year before the index event.

required

Returns:

Type Description
Series

The ARC HBR cancer criterion (0.0, 1.0)

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_cancer(has_prior_cancer: DataFrame) -> Series:
    """Calculate the cancer ARC HBR criterion

    This function takes a dataframe with a column prior_cancer
    with a count of the cancer diagnoses in the previous year.

    Args:
        has_prior_cancer: Has a column `prior_cancer` with a count
            of the number of cancer diagnoses occurring in the
            year before the index event.

    Returns:
        The ARC HBR cancer criterion (0.0, 1.0)
    """
    return Series(
        np.where(has_prior_cancer["cancer_before"] > 0, 1.0, 0),
        index=has_prior_cancer.index,
    )

arc_hbr_cirrhosis_ptl_hyp(has_prior_cirrhosis)

Calculate the liver cirrhosis with portal hypertension ARC HBR criterion

This function takes a dataframe with two columns prior_cirrhosis and prior_portal_hyp, which count the number of diagnosis of liver cirrhosis and portal hypertension seen in the previous year.

Parameters:

Name Type Description Default
has_prior_cirrhosis DataFrame

Has columns prior_cirrhosis and prior_portal_hyp.

required

Returns:

Type Description
Series

The ARC HBR criterion (0.0, 1.0)

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_cirrhosis_ptl_hyp(has_prior_cirrhosis: DataFrame) -> Series:
    """Calculate the liver cirrhosis with portal hypertension ARC HBR criterion

    This function takes a dataframe with two columns prior_cirrhosis
    and prior_portal_hyp, which count the number of diagnosis of
    liver cirrhosis and portal hypertension seen in the previous
    year.

    Args:
        has_prior_cirrhosis: Has columns `prior_cirrhosis` and
            `prior_portal_hyp`.

    Returns:
        The ARC HBR criterion (0.0, 1.0)
    """
    cirrhosis = has_prior_cirrhosis["liver_cirrhosis_before"] > 0
    portal_hyp = has_prior_cirrhosis["portal_hypertension_before"] > 0

    return Series(
        np.where(cirrhosis & portal_hyp, 1.0, 0),
        index=has_prior_cirrhosis.index,
    )

arc_hbr_ckd(has_index_egfr)

Calculate the ARC HBR chronic kidney disease (CKD) criterion

The ARC HBR CKD criterion is calculated based on the eGFR as follows:

eGFR Score
eGFR < 30 mL/min 1.0
30 mL/min \<= eGFR < 60 mL/min 0.5
eGFR >= 60 mL/min 0.0

If the eGFR is NaN, set score to zero (TODO: fall back to ICD-10 codes in this case)

Parameters:

Name Type Description Default
has_index_egfr DataFrame

Dataframe having the column index_egfr (in units of mL/min) with the eGFR measurement at index, or NaN which means no eGFR measurement was found at the index.

required

Returns:

Type Description
Series

A series containing the CKD ARC criterion, based on the eGFR at index.

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_ckd(has_index_egfr: DataFrame) -> Series:
    """Calculate the ARC HBR chronic kidney disease (CKD) criterion

    The ARC HBR CKD criterion is calculated based on the eGFR as
    follows:

    | eGFR                           | Score |
    |--------------------------------|-------|
    | eGFR < 30 mL/min               | 1.0   |
    | 30 mL/min \<= eGFR < 60 mL/min | 0.5   |
    | eGFR >= 60 mL/min              | 0.0   |

    If the eGFR is NaN, set score to zero (TODO: fall back to ICD-10
    codes in this case)

    Args:
        has_index_egfr: Dataframe having the column `index_egfr` (in units of mL/min)
            with the eGFR measurement at index, or NaN which means no eGFR
            measurement was found at the index.

    Returns:
        A series containing the CKD ARC criterion, based on the eGFR at
            index.
    """

    # Replace NaN values for now with 100 (meaning score 0.0)
    df = has_index_egfr["egfr"].fillna(90)

    # Using a high upper limit to catch any high eGFR values. In practice,
    # the highest value is 90 (which comes from the string ">90" in the database).
    return cut(df, [0, 30, 60, 10000], right=False, labels=[1.0, 0.5, 0.0])

arc_hbr_ischaemic_stroke_ich(has_prior_ischaemic_stroke)

Calculate the ischaemic stroke/intracranial haemorrhage ARC HBR criterion

This function takes a dataframe with two columns prior_bavm_ich and prior_portal_hyp, which count the number of diagnosis of liver cirrhosis and portal hypertension seen in the previous year.

If bAVM/ICH is present, 1.0 is added to the score. Else, if ischaemic stroke is present, add 0.5. Otherwise add 0.

Parameters:

Name Type Description Default
has_prior_ischaemic_stroke DataFrame

Has a column prior_ischaemic_stroke containing the number of any-severity ischaemic strokes in the previous year, and a column prior_bavm_ich containing a count of any diagnosis of brain arteriovenous malformation or intracranial haemorrhage.

required

Returns:

Type Description
Series

The ARC HBR criterion (0.0, 1.0)

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_ischaemic_stroke_ich(has_prior_ischaemic_stroke: DataFrame) -> Series:
    """Calculate the ischaemic stroke/intracranial haemorrhage ARC HBR criterion

    This function takes a dataframe with two columns prior_bavm_ich
    and prior_portal_hyp, which count the number of diagnosis of
    liver cirrhosis and portal hypertension seen in the previous
    year.

    If bAVM/ICH is present, 1.0 is added to the score. Else, if
    ischaemic stroke is present, add 0.5. Otherwise add 0.

    Args:
        has_prior_ischaemic_stroke: Has a column `prior_ischaemic_stroke` containing
            the number of any-severity ischaemic strokes in the previous
            year, and a column `prior_bavm_ich` containing a count of
            any diagnosis of brain arteriovenous malformation or
            intracranial haemorrhage.


    Returns:
        The ARC HBR criterion (0.0, 1.0)
    """
    ischaemic_stroke = has_prior_ischaemic_stroke["ischaemic_stroke_before"] > 0
    bavm_ich = (has_prior_ischaemic_stroke["bavm_before"] + has_prior_ischaemic_stroke["ich_before"]) > 0

    score_one = np.where(bavm_ich, 1, 0)
    score_half = np.where(ischaemic_stroke, 0.5, 0)
    score_zero = np.zeros(len(has_prior_ischaemic_stroke))

    return Series(
        np.maximum(score_one, score_half, score_zero),
        index=has_prior_ischaemic_stroke.index,
    )

arc_hbr_medicine(index_spells, episodes, prescriptions, medicine_group, arc_score)

Calculate the oral-anticoagulant/NSAID ARC HBR criterion

Pass the list of medicines which qualifies for the OAC ARC criterion, along with the ARC score; or pass the same data for the NSAID criterion.

The score is added if a prescription of the medicine is seen at any time during the patient spell.

Notes on the OAC and NSAID criteria:

1.0 point if an one of the OACs "warfarin", "apixaban", "rivaroxaban", "edoxaban", "dabigatran", is present in the index spell (meaning the index episode, or any other episode in the spell).

1.0 point is added if an one of the following NSAIDs is present on admission:

  • Ibuprofen
  • Naproxen
  • Diclofenac
  • Celecoxib
  • Mefenamic acid
  • Etoricoxib
  • Indomethacin

Note

The on admission flag could be used to imply expected chronic/extended use, but this is not included as it filters out all OAC prescriptions in the HIC data.

Parameters:

Name Type Description Default
index_spells DataFrame

Index spell_id is used to narrow prescriptions.

required
prescriptions DataFrame

Contains name (of medicine).

required

Returns:

Type Description
Series

The ARC score for each index spell

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_medicine(
    index_spells: DataFrame,
    episodes: DataFrame,
    prescriptions: DataFrame,
    medicine_group: str,
    arc_score: float,
) -> Series:
    """Calculate the oral-anticoagulant/NSAID ARC HBR criterion

    Pass the list of medicines which qualifies for the OAC
    ARC criterion, along with the ARC score; or pass the same
    data for the NSAID criterion.

    The score is added if a prescription of the medicine is seen
    at any time during the patient spell.

    Notes on the OAC and NSAID criteria:

    1.0 point if an one of the OACs "warfarin", "apixaban",
    "rivaroxaban", "edoxaban", "dabigatran", is present
    in the index spell (meaning the index episode, or any
    other episode in the spell).

    1.0 point is added if an one of the following NSAIDs is present
    on admission:

    * Ibuprofen
    * Naproxen
    * Diclofenac
    * Celecoxib
    * Mefenamic acid
    * Etoricoxib
    * Indomethacin

    !!! note
        The on admission flag could be used to imply expected
        chronic/extended use, but this is not included as it filters
        out all OAC prescriptions in the HIC data.

    Args:
        index_spells: Index `spell_id` is used to narrow prescriptions.
        prescriptions: Contains `name` (of medicine).

    Returns:
        The ARC score for each index spell
    """

    # Get all the data required
    df = (
        index_spells.reset_index("spell_id")
        .merge(prescriptions, on="patient_id", how="left")
        .merge(episodes[["admission", "discharge"]], on="episode_id", how="left")
    )

    # Filter by prescription name and only keep only prescriptions ordered between
    # admission and discharge
    correct_prescription = df["group"] == medicine_group
    within_spell = (df["order_date"] >= df["admission"]) & (
        df["order_date"] <= df["discharge"]
    )

    # Populate the rows of df with the score
    df["arc_score"] = 0.0
    df.loc[correct_prescription & within_spell, "arc_score"] = 1.0

    # Group by the index spell id and get the max score
    return df.groupby("spell_id").max("arc_score")["arc_score"]

arc_hbr_nsaid(index_episodes, prescriptions)

Calculate the non-steroidal anti-inflamatory drug (NSAID) ARC HBR criterion

1.0 point is added if an one of the following NSAIDs is present on admission:

  • Ibuprofen
  • Naproxen
  • Diclofenac
  • Celecoxib
  • Mefenamic acid
  • Etoricoxib
  • Indomethacin

Parameters:

Name Type Description Default
index_episodes DataFrame

Index episode_id is used to narrow prescriptions.

required
prescriptions DataFrame

Contains name (of medicine).

required

Returns:

Type Description
Series

The OAC ARC score for each index event.

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_nsaid(index_episodes: DataFrame, prescriptions: DataFrame) -> Series:
    """Calculate the non-steroidal anti-inflamatory drug (NSAID) ARC HBR criterion

    1.0 point is added if an one of the following NSAIDs is present
    on admission:

    * Ibuprofen
    * Naproxen
    * Diclofenac
    * Celecoxib
    * Mefenamic acid
    * Etoricoxib
    * Indomethacin

    Args:
        index_episodes: Index `episode_id` is used to narrow prescriptions.
        prescriptions: Contains `name` (of medicine).

    Returns:
        The OAC ARC score for each index event.
    """
    df = index_episodes.merge(prescriptions, how="left", on="episode_id")
    nsaid_criterion = ((df["group"] == "nsaid") & (df["on_admission"] == True)).astype(
        "float"
    )
    return nsaid_criterion.set_axis(index_episodes.index)

arc_hbr_prior_bleeding(has_prior_bleeding)

Calculate the prior bleeding/transfusion ARC HBR criterion

This function takes a dataframe with a column prior_bleeding_12 with a count of the prior bleeding events in the previous year.

TODO: Input needs a separate column for bleeding in 6 months and bleeding in a year, so distinguish 0.5 from 1. Also need to add transfusion.

Parameters:

Name Type Description Default
has_prior_bleeding DataFrame

Has a column prior_bleeding_12 with a count of the number of bleeds occurring one year before the index. Has episode_id as the index.

required

Returns:

Type Description
Series

The ARC HBR bleeding/transfusion criterion (0.0, 0.5, or 1.0)

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_prior_bleeding(has_prior_bleeding: DataFrame) -> Series:
    """Calculate the prior bleeding/transfusion ARC HBR criterion

    This function takes a dataframe with a column prior_bleeding_12
    with a count of the prior bleeding events in the previous year.

    TODO: Input needs a separate column for bleeding in 6 months and
    bleeding in a year, so distinguish 0.5 from 1. Also need to add
    transfusion.

    Args:
        has_prior_bleeding: Has a column `prior_bleeding_12` with a count
            of the number of bleeds occurring one year before the index.
            Has `episode_id` as the index.

    Returns:
        The ARC HBR bleeding/transfusion criterion (0.0, 0.5, or 1.0)
    """
    return Series(
        np.where(has_prior_bleeding["bleeding_adaptt_before"] > 0, 0.5, 0),
        index=has_prior_bleeding.index,
    )

arc_hbr_tcp(has_index_platelets)

Calculate the ARC HBR thrombocytopenia (low platelet count) criterion

The score is 1.0 if platelet count < 100e9/L, otherwise it is 0.0.

Parameters:

Name Type Description Default
has_index_platelets DataFrame

Has column index_platelets, which is the platelet count measurement in the index.

required

Returns:

Type Description
Series

Series containing the ARC score

Source code in src\pyhbr\analysis\arc_hbr.py
def arc_hbr_tcp(has_index_platelets: DataFrame) -> Series:
    """Calculate the ARC HBR thrombocytopenia (low platelet count) criterion

    The score is 1.0 if platelet count < 100e9/L, otherwise it is 0.0.

    Args:
        has_index_platelets: Has column `index_platelets`, which is the
            platelet count measurement in the index.

    Returns:
        Series containing the ARC score
    """
    return Series(
        np.where(has_index_platelets["platelets"] < 100, 1.0, 0),
        index=has_index_platelets.index,
    )

first_index_lab_result(index_spells, lab_results, episodes)

Get the (first) lab result associated to each index spell

Get a table of the first lab result seen in the index admission (between the admission date and discharge date), with one column for each value of the test_name column in lab_results.

The resulting table has all-NA rows for index spells where no lab results were seen, and cells contain NA if that lab result was missing from the index spell.

Parameters:

Name Type Description Default
index_spells DataFrame

Has an spell_id index and patient_id column.

required
lab_results DataFrame

Has a test_name and a result column for the numerical test result, and a sample_date for when the sample for the test was collected.

required
episodes DataFrame

Indexed by episode_id, and contains admission and discharge columns.

required

Returns:

Type Description
DataFrame

A table indexed by spell_id containing one column per unique test in test_name (the column name is the same as the value in the test_name column).

Source code in src\pyhbr\analysis\arc_hbr.py
def first_index_lab_result(
    index_spells: DataFrame,
    lab_results: DataFrame,
    episodes: DataFrame,
) -> DataFrame:
    """Get the (first) lab result associated to each index spell

    Get a table of the first lab result seen in the index admission (between
    the admission date and discharge date), with one column for each
    value of the `test_name` column in lab_results.

    The resulting table has all-NA rows for index spells where no lab results
    were seen, and cells contain NA if that lab result was missing from the
    index spell.

    Args:
        index_spells: Has an `spell_id` index and `patient_id` column.
        lab_results: Has a `test_name` and a `result` column for the
            numerical test result, and a `sample_date` for when the sample
            for the test was collected.
        episodes: Indexed by `episode_id`, and contains `admission`
            and `discharge` columns.

    Returns:
        A table indexed by `spell_id` containing one column per unique
            test in `test_name` (the column name is the same as the value
            in the `test_name` column).
    """

    # Get the admission and discharge time for each index spell
    admission_discharge = (
        index_spells[["patient_id", "episode_id"]]
        .reset_index("spell_id")
        .merge(episodes[["admission", "discharge"]], on="episode_id", how="left")
    )

    # For every index spell, join all the lab results for that patient
    # and reduce to only those occurring within the admission/discharge
    # time window
    df = admission_discharge.merge(lab_results, on="patient_id", how="left")
    within_spell = (df["sample_date"] > df["admission"]) & (
        df["sample_date"] < df["discharge"]
    )
    index_spell_labs = df[within_spell]
    first_lab = (
        index_spell_labs.sort_values("sample_date")
        .groupby(["spell_id", "test_name"])
        .head(1)
    )

    # Convert to a wide format with one column per test, then right-join the
    # index spells to catch cases where no lab results were present
    wide = first_lab.pivot(index="spell_id", columns="test_name", values="result").merge(
        index_spells[[]], on="spell_id", how="right"
    )

    return wide

plot_index_measurement_distribution(features)

Plot a histogram of measurement results at the index

Parameters:

Name Type Description Default
index_episodes

Must contain index_hb, index_egfr,

required
Source code in src\pyhbr\analysis\arc_hbr.py
def plot_index_measurement_distribution(features: DataFrame):
    """Plot a histogram of measurement results at the index

    Args:
        index_episodes: Must contain `index_hb`, `index_egfr`,
        and `index_platelets`. The index_hb column is multiplied
        by 10 to get units g/L.
    """

    # Make a plot showing the three lab results as histograms
    df = features.copy()
    df["index_hb"] = 10 * df["index_hb"]  # Convert from g/dL to g/L
    df = (
        df.filter(regex="^index_(egfr|hb|platelets)")
        .rename(
            columns={
                "index_egfr": "eGFR (mL/min)",
                "index_hb": "Hb (g/L)",
                "index_platelets": "Plt (x10^9/L)",
            }
        )
        .melt(value_name="Test result at index episode", var_name="Test")
    )
    g = sns.displot(
        df,
        x="Test result at index episode",
        hue="Test",
    )
    g.figure.subplots_adjust(top=0.95)
    g.ax.set_title("Distribution of Laboratory Test Results in ACS/PCI index events")

calibration

Calibration plots

A calibration plot is a comparison of the proportion p of events that occur in the subset of those with predicted probability p'. Ideally, p = p' meaning that of the cases predicted to occur with probability p', p of them do occur. Calibration is presented as a plot of p against 'p'.

The stability of the calibration can be investigated, by plotting p against p' for multiple bootstrapped models (see stability.py).

draw_calibration_confidence(ax, calibration)

Draw a single model's calibration curve with confidence intervals

Parameters:

Name Type Description Default
ax Axes

The axes on which to draw the plot

required
calibration DataFrame

The model's calibration data

required
Source code in src\pyhbr\analysis\calibration.py
def draw_calibration_confidence(ax: Axes, calibration: DataFrame):
    """Draw a single model's calibration curve with confidence intervals

    Args:
        ax: The axes on which to draw the plot
        calibration: The model's calibration data
    """
    c = calibration

    make_error_boxes(ax, c)

    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.xaxis.set_major_formatter(mtick.PercentFormatter())
    ax.yaxis.set_major_formatter(mtick.PercentFormatter())
    ax.set_ylabel("Estimated Prevalence")
    ax.set_xlabel("Model-Estimated Risks")
    ax.set_title("Accuracy of Risk Estimates")

    # Get the minimum and maximum for the x range
    min_x = 100 * (c["bin_center"]).min()
    max_x = 100 * (c["bin_center"]).max()

    # Generate a dense straight line (smooth curve on log scale)
    coords = np.linspace(min_x, max_x, num=50)

    ax.plot(coords, coords, c="k")

get_average_calibration_error(probs, y_test, n_bins)

This is the weighted average discrepancy between the predicted risk and the observed proportions on the calibration curve.

See "https://towardsdatascience.com/expected-calibration-error-ece-a-step- by-step-visual-explanation-with-python-code-c3e9aa12937d" for a good explanation.

The formula for estimated calibration error (ece) is:

ece = Sum over bins [samples_in_bin / N] * | P_observed - P_pred |,

where P_observed is the empirical proportion of positive samples in the bin, and P_pred is the predicted probability for that bin. The results are weighted by the number of samples in the bin (because some probabilities are predicted more frequently than others).

The result is interpreted as an absolute error: i.e. a value of 0.1 means that the calibration is out on average by 10%. It may be better to modify the formula to compute an average relative error.

Testing: not yet tested.

Source code in src\pyhbr\analysis\calibration.py
def get_average_calibration_error(probs, y_test, n_bins):
    """
    This is the weighted average discrepancy between the predicted risk and the
    observed proportions on the calibration curve.

    See "https://towardsdatascience.com/expected-calibration-error-ece-a-step-
    by-step-visual-explanation-with-python-code-c3e9aa12937d" for a good
    explanation.

    The formula for estimated calibration error (ece) is:

       ece = Sum over bins [samples_in_bin / N] * | P_observed - P_pred |,

    where P_observed is the empirical proportion of positive samples in the
    bin, and P_pred is the predicted probability for that bin. The results are
    weighted by the number of samples in the bin (because some probabilities are
    predicted more frequently than others).

    The result is interpreted as an absolute error: i.e. a value of 0.1 means
    that the calibration is out on average by 10%. It may be better to modify the
    formula to compute an average relative error.

    Testing: not yet tested.
    """

    # There is one estimated calibration error for each model (the model under
    # test and all the bootstrap models). These will be averaged at the end
    estimated_calibration_errors = []

    # The total number of samples is the number of rows in the probs array. This
    # is used with the number of samples in the bins to weight the probability
    # error
    N = probs.shape[0]

    bin_edges = np.linspace(0, 1, n_bins + 1)
    for n in range(probs.shape[1]):

        prob_true, prob_pred = calibration_curve(y_test, probs[:, n], n_bins=n_bins)

        # For each prob_pred, need to count the number of samples in that lie in
        # the bin centered at prob_pred.
        bin_width = 1 / n_bins
        count_in_bins = []
        for prob in prob_pred:
            bin_start = prob - bin_width / 2
            bin_end = prob + bin_width / 2
            count = ((bin_start <= probs[:, n]) & (probs[:, n] < bin_end)).sum()
            count_in_bins.append(count)
        count_in_bins = np.array(count_in_bins)

        error = np.sum(count_in_bins * np.abs(prob_true - prob_pred)) / N
        estimated_calibration_errors.append(error)

    return np.mean(estimated_calibration_errors)

get_calibration(probs, y_test, n_bins)

Calculate the calibration of the fitted models

Warning

This function is deprecated. Use the variable bin width calibration function instead.

Get the calibration curves for all models (whose probability predictions for the positive class are columns of probs) based on the outcomes in y_test. Rows of y_test correspond to rows of probs. The result is a list of pairs, one for each model (column of probs). Each pair contains the vector of x- and y-coordinates of the calibration curve.

Parameters:

Name Type Description Default
probs DataFrame

The dataframe of probabilities predicted by the model. The first column is the model-under-test (fitted on the training data) and the other columns are from the fits on the training data resamples.

required
y_test Series

The outcomes corresponding to the predicted probabilities.

required
n_bins int

The number of bins to group probability predictions into, for the purpose of averaging the observed frequency of outcome in the test set.

required

Returns:

Type Description
list[DataFrame]

A list of DataFrames containing the calibration curves. Each DataFrame contains the columns predicted and observed.

Source code in src\pyhbr\analysis\calibration.py
def get_calibration(probs: DataFrame, y_test: Series, n_bins: int) -> list[DataFrame]:
    """Calculate the calibration of the fitted models

    !!! warning

        This function is deprecated. Use the variable bin width calibration
        function instead.

    Get the calibration curves for all models (whose probability
    predictions for the positive class are columns of probs) based
    on the outcomes in y_test. Rows of y_test correspond to rows of
    probs. The result is a list of pairs, one for each model (column
    of probs). Each pair contains the vector of x- and y-coordinates
    of the calibration curve.

    Args:
        probs: The dataframe of probabilities predicted by the model.
            The first column is the model-under-test (fitted on the training
            data) and the other columns are from the fits on the training
            data resamples.
        y_test: The outcomes corresponding to the predicted probabilities.
        n_bins: The number of bins to group probability predictions into, for
            the purpose of averaging the observed frequency of outcome in the
            test set.

    Returns:
        A list of DataFrames containing the calibration curves. Each DataFrame
            contains the columns `predicted` and `observed`.

    """
    curves = []
    for column in probs.columns:
        prob_true, prob_pred = calibration_curve(y_test, probs[column], n_bins=n_bins)
        df = DataFrame({"predicted": prob_pred, "observed": prob_true})
        curves.append(df)
    return curves

get_prevalence(y_test)

Estimate the prevalence in a set of outcomes

To calculate model calibration, patients are grouped together into similar-risk groups. The prevalence of the outcome in each group is then compared to the predicted risk.

The true risk of the outcome within each group is not known, but it is known what outcome occurred.

One possible assumption is that the patients in each group all have the same risk, p. In this case, the outcomes from the group follow a Bernoulli distribution. The population parameter p (where the popopulation is all patients receiving risk predictions in this group) can be estimated simply using \(\hat{p} = N_\text{outcome}/N_\text{group_size}\). Using a simple approach to calculate the confidence interval on this estimate, assuming a large enough sample size for normally distributed estimate of the mean, gives a CI of:

\[ \hat{p} \pm 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{N_\text{group_size}}} \]

(See this answer for details.)

However, the assumption of uniform risk within the models groups-of-equal-risk-prediction may not be valid, because it assumes that the model is predicting reasonably accurate risks, and the model is the item under test.

One argument is that, if the estimated prevalence matches the risk of the group closely, then this may give evidence that the models predicted risks are accurate -- the alternative would be that the real risks follow a different distribution, whose mean happens (coincidentally) to coincide with the predicted risk. Such a conclusion may be possible if the confidence interval for the estimated prevalence is narrow, and agrees with the predicted risk closely.

Without further assumptions, there is nothing further that can be said about the distribution of patient risks within each group. As a result, good calibration is a necessary, but not sufficient, condition for accurate risk predictions in the model .

Parameters:

Name Type Description Default
y_test Series

The (binary) outcomes in a single risk group. The values are True/False (boolean)

required

Returns:

Type Description

A map containing the key "prevalence", for the estimated mean of the Bernoulli distribution, and "lower" and "upper" for the estimated confidence interval, assuming all patients in the risk group are drawn from a single Bernoulii distribution. The "variance" is the estimate of the sample variance of the estimated prevalence, and can be used to form an average of the accuracy uncertainties in each bin.

Note that the assumption of a Bernoulli distribution is not necessarily accurate.

Source code in src\pyhbr\analysis\calibration.py
def get_prevalence(y_test: Series):
    """Estimate the prevalence in a set of outcomes

    To calculate model calibration, patients are grouped
    together into similar-risk groups. The prevalence of
    the outcome in each group is then compared to the
    predicted risk.

    The true risk of the outcome within each group is not
    known, but it is known what outcome occurred.

    One possible assumption is that the patients in each
    group all have the same risk, p. In this case, the
    outcomes from the group follow a Bernoulli
    distribution. The population parameter p (where the
    popopulation is all patients receiving risk predictions
    in this group) can be estimated simply using
    $\hat{p} = N_\\text{outcome}/N_\\text{group_size}$.
    Using a simple approach to calculate the confidence
    interval on this estimate, assuming a large enough
    sample size for normally distributed estimate of the
    mean, gives a CI of:

    $$
    \hat{p} \pm 1.96\sqrt{\\frac{\hat{p}(1-\hat{p})}{N_\\text{group_size}}}
    $$

    (See [this answer](https://stats.stackexchange.com/a/156807)
    for details.)

    However, the assumption of uniform risk within the
    models groups-of-equal-risk-prediction may not be valid,
    because it assumes that the model is predicting
    reasonably accurate risks, and the model is the item
    under test.

    One argument is that, if the estimated prevalence matches
    the risk of the group closely, then this may give evidence
    that the models predicted risks are accurate -- the alternative
    would be that the real risks follow a different distribution, whose
    mean happens (coincidentally) to coincide with the predicted
    risk. Such a conclusion may be possible if the confidence
    interval for the estimated prevalence is narrow, and agrees
    with the predicted risk closely.

    Without further assumptions, there is nothing further that
    can be said about the distribution of patient risks within
    each group. As a result, good calibration is a necessary,
    but not sufficient, condition for accurate risk
    predictions in the model .

    Args:
        y_test: The (binary) outcomes in a single risk group.
            The values are True/False (boolean)

    Returns:
        A map containing the key "prevalence", for the estimated
            mean of the Bernoulli distribution, and "lower"
            and "upper" for the estimated confidence interval,
            assuming all patients in the risk group are drawn
            from a single Bernoulii distribution. The "variance"
            is the estimate of the sample variance of the estimated
            prevalence, and can be used to form an average of
            the accuracy uncertainties in each bin.

            Note that the assumption of a Bernoulli distribution
            is not necessarily accurate.
    """
    n_group_size = len(y_test)
    p_hat = np.mean(y_test)
    variance = (p_hat * (1 - p_hat)) / n_group_size # square of standard error of Bernoulli
    half_width = 1.96 * np.sqrt(variance) # Estimate of 95% confidence interval
    return {
        "prevalence": p_hat,
        "lower": p_hat - half_width,
        "upper": p_hat + half_width,
        "variance": variance
    }

get_variable_width_calibration(probs, y_test, n_bins)

Get variable-bin-width calibration curves

Model predictions are arranged in ascending order, and then risk ranges are selected so that an equal number of predictions falls in each group. This means bin widths will be more granular at points where many patients are predicted the same risk. The risk bins are shown on the x-axis of calibration plots.

In each bin, the proportion of patient with an event are calculated. This value, which is a function of each bin, is plotted on the y-axis of the calibration plot, and is a measure of the prevalence of the outcome in each bin. In a well calibrated model, this prevalence should match the mean risk prediction in the bin (the bin center).

Note that a well-calibrated model is not a sufficient condition for correctness of risk predictions. One way that the prevalence of the bin can match the bin risk is for all true risks to roughly match the bin risk P. However, other ways are possible, for example, a proportion P of patients in the bin could have 100% risk, and the other have zero risk.

Parameters:

Name Type Description Default
probs DataFrame

Each column is the predictions from one of the resampled models. The first column corresponds to the model-under-test.

required
y

Contains the observed outcomes.

required
n_bins int

The number of (variable-width) bins to include.

required

Returns:

Type Description
list[DataFrame]

A list of dataframes, one for each calibration curve. The "bin_center" column contains the central bin width; the "bin_half_width" column contains the half-width of each equal-risk group. The "est_prev" column contains the mean number of events in that bin; and the "est_prev_err" contains the half-width of the 95% confidence interval (symmetrical above and below bin_prev).

Source code in src\pyhbr\analysis\calibration.py
def get_variable_width_calibration(
    probs: DataFrame, y_test: Series, n_bins: int
) -> list[DataFrame]:
    """Get variable-bin-width calibration curves

    Model predictions are arranged in ascending order, and then risk ranges
    are selected so that an equal number of predictions falls in each group.
    This means bin widths will be more granular at points where many patients
    are predicted the same risk. The risk bins are shown on the x-axis of
    calibration plots.

    In each bin, the proportion of patient with an event are calculated. This
    value, which is a function of each bin, is plotted on the y-axis of the
    calibration plot, and is a measure of the prevalence of the outcome in
    each bin. In a well calibrated model, this prevalence should match the
    mean risk prediction in the bin (the bin center).

    Note that a well-calibrated model is not a sufficient condition for
    correctness of risk predictions. One way that the prevalence of the
    bin can match the bin risk is for all true risks to roughly match
    the bin risk P. However, other ways are possible, for example, a
    proportion P of patients in the bin could have 100% risk, and the
    other have zero risk.


    Args:
        probs: Each column is the predictions from one of the resampled
            models. The first column corresponds to the model-under-test.
        y: Contains the observed outcomes.
        n_bins: The number of (variable-width) bins to include.

    Returns:
        A list of dataframes, one for each calibration curve. The
            "bin_center" column contains the central bin width;
            the "bin_half_width" column contains the half-width
            of each equal-risk group. The "est_prev" column contains
            the mean number of events in that bin;
            and the "est_prev_err" contains the half-width of the 95%
            confidence interval (symmetrical above and below bin_prev).
    """

    # Make the list that will contain the output calibration information
    calibration_dfs = []

    n_cols = probs.shape[1]
    for n in range(n_cols):

        # Get the probabilities predicted by one of the resampled
        # models (stored as a column in probs)
        col = probs.iloc[:, n].sort_values()

        # Bin the predictions into variable-width risk
        # ranges with equal numbers in each bin
        n_bins = 5
        samples_per_bin = int(np.ceil(len(col) / n_bins))
        bins = []
        for start in range(0, len(col), samples_per_bin):
            end = start + samples_per_bin
            bins.append(col[start:end])

        # Get the bin centres and bin widths
        bin_center = []
        bin_half_width = []
        for b in bins:
            upper = b.max()
            lower = b.min()
            bin_center.append((upper + lower) / 2)
            bin_half_width.append((upper - lower) / 2)

        # Get the event prevalence in the bin
        # Get the confidence intervals for each bin
        est_prev = []
        est_prev_err = []
        est_prev_variance = []
        actual_samples_per_bin = []
        num_events = []
        for b in bins:

            # Get the outcomes corresponding to the current
            # bin (group of equal predicted risk)
            equal_risk_group = y_test.loc[b.index]

            actual_samples_per_bin.append(len(b))
            num_events.append(equal_risk_group.sum())

            prevalence_ci = get_prevalence(equal_risk_group)
            est_prev_err.append((prevalence_ci["upper"] - prevalence_ci["lower"]) / 2)
            est_prev.append(prevalence_ci["prevalence"])
            est_prev_variance.append(prevalence_ci["variance"])

        # Add the data to the calibration list
        df = DataFrame(
            {
                "bin_center": bin_center,
                "bin_half_width": bin_half_width,
                "est_prev": est_prev,
                "est_prev_err": est_prev_err,
                "est_prev_variance": est_prev_variance,
                "samples_per_bin": actual_samples_per_bin,
                "num_events": num_events,
            }
        )
        calibration_dfs.append(df)

    return calibration_dfs

make_error_boxes(ax, calibration)

Plot error boxes and error bars around points

Parameters:

Name Type Description Default
ax Axes

The axis on which to plot the error boxes.

required
calibration DataFrame

Dataframe containing one row per bin, showing how the predicted risk compares to the estimated prevalence.

required
Source code in src\pyhbr\analysis\calibration.py
def make_error_boxes(ax: Axes, calibration: DataFrame):
    """Plot error boxes and error bars around points

    Args:
        ax: The axis on which to plot the error boxes.
        calibration: Dataframe containing one row per
            bin, showing how the predicted risk compares
            to the estimated prevalence.
    """

    alpha = 0.3

    c = calibration
    for n in range(len(c)):
        num_events = c.loc[n, "num_events"]
        samples_in_bin = c.loc[n, "samples_per_bin"]

        est_prev = 100 * c.loc[n, "est_prev"]
        est_prev_err = 100 * c.loc[n, "est_prev_err"]
        risk = 100 * c.loc[n, "bin_center"]
        bin_half_width = 100 * c.loc[n, "bin_half_width"]

        margin = 1.0
        x = risk - margin * bin_half_width
        y = est_prev - margin * est_prev_err
        width = 2 * margin * bin_half_width
        height = 2 * margin * est_prev_err

        rect = Rectangle(
            (x, y), width, height,
            label=f"Risk {risk:.2f}%, {num_events}/{samples_in_bin} events",
            alpha=alpha,
            facecolor=cm.jet(n/len(c))
        )
        ax.add_patch(rect)

    ax.errorbar(
        x=100 * c["bin_center"],
        y=100 * c["est_prev"],
        xerr=100 * c["bin_half_width"],
        yerr=100 * c["est_prev_err"],
        fmt="None",
    )

    ax.legend()

plot_calibration_curves(ax, curves, title='Stability of Calibration')

Plot calibration curves for the model under test and resampled models

Parameters:

Name Type Description Default
ax Axes

The axes on which to plot the calibration curves

required
curves list[DataFrame]

A list of DataFrames containing the calibration curve data

required
title

Title to add to the plot.

'Stability of Calibration'
Source code in src\pyhbr\analysis\calibration.py
def plot_calibration_curves(
    ax: Axes,
    curves: list[DataFrame],
    title="Stability of Calibration",
):
    """Plot calibration curves for the model under test and resampled models

    Args:
        ax: The axes on which to plot the calibration curves
        curves: A list of DataFrames containing the calibration curve data
        title: Title to add to the plot.
    """
    mut_curve = curves[0]  # model-under-test
    ax.plot(
        100 * mut_curve["bin_center"],
        100 * mut_curve["est_prev"],
        label="Model-under-test",
        c="r",
    )
    for curve in curves[1:]:
        ax.plot(
            100*curve["bin_center"],
            100*curve["est_prev"],
            label="Resample",
            c="b",
            linewidth=0.3,
            alpha=0.4,
        )

    # Get the minimum and maximum for the x range
    min_x = 100 * (curves[0]["bin_center"]).min()
    max_x = 100 * (curves[0]["bin_center"]).max()

    # Generate a dense straight line (smooth curve on log scale)
    coords = np.linspace(min_x, max_x, num=50)
    ax.plot(coords, coords, c="k")

    ax.legend(["Model-under-test", "Bootstrapped models"])

    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.xaxis.set_major_formatter(mtick.PercentFormatter())
    ax.yaxis.set_major_formatter(mtick.PercentFormatter())
    ax.set_ylabel("Estimated Prevalence")
    ax.set_xlabel("Model-Estimated Risks")
    ax.set_title(title)

plot_prediction_distribution(ax, probs, n_bins)

Plot the distribution of predicted probabilities over the models as a bar chart, with error bars showing the standard deviation of each model height. All model predictions (columns of probs) are given equal weight in the average; column 0 (the model under test) is not singled out in any way.

The function plots vertical error bars that are one standard deviation up and down (so 2*sd in total)

Source code in src\pyhbr\analysis\calibration.py
def plot_prediction_distribution(ax, probs, n_bins):
    """
    Plot the distribution of predicted probabilities over the models as
    a bar chart, with error bars showing the standard deviation of each
    model height. All model predictions (columns of probs) are given equal
    weight in the average; column 0 (the model under test) is not singled
    out in any way.

    The function plots vertical error bars that are one standard deviation
    up and down (so 2*sd in total)
    """
    bin_edges = np.linspace(0, 1, n_bins + 1)
    freqs = []
    for j in range(probs.shape[1]):
        f, _ = np.histogram(probs[:, j], bins=bin_edges)
        freqs.append(f)
    means = np.mean(freqs, axis=0)
    sds = np.std(freqs, axis=0)

    bin_centers = (bin_edges[1:] + bin_edges[:-1]) / 2

    # Compute the bin width to leave a gap between bars
    # of 20%
    bin_width = 0.80 / n_bins

    ax.bar(bin_centers, height=means, width=bin_width, yerr=2 * sds)
    # ax.set_title("Distribution of predicted probabilities")
    ax.set_xlabel("Mean predicted probability")
    ax.set_ylabel("Count")

describe

column_prop(bool_col)

Return a string with the number of non-zero items in the columns and a percentage

Source code in src\pyhbr\analysis\describe.py
def column_prop(bool_col):
    """Return a string with the number of non-zero items in the columns
    and a percentage
    """
    count = bool_col.sum()
    percent = 100 * count / len(bool_col)
    return f"{count} ({percent:.2f}%)"

get_column_rates(data)

Get the proportion of rows in each column that are non-zero

Either pass the full table, or subset it based on a condition to get the rates for that subset.

Parameters:

Name Type Description Default
data DataFrame

A table containing columns where the proportion of non-zero rows should be calculated.

required

Returns:

Type Description
Series

A Series (single column) with one row per column in the original data, containing the rate of non-zero items in each column. The Series is indexed by the names of the columns, with "_rate" appended.

Source code in src\pyhbr\analysis\describe.py
def get_column_rates(data: DataFrame) -> Series:
    """Get the proportion of rows in each column that are non-zero

    Either pass the full table, or subset it based
    on a condition to get the rates for that subset.

    Args:
        data: A table containing columns where the proportion
            of non-zero rows should be calculated.

    Returns:
        A Series (single column) with one row per column in the
            original data, containing the rate of non-zero items
            in each column. The Series is indexed by the names of
            the columns, with "_rate" appended.
    """
    return Series(
        {name + "_rate": proportion_nonzero(col) for name, col in data.items()}
    ).sort_values()

get_outcome_prevalence(outcomes)

Get the prevalence of each outcome as a percentage.

This function takes the outcomes dataframe used to define the y vector of the training/testing set and calculates the prevalence of each outcome in a form suitable for inclusion in a report.

Parameters:

Name Type Description Default
outcomes DataFrame

A dataframe with the columns "fatal_{outcome}", "non_fatal_{outcome}", and "{outcome}" (for the total), where {outcome} is "bleeding" or "ischaemia". Each row is an index spell, and the elements in the table are boolean (whether or not the outcome occurred).

required

Returns:

Type Description
DataFrame

A table with the prevalence of each outcome, and a multi-index containing the "Outcome" ("Bleeding" or "Ischaemia"), and the outcome "Type" (fatal, total, etc.)

Source code in src\pyhbr\analysis\describe.py
def get_outcome_prevalence(outcomes: DataFrame) -> DataFrame:
    """Get the prevalence of each outcome as a percentage.

    This function takes the outcomes dataframe used to define
    the y vector of the training/testing set and calculates the
    prevalence of each outcome in a form suitable for inclusion
    in a report.

    Args:
        outcomes: A dataframe with the columns "fatal_{outcome}",
            "non_fatal_{outcome}", and "{outcome}" (for the total),
            where {outcome} is "bleeding" or "ischaemia". Each row
            is an index spell, and the elements in the table are
            boolean (whether or not the outcome occurred).

    Returns:
        A table with the prevalence of each outcome, and a multi-index
            containing the "Outcome" ("Bleeding" or "Ischaemia"), and
            the outcome "Type" (fatal, total, etc.)
    """
    df = (
        100
        * outcomes.rename(
            columns={
                "bleeding": "Bleeding.Total",
                "non_fatal_bleeding": "Bleeding.Non-Fatal (BARC 2-4)",
                "fatal_bleeding": "Bleeding.Fatal (BARC 5)",
                "ischaemia": "Ischaemia.Total",
                "non_fatal_ischaemia": "Ischaemia.Non-Fatal (MI/Stroke)",
                "fatal_ischaemia": "Ischaemia.Fatal (CV Death)",
            }
        )
        .melt(value_name="Prevalence (%)")
        .groupby("variable")
        .sum()
        / len(outcomes)
    )
    df = df.reset_index()
    df[["Outcome", "Type"]] = df["variable"].str.split(".", expand=True)
    return df.set_index(["Outcome", "Type"])[["Prevalence (%)"]].apply(
        lambda x: round(x, 2)
    )

get_summary_table(models, high_risk_thresholds, config)

Get a table of model metric comparison across different models

Parameters:

Name Type Description Default
models dict[str, Any]

A map from model names to model data (containing the key "fit_results")

required
high_risk_thresholds dict[str, float]

A dictionary containing the keys "bleeding" and "ischaemia" mapped to the thresholds used to determine whether a patient is at high risk from the models.

required
config dict[str, Any]

The config file used as input to the results and report generator scripts. It must contain the keys "outcomes" and "models", which are dictionaries containing the outcome or model name and a sub-key "abbr" which contains a short name of the outcome/model.

required
Source code in src\pyhbr\analysis\describe.py
def get_summary_table(
    models: dict[str, Any],
    high_risk_thresholds: dict[str, float],
    config: dict[str, Any],
):
    """Get a table of model metric comparison across different models

    Args:
        models: A map from model names to model data (containing the
            key "fit_results")
        high_risk_thresholds: A dictionary containing the keys
            "bleeding" and "ischaemia" mapped to the thresholds
            used to determine whether a patient is at high risk
            from the models.
        config: The config file used as input to the results and
            report generator scripts. It must contain the keys
            "outcomes" and "models", which are dictionaries
            containing the outcome or model name and a sub-key
            "abbr" which contains a short name of the outcome/model.
    """
    model_names = []
    instabilities = []
    aucs = []
    risk_accuracy = []
    low_risk_reclass = []
    high_risk_reclass = []
    model_key = []  # For identifying the model later
    outcome_key = []  # For identifying the outcome later
    median_auc = []  # Numerical AUC for finding the best model

    for model, model_data in models.items():
        for outcome in ["bleeding", "ischaemia"]:

            model_key.append(model)
            outcome_key.append(outcome)

            fit_results = model_data["fit_results"]

            # Abbreviated model name
            model_abbr = config["models"][model]["abbr"]
            outcome_abbr = config["outcomes"][outcome]["abbr"]
            model_names.append(f"{model_abbr}-{outcome_abbr}")

            probs = fit_results["probs"]

            # Get the summary instabilities
            instability = stability.average_absolute_instability(probs[outcome])
            instabilities.append(common.median_to_string(instability))

            # Get the summary calibration accuracies
            calibrations = fit_results["calibrations"][outcome]

            # Join together all the calibration data for the primary model
            # and all the bootstrap models, to compare the bin center positions
            # with the estimated prevalence for all bins.
            all_calibrations = pd.concat(calibrations)

            # Average relative error where prevalence is non-zero
            accuracy_mean = 0
            accuracy_variance = 0
            count = 0
            for n in range(len(all_calibrations)):
                if all_calibrations["est_prev"].iloc[n] > 0:

                    # This assumes that all risk predictions in the bin are at the bin center, with no
                    # distribution (i.e. the result is normal with a distribution based on the sample
                    # mean of the prevalence. For more accuracy, consider using the empirical distribution
                    # of the risk predictions in the bin as the basis for this calculation.
                    accuracy_mean += np.abs(
                        all_calibrations["bin_center"].iloc[n]
                        - all_calibrations["est_prev"].iloc[n]
                    )

                    # When adding normal distributions together, the variances sum.
                    accuracy_variance += all_calibrations["est_prev_variance"].iloc[n]

                    count += 1
            accuracy_mean /= count
            accuracy_variance /= count

            # Calculate a 95% confidence interval for the resulting mean of the accuracies,
            # assuming all the distributions are normal.
            ci_upper = accuracy_mean + 1.96 * np.sqrt(accuracy_variance)
            ci_lower = accuracy_mean - 1.96 * np.sqrt(accuracy_variance)
            risk_accuracy.append(
                f"{100*accuracy_mean:.2f}%, CI [{100*ci_lower:.2f}%, {100*ci_upper:.2f}%]"
            )

            threshold = high_risk_thresholds[outcome]
            y_test = model_data["y_test"][outcome]
            df = stability.get_reclass_probabilities(probs[outcome], y_test, threshold)
            high_risk = (df["original_risk"] >= threshold).sum()
            high_risk_and_unstable = (
                (df["original_risk"] >= threshold) & (df["unstable_prob"] >= 0.5)
            ).sum()
            high_risk_reclass.append(f"{100 * high_risk_and_unstable / high_risk:.2f}%")
            low_risk = (df["original_risk"] < threshold).sum()
            low_risk_and_unstable = (
                (df["original_risk"] < threshold) & (df["unstable_prob"] >= 0.5)
            ).sum()
            low_risk_reclass.append(f"{100 * low_risk_and_unstable / low_risk:.2f}%")

            # Get the summary ROC AUCs
            auc_data = fit_results["roc_aucs"][outcome]
            auc_spread = Series(
                auc_data.resample_auc + [auc_data.model_under_test_auc]
            ).quantile([0.025, 0.5, 0.975])
            aucs.append(common.median_to_string(auc_spread, unit=""))
            median_auc.append(auc_spread[0.5])

    return DataFrame(
        {
            "Model": model_names,
            "Spread of Instability": instabilities,
            "H→L": high_risk_reclass,
            "L→H": low_risk_reclass,
            "Estimated Risk Uncertainty": risk_accuracy,
            "ROC AUC": aucs,
            "model_key": model_key,
            "outcome_key": outcome_key,
            "median_auc": median_auc,
        }
    ).set_index("Model", drop=True)

nearly_constant(data, threshold)

Check which columns of the input table have low variation

A column is considered low variance if the proportion of rows containing NA or the most common non-NA value exceeds threshold. For example, if NA and one other value together comprise 99% of the column, then it is considered to be low variance based on a threshold of 0.9.

Parameters:

Name Type Description Default
data DataFrame

The table to check for zero variance

required
threshold float

The proportion of the column that must be NA or the most common value above which the column is considered low variance.

required

Returns:

Type Description
Series

A Series containing bool, indexed by the column name in the original data, containing whether the column has low variance.

Source code in src\pyhbr\analysis\describe.py
def nearly_constant(data: DataFrame, threshold: float) -> Series:
    """Check which columns of the input table have low variation

    A column is considered low variance if the proportion of rows
    containing NA or the most common non-NA value exceeds threshold.
    For example, if NA and one other value together comprise 99% of
    the column, then it is considered to be low variance based on
    a threshold of 0.9.

    Args:
        data: The table to check for zero variance
        threshold: The proportion of the column that must be NA or
            the most common value above which the column is considered
            low variance.

    Returns:
        A Series containing bool, indexed by the column name
            in the original data, containing whether the column
            has low variance.
    """

    def low_variance(column: Series) -> bool:

        if len(column) == 0:
            # If the column has length zero, consider
            # it low variance
            return True

        if len(column.dropna()) == 0:
            # If the column is all-NA, it is low variance
            # independently of the threshold
            return True

        # Else, if the proportion of NA and the most common
        # non-NA value is higher than threshold, the column
        # is low variance
        na_count = column.isna().sum()
        counts = column.value_counts()
        most_common_value_count = counts.iloc[0]
        if (na_count + most_common_value_count) / len(column) > threshold:
            return True

        return False

    return data.apply(low_variance).rename("nearly_constant")

plot_arc_hbr_survival(ax, data)

Plot survival curves for bleeding by ARC HBR score.

Parameters:

Name Type Description Default
ax

List of two axes objects

required
data

A loaded data file

required
config

The analysis config (from yaml)

required
Source code in src\pyhbr\analysis\describe.py
def plot_arc_hbr_survival(ax, data, ):
    """Plot survival curves for bleeding by ARC HBR score.

    Args:
        ax: List of two axes objects
        data: A loaded data file
        config: The analysis config (from yaml)
    """

    # Get bleeding survival data
    survival = data["bleeding_survival"]
    features_index = data["features_index"]
    arc_hbr_score = data["arc_hbr_score"]
    arc_hbr_score["score"] = pd.cut(
        data["arc_hbr_score"]["total_score"],
        [0, 1, 2, 100],
        labels=["Score = 0", "0 < Score <= 1", "Score > 1"],
        right=False,
    )

    def masked_survival(survival, mask):
        masked_survival = survival[mask]
        status = ~masked_survival["right_censor"]
        survival_in_days = masked_survival["time_to_event"].dt.days
        return kaplan_meier_estimator(status, survival_in_days, conf_type="log-log")

    def add_arc_survival(ax, arc_mask, label, color):
        time, survival_prob, conf_int = masked_survival(survival, arc_mask)

        ax[0].step(time, survival_prob, where="post", color=color)
        ax[0].fill_between(
            time,
            conf_int[0],
            conf_int[1],
            alpha=0.25,
            step="post",
            label=label,
            color=color,
        )

    df = (
        features_index[["therapy"]]
        .fillna("Missing")
        .merge(arc_hbr_score[["score"]], how="left", on="spell_id")
        .groupby(["therapy", "score"], as_index=False)
        .size()
    )
    df["score_sum"] = df.groupby("score")["size"].transform(sum)
    df["percent"] = 100 * df["size"] / df["score_sum"]
    print(df.sort_values(["score", "therapy"]))
    df = df.rename(
        columns={"therapy": "Therapy", "percent": "Percent", "score": "Score"}
    ).drop(columns=["size", "score_sum"])
    print(df)

    # Set custom order of therapy (least to most aggressive)
    df["Therapy"] = pd.Categorical(df["Therapy"], ["Single", "DAPT-AC", "DAPT-AP", "DAPT-AT", "Triple", "Missing"])
    df = df.sort_values("Therapy")

    # Plot the distribution of therapies
    sns.barplot(
        data=df,
        x="Therapy",
        y="Percent",
        hue="Score",
        ax=ax[1],
        palette={
            "Score = 0": "tab:green",
            "0 < Score <= 1": "tab:orange",
            "Score > 1": "tab:red",
        },
    )

    # Plot survival curves by ARC score
    arc_mask = arc_hbr_score["score"] == "Score = 0"
    add_arc_survival(ax, arc_mask, "Score = 0", "tab:green")
    arc_mask = arc_hbr_score["score"] == "0 < Score <= 1"
    add_arc_survival(ax, arc_mask, "0 < Score <= 1", "tab:orange")
    arc_mask = arc_hbr_score["score"] == "Score > 1"
    add_arc_survival(ax, arc_mask, "Score > 1", "tab:red")

    ax[0].set_ylim(0.90, 1.00)
    ax[0].set_ylabel(r"Est. probability of no adverse event")
    ax[0].set_xlabel("Time (days since index ACS admission)")
    ax[0].set_title("Bleeding outcome survival curves by ARC HBR score")
    ax[0].legend()

plot_clinical_code_distribution(ax, data, config)

Plot histograms of the distribution of bleeding/ischaemia codes

Parameters:

Name Type Description Default
ax

A list of two axes objects

required
data

A loaded data file

required
config

The analysis config (from yaml)

required
Source code in src\pyhbr\analysis\describe.py
def plot_clinical_code_distribution(ax, data, config):
    """Plot histograms of the distribution of bleeding/ischaemia codes

    Args:
        ax: A list of two axes objects
        data: A loaded data file
        config: The analysis config (from yaml)
    """

    bleeding_group = config["outcomes"]["bleeding"]["non_fatal"]["group"]
    ischaemia_group = config["outcomes"]["ischaemia"]["non_fatal"]["group"]

    # Set the quantile level to find a cut-off that includes most codes
    level = 0.95

    codes = data["codes"]
    bleeding_codes = codes[codes["group"].eq(bleeding_group)]["position"]
    bleeding_codes.hist(ax=ax[0], rwidth=0.9)
    ax[0].set_title("Bleeding Codes")
    ax[0].set_xlabel("Code position (1 is primary, > 1 is secondary)")
    ax[0].set_ylabel("Total Code Count")

    q = bleeding_codes.quantile(level)
    ax[0].axvline(q)
    ax[0].text(
        q + 0.5,
        0.5,
        f"{100*level:.0f}% quantile",
        rotation=90,
        transform=ax[0].get_xaxis_transform(),
    )

    ischaemia_codes = codes[codes["group"].eq(ischaemia_group)]["position"]
    ischaemia_codes.hist(ax=ax[1], rwidth=0.9)
    ax[1].set_title("Ischaemia Codes")
    ax[1].set_xlabel("Code position")
    ax[1].set_ylabel("Total Code Count")

    q = ischaemia_codes.quantile(level)
    ax[1].axvline(q)
    ax[1].text(
        q + 0.5,
        0.5,
        f"{100*level:.0f}% quantile",
        rotation=90,
        transform=ax[1].get_xaxis_transform(),
    )

    plt.suptitle("Distribution of Bleeding/Ischaemia ICD-10 Primary/Secondary Codes")
    plt.tight_layout()

plot_survival_curves(ax, data, config)

Plot survival curves for bleeding/ischaemia broken down by age

Parameters:

Name Type Description Default
ax

A list of two axes objects

required
data

A loaded data file

required
config

The analysis config (from yaml)

required
Source code in src\pyhbr\analysis\describe.py
def plot_survival_curves(ax, data, config):
    """Plot survival curves for bleeding/ischaemia broken down by age

    Args:
        ax: A list of two axes objects
        data: A loaded data file
        config: The analysis config (from yaml)
    """

    # Mask the dataset by age to get different survival plots
    features_index = data["features_index"]
    print(features_index)
    age_over_75 = features_index["age"] > 75

    # Get bleeding survival data
    survival = data["bleeding_survival"].merge(
        features_index, on="spell_id", how="left"
    )

    # Calculate survival curves for bleeding (over 75)
    masked = survival[survival["age"] >= 75]
    status = ~masked["right_censor"]
    survival_in_days = masked["time_to_event"].dt.days
    time, survival_prob, conf_int = kaplan_meier_estimator(
        status, survival_in_days, conf_type="log-log"
    )
    ax[0].step(time, survival_prob, where="post", label="Age >= 75")
    ax[0].fill_between(time, conf_int[0], conf_int[1], alpha=0.25, step="post")

    # Now for under 75
    masked = survival[survival["age"] < 75]
    status = ~masked["right_censor"]
    survival_in_days = masked["time_to_event"].dt.days
    time, survival_prob, conf_int = kaplan_meier_estimator(
        status, survival_in_days, conf_type="log-log"
    )
    ax[0].step(time, survival_prob, where="post", label="Age < 75")
    ax[0].fill_between(time, conf_int[0], conf_int[1], alpha=0.25, step="post")

    ax[0].set_ylim(0.75, 1.00)
    ax[0].set_ylabel(r"Est. probability of no adverse event")
    ax[0].set_xlabel("Time (days)")
    ax[0].set_title("Bleeding Outcome")
    ax[0].legend()

    # Get ischaemia survival data
    survival = data["ischaemia_survival"].merge(
        features_index, on="spell_id", how="left"
    )

    # Calculate survival curves for ischaemia (over 75)
    masked = survival[survival["age"] >= 75]
    status = ~masked["right_censor"]
    survival_in_days = masked["time_to_event"].dt.days
    time, survival_prob, conf_int = kaplan_meier_estimator(
        status, survival_in_days, conf_type="log-log"
    )
    ax[1].step(time, survival_prob, where="post", label="Age >= 75")
    ax[1].fill_between(time, conf_int[0], conf_int[1], alpha=0.25, step="post")

    # Now for under 75
    masked = survival[survival["age"] < 75]
    status = ~masked["right_censor"]
    survival_in_days = masked["time_to_event"].dt.days
    time, survival_prob, conf_int = kaplan_meier_estimator(
        status, survival_in_days, conf_type="log-log"
    )
    ax[1].step(time, survival_prob, where="post", label="Age < 75")
    ax[1].fill_between(time, conf_int[0], conf_int[1], alpha=0.25, step="post")

    ax[1].set_ylim(0.75, 1.00)
    ax[1].set_ylabel(r"Est. probability of no adverse event")
    ax[1].set_xlabel("Time (days)")
    ax[1].set_title("Ischaemia Outcome")
    ax[1].legend()

    plt.tight_layout()

proportion_missingness(data)

Get the proportion of missing values in each column

Parameters:

Name Type Description Default
data DataFrame

A table where missingness should be calculate for each column

required

Returns:

Type Description
Series

The proportion of missing values in each column, indexed by the original table column name. The values are sorted in order of increasing missingness

Source code in src\pyhbr\analysis\describe.py
def proportion_missingness(data: DataFrame) -> Series:
    """Get the proportion of missing values in each column

    Args:
        data: A table where missingness should be calculate
            for each column

    Returns:
        The proportion of missing values in each column, indexed
            by the original table column name. The values are sorted
            in order of increasing missingness
    """
    return (data.isna().sum() / len(data)).sort_values().rename("missingness")

proportion_nonzero(column)

Get the proportion of non-zero values in a column

Source code in src\pyhbr\analysis\describe.py
def proportion_nonzero(column: Series) -> float:
    """Get the proportion of non-zero values in a column"""
    return (column > 0).sum() / len(column)

pvalue_chi2_high_risk_vs_outcome(probs, y_test, high_risk_threshold)

Perform a Chi-2 hypothesis test on the contingency between estimated high risk and outcome

Get the p-value from the hypothesis test that there is no association between the estimated high-risk category, and the outcome. The p-value is interpreted as the probability of getting obtaining the outcomes corresponding to the model's estimated high-risk category under the assumption that there is no association between the two.

Parameters:

Name Type Description Default
probs DataFrame

The model-estimated probabilities (first column is used)

required
y_test Series

Whether the outcome occurred

required
high_risk_threshold float

The cut-off risk (probability) defining an estimate to be high risk.

required

Returns:

Type Description
float

The p-value for the hypothesis test.

Source code in src\pyhbr\analysis\describe.py
def pvalue_chi2_high_risk_vs_outcome(
    probs: DataFrame, y_test: Series, high_risk_threshold: float
) -> float:
    """Perform a Chi-2 hypothesis test on the contingency between estimated high risk and outcome

    Get the p-value from the hypothesis test that there is no association
    between the estimated high-risk category, and the outcome. The p-value
    is interpreted as the probability of getting obtaining the outcomes
    corresponding to the model's estimated high-risk category under the
    assumption that there is no association between the two.

    Args:
        probs: The model-estimated probabilities (first column is used)
        y_test: Whether the outcome occurred
        high_risk_threshold: The cut-off risk (probability) defining an
            estimate to be high risk.

    Returns:
        The p-value for the hypothesis test.
    """

    # Get the cases (True) where the model estimated a risk
    # that puts the patient in the high risk category
    estimated_high_risk = (probs.iloc[:, 0] > high_risk_threshold).rename(
        "estimated_high_risk"
    )

    # Get the instances (True) in the test set where the outcome
    # occurred
    outcome_occurred = y_test.rename("outcome_occurred")

    # Create a contingency table of the estimated high risk
    # vs. whether the outcome occurred.
    table = pd.crosstab(estimated_high_risk, outcome_occurred)

    # Hypothesis test whether the estimated high risk category
    # is related to the outcome (null hypothesis is that there
    # is no relation).
    return scipy.stats.chi2_contingency(table.to_numpy()).pvalue

dim_reduce

Functions for dimension-reduction of clinical codes

Dataset dataclass

Stores either the train or test set

Source code in src\pyhbr\analysis\dim_reduce.py
@dataclass
class Dataset:
    """Stores either the train or test set"""

    y: DataFrame
    X_manual: DataFrame
    X_reduce: DataFrame

make_full_pipeline(model, reducer=None)

Make a model pipeline from the model part and dimension reduction

This pipeline has one or two steps:

  • If no reduction is performed, the only step is "model"
  • If dimension reduction is performed, the steps are "reducer", "model"

This function can be used to make the pipeline with no dimension (pass None to reducer). Otherwise, pass the reducer which will reduce a subset of the columns before fitting the model (use make_column_transformer to create this argument).

Parameters:

Name Type Description Default
model Pipeline

A list of model fitting steps that should be applied after the (optional) dimension reduction.

required
reducer Pipeline

If non-None, this reduction pipeline is applied before the model to reduce a subset of the columns.

None

Returns:

Type Description
Pipeline

A scikit-learn pipeline that can be fitted to training data.

Source code in src\pyhbr\analysis\dim_reduce.py
def make_full_pipeline(model: Pipeline, reducer: Pipeline = None) -> Pipeline:
    """Make a model pipeline from the model part and dimension reduction

    This pipeline has one or two steps:

    * If no reduction is performed, the only step is "model"
    * If dimension reduction is performed, the steps are "reducer", "model"

    This function can be used to make the pipeline with no dimension
    (pass None to reducer). Otherwise, pass the reducer which will reduce
    a subset of the columns before fitting the model (use make_column_transformer
    to create this argument).

    Args:
        model: A list of model fitting steps that should be applied
            after the (optional) dimension reduction.
        reducer: If non-None, this reduction pipeline is applied before
            the model to reduce a subset of the columns.

    Returns:
        A scikit-learn pipeline that can be fitted to training data.
    """
    if reducer is not None:
        return Pipeline([("reducer", reducer), ("model", model)])
    else:
        return Pipeline([("model", model)])

make_grad_boost(random_state)

Make a new gradient boosting classifier

Returns:

Type Description
Pipeline

The unfitted pipeline for the gradient boosting classifier

Source code in src\pyhbr\analysis\dim_reduce.py
def make_grad_boost(random_state: RandomState) -> Pipeline:
    """Make a new gradient boosting classifier

    Returns:
        The unfitted pipeline for the gradient boosting classifier
    """
    random_forest = GradientBoostingClassifier(
        n_estimators=100, max_depth=10, random_state=random_state
    )
    return Pipeline([("model", random_forest)])

make_logistic_regression(random_state)

Make a new logistic regression model

The model involves scaling all predictors and then applying a logistic regression model.

Returns:

Type Description
Pipeline

The unfitted pipeline for the logistic regression model

Source code in src\pyhbr\analysis\dim_reduce.py
def make_logistic_regression(random_state: RandomState) -> Pipeline:
    """Make a new logistic regression model

    The model involves scaling all predictors and then
    applying a logistic regression model.

    Returns:
        The unfitted pipeline for the logistic regression model
    """

    scaler = StandardScaler()
    logreg = LogisticRegression(random_state=random_state)
    return Pipeline([("scaler", scaler), ("model", logreg)])

make_random_forest(random_state)

Make a new random forest model

Returns:

Type Description
Pipeline

The unfitted pipeline for the random forest model

Source code in src\pyhbr\analysis\dim_reduce.py
def make_random_forest(random_state: RandomState) -> Pipeline:
    """Make a new random forest model

    Returns:
        The unfitted pipeline for the random forest model
    """
    random_forest = RandomForestClassifier(
        n_estimators=100, max_depth=10, random_state=random_state
    )
    return Pipeline([("model", random_forest)])

make_reducer_pipeline(reducer, cols_to_reduce)

Make a wrapper that applies dimension reduction to a subset of columns.

A column transformer is necessary if only some of the columns should be dimension-reduced, and others should be preserved. The resulting pipeline is intended for use in a scikit-learn pipeline taking a pandas DataFrame as input (where a subset of the columns are cols_to_reduce).

Parameters:

Name Type Description Default
reducer

The dimension reduction model to use for reduction

required
cols_to_reduce list[str]

The list of column names to reduce

required

Returns:

Type Description
Pipeline

A pipeline which contains the column_transformer that applies the reducer to cols_to_reduce. This can be included as a step in a larger pipeline.

Source code in src\pyhbr\analysis\dim_reduce.py
def make_reducer_pipeline(reducer, cols_to_reduce: list[str]) -> Pipeline:
    """Make a wrapper that applies dimension reduction to a subset of columns.

    A column transformer is necessary if only some of the columns should be
    dimension-reduced, and others should be preserved. The resulting pipeline
    is intended for use in a scikit-learn pipeline taking a pandas DataFrame as
    input (where a subset of the columns are cols_to_reduce).

    Args:
        reducer: The dimension reduction model to use for reduction
        cols_to_reduce: The list of column names to reduce

    Returns:
        A pipeline which contains the column_transformer that applies the
            reducer to cols_to_reduce. This can be included as a step in a
            larger pipeline.
    """
    column_transformer = ColumnTransformer(
        [("reducer", reducer, cols_to_reduce)],
        remainder="passthrough",
        verbose_feature_names_out=True,
    )
    return Pipeline([("column_transformer", column_transformer)])

prepare_train_test(data_manual, data_reduce, random_state)

Make the test/train datasets for manually-chosen groups and high-dimensional data

Parameters:

Name Type Description Default
data_manual DataFrame

The dataset with manually-chosen code groups

required
data_reduce DataFrame

The high-dimensional dataset

required
random_state RandomState

The random state to pick the test/train split

required

Returns:

Type Description
(Dataset, Dataset)

A tuple (train, test) containing the datasets to be used for training and testing the models. Both contain the outcome y along with the features for both the manually-chosen code groups and the data for dimension reduction.

Source code in src\pyhbr\analysis\dim_reduce.py
def prepare_train_test(
    data_manual: DataFrame, data_reduce: DataFrame, random_state: RandomState
) -> (Dataset, Dataset):
    """Make the test/train datasets for manually-chosen groups and high-dimensional data

    Args:
        data_manual: The dataset with manually-chosen code groups
        data_reduce: The high-dimensional dataset
        random_state: The random state to pick the test/train split

    Returns:
        A tuple (train, test) containing the datasets to be used for training and
            testing the models. Both contain the outcome y along with the features
            for both the manually-chosen code groups and the data for dimension
            reduction.
    """

    # Check number of rows match
    if data_manual.shape[0] != data_reduce.shape[0]:
        raise RuntimeError(
            "The number of rows in data_manual and data_reduce do not match."
        )

    test_set_proportion = 0.25

    # First, get the outcomes (y) from the dataframe. This is the
    # source of test/train outcome data, and is used for both the
    # manual and UMAP models. Just interested in whether bleeding
    # occurred (not number of occurrences) for this experiment
    outcome_name = "bleeding_al_ani_outcome"
    y = data_manual[outcome_name]

    # Get the set of manual code predictors (X0) to use for the
    # first logistic regression model (all the columns with names
    # ending in "_before").
    X_manual = data_manual.drop(columns=[outcome_name])

    # Make a random test/train split.=
    X_train_manual, X_test_manual, y_train, y_test = train_test_split(
        X_manual, y, test_size=test_set_proportion, random_state=random_state
    )

    # Extract the test/train sets from the UMAP data based on
    # the index of the training set for the manual codes
    X_reduce = data_reduce.drop(columns=[outcome_name])
    X_train_reduce = X_reduce.loc[X_train_manual.index]
    X_test_reduce = X_reduce.loc[X_test_manual.index]

    # Store the test/train data together
    train = Dataset(y_train, X_train_manual, X_train_reduce)
    test = Dataset(y_test, X_test_manual, X_test_reduce)

    return train, test

fit

fit_model(pipe, X_train, y_train, X_test, y_test, num_bootstraps, num_bins, random_state)

Fit the model and bootstrap models, and calculate model performance

Parameters:

Name Type Description Default
pipe Pipeline

The model pipeline to fit

required
X_train DataFrame

Training features

required
y_train DataFrame

Training outcomes (containing "bleeding"/"ischaemia" columns)

required
X_test DataFrame

Test features

required
y_test DataFrame

Test outcomes

required
num_bootstraps int

The number of resamples of the training set to use to fit bootstrap models.

required
num_bins int

The number of equal-size bins to split risk estimates into to calculate calibration curves.

required
random_state RandomState

The source of randomness for the resampling and fitting process.

required

Returns:

Type Description
dict[str, DataFrame | Pipeline]

Dictionary with keys "probs", "calibrations", "roc_curves", "roc_aucs".

Source code in src\pyhbr\analysis\fit.py
def fit_model(
    pipe: Pipeline,
    X_train: DataFrame,
    y_train: DataFrame,
    X_test: DataFrame,
    y_test: DataFrame,
    num_bootstraps: int,
    num_bins: int,
    random_state: RandomState,
) -> dict[str, DataFrame | Pipeline]:
    """Fit the model and bootstrap models, and calculate model performance

    Args:
        pipe: The model pipeline to fit
        X_train: Training features
        y_train: Training outcomes (containing "bleeding"/"ischaemia" columns)
        X_test: Test features
        y_test: Test outcomes
        num_bootstraps: The number of resamples of the training set to use to
            fit bootstrap models.
        num_bins: The number of equal-size bins to split risk estimates into
            to calculate calibration curves.
        random_state: The source of randomness for the resampling and fitting
            process.

    Returns:
        Dictionary with keys "probs", "calibrations", "roc_curves", "roc_aucs".
    """

    # Calculate the results of the model
    probs = {}
    calibrations = {}
    roc_curves = {}
    roc_aucs = {}
    fitted_models = {}
    feature_importances = {}
    for outcome in ["bleeding", "ischaemia"]:

        log.info(f"Fitting {outcome} model")

        # Fit the bleeding and ischaemia models on the training set
        # and bootstrap resamples of the training set (to assess stability)
        fitted_models[outcome] = stability.fit_model(
            pipe, X_train, y_train.loc[:, outcome], num_bootstraps, random_state
        )

        log.info(f"Running permutation feature importance on {outcome} model M0")
        M0 = fitted_models[outcome].M0
        r = permutation_importance(
            M0,
            X_test,
            y_test.loc[:, outcome],
            n_repeats=20,
            random_state=random_state,
            scoring="roc_auc",
        )
        feature_importances[outcome] = {
            "names": X_train.columns,
            "result": r,
        }

        # Get the predicted probabilities associated with all the resamples of
        # the bleeding and ischaemia models
        probs[outcome] = stability.predict_probabilities(fitted_models[outcome], X_test)

        # Get the calibration of the models
        calibrations[outcome] = calibration.get_variable_width_calibration(
            probs[outcome], y_test.loc[:, outcome], num_bins
        )

        # Calculate the ROC curves for the models
        roc_curves[outcome] = roc.get_roc_curves(probs[outcome], y_test.loc[:, outcome])
        roc_aucs[outcome] = roc.get_auc(probs[outcome], y_test.loc[:, outcome])

    return {
        "probs": probs,
        "calibrations": calibrations,
        "roc_aucs": roc_aucs,
        "roc_curves": roc_curves,
        "fitted_models": fitted_models,
        "feature_importances": feature_importances,
    }

model

DenseTransformer

Bases: TransformerMixin

Useful when the model requires a dense matrix but the preprocessing steps produce a sparse output

Source code in src\pyhbr\analysis\model.py
class DenseTransformer(TransformerMixin):
    """Useful when the model requires a dense matrix
    but the preprocessing steps produce a sparse output
    """

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        if hasattr(X, "todense"):
            return np.asarray(X.todense())
        else:
            return X

Preprocessor dataclass

Preprocessing steps for a subset of columns

This holds the set of preprocessing steps that should be applied to a subset of the (named) columns in the input training dataframe.

Multiple instances of this classes (for different subsets of columns) are grouped together to create a ColumnTransformer, which preprocesses all columns in the training dataframe.

Parameters:

Name Type Description Default
name str

The name of the preprocessor (which will become the name of the transformer in ColumnTransformer

required
pipe Pipeline

The sklearn Pipeline that should be applied to the set of columns

required
columns list[str]

The set of columns that should have pipe applied to them.

required
Source code in src\pyhbr\analysis\model.py
@dataclass
class Preprocessor:
    """Preprocessing steps for a subset of columns

    This holds the set of preprocessing steps that should
    be applied to a subset of the (named) columns in the
    input training dataframe.

    Multiple instances of this classes (for different subsets
    of columns) are grouped together to create a ColumnTransformer,
    which preprocesses all columns in the training dataframe.

    Args:
        name: The name of the preprocessor (which will become
            the name of the transformer in ColumnTransformer
        pipe: The sklearn Pipeline that should be applied to
            the set of columns
        columns: The set of columns that should have pipe
            applied to them.
    """

    name: str
    pipe: Pipeline
    columns: list[str]

TradeOffModel

Bases: ClassifierMixin, BaseEstimator

Source code in src\pyhbr\analysis\model.py
class TradeOffModel(ClassifierMixin, BaseEstimator):

    def fit(self, X, y):
        """Use the name of the Y variable to choose between
        bleeding and ischaemia
        """

        # Get the outcome name to decide between bleeding and
        # ischaemia model function
        self.outcome = y.name

        self.classes_ = unique_labels(y)

        self.X_ = X
        self.y_ = y

        # Return the classifier

        return self

    def decision_function(self, X: DataFrame) -> DataFrame:
        return self.predict_proba(X)[:, 1]

    def predict_proba(self, X: DataFrame) -> DataFrame:
        if self.outcome == "bleeding":
            risk = trade_off_model_bleeding_risk(X)
        else:
            risk = trade_off_model_ischaemia_risk(X)
        return np.column_stack((1-risk, risk))
fit(X, y)

Use the name of the Y variable to choose between bleeding and ischaemia

Source code in src\pyhbr\analysis\model.py
def fit(self, X, y):
    """Use the name of the Y variable to choose between
    bleeding and ischaemia
    """

    # Get the outcome name to decide between bleeding and
    # ischaemia model function
    self.outcome = y.name

    self.classes_ = unique_labels(y)

    self.X_ = X
    self.y_ = y

    # Return the classifier

    return self

get_feature_importances(fit)

Get a table of the features used in the model along with feature importances

Parameters:

Name Type Description Default
fit Pipeline

The fitted Pipeline

required

Returns:

Type Description
DataFrame

Contains a column for feature names, a column for type, and a feature importance column.

Source code in src\pyhbr\analysis\model.py
def get_feature_importances(fit: Pipeline) -> DataFrame:
    """Get a table of the features used in the model along with feature importances

    Args:
        fit: The fitted Pipeline

    Returns:
        Contains a column for feature names, a column for type, and a feature importance column.
    """

    df = get_feature_names(fit)

    model = fit["model"]

    # Check if the Pipe is a raw model, or a CV search (either
    # grid or randomised)
    if hasattr(model, "best_estimator_"):
        # CV model
        importances = model.best_estimator_.feature_importances_
    else:
        importances = model.feature_importances_

    df["feature_importances"] = importances
    return df.sort_values("feature_importances", ascending=False)

get_feature_names(fit)

Get a table of feature names

The feature names are the names of the columns in the output from the preprocessing step in the fitted pipeline

Parameters:

Name Type Description Default
fit Pipeline

A fitted sklearn pipeline, containing a "preprocess" step.

required

Raises:

Type Description
RuntimeError

description

Returns:

Type Description
DataFrame

dict[str, str]: description

Source code in src\pyhbr\analysis\model.py
def get_feature_names(fit: Pipeline) -> DataFrame:
    """Get a table of feature names

    The feature names are the names of the columns in the output
    from the preprocessing step in the fitted pipeline

    Args:
        fit: A fitted sklearn pipeline, containing a "preprocess"
            step.

    Raises:
        RuntimeError: _description_

    Returns:
        dict[str, str]: _description_
    """

    # Get the fitted ColumnTransformer from the fitted pipeline
    preprocess = fit["preprocess"]

    # Map from preprocess name to the relevant step that changes
    # column names. This must be kept consistent with the
    # make_*_preprocessor functions
    relevant_step = {
        "category": "one_hot_encoder",
        "float": "low_variance",
        "flag": "one_hot_encode",
    }

    # Get the map showing which column transformers (preprocessors)
    # are responsible which which slices of columns in the output
    # training dataframe
    column_slices = preprocess.output_indices_

    # Make an empty list of the right length to store all the columns
    column_names = get_num_feature_columns(fit) * [None]

    # Make an empty list for the preprocessor groups
    prep_names = get_num_feature_columns(fit) * [None]

    for name, pipe, columns in preprocess.transformers_:

        # Ignore the remainder step
        if name == "remainder":
            continue

        step_name = relevant_step[name]

        # Get the step which transforms column names
        step = pipe[step_name]

        # A special case is required for the low_variance columns
        # which need original list of columns passing in
        if name == "float":
            columns = step.get_feature_names_out(columns)
        else:
            columns = step.get_feature_names_out()

        # Get the properties of the slice where this set of
        # columns sits
        start = column_slices[name].start
        stop = column_slices[name].stop
        length = stop - start

        # Check the length of the slice matches the output
        # columns length
        if len(columns) != length:
            raise RuntimeError(
                "Length of output columns slice did not match the length of the column names list"
            )

        # Get the current slice corresponding to this preprocess
        s = column_slices[name]

        # Insert the list of colum names by slice
        column_names[s] = columns

        # Store the preprocessor name for the columns
        prep_names[s] = (s.stop - s.start) * [name]

    return DataFrame({"column": column_names, "preprocessor": prep_names})

get_features(fit, X)

Get the features after preprocessing the input X dataset

The features are generated by the "preprocess" step in the fitted pipe. This step is a column transformer that one-hot-encodes discrete data, and imputes, centers, and scales numerical data.

Note that the result may be a dense or sparse Pandas dataframe, depending on whether the preprocessing steps produce a sparse numpy array or not.

Parameters:

Name Type Description Default
fit Pipeline

Fitted pipeline with "preprocess" step.

required
X DataFrame

An input dataset (either training or test) containing the input columns to be preprocessed.

required

Returns:

Type Description
DataFrame

The resulting feature columns generated by the preprocessing step.

Source code in src\pyhbr\analysis\model.py
def get_features(fit: Pipeline, X: DataFrame) -> DataFrame:
    """Get the features after preprocessing the input X dataset

    The features are generated by the "preprocess" step in the fitted
    pipe. This step is a column transformer that one-hot-encodes
    discrete data, and imputes, centers, and scales numerical data.

    Note that the result may be a dense or sparse Pandas dataframe,
    depending on whether the preprocessing steps produce a sparse
    numpy array or not.

    Args:
        fit: Fitted pipeline with "preprocess" step.
        X: An input dataset (either training or test) containing
            the input columns to be preprocessed.

    Returns:
        The resulting feature columns generated by the preprocessing
            step.
    """

    # Get the preprocessing step and new feature column names
    preprocess = fit["preprocess"]
    prep_columns = get_feature_names(fit)
    X_numpy = preprocess.transform(X)

    # Convert the numpy array or sparse array to a dataframe
    if scipy.sparse.issparse(X_numpy):
        return DataFrame.sparse.from_spmatrix(
            X_numpy,
            columns=prep_columns["column"],
            index=X.index,
        )
    else:
        return DataFrame(
            X_numpy,
            columns=prep_columns["column"],
            index=X.index,
        )

get_num_feature_columns(fit)

Get the total number of feature columns Args: fit: The fitted pipeline, containing a "preprocess" step.

Returns:

Type Description
int

The total number of columns in the features, after preprocessing.

Source code in src\pyhbr\analysis\model.py
def get_num_feature_columns(fit: Pipeline) -> int:
    """Get the total number of feature columns
    Args:
        fit: The fitted pipeline, containing a "preprocess"
            step.

    Returns:
        The total number of columns in the features, after
            preprocessing.
    """

    # Get the map from column transformers to the slices
    # that they occupy in the training data
    preprocess = fit["preprocess"]
    column_slices = preprocess.output_indices_

    total = 0
    for s in column_slices.values():
        total += s.stop - s.start

    return total

make_abc(random_state, X_train, config)

Make the AdaBoost classifier pipeline

Source code in src\pyhbr\analysis\model.py
def make_abc(random_state: RandomState, X_train: DataFrame, config: dict[str, Any]) -> Pipeline:
    """Make the AdaBoost classifier pipeline
    """

    preprocessors = [
        make_category_preprocessor(X_train),
        make_flag_preprocessor(X_train),
        make_float_preprocessor(X_train),
    ]
    preprocess = make_columns_transformer(preprocessors)
    mod = AdaBoostClassifier(**config, random_state=random_state)
    return Pipeline([("preprocess", preprocess), ("model", mod)])

make_category_preprocessor(X_train, drop=None)

Create a preprocessor for string/category columns

Columns in the training features that are discrete, represented using strings ("object") or "category" dtypes, should be one-hot encoded. This generates one new columns for each possible value in the original columns.

The ColumnTransformer transformer created from this preprocessor will be called "category".

Parameters:

Name Type Description Default
X_train DataFrame

The training features

required
drop

The drop argument to be passed to OneHotEncoder. Default None means no features will be dropped. Using "first" drops the first item in the category, which is useful to avoid collinearity in linear models.

None

Returns:

Type Description
Preprocessor | None

A preprocessor for processing the discrete columns. None is returned if the training features do not contain any string/category columns

Source code in src\pyhbr\analysis\model.py
def make_category_preprocessor(X_train: DataFrame, drop=None) -> Preprocessor | None:
    """Create a preprocessor for string/category columns

    Columns in the training features that are discrete, represented
    using strings ("object") or "category" dtypes, should be one-hot
    encoded. This generates one new columns for each possible value
    in the original columns.

    The ColumnTransformer transformer created from this preprocessor
    will be called "category".

    Args:
        X_train: The training features
        drop: The drop argument to be passed to OneHotEncoder. Default
            None means no features will be dropped. Using "first" drops
            the first item in the category, which is useful to avoid
            collinearity in linear models.

    Returns:
        A preprocessor for processing the discrete columns. None is
            returned if the training features do not contain any
            string/category columns
    """

    # Category columns should be one-hot encoded (in all these one-hot encoders,
    # consider the effect of linear dependence among the columns due to the extra
    # variable compared to dummy encoding -- the relevant parameter is called
    # 'drop').
    columns = X_train.columns[
        (X_train.dtypes == "object") | (X_train.dtypes == "category")
    ]

    # Return None if there are no discrete columns.
    if len(columns) == 0:
        return None

    pipe = Pipeline(
        [
            (
                "one_hot_encoder",
                OneHotEncoder(
                    handle_unknown="infrequent_if_exist", min_frequency=0.002, drop=drop
                ),
            ),
        ]
    )

    return Preprocessor("category", pipe, columns)

make_flag_preprocessor(X_train, drop=None)

Create a preprocessor for flag columns

Columns in the training features that are flags (bool + NaN) are represented using Int8 (because bool does not allow NaN). These columns are also one-hot encoded.

The ColumnTransformer transformer created from this preprocessor will be called "flag".

Parameters:

Name Type Description Default
X_train DataFrame

The training features.

required
drop

The drop argument to be passed to OneHotEncoder. Default None means no features will be dropped. Using "first" drops the first item in the category, which is useful to avoid collinearity in linear models.

None

Returns:

Type Description
Preprocessor | None

A preprocessor for processing the flag columns. None is returned if the training features do not contain any Int8 columns.

Source code in src\pyhbr\analysis\model.py
def make_flag_preprocessor(X_train: DataFrame, drop=None) -> Preprocessor | None:
    """Create a preprocessor for flag columns

    Columns in the training features that are flags (bool + NaN) are
    represented using Int8 (because bool does not allow NaN). These
    columns are also one-hot encoded.

    The ColumnTransformer transformer created from this preprocessor
    will be called "flag".

    Args:
        X_train: The training features.
        drop: The drop argument to be passed to OneHotEncoder. Default
            None means no features will be dropped. Using "first" drops
            the first item in the category, which is useful to avoid
            collinearity in linear models.

    Returns:
        A preprocessor for processing the flag columns. None is
            returned if the training features do not contain any
            Int8 columns.
    """

    # Flag columns (encoded using Int8, which supports NaN), should be one-hot
    # encoded (considered separately from category in case we want to do something
    # different with these).
    columns = X_train.columns[(X_train.dtypes == "Int8")]

    # Return None if there are no discrete columns.
    if len(columns) == 0:
        return None

    pipe = Pipeline(
        [
            (
                "one_hot_encode",
                OneHotEncoder(handle_unknown="infrequent_if_exist", drop=drop),
            ),
        ]
    )

    return Preprocessor("flag", pipe, columns)

make_float_preprocessor(X_train)

Create a preprocessor for float (numerical) columns

Columns in the training features that are numerical are encoded using float (to distinguish them from Int8, which is used for flags).

Missing values in these columns are imputed using the mean, then low variance columns are removed. The remaining columns are centered and scaled.

The ColumnTransformer transformer created from this preprocessor will be called "float".

Parameters:

Name Type Description Default
X_train DataFrame

The training features

required

Returns:

Type Description
Preprocessor | None

A preprocessor for processing the float columns. None is returned if the training features do not contain any Int8 columns.

Source code in src\pyhbr\analysis\model.py
def make_float_preprocessor(X_train: DataFrame) -> Preprocessor | None:
    """Create a preprocessor for float (numerical) columns

    Columns in the training features that are numerical are encoded
    using float (to distinguish them from Int8, which is used for
    flags).

    Missing values in these columns are imputed using the mean, then
    low variance columns are removed. The remaining columns are
    centered and scaled.

    The ColumnTransformer transformer created from this preprocessor
    will be called "float".

    Args:
        X_train: The training features

    Returns:
        A preprocessor for processing the float columns. None is
            returned if the training features do not contain any
            Int8 columns.
    """

    # Numerical columns -- impute missing values, remove low variance
    # columns, and then centre and scale the rest.
    columns = X_train.columns[(X_train.dtypes == "float")]

    # Return None if there are no discrete columns.
    if len(columns) == 0:
        return None

    pipe = Pipeline(
        [
            ("impute", SimpleImputer(missing_values=np.nan, strategy="mean")),
            ("low_variance", VarianceThreshold()),
            ("scaler", StandardScaler()),
        ]
    )

    return Preprocessor("float", pipe, columns)

make_nearest_neighbours_cv(random_state, X_train, config)

Nearest neighbours classifier trained using cross validation

Parameters:

Name Type Description Default
random_state RandomState

Source of randomness for creating the model

required
X_train DataFrame

The training dataset containing all features for modelling

required
config dict[str, Any]

The dictionary of keyword arguments to configure the CV search.

required

Returns:

Type Description
Pipeline

The preprocessing and fitting pipeline.

Source code in src\pyhbr\analysis\model.py
def make_nearest_neighbours_cv(random_state: RandomState, X_train: DataFrame, config: dict[str, Any]) -> Pipeline:
    """Nearest neighbours classifier trained using cross validation

    Args:
        random_state: Source of randomness for creating the model
        X_train: The training dataset containing all features for modelling
        config: The dictionary of keyword arguments to configure the CV search.

    Returns:
        The preprocessing and fitting pipeline.
    """
    preprocessors = [
        make_category_preprocessor(X_train),
        make_flag_preprocessor(X_train),
        make_float_preprocessor(X_train),
    ]
    preprocess = make_columns_transformer(preprocessors)

    mod = RandomizedSearchCV(
        KNeighborsClassifier(),
        param_distributions=config,
        random_state=random_state,
        scoring="roc_auc",
        cv=5,
        verbose=3
    )
    return Pipeline([("preprocess", preprocess), ("model", mod)])

make_random_forest(random_state, X_train)

Make the random forest model

Parameters:

Name Type Description Default
random_state RandomState

Source of randomness for creating the model

required
X_train DataFrame

The training dataset containing all features for modelling

required

Returns:

Type Description
Pipeline

The preprocessing and fitting pipeline.

Source code in src\pyhbr\analysis\model.py
def make_random_forest(random_state: RandomState, X_train: DataFrame) -> Pipeline:
    """Make the random forest model

    Args:
        random_state: Source of randomness for creating the model
        X_train: The training dataset containing all features for modelling

    Returns:
        The preprocessing and fitting pipeline.
    """

    preprocessors = [
        make_category_preprocessor(X_train),
        make_flag_preprocessor(X_train),
        make_float_preprocessor(X_train),
    ]
    preprocess = make_columns_transformer(preprocessors)
    mod = RandomForestClassifier(
        n_estimators=100, max_depth=10, random_state=random_state
    )
    return Pipeline([("preprocess", preprocess), ("model", mod)])

make_random_forest_cv(random_state, X_train, config)

Random forest model trained using cross validation

Parameters:

Name Type Description Default
random_state RandomState

Source of randomness for creating the model

required
X_train DataFrame

The training dataset containing all features for modelling

required
config dict[str, Any]

The dictionary of keyword arguments to configure the CV search.

required

Returns:

Type Description
Pipeline

The preprocessing and fitting pipeline.

Source code in src\pyhbr\analysis\model.py
def make_random_forest_cv(random_state: RandomState, X_train: DataFrame, config: dict[str, Any]) -> Pipeline:
    """Random forest model trained using cross validation

    Args:
        random_state: Source of randomness for creating the model
        X_train: The training dataset containing all features for modelling
        config: The dictionary of keyword arguments to configure the CV search.

    Returns:
        The preprocessing and fitting pipeline.
    """
    preprocessors = [
        make_category_preprocessor(X_train),
        make_flag_preprocessor(X_train),
        make_float_preprocessor(X_train),
    ]
    preprocess = make_columns_transformer(preprocessors)

    mod = RandomizedSearchCV(
        RandomForestClassifier(random_state=random_state),
        param_distributions=config,
        random_state=random_state,
        scoring="roc_auc",
        cv=5,
        verbose=3
    )
    return Pipeline([("preprocess", preprocess), ("model", mod)])

make_trade_off(random_state, X_train, config)

Make the ARC HBR bleeding/ischaemia trade-off model

Parameters:

Name Type Description Default
random_state RandomState

Source of randomness for creating the model

required
X_train DataFrame

The training dataset containing all features for modelling

required

Returns:

Type Description
Pipeline

The preprocessing and fitting pipeline.

Source code in src\pyhbr\analysis\model.py
def make_trade_off(random_state: RandomState, X_train: DataFrame, config: dict[str, Any]) -> Pipeline:
    """Make the ARC HBR bleeding/ischaemia trade-off model

    Args:
        random_state: Source of randomness for creating the model
        X_train: The training dataset containing all features for modelling

    Returns:
        The preprocessing and fitting pipeline.
    """

    #preprocess = make_columns_transformer(preprocessors)
    mod = TradeOffModel()
    return Pipeline([("model", mod)])

make_xgboost_cv(random_state, X_train, config)

XGBoost model trained using cross validation

Parameters:

Name Type Description Default
random_state RandomState

Source of randomness for creating the model

required
X_train DataFrame

The training dataset containing all features for modelling

required
config dict[str, Any]

The dictionary of keyword arguments to configure the CV search.

required

Returns:

Type Description
Pipeline

The preprocessing and fitting pipeline.

Source code in src\pyhbr\analysis\model.py
def make_xgboost_cv(random_state: RandomState, X_train: DataFrame, config: dict[str, Any]) -> Pipeline:
    """XGBoost model trained using cross validation

    Args:
        random_state: Source of randomness for creating the model
        X_train: The training dataset containing all features for modelling
        config: The dictionary of keyword arguments to configure the CV search.

    Returns:
        The preprocessing and fitting pipeline.
    """
    preprocessors = [
        make_category_preprocessor(X_train),
        make_flag_preprocessor(X_train),
        make_float_preprocessor(X_train),
    ]
    preprocess = make_columns_transformer(preprocessors)

    mod = RandomizedSearchCV(
        XGBClassifier(random_state=random_state),
        param_distributions=config,
        random_state=random_state,
        scoring="roc_auc",
        cv=5,
        verbose=3
    )
    return Pipeline([("preprocess", preprocess), ("model", mod)])

trade_off_model_bleeding_risk(features)

ARC-HBR bleeding part of the trade-off model

This function implements the bleeding model contained here https://pubmed.ncbi.nlm.nih.gov/33404627/. The numbers used below come from correspondence with the authors.

Parameters:

Name Type Description Default
features DataFrame

must contain age, smoking, copd, hb, egfr_x, oac.

required

Returns:

Type Description
Series

The bleeding risks as a Series.

Source code in src\pyhbr\analysis\model.py
def trade_off_model_bleeding_risk(features: DataFrame) -> Series:
    """ARC-HBR bleeding part of the trade-off model

    This function implements the bleeding model contained here
    https://pubmed.ncbi.nlm.nih.gov/33404627/. The numbers used
    below come from correspondence with the authors.


    Args:
        features: must contain age, smoking, copd, hb, egfr_x, oac.

    Returns:
        The bleeding risks as a Series.
    """

    # Age component, right=False for >= 65, setting upper limit == 1000 to catch all
    age = pd.cut(features["age"], [0, 65, 1000], labels=[1, 1.5], right=False).astype("float")

    smoking = np.where(features["smoking"] == "yes", 1.47, 1)
    copd = np.where(features["copd"].fillna(0) == 1, 1.39, 1)

    # Fill NA with a high Hb value (50) to treat missing as low risk
    hb = pd.cut(
        10 * features["hb"].fillna(50),
        [0, 110, 130, 1000],
        labels=[3.99, 1.69, 1],
        right=False,
    ).astype("float")

    # Fill NA with a high eGFR value (500) to treat missing as low risk
    egfr = pd.cut(
        features["egfr_x"].fillna(500),
        [0, 30, 60, 1000],
        labels=[1.43, 0.99, 1],
        right=False,
    ).astype("float")

    # Complex PCI and liver/cancer composite
    complex_score = np.where(features["complex_pci_index"], 1.32, 1.0)
    liver_cancer_surgery = np.where((features["cancer_before"] + features["ckd_before"]) > 0, 1.63, 1.0)

    oac = np.where(features["oac"] == 1, 2.0, 1.0)

    # Calculate bleeding risk
    xb = age*smoking*copd*liver_cancer_surgery*hb*egfr*complex_score*oac
    risk = 1 - 0.986**xb

    return risk

trade_off_model_ischaemia_risk(features)

ARC-HBR ischaemia part of the trade-off model

This function implements the bleeding model contained here https://pubmed.ncbi.nlm.nih.gov/33404627/. The numbers used below come from correspondence with the authors.

Parameters:

Name Type Description Default
features DataFrame

must contain diabetes_before, smoking,

required

Returns:

Type Description
Series

The ischaemia risks as a Series.

Source code in src\pyhbr\analysis\model.py
def trade_off_model_ischaemia_risk(features: DataFrame) -> Series:
    """ARC-HBR ischaemia part of the trade-off model

    This function implements the bleeding model contained here
    https://pubmed.ncbi.nlm.nih.gov/33404627/. The numbers used
    below come from correspondence with the authors.

    Args:
        features: must contain diabetes_before, smoking, 

    Returns:
        The ischaemia risks as a Series.
    """

    diabetes = np.where(features["diabetes_before"] > 0, 1.56, 1)
    smoking = np.where(features["smoking"] == "yes", 1.47, 1)
    prior_mi = np.where(features["mi_schnier_before"] > 0, 1.89, 1)

    # Interpreting "presentation" as stemi vs. nstemi
    presentation = np.where(features["stemi_index"], 1.82, 1)

    # Fill NA with a high Hb value (50) to treat missing as low risk
    hb = pd.cut(
        10 * features["hb"].fillna(50),
        [0, 110, 130, 1000],
        labels=[1.5, 1.27, 1],
        right=False,
    ).astype("float")

    # Fill NA with a high eGFR value (500) to treat missing as low risk
    egfr = pd.cut(
        features["egfr_x"].fillna(500),
        [0, 30, 60, 1000],
        labels=[1.69, 1.3, 1],
        right=False,
    ).astype("float")

    # TODO bare metal stent (missing from data)
    complex_score = np.where(features["complex_pci_index"], 1.5, 1.0)
    bms = 1.0

    # Calculate bleeding risk
    xb = diabetes*smoking*prior_mi*presentation*hb*egfr*complex_score*bms
    risk = 1 - 0.986**xb

    return risk

patient_viewer

get_patient_history(patient_id, hic_data)

Get a list of all this patient's episode data

Parameters:

Name Type Description Default
patient_id str

Which patient to fetch

required
hic_data HicData

Contains episodes and codes tables

required

Returns:

Type Description

A table indexed by spell_id, episode_id, type (of clinical code) and clinical code position.

Source code in src\pyhbr\analysis\patient_viewer.py
def get_patient_history(patient_id: str, hic_data: HicData):
    """Get a list of all this patient's episode data

    Args:
        patient_id: Which patient to fetch
        hic_data: Contains `episodes` and `codes` tables

    Returns:
        A table indexed by spell_id, episode_id, type (of clinical code)
            and clinical code position.
    """
    df = hic_data.codes.merge(
        hic_data.episodes[["patient_id", "spell_id", "episode_start"]],
        how="left",
        on="episode_id",
    )
    this_patient = (
        df[df["patient_id"] == patient_id]
        .sort_values(["episode_start", "type","position"])
        .drop(columns="group")
        .set_index(["spell_id", "episode_id", "type", "position"])
    ).drop_duplicates()
    return this_patient

roc

ROC Curves

The file calculates the ROC curves of the bootstrapped models (for assessing ROC curve stability; see stability.py).

AucData dataclass

Source code in src\pyhbr\analysis\roc.py
@dataclass
class AucData:
    model_under_test_auc: float
    resample_auc: list[float]

    def mean_resample_auc(self) -> float:
        """Get the mean of the resampled AUCs
        """
        return np.mean(self.resample_auc)

    def std_dev_resample_auc(self) -> float:
        """Get the standard deviation of the resampled AUCs
        """
        return np.mean(self.resample_auc)

    def roc_auc_spread(self) -> DataFrame:
        return Series(self.resample_auc + [self.model_under_test_auc]).quantile([0.25, 0.5, 0.75])
mean_resample_auc()

Get the mean of the resampled AUCs

Source code in src\pyhbr\analysis\roc.py
def mean_resample_auc(self) -> float:
    """Get the mean of the resampled AUCs
    """
    return np.mean(self.resample_auc)
std_dev_resample_auc()

Get the standard deviation of the resampled AUCs

Source code in src\pyhbr\analysis\roc.py
def std_dev_resample_auc(self) -> float:
    """Get the standard deviation of the resampled AUCs
    """
    return np.mean(self.resample_auc)

get_auc(probs, y_test)

Get the area under the ROC curves for the fitted models

Compute area under the ROC curve (AUC) for the model-under-test (the first column of probs), and the other bootstrapped models (other columns of probs).

Source code in src\pyhbr\analysis\roc.py
def get_auc(probs: DataFrame, y_test: Series) -> AucData:
    """Get the area under the ROC curves for the fitted models

    Compute area under the ROC curve (AUC) for the model-under-test
    (the first column of probs), and the other bootstrapped models
    (other columns of probs).

    """
    model_under_test_auc = roc_auc_score(y_test, probs.iloc[:,0]) # Model-under test
    resample_auc = []
    for column in probs:
        resample_auc.append(roc_auc_score(y_test, probs[column]))
    return AucData(model_under_test_auc, resample_auc)

get_roc_curves(probs, y_test)

Get the ROC curves for the fitted models

Get the ROC curves for all models (whose probability predictions for the positive class are columns of probs) based on the outcomes in y_test. Rows of y_test correspond to rows of probs. The result is a list of pairs, one for each model (column of probs). Each pair contains the vector of x- and y-coordinates of the ROC curve.

Parameters:

Name Type Description Default
probs DataFrame

The probabilities predicted by all the fitted models. The first column is the model-under-test (the training set), and the other columns are resamples of the training set.

required
y_test Series

The outcome data corresponding to each row of probs.

required

Returns:

Type Description
list[DataFrame]

A list of DataFrames, each of which contains one ROC curve, corresponding to the columns in probs. The columns of the DataFrames are fpr (false positive rate) and tpr (true positive rate)

Source code in src\pyhbr\analysis\roc.py
def get_roc_curves(probs: DataFrame, y_test: Series) -> list[DataFrame]:
    """Get the ROC curves for the fitted models

    Get the ROC curves for all models (whose probability
    predictions for the positive class are columns of probs) based
    on the outcomes in y_test. Rows of y_test correspond to rows of
    probs. The result is a list of pairs, one for each model (column
    of probs). Each pair contains the vector of x- and y-coordinates
    of the ROC curve.

    Args:
        probs: The probabilities predicted by all the fitted models.
            The first column is the model-under-test (the training set),
            and the other columns are resamples of the training set.
        y_test: The outcome data corresponding to each row of probs.

    Returns:
        A list of DataFrames, each of which contains one ROC curve,
            corresponding to the columns in probs. The columns of the
            DataFrames are `fpr` (false positive rate) and `tpr` (true
            positive rate)
    """
    curves = []
    for n in range(probs.shape[1]):
        fpr, tpr, _ = roc_curve(y_test, probs.iloc[:, n])
        curves.append(DataFrame({"fpr": fpr, "tpr": tpr}))
    return curves

plot_roc_curves(ax, curves, auc, title='ROC-stability Curves')

Plot ROC curves of the model-under-test and resampled models

Plot the set of bootstrapped ROC curves (an instability plot), using the data in curves (a list of curves to plot). Assume that the first curve is the model-under-test (which is coloured differently).

The auc argument is an array where the first element is the AUC of the model under test, and the second element is the mean AUC of the bootstrapped models, and the third element is the standard deviation of the AUC of the bootstrapped models (these latter two measure stability). This argument is the output from get_bootstrapped_auc.

Source code in src\pyhbr\analysis\roc.py
def plot_roc_curves(ax, curves, auc, title = "ROC-stability Curves"):
    """Plot ROC curves of the model-under-test and resampled models

    Plot the set of bootstrapped ROC curves (an instability plot),
    using the data in curves (a list of curves to plot). Assume that the
    first curve is the model-under-test (which is coloured differently).

    The auc argument is an array where the first element is the AUC of the
    model under test, and the second element is the mean AUC of the
    bootstrapped models, and the third element is the standard deviation
    of the AUC of the bootstrapped models (these latter two measure
    stability). This argument is the output from get_bootstrapped_auc.
    """
    mut_curve = curves[0]  # model-under-test
    ax.plot(mut_curve["fpr"], mut_curve["tpr"], color="r")
    for curve in curves[1:]:
        ax.plot(curve["fpr"], curve["tpr"], color="b", linewidth=0.3, alpha=0.4)
    ax.axline([0, 0], [1, 1], color="k", linestyle="--")
    ax.legend(
        [
            f"Model (AUC = {auc.model_under_test_auc:.2f})",
            f"Bootstrapped models",
        ]
    )
    ax.set_title(title)
    ax.set_xlabel("False positive rate")
    ax.set_ylabel("True positive rate")

stability

Assessing model stability

Model stability of an internally-validated model refers to how well models developed on a similar internal population agree with each other. The methodology for assessing model stability follows Riley and Collins, 2022 (https://arxiv.org/abs/2211.01061)

Assessing model stability is an end-to-end test of the entire model development process. Riley and Collins do not refer to a test/train split, but their method will be interpreted as applying to the training set (with instability measures assessed by applying models to the test set). As a result, the first step in the process is to split the internal dataset into a training set P0 and a test set T.

Assuming that a training set P0 is used to develop a model M0 using a model development process D (involving steps such cross-validation and hyperparameter tuning in the training set, and validation of accuracy of model prediction in the test set), the following steps are required to assess the stability of M0:

  1. Bootstrap resample P0 with replacement M >= 200 times, creating M new datasets Pm that are all the same size as P0
  2. Apply D to each Pm, to obtain M new models Mn which are all comparable with M0.
  3. Collect together the predictions from all Mn and compare them to the predictions from M0 for each sample in the test set T.
  4. From the data in 3, plot instability plots such as a scatter plot of M0 predictions on the x-axis and all the Mn predictions on the y-axis, for each sample of T. In addition, plot graphs of how all the model validation metrics vary as a function of the bootstrapped models Mn.

Implementation

A function is required that takes the original training set P0 and generates N bootstrapped resamples Pn that are the same size as P.

A function is required that wraps the entire model into one call, taking as input the bootstrapped resample Pn and providing as output the bootstrapped model Mn. This function can then be called M times to generate the bootstrapped models. This function is not defined in this file (see the fit.py file)

An aggregating function will then take all the models Mn, the model-under-test M0, and the test set T, and make predictions using all the models for each sample in the test set. It should return all these predictions (probabilities) in a 2D array, where each row corresponds to a test-set sample, column 0 is the probability from M0, and columns 1 through M are the probabilities from each Mn.

This 2D array may be used as the basis of instability plots. Paired with information about the true outcomes y_test, this can also be used to plot ROC-curve variability (i.e. plotting the ROC curve for all model M0 and Mn on one graph). Any other accuracy metric of interest can be calculated from this information (i.e. for step 4 above).

FittedModel dataclass

Stores a model fitted to a training set and resamples of the training set.

Source code in src\pyhbr\analysis\stability.py
@dataclass
class FittedModel:
    """Stores a model fitted to a training set and resamples of the training set."""

    M0: Pipeline
    Mm: list[Pipeline]

    def flatten(self) -> list[Pipeline]:
        """Get a flat list of all the models

        Returns:
            The list of fitted models, with M0 at the front
        """
        return [self.M0] + self.Mm
flatten()

Get a flat list of all the models

Returns:

Type Description
list[Pipeline]

The list of fitted models, with M0 at the front

Source code in src\pyhbr\analysis\stability.py
def flatten(self) -> list[Pipeline]:
    """Get a flat list of all the models

    Returns:
        The list of fitted models, with M0 at the front
    """
    return [self.M0] + self.Mm

Resamples dataclass

Store a training set along with M resamples of it

Parameters:

Name Type Description Default
X0 DataFrame

The matrix of predictors

required
Y0 DataFrame

The matrix of outcomes (one column per outcome)

required
Xm list[DataFrame]

A list of resamples of the predictors

required
Ym list[DataFrame]

A list of resamples of the outcomes

required
Source code in src\pyhbr\analysis\stability.py
@dataclass
class Resamples:
    """Store a training set along with M resamples of it

    Args:
        X0: The matrix of predictors
        Y0: The matrix of outcomes (one column per outcome)
        Xm: A list of resamples of the predictors
        Ym: A list of resamples of the outcomes
    """

    X0: DataFrame
    Y0: DataFrame
    Xm: list[DataFrame]
    Ym: list[DataFrame]

absolute_instability(probs)

Get a list of the absolute percentage-point differences

Compare the primary model to the bootstrap models by flattening all the bootstrap model estimates and calculating the absolute difference between the primary model estimate and the bootstraps. Results are expressed in percentage points.

Parameters:

Name Type Description Default
probs DataFrame

First column is primary model risk estimates, other columns are bootstrap model estimates.

required

Returns:

Type Description
Series

A Series of absolute percentage-point discrepancies between the primary model predictions and the bootstrap estimates.

Source code in src\pyhbr\analysis\stability.py
def absolute_instability(probs: DataFrame) -> Series:
    """Get a list of the absolute percentage-point differences

    Compare the primary model to the bootstrap models by flattening
    all the bootstrap model estimates and calculating the absolute
    difference between the primary model estimate and the bootstraps.
    Results are expressed in percentage points.

    Args:
        probs: First column is primary model risk estimates, other
            columns are bootstrap model estimates.

    Returns:
        A Series of absolute percentage-point discrepancies between
            the primary model predictions and the bootstrap 
            estimates.
    """

    # Make a table containing the initial risk (from the
    # model under test) and a column for all other risks
    prob_compare = 100 * probs.melt(
        id_vars="prob_M0", value_name="bootstrap_risk", var_name="initial_risk"
    )

    # Round the resulting risk error to 2 decimal places (i.e. to 0.01%). This truncates very small values
    # to zero, which means the resulting log y scale is not artificially extended downwards.
    return (
        (prob_compare["bootstrap_risk"] - prob_compare["prob_M0"])
        .abs()
        .round(decimals=2)
    )

average_absolute_instability(probs)

Get the average absolute error between primary model and bootstrap estimates.

This function computes the average of the absolute difference between the risks estimated by the primary model, and the risks estimated by the bootstrap models. For example, if the primary model estimates 1%, and a bootstrap model provides 2% and 3%, the result is 1.5% error.

Expressed differently, the function calculates the average percentage-point difference between the model under test and bootstrap models.

Using the absolute error instead of the relative error is more useful in practice, because it does not inflate errors between very small risks. Since most risks are on the order < 20%, with a risk threshold like 5%, it is easier to interpret an absolute risk difference.

Further granularity in the variability of risk estimates as a function of risk is obtained by looking at the instability box plot.

Parameters:

Name Type Description Default
probs DataFrame

The table of risks estimated by the models. The first column is the model under test, and the other columns are bootstrap models.

required

Returns:

Type Description
dict[str, float]

A mean and confidence interval for the estimate. The units are percent.

Source code in src\pyhbr\analysis\stability.py
def average_absolute_instability(probs: DataFrame) -> dict[str, float]:
    """Get the average absolute error between primary model and bootstrap estimates.

    This function computes the average of the absolute difference between the risks
    estimated by the primary model, and the risks estimated by the bootstrap models.
    For example, if the primary model estimates 1%, and a bootstrap model provides
    2% and 3%, the result is 1.5% error.

    Expressed differently, the function calculates the average percentage-point
    difference between the model under test and bootstrap models.

    Using the absolute error instead of the relative error is more useful in
    practice, because it does not inflate errors between very small risks. Since
    most risks are on the order < 20%, with a risk threshold like 5%, it is
    easier to interpret an absolute risk difference.

    Further granularity in the variability of risk estimates as a function of
    risk is obtained by looking at the instability box plot.

    Args:
        probs: The table of risks estimated by the models. The first column is
            the model under test, and the other columns are bootstrap models.

    Returns:
        A mean and confidence interval for the estimate. The units are percent.
    """

    absolute_errors = absolute_instability(probs)
    return absolute_errors.quantile([0.025, 0.5, 0.975])

fit_model(model, X0, y0, M, random_state)

Fit a model to a training set and resamples of the training set.

Use the unfitted model pipeline to:

  • Fit a model to the training set (X0, y0)
  • Fit a model to M resamples (Xm, ym) of the training set

The model is an unfitted scikit-learn Pipeline. Note that if RandomState is used when specifying the model, then the models used to fit the resamples here will be statstical clones (i.e. they might not necessarily produce the same result on the same data). clone() is called on model before fitting, so each fit gets a new clean object.

Parameters:

Name Type Description Default
model Pipeline

An unfitted scikit-learn pipeline, which is used as the basis for all the fits. Each fit calls clone() on this object before fitting, to get a new model with clean parameters. The cloned fitted models are then stored in the returned fitted model.

required
X0 DataFrame

The training set features

required
y0 Series

The training set outcome

required
M int

How many resamples to take from the training set (ideally >= 200)

required
random_state RandomState

The source of randomness for model fitting

required

Returns:

Type Description
FittedModel

An object containing the model fitted on (X0,y0) and all (Xm,ym)

Source code in src\pyhbr\analysis\stability.py
def fit_model(
    model: Pipeline, X0: DataFrame, y0: Series, M: int, random_state: RandomState
) -> FittedModel:
    """Fit a model to a training set and resamples of the training set.

    Use the unfitted model pipeline to:

    * Fit a model to the training set (X0, y0)
    * Fit a model to M resamples (Xm, ym) of the training set

    The model is an unfitted scikit-learn Pipeline. Note that if RandomState is used
    when specifying the model, then the models used to fit the resamples here will
    be _statstical clones_ (i.e. they might not necessarily produce the same result
    on the same data). clone() is called on model before fitting, so each fit gets a
    new clean object.

    Args:
        model: An unfitted scikit-learn pipeline, which is used as the basis for
            all the fits. Each fit calls clone() on this object before fitting, to
            get a new model with clean parameters. The cloned fitted models are then
            stored in the returned fitted model.
        X0: The training set features
        y0: The training set outcome
        M (int): How many resamples to take from the training set (ideally >= 200)
        random_state: The source of randomness for model fitting

    Returns:
        An object containing the model fitted on (X0,y0) and all (Xm,ym)
    """

    # Develop a single model from the training set (X0_train, y0_train),
    # using any method (e.g. including cross validation and hyperparameter
    # tuning) using training set data. This is referred to as D in
    # stability.py.
    log.info("Fitting model-under-test")
    pipe = clone(model)
    M0 = pipe.fit(X0, y0)

    # Resample the training set to obtain the new datasets (Xm, ym)
    log.info(f"Creating {M} bootstrap resamples of training set")
    resamples = make_bootstrapped_resamples(X0, y0, M, random_state)

    # Develop all the bootstrap models to compare with the model-under-test M0
    log.info("Fitting bootstrapped models")
    Mm = []
    for m in range(M):
        pipe = clone(model)
        ym = resamples.Ym[m]
        Xm = resamples.Xm[m]
        Mm.append(pipe.fit(Xm, ym))

    return FittedModel(M0, Mm)

get_average_instability(probs)

Instability is the extent to which the bootstrapped models give a different prediction from the model under test. The average instability is an average of the SMAPE between the prediction of the model-under-test and the predictions of each of the other bootstrap models (i.e. pairing the model-under-test) with a single bootstrapped model gives one SMAPE value, and these are averaged over all the bootstrap models).

SMAPE is preferable to mean relative error, because the latter diverges when the prediction from the model-under-test is very small. It may however be better still to use the log of the accuracy ratio; see https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error, since the probabilities are all positive (or maybe there is a better thing for comparing probabilities specifically)

Testing: not yet tested

Source code in src\pyhbr\analysis\stability.py
def get_average_instability(probs):
    """
    Instability is the extent to which the bootstrapped models
    give a different prediction from the model under test. The
    average instability is an average of the SMAPE between
    the prediction of the model-under-test and the predictions of
    each of the other bootstrap models (i.e. pairing the model-under-test)
    with a single bootstrapped model gives one SMAPE value, and
    these are averaged over all the bootstrap models).

    SMAPE is preferable to mean relative error, because the latter
    diverges when the prediction from the model-under-test is very small.
    It may however be better still to use the log of the accuracy ratio;
    see https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error,
    since the probabilities are all positive (or maybe there is a better
    thing for comparing probabilities specifically)

    Testing: not yet tested
    """
    num_rows = probs.shape[0]
    num_cols = probs.shape[1]

    smape_over_bootstraps = []

    # Loop over each boostrap model
    for j in range(1, num_cols):

        # Calculate SMAPE between bootstrap model j and
        # the model-under-test
        smape_over_bootstraps.append(smape(probs[:, 0], probs[:, j]))

    return np.mean(smape_over_bootstraps)

get_reclass_probabilities(probs, y_test, threshold)

Get the probability of risk reclassification for each patient

Parameters:

Name Type Description Default
probs DataFrame

The matrix of probabilities from the model-under-test (first column) and the bootstrapped models (subsequent models).

required
y_test Series

The true outcome corresponding to each row of the probs matrix. This is used to colour the points based on whether the outcome occurred on not.

required
threshold float

The risk level at which a patient is considered high risk

required

Returns:

Type Description
DataFrame

A table containing columns "original_risk", "unstable_prob", and "outcome".

Source code in src\pyhbr\analysis\stability.py
def get_reclass_probabilities(probs: DataFrame, y_test: Series, threshold: float) -> DataFrame:
    """Get the probability of risk reclassification for each patient

    Args:
        probs: The matrix of probabilities from the model-under-test
            (first column) and the bootstrapped models (subsequent
            models).
        y_test: The true outcome corresponding to each row of the
            probs matrix. This is used to colour the points based on
            whether the outcome occurred on not.
        threshold: The risk level at which a patient is considered high risk

    Returns:
        A table containing columns "original_risk", "unstable_prob", and
            "outcome".
    """

    # For the predictions of each model, categorise patients as
    # high risk or not based on the threshold.
    high_risk = probs > threshold

    # Find the subsets of patients who were flagged as high risk
    # by the original model.
    originally_low_risk = high_risk[~high_risk.iloc[:, 0]]
    originally_high_risk = high_risk[high_risk.iloc[:, 0]]

    # Count how many of the patients remained high risk or
    # low risk in the bootstrapped models.
    stayed_high_risk = originally_high_risk.iloc[:, 1:].sum(axis=1)
    stayed_low_risk = (~originally_low_risk.iloc[:, 1:]).sum(axis=1)

    # Calculate the number of patients who changed category (category
    # unstable)
    num_resamples = probs.shape[1]
    stable_count = pd.concat([stayed_low_risk, stayed_high_risk])
    unstable_prob = (
        ((num_resamples - stable_count) / num_resamples)
        .rename("unstable_prob")
        .to_frame()
    )

    # Merge the original risk with the unstable count
    original_risk = probs.iloc[:, 0].rename("original_risk")
    return (
        original_risk.to_frame()
        .merge(unstable_prob, on="spell_id", how="left")
        .merge(y_test.rename("outcome"), on="spell_id", how="left")
    )

make_bootstrapped_resamples(X0, y0, M, random_state)

Make M resamples of the training data

Makes M bootstrapped resamples of a training set (X0,y0). M should be at least 200 (as per recommendation).

Parameters:

Name Type Description Default
X0 DataFrame

The features in the training set to be resampled

required
y0 DataFrame

The outcome in the training set to be resampled. Can have multiple columns (corresponding to different outcomes).

required
M int

How many resamples to take

required
random_state RandomState

Source of randomness for resampling

required

Raises:

Type Description
ValueError

If the number of rows in X0 and y0 do not match

Returns:

Type Description
Resamples

An object containing the original training set and the resamples.

Source code in src\pyhbr\analysis\stability.py
def make_bootstrapped_resamples(
    X0: DataFrame, y0: DataFrame, M: int, random_state: RandomState
) -> Resamples:
    """Make M resamples of the training data

    Makes M bootstrapped resamples of a training set (X0,y0).
    M should be at least 200 (as per recommendation).

    Args:
        X0: The features in the training set to be resampled
        y0: The outcome in the training set to be resampled. Can have multiple
            columns (corresponding to different outcomes).
        M: How many resamples to take
        random_state: Source of randomness for resampling

    Raises:
        ValueError: If the number of rows in X0 and y0 do not match

    Returns:
        An object containing the original training set and the resamples.
    """

    if len(X0) != len(y0):
        raise ValueError("Number of rows in X0_train and y0_train must match")
    if M < 200:
        warnings.warn("M should be at least 200; see Riley and Collins, 2022")

    Xm = []
    ym = []
    for _ in range(M):
        X, y = resample(X0, y0, random_state=random_state)
        Xm.append(X)
        ym.append(y)

    return Resamples(X0, y0, Xm, ym)

plot_instability(ax, probs, y_test, title='Probability stability')

Plot the instability of risk predictions

This function plots a scatter graph of one point per value in the test set (row of probs), where the x-axis is the value of the model under test (the first column of probs), and the y-axis is every other probability predicted from the bootstrapped models Mn (the other columns of probs). The predictions from the model-under-test corresponds to the straight line at 45 degrees through the origin

For a stable model M0, the scattered points should be close to the M0 line, indicating that the bootstrapped models Mn broadly agree with the predictions made by M0.

Parameters:

Name Type Description Default
ax Axes

The axes on which to plot the risks

required
probs DataFrame

The matrix of probabilities from the model-under-test (first column) and the bootstrapped models (subsequent models).

required
y_test Series

The true outcome corresponding to each row of the probs matrix. This is used to colour the points based on whether the outcome occurred on not.

required
title

The title to place on the axes.

'Probability stability'
Source code in src\pyhbr\analysis\stability.py
def plot_instability(
    ax: Axes, probs: DataFrame, y_test: Series, title="Probability stability"
):
    """Plot the instability of risk predictions

    This function plots a scatter graph of one point
    per value in the test set (row of probs), where the
    x-axis is the value of the model under test (the
    first column of probs), and the y-axis is every other
    probability predicted from the bootstrapped models Mn
    (the other columns of probs). The predictions from
    the model-under-test corresponds to the straight line
    at 45 degrees through the origin

    For a stable model M0, the scattered points should be
    close to the M0 line, indicating that the bootstrapped
    models Mn broadly agree with the predictions made by M0.

    Args:
        ax: The axes on which to plot the risks
        probs: The matrix of probabilities from the model-under-test
            (first column) and the bootstrapped models (subsequent
            models).
        y_test: The true outcome corresponding to each row of the
            probs matrix. This is used to colour the points based on
            whether the outcome occurred on not.
        title: The title to place on the axes.
    """

    num_rows = probs.shape[0]
    num_cols = probs.shape[1]
    x = []
    y = []
    c = []
    # Keep track of an example point to plot
    example_risk = 1
    example_second_risk = 1
    for i in range(num_rows):
        for j in range(1, num_cols):

            # Get the pair of risks
            risk = 100 * probs.iloc[i, 0]
            second_risk = 100 * probs.iloc[i, j]

            # Keep track of the worst discrepancy
            # in the upper left quadrant
            if (
                (1.0 < risk < 10.0)
                and (second_risk > risk)
                and (second_risk / risk) > (example_second_risk / example_risk)
            ):
                example_risk = risk
                example_second_risk = second_risk

            x.append(risk)  # Model-under-test
            y.append(second_risk)  # Other bootstrapped models
            c.append(y_test.iloc[i]),  # What was the actual outcome

    colour_map = {0: "b", 1: "r"}

    text = f"Model risk {example_risk:.1f}%, bootstrap risk {example_second_risk:.1f}%"
    ax.annotate(
        text,
        xy=(example_risk, example_second_risk),
        xycoords="data",
        xytext=(example_risk, 95),
        fontsize=9,
        verticalalignment="top",
        horizontalalignment="center",
        textcoords="data",
        arrowprops={"arrowstyle": "->"},
        backgroundcolor="w",
    )

    for outcome_to_plot, colour in colour_map.items():
        x_to_plot = [x for x, outcome in zip(x, c) if outcome == outcome_to_plot]
        y_to_plot = [y for y, outcome in zip(y, c) if outcome == outcome_to_plot]
        ax.scatter(x_to_plot, y_to_plot, c=colour, s=1, marker=".")

    ax.axline([0, 0], [1, 1])

    ax.set_xlim(0.01, 100)
    ax.set_ylim(0.01, 100)
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.xaxis.set_major_formatter(mtick.PercentFormatter())
    ax.yaxis.set_major_formatter(mtick.PercentFormatter())

    ax.legend(
        [
            "Did not occur",
            "Event occurred",
        ],
        markerscale=10,
        loc="lower right",
    )
    ax.set_title(title)
    ax.set_xlabel("Risk estimate from model")
    ax.set_ylabel("Risk estimates from equivalent models")

plot_reclass_instability(ax, probs, y_test, threshold, title='Stability of Risk Class')

Plot the probability of reclassification by predicted risk

Parameters:

Name Type Description Default
ax Axes

The axes on which to draw the plot

required
probs DataFrame

The matrix of probabilities from the model-under-test (first column) and the bootstrapped models (subsequent models).

required
y_test Series

The true outcome corresponding to each row of the probs matrix. This is used to colour the points based on whether the outcome occurred on not.

required
threshold float

The risk level at which a patient is considered high risk

required
title str

The plot title.

'Stability of Risk Class'
Source code in src\pyhbr\analysis\stability.py
def plot_reclass_instability(
    ax: Axes,
    probs: DataFrame,
    y_test: Series,
    threshold: float,
    title: str = "Stability of Risk Class",
):
    """Plot the probability of reclassification by predicted risk

    Args:
        ax: The axes on which to draw the plot
        probs: The matrix of probabilities from the model-under-test
            (first column) and the bootstrapped models (subsequent
            models).
        y_test: The true outcome corresponding to each row of the
            probs matrix. This is used to colour the points based on
            whether the outcome occurred on not.
        threshold: The risk level at which a patient is considered high risk
        title: The plot title.
    """

    df = get_reclass_probabilities(probs, y_test, threshold)

    x = 100*df["original_risk"]
    y = 100*df["unstable_prob"]
    c = df["outcome"]
    colour_map = {False: "b", True: "r"}

    # TODO: Plot is all black now, this can go
    for outcome_to_plot, colour in colour_map.items():
        x_to_plot = [x for x, outcome in zip(x, c) if outcome == outcome_to_plot]
        y_to_plot = [y for y, outcome in zip(y, c) if outcome == outcome_to_plot]
        ax.scatter(x_to_plot, y_to_plot, c="k", s=1, marker=".")

    # ax.legend(
    #     [
    #         "Did not occur",
    #         "Event occurred",
    #     ],
    #     markerscale=15
    # )

    # Plot the risk category threshold and label it
    ax.axline(
        [100 * threshold, 0],
        [100 * threshold, 1],
        c="r",
    )

    # Plot the 50% line for more-likely-than-not reclassification
    ax.axline([0, 50], [100, 50], c="r")

    # Get the lower axis limits
    min_risk = 100 * df["original_risk"].min()
    min_unstable_prob = 100 * df["unstable_prob"].min()

    # Plot boxes to show high and low risk groups
    # low_risk_rect = Rectangle((min_risk, min_unstable_prob), 100*threshold, 100, facecolor="g", alpha=0.3)
    # ax[1].add_patch(low_risk_rect)
    # high_risk_rect = Rectangle((100*threshold, min_unstable_prob), 100*(1 - threshold), 100, facecolor="r", alpha=0.3)
    # ax[1].add_patch(high_risk_rect)

    text_str = f"High-risk threshold ({100*threshold:.2f}%)"
    ax.text(
        100 * threshold,
        min_unstable_prob * 1.1,
        text_str,
        fontsize=9,
        rotation="vertical",
        color="r",
        horizontalalignment="center",
        verticalalignment="bottom",
        backgroundcolor="w",
    )

    text_str = f"Prob. of reclassification = 50%"
    ax.text(
        0.011,
        50,
        text_str,
        fontsize=9,
        # rotation="vertical",
        color="r",
        # horizontalalignment="center",
        verticalalignment="center",
        backgroundcolor="w",
    )

    # Calculate the number of patients who fall in each stability group.
    # Unstable means
    num_high_risk = (df["original_risk"] >= threshold).sum()
    num_low_risk = (df["original_risk"] < threshold).sum()

    num_stable = (df["unstable_prob"] < 0.5).sum()
    num_unstable = (df["unstable_prob"] >= 0.5).sum()

    high_risk_and_unstable = (
        (df["original_risk"] >= threshold) & (df["unstable_prob"] >= 0.5)
    ).sum()

    high_risk_and_stable = (
        (df["original_risk"] >= threshold) & (df["unstable_prob"] < 0.5)
    ).sum()

    low_risk_and_unstable = (
        (df["original_risk"] < threshold) & (df["unstable_prob"] >= 0.5)
    ).sum()

    low_risk_and_stable = (
        (df["original_risk"] < threshold) & (df["unstable_prob"] < 0.5)
    ).sum()

    # Count the number of events in each risk group
    num_events_in_low_risk_group = df[df["original_risk"] < threshold]["outcome"].sum()
    num_events_in_high_risk_group = df[df["original_risk"] >= threshold][
        "outcome"
    ].sum()

    ax.set_xlim(0.009, 110)
    ax.set_ylim(0.9 * min_unstable_prob, 110)

    text_str = f"Unstable\nN = {low_risk_and_unstable}"
    ax.text(
        0.011,
        90,
        text_str,
        fontsize=9,
        verticalalignment="top",
        backgroundcolor="w",
    )

    text_str = f"Unstable\nN = {high_risk_and_unstable}"
    ax.text(
        90,
        90,
        text_str,
        fontsize=9,
        verticalalignment="top",
        horizontalalignment="right",
        backgroundcolor="w",
    )

    text_str = f"Stable\nN = {low_risk_and_stable}"
    ax.text(
        0.011,
        40,
        text_str,
        fontsize=9,
        verticalalignment="top",
        horizontalalignment="left",
        backgroundcolor="w",
    )

    text_str = f"Stable\nN = {high_risk_and_stable}"
    ax.text(
        90,
        40,
        text_str,
        fontsize=9,
        verticalalignment="top",
        horizontalalignment="right",
        backgroundcolor="w",
    )

    # Set axis properties
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.xaxis.set_major_formatter(mtick.PercentFormatter())
    ax.yaxis.set_major_formatter(mtick.PercentFormatter())

    ax.set_title(title)
    ax.set_xlabel("Risk estimate from model")
    ax.set_ylabel("Probability of risk reclassification by equivalent model")

plot_stability_analysis(ax, outcome_name, probs, y_test, high_risk_thresholds)

Plot the two stability plots

Parameters:

Name Type Description Default
ax Axes

The axes on which to plot the graphs (must have two

required
outcome_name str

One of "bleeding" or "ischaemia"

required
probs DataFrame

The model predictions. The first column is the model-under-test, and the other columns are the bootstrap model predictions.

required
y_test DataFrame

The outcomes table, with columns for "bleeding" and "ischaemia".

required
high_risk_thresholds dict[str, float]

Map containing the vertical risk prediction threshold for "bleeding" and "ischaemia".

required
Source code in src\pyhbr\analysis\stability.py
def plot_stability_analysis(
    ax: Axes,
    outcome_name: str,
    probs: DataFrame,
    y_test: DataFrame,
    high_risk_thresholds: dict[str, float],
):
    """Plot the two stability plots

    Args:
        ax: The axes on which to plot the graphs (must have two
        outcome_name: One of "bleeding" or "ischaemia"
        probs: The model predictions. The first column is
            the model-under-test, and the other columns are
            the bootstrap model predictions.
        y_test: The outcomes table, with columns for "bleeding"
            and "ischaemia".
        high_risk_thresholds: Map containing the vertical risk
            prediction threshold for "bleeding" and "ischaemia".
    """
    plot_instability_boxes(
        ax[0],
        probs[outcome_name],
    )
    plot_reclass_instability(
        ax[1],
        probs[outcome_name],
        y_test.loc[:, outcome_name],
        high_risk_thresholds[outcome_name],
    )

predict_probabilities(fitted_model, X_test)

Predict outcome probabilities using the fitted models on the test set

Aggregating function which finds the predicted probability from the model-under-test M0 and all the bootstrapped models Mn on each sample of the training set features X_test. The result is a 2D numpy array, where each row corresponds to a test-set sample, the first column is the predicted probabilities from M0, and the following N columns are the predictions from all the other Mn.

Note: the numbers in the matrix are the probabilities of 1 in the test set y_test.

Parameters:

Name Type Description Default
fitted_model FittedModel

The model fitted on the training set and resamples

required

Returns:

Type Description
DataFrame

An table of probabilities of the positive outcome in the class, where each column comes from a different model. Column zero corresponds to the training set, and the other columns are from the resamples. The index for the DataFrame is the same as X_test

Source code in src\pyhbr\analysis\stability.py
def predict_probabilities(fitted_model: FittedModel, X_test: DataFrame) -> DataFrame:
    """Predict outcome probabilities using the fitted models on the test set

    Aggregating function which finds the predicted probability
    from the model-under-test M0 and all the bootstrapped models
    Mn on each sample of the training set features X_test. The
    result is a 2D numpy array, where each row corresponds to
    a test-set sample, the first column is the predicted probabilities
    from M0, and the following N columns are the predictions from all
    the other Mn.

    Note: the numbers in the matrix are the probabilities of 1 in the
    test set y_test.

    Args:
        fitted_model: The model fitted on the training set and resamples

    Returns:
        An table of probabilities of the positive outcome in the class,
            where each column comes from a different model. Column zero
            corresponds to the training set, and the other columns are
            from the resamples. The index for the DataFrame is the same
            as X_test
    """
    columns = []
    for m, M in enumerate(fitted_model.flatten()):
        log.info(f"Predicting test-set probabilities {m}")
        columns.append(M.predict_proba(X_test)[:, 1])

    raw_probs = np.column_stack(columns)

    df = DataFrame(raw_probs)
    df.columns = [f"prob_M{m}" for m in range(len(fitted_model.Mm) + 1)]
    df.index = X_test.index
    return df

clinical_codes

Contains utilities for clinical code groups

Category dataclass

Code/categories struct

Attributes:

Name Type Description
name str

The name of the category (e.g. I20) or clinical code (I20.1)

docs str

The description of the category or code

index str | tuple[str, str]

Used to sort a list of Categories

categories list[Category] | None

For a category, the list of sub-categories contained. None for a code.

exclude set[str] | None

Contains code groups which do not contain any members from this category or any of its sub-categories.

Source code in src\pyhbr\clinical_codes\__init__.py
@dataclass
class Category:
    """Code/categories struct

    Attributes:
        name: The name of the category (e.g. I20) or clinical code (I20.1)
        docs: The description of the category or code
        index: Used to sort a list of Categories
        categories: For a category, the list of sub-categories contained.
            None for a code.
        exclude: Contains code groups which do not contain any members
            from this category or any of its sub-categories.

    """

    name: str
    docs: str
    index: str | tuple[str, str]
    categories: list[Category] | None
    exclude: set[str] | None

    def is_leaf(self):
        """Check if the categories is a leaf node

        Returns:
            True if leaf node (i.e. clinical code), false otherwise
        """
        return self.categories is None

    def excludes(self, group: str) -> bool:
        """Check if this category excludes a code group

        Args:
            group: The string name of the group to check

        Returns:
            True if the group is excluded; False otherwise
        """
        if self.exclude is not None:
            return group in self.exclude
        else:
            return False

excludes(group)

Check if this category excludes a code group

Parameters:

Name Type Description Default
group str

The string name of the group to check

required

Returns:

Type Description
bool

True if the group is excluded; False otherwise

Source code in src\pyhbr\clinical_codes\__init__.py
def excludes(self, group: str) -> bool:
    """Check if this category excludes a code group

    Args:
        group: The string name of the group to check

    Returns:
        True if the group is excluded; False otherwise
    """
    if self.exclude is not None:
        return group in self.exclude
    else:
        return False

is_leaf()

Check if the categories is a leaf node

Returns:

Type Description

True if leaf node (i.e. clinical code), false otherwise

Source code in src\pyhbr\clinical_codes\__init__.py
def is_leaf(self):
    """Check if the categories is a leaf node

    Returns:
        True if leaf node (i.e. clinical code), false otherwise
    """
    return self.categories is None

ClinicalCode dataclass

Store a clinical code together with its description.

Attributes:

Name Type Description
name str

The code itself, e.g. "I21.0"

docs str

The code description, e.g. "Acute transmural myocardial infarction of anterior wall"

Source code in src\pyhbr\clinical_codes\__init__.py
@dataclass
class ClinicalCode:
    """Store a clinical code together with its description.

    Attributes:
        name: The code itself, e.g. "I21.0"
        docs: The code description, e.g. "Acute
            transmural myocardial infarction of anterior wall"
    """

    name: str
    docs: str

    def normalise(self):
        """Return the name without whitespace/dots, as lowercase

        See the documentation for [normalize_code()][pyhbr.clinical_codes.normalise_code].

        Returns:
            The normalized form of this clinical code
        """
        return normalise_code(self.name)

normalise()

Return the name without whitespace/dots, as lowercase

See the documentation for normalize_code().

Returns:

Type Description

The normalized form of this clinical code

Source code in src\pyhbr\clinical_codes\__init__.py
def normalise(self):
    """Return the name without whitespace/dots, as lowercase

    See the documentation for [normalize_code()][pyhbr.clinical_codes.normalise_code].

    Returns:
        The normalized form of this clinical code
    """
    return normalise_code(self.name)

ClinicalCodeTree dataclass

Code definition file structure

Source code in src\pyhbr\clinical_codes\__init__.py
@serde
@dataclass
class ClinicalCodeTree:
    """Code definition file structure"""

    categories: list[Category]
    groups: set[str]

    def codes_in_group(self, group: str) -> list[ClinicalCode]:
        """Get the clinical codes in a code group

        Args:
            group: The group to fetch

        Raises:
            ValueError: Raised if the requested group does not exist

        Returns:
            The list of code groups
        """
        if not group in self.groups:
            raise ValueError(f"'{group}' is not a valid code group ({self.groups})")

        return get_codes_in_group(group, self.categories)

codes_in_group(group)

Get the clinical codes in a code group

Parameters:

Name Type Description Default
group str

The group to fetch

required

Raises:

Type Description
ValueError

Raised if the requested group does not exist

Returns:

Type Description
list[ClinicalCode]

The list of code groups

Source code in src\pyhbr\clinical_codes\__init__.py
def codes_in_group(self, group: str) -> list[ClinicalCode]:
    """Get the clinical codes in a code group

    Args:
        group: The group to fetch

    Raises:
        ValueError: Raised if the requested group does not exist

    Returns:
        The list of code groups
    """
    if not group in self.groups:
        raise ValueError(f"'{group}' is not a valid code group ({self.groups})")

    return get_codes_in_group(group, self.categories)

codes_in_any_group(codes)

Get a DataFrame of all the codes in any group in a codes file

Returns a table with the normalised code (lowercase/no whitespace/no dots) in column code, and the group containing the code in the column group.

All codes which are in any group will be included.

Codes will be duplicated if they appear in more than one group.

Parameters:

Name Type Description Default
codes ClinicalCodeTree

The tree clinical codes (e.g. ICD-10 or OPCS-4, loaded from a file) to search for codes

required

Returns:

Type Description
DataFrame

pd.DataFrame: All codes in any group in the codes file

Source code in src\pyhbr\clinical_codes\__init__.py
def codes_in_any_group(codes: ClinicalCodeTree) -> pd.DataFrame:
    """Get a DataFrame of all the codes in any group in a codes file

    Returns a table with the normalised code (lowercase/no whitespace/no
    dots) in column `code`, and the group containing the code in the
    column `group`.

    All codes which are in any group will be included.

    Codes will be duplicated if they appear in more than one group.

    Args:
        codes: The tree clinical codes (e.g. ICD-10 or OPCS-4, loaded
            from a file) to search for codes

    Returns:
        pd.DataFrame: All codes in any group in the codes file
    """
    dfs = []
    for g in codes.groups:
        clinical_codes = codes.codes_in_group(g)
        normalised_codes = [c.normalise() for c in clinical_codes]
        docs = [c.docs for c in clinical_codes]
        df = pd.DataFrame({"code": normalised_codes, "docs": docs, "group": g})
        dfs.append(df)

    return pd.concat(dfs).reset_index(drop=True)

filter_to_groups(codes_table, codes)

Filter a table of raw clinical codes to only keep codes in groups

Use this function to drop clinical codes which are not of interest, and convert all codes to normalised form (lowercase, no whitespace, no dot).

This function is tested on the HIC dataset, but should be modifiable for use with any data source returning diagnoses and procedures as separate tables in long format. Consider modifying the columns of codes_table that are contained in the output.

Parameters:

Name Type Description Default
codes_table DataFrame

Either a diagnoses or procedures table. For this function to work, it needs:

  • A code column containing the clinical code.
  • An episode_id identifying which episode contains the code.
  • A position identifying the primary/secondary position of the code in the episode.
required
codes ClinicalCodeTree

The clinical codes object (previously loaded from a file) containing code groups to use.

required

Returns:

Type Description
DataFrame

A table containing the episode ID, the clinical code (normalised), the group containing the code, and the code position.

Source code in src\pyhbr\clinical_codes\__init__.py
def filter_to_groups(
    codes_table: pd.DataFrame, codes: ClinicalCodeTree
) -> pd.DataFrame:
    """Filter a table of raw clinical codes to only keep codes in groups

    Use this function to drop clinical codes which are not of interest,
    and convert all codes to normalised form (lowercase, no whitespace, no dot).

    This function is tested on the HIC dataset, but should be modifiable
    for use with any data source returning diagnoses and procedures as
    separate tables in long format. Consider modifying the columns of
    codes_table that are contained in the output.

    Args:
        codes_table: Either a diagnoses or procedures table. For this
            function to work, it needs:

            * A `code` column containing the clinical code.
            * An `episode_id` identifying which episode contains the code.
            * A `position` identifying the primary/secondary position of the
                code in the episode.

        codes: The clinical codes object (previously loaded from a file)
            containing code groups to use.

    Returns:
        A table containing the episode ID, the clinical code (normalised),
            the group containing the code, and the code position.

    """
    codes_with_groups = codes_in_any_group(codes)
    codes_table["code"] = codes_table["code"].apply(normalise_code)
    codes_table = pd.merge(codes_table, codes_with_groups, on="code", how="inner")
    codes_table = codes_table[["episode_id", "code", "docs", "group", "position"]]

    return codes_table

get_code_groups(diagnosis_codes, procedure_codes)

Get a table of any diagnosis/procedure code which is in a code group

This function converts the code tree formats into a simple table containing normalised codes (lowercase, no dot), the documentation string for the code, what group the code is in, and whether it is a diagnosis or procedure code

Parameters:

Name Type Description Default
diagnosis_codes ClinicalCodeTree

The tree of diagnosis codes

required
procedure_codes ClinicalCodeTree

The tree of procedure codes

required

Returns:

Type Description
DataFrame

A table with columns code, docs, group and type.

Source code in src\pyhbr\clinical_codes\__init__.py
def get_code_groups(diagnosis_codes: ClinicalCodeTree, procedure_codes: ClinicalCodeTree) -> DataFrame:
    """Get a table of any diagnosis/procedure code which is in a code group

    This function converts the code tree formats into a simple table containing 
    normalised codes (lowercase, no dot), the documentation string for the code,
    what group the code is in, and whether it is a diagnosis or procedure code

    Args:
        diagnosis_codes: The tree of diagnosis codes
        procedure_codes: The tree of procedure codes

    Returns:
        A table with columns `code`, `docs`, `group` and `type`.
    """

    diagnosis_groups = codes_in_any_group(diagnosis_codes)
    procedure_groups = codes_in_any_group(procedure_codes)
    diagnosis_groups["type"] = "diagnosis"
    procedure_groups["type"] = "procedure"
    code_groups = pd.concat([diagnosis_groups, procedure_groups]).reset_index(drop=True)
    code_groups["type"] = code_groups["type"].astype("category")
    return code_groups

get_codes_in_group(group, categories)

Helper function to get clinical codes in a group

Parameters:

Name Type Description Default
group str

The group to fetch

required
categories list[Category]

The list of categories to search for codes

required

Returns:

Type Description
list[ClinicalCode]

A list of clinical codes in the group

Source code in src\pyhbr\clinical_codes\__init__.py
def get_codes_in_group(group: str, categories: list[Category]) -> list[ClinicalCode]:
    """Helper function to get clinical codes in a group

    Args:
        group: The group to fetch
        categories: The list of categories to search for codes

    Returns:
        A list of clinical codes in the group
    """

    # Filter out the categories that exclude the group
    categories_left = [c for c in categories if not c.excludes(group)]

    codes_in_group = []

    # Loop over the remaining categories. For all the leaf
    # categories, if there is no exclude for this group,
    # include it in the results. For non-leaf categories,
    # call this function again and append the resulting codes
    for category in categories_left:
        if category.is_leaf() and not category.excludes(group):
            code = ClinicalCode(name=category.name, docs=category.docs)
            codes_in_group.append(code)
        else:
            sub_categories = category.categories
            # Check it is non-empty (or refactor logic)
            new_codes = get_codes_in_group(group, sub_categories)
            codes_in_group.extend(new_codes)

    return codes_in_group

load_from_file(path)

Load a clinical codes file relative to the working directory

Parameters:

Name Type Description Default
path str

The path to the codes file relative to the current working directory.

required

Returns:

Type Description
ClinicalCodeTree

The contents of the file

Source code in src\pyhbr\clinical_codes\__init__.py
def load_from_file(path: str) -> ClinicalCodeTree:
    """Load a clinical codes file relative to the working directory

    Args:
        path: The path to the codes file relative to the current
            working directory.

    Returns:
        The contents of the file
    """
    with open(path, "r") as file:
        contents = file.read()
        return from_yaml(ClinicalCodeTree, contents)

load_from_package(name)

Load a clinical codes file from the pyhbr package.

The clinical codes are stored in yaml format, and this function returns a dictionary corresponding to the structure of the yaml file.

Examples:

>>> import pyhbr.clinical_codes as codes
>>> tree = codes.load_from_package("icd10_test.yaml")
>>> group = tree.codes_in_group("group_1")
>>> [code.name for code in group]
['I20.0', 'I20.1', 'I20.8', 'I20.9']

Parameters:

Name Type Description Default
name str

The file name of the codes file to load

required

Returns:

Type Description
ClinicalCodeTree

The contents of the file.

Source code in src\pyhbr\clinical_codes\__init__.py
def load_from_package(name: str) -> ClinicalCodeTree:
    """Load a clinical codes file from the pyhbr package.

    The clinical codes are stored in yaml format, and this
    function returns a dictionary corresponding to the structure
    of the yaml file.

    Examples:
        >>> import pyhbr.clinical_codes as codes
        >>> tree = codes.load_from_package("icd10_test.yaml")
        >>> group = tree.codes_in_group("group_1")
        >>> [code.name for code in group]
        ['I20.0', 'I20.1', 'I20.8', 'I20.9']

    Args:
        name: The file name of the codes file to load

    Returns:
        The contents of the file.
    """
    contents = res_files("pyhbr.clinical_codes.files").joinpath(name).read_text()
    return from_yaml(ClinicalCodeTree, contents)

normalise_code(code)

Remove whitespace/dots, and convert to lower-case

The format of clinical codes can vary across different data sources. A simple way to compare codes is to convert them into a common format and compare them as strings. The purpose of this function is to define the common format, which uses all lower-case letters, does not contain any dots, and does not include any leading/trailing whitespace.

Comparing codes for equality does not immediately allow checking whether one code is a sub-category of another. It also ignores clinical code annotations such as dagger/asterisk.

Examples:

>>> normalise_code("  I21.0 ")
'i210'

Parameters:

Name Type Description Default
code str

The raw code, e.g.

required

Returns:

Type Description
str

The normalised form of the clinical code

Source code in src\pyhbr\clinical_codes\__init__.py
def normalise_code(code: str) -> str:
    """Remove whitespace/dots, and convert to lower-case

    The format of clinical codes can vary across different data
    sources. A simple way to compare codes is to convert them into
    a common format and compare them as strings. The purpose of
    this function is to define the common format, which uses all
    lower-case letters, does not contain any dots, and does not
    include any leading/trailing whitespace.

    Comparing codes for equality does not immediately allow checking
    whether one code is a sub-category of another. It also ignores
    clinical code annotations such as dagger/asterisk.

    Examples:
        >>> normalise_code("  I21.0 ")
        'i210'

    Args:
        code: The raw code, e.g.

    Returns:
        The normalised form of the clinical code
    """
    return code.lower().strip().replace(".", "")

codes_editor

Edit groups of ICD-10 and OPCS-4 codes

codes_editor

run_app()

Run the main codes editor application

Source code in src\pyhbr\clinical_codes\codes_editor\codes_editor.py
def run_app() -> None:
    """Run the main codes editor application
    """

    # You need one (and only one) QApplication instance per application.
    # Pass in sys.argv to allow command line arguments for your app.
    # If you know you won't use command line arguments QApplication([]) works too.
    app = QApplication(sys.argv)

    # Create a Qt widget, which will be our window.
    window = MainWindow()
    window.show()

    # Start the event loop.
    app.exec()

counting

Utilities for counting clinical codes satisfying conditions

count_code_groups(index_spells, filtered_episodes)

Count the number of matching codes relative to index episodes

This function counts the rows for each index spell ID in the output of filter_by_code_groups, and adds 0 for any index spell ID without any matching rows in filtered_episodes.

The intent is to count the number of codes (one per row) that matched filter conditions in other episodes with respect to the index spell.

Parameters:

Name Type Description Default
index_spells DataFrame

The index spells, which provides the list of spell IDs of interest. The output will be NA for any spell ID that does not have any matching rows in filtered_episodes.

required
filtered_episodes DataFrame

The output from filter_by_code_groups, which produces a table where each row represents a matching code.

required

Returns:

Type Description
Series

How many codes (rows) occurred for each index spell

Source code in src\pyhbr\clinical_codes\counting.py
def count_code_groups(index_spells: DataFrame, filtered_episodes: DataFrame) -> Series:
    """Count the number of matching codes relative to index episodes

    This function counts the rows for each index spell ID in the output of
    filter_by_code_groups, and adds 0 for any index spell ID without
    any matching rows in filtered_episodes.

    The intent is to count the number of codes (one per row) that matched
    filter conditions in other episodes with respect to the index spell.

    Args:
        index_spells: The index spells, which provides the list of
            spell IDs of interest. The output will be NA for any spell
            ID that does not have any matching rows in filtered_episodes.
        filtered_episodes: The output from filter_by_code_groups,
            which produces a table where each row represents a matching
            code.

    Returns:
        How many codes (rows) occurred for each index spell
    """
    df = (
        filtered_episodes.groupby("index_spell_id")
        .size()
        .rename("count")
        .to_frame()
        .reset_index(names="spell_id")
        .set_index("spell_id")
    )
    return index_spells[[]].merge(df, how="left", on="spell_id").fillna(0)["count"]

count_events(index_spells, events, event_name)

Count the occurrences (rows) of an event given in long format.

The input table (events) contains instances of events, one per row, where the event_name contains the name of a string column labelling the events. The table also contains a spell_id column, which may be associated with multiple rows.

The function pivots the events so that there is one row per spell, each event has its own column, and the table contains the total number of each event associated with the spell.

The index_spells table is required because some index spells may have no events. These index spells will have a row of zeros in the output.

Parameters:

Name Type Description Default
index_spells DataFrame

Must have Pandas index spell_id

required
events DataFrame

Contains a spell_id column and an event_name column.

required

Returns:

Type Description
DataFrame

A table of the counts for each event (one event per column), with Pandas index spell_id.

Source code in src\pyhbr\clinical_codes\counting.py
def count_events(index_spells: DataFrame, events: DataFrame, event_name: str) -> DataFrame:
    """Count the occurrences (rows) of an event given in long format.

    The input table (events) contains instances of events, one per row,
    where the event_name contains the name of a string column labelling the
    events. The table also contains a `spell_id` column, which may be 
    associated with multiple rows.

    The function pivots the events so that there is one row per spell,
    each event has its own column, and the table contains the total number
    of each event associated with the spell.

    The index_spells table is required because some index spells may have
    no events. These index spells will have a row of zeros in the output.

    Args:
        index_spells: Must have Pandas index `spell_id`
        events: Contains a `spell_id` column and an event_name
            column.

    Returns:
        A table of the counts for each event (one event per column), with
            Pandas index `spell_id`.
    """

    # Pivot the prescriptions into one column per medicine type,
    # and prefix the name with "prior_" (e.g. "prior_oac").
    nonzero_counts = (
        events.groupby("spell_id")[event_name]
        .value_counts()
        .unstack(fill_value=0)
    )
    all_counts = (
        index_spells[[]].merge(nonzero_counts, how="left", on="spell_id").fillna(0)
    )
    return all_counts

get_all_other_codes(index_spells, episodes, codes)

For each patient, get clinical codes in other episodes before/after the index

This makes a table of index episodes (which is the first episode of the index spell) along with all other episodes for a patient. Two columns index_episode_id and other_episode_id identify the two episodes for each row (they may be equal), and other information is stored such as the time of the base episode, the time to the other episode, and clinical code information for the other episode.

This table is used as the basis for all processing involving counting codes before and after an episode.

Note

Episodes will not be included in the result if they do not have any clinical codes that are in any code group.

Parameters:

Name Type Description Default
index_spells DataFrame

Contains episode_id as an index.

required
episodes DataFrame

Contains episode_id as an index, and patient_id and episode_start as columns

required
codes DataFrame

Contains episode_id and other code data as columns

required

Returns:

Type Description
DataFrame

A table containing columns index_episode_id, other_episode_id, index_episode_start, time_to_other_episode, and code data columns for the other episode. Note that the base episode itself is included as an other episode.

Source code in src\pyhbr\clinical_codes\counting.py
def get_all_other_codes(
    index_spells: DataFrame, episodes: DataFrame, codes: DataFrame
) -> DataFrame:
    """For each patient, get clinical codes in other episodes before/after the index

    This makes a table of index episodes (which is the first episode of the index spell)
    along with all other episodes for a patient. Two columns `index_episode_id` and
    `other_episode_id` identify the two episodes for each row (they may be equal), and
    other information is stored such as the time of the base episode, the time to the
    other episode, and clinical code information for the other episode.

    This table is used as the basis for all processing involving counting codes before
    and after an episode.

    !!! note
        Episodes will not be included in the result if they do not have any clinical
            codes that are in any code group.

    Args:
        index_spells: Contains `episode_id` as an index.
        episodes: Contains `episode_id` as an index, and `patient_id` and `episode_start` as columns
        codes: Contains `episode_id` and other code data as columns

    Returns:
        A table containing columns `index_episode_id`, `other_episode_id`,
            `index_episode_start`, `time_to_other_episode`, and code data columns
            for the other episode. Note that the base episode itself is included
            as an other episode.
    """

    # Remove everything but the index episode_id (in case base_episodes
    # already has the columns)
    df = index_spells.reset_index(names="spell_id").set_index("episode_id")[
        ["spell_id"]
    ]

    index_episode_info = df.merge(
        episodes[["patient_id", "episode_start"]], how="left", on="episode_id"
    ).rename(
        columns={"episode_start": "index_episode_start", "spell_id": "index_spell_id"}
    )

    other_episodes = (
        index_episode_info.reset_index(names="index_episode_id")
        .merge(
            episodes[["episode_start", "patient_id", "spell_id"]].reset_index(
                names="other_episode_id"
            ),
            how="left",
            on="patient_id",
        )
        .rename(columns={"spell_id": "other_spell_id"})
    )

    other_episodes["time_to_other_episode"] = (
        other_episodes["episode_start"] - other_episodes["index_episode_start"]
    )

    # Use an inner join to filter out other episodes that have no associated codes
    # in any group.
    with_codes = other_episodes.merge(
        codes, how="inner", left_on="other_episode_id", right_on="episode_id"
    ).drop(columns=["patient_id", "episode_start", "episode_id"])

    return with_codes

get_time_window(time_diff_table, window_start, window_end, time_diff_column='time_to_other_episode')

Get events that occurred in a time window with respect to a base event

Use the time_diff_column column to filter the time_diff_table to just those that occurred between window_start and window_end with respect to the base. For example, rows can represent an index episode paired with other episodes, with the time_diff_column representing the time to the other episode.

The arguments window_start and window_end control the minimum and maximum values for the time difference. Use positive values for a window after the base event, and use negative values for a window before the base event.

Events on the boundary of the window are included.

Note that the base event itself will be included as a row if window_start is negative and window_end is positive.

Parameters:

Name Type Description Default
time_diff_table DataFrame

Table containing at least the time_diff_column

required
window_start timedelta

The smallest value of time_diff_column that will be included in the returned table. Can be negative, meaning events before the base event will be included.

required
window_end timedelta

The largest value of time_diff_column that will be included in the returned table. Can be negative, meaning events after the base will be included.

required
time_diff_column str

The name of the column containing the time difference, which is positive for an event occurring after the base event.

'time_to_other_episode'

Returns:

Type Description
DataFrame

The rows within the specific time window

Source code in src\pyhbr\clinical_codes\counting.py
def get_time_window(
    time_diff_table: DataFrame,
    window_start: timedelta,
    window_end: timedelta,
    time_diff_column: str = "time_to_other_episode",
) -> DataFrame:
    """Get events that occurred in a time window with respect to a base event

    Use the time_diff_column column to filter the time_diff_table to just those
    that occurred between window_start and window_end with respect to the base. 
    For example, rows can represent an index episode paired with other episodes,
    with the time_diff_column representing the time to the other episode.

    The arguments window_start and window_end control the minimum and maximum 
    values for the time difference. Use positive values for a window after the 
    base event, and use negative values for a window before the base event.

    Events on the boundary of the window are included.

    Note that the base event itself will be included as a row if window_start
    is negative and window_end is positive.

    Args:
        time_diff_table: Table containing at least the `time_diff_column`
        window_start: The smallest value of `time_diff_column` that will be included
            in the returned table. Can be negative, meaning events before the base
            event will be included.
        window_end: The largest value of `time_diff_column` that will be included in
            the returned table. Can be negative, meaning events after the base
            will be included.
        time_diff_column: The name of the column containing the time difference,
            which is positive for an event occurring after the base event.

    Returns:
        The rows within the specific time window
    """
    df = time_diff_table
    return df[
        (df[time_diff_column] <= window_end) & (df[time_diff_column] >= window_start)
    ]

common

Common utilities for other modules.

A collection of routines used by the data source or analysis functions.

CheckedTable

Wrapper for sqlalchemy table with checks for table/columns

Source code in src\pyhbr\common.py
class CheckedTable:
    """Wrapper for sqlalchemy table with checks for table/columns"""

    def __init__(self, table_name: str, engine: Engine, schema="dbo") -> None:
        """Get a CheckedTable by reading from the remote server

        This is a wrapper around the sqlalchemy Table for
        catching errors when accessing columns through the
        c attribute.

        Args:
            table_name: The name of the table whose metadata should be retrieved
            engine: The database connection

        Returns:
            The table data for use in SQL queries
        """
        self.name = table_name
        metadata_obj = MetaData(schema=schema)
        try:
            self.table = Table(self.name, metadata_obj, autoload_with=engine)
        except NoSuchTableError as e:
            raise RuntimeError(
                f"Could not find table '{e}' in database connection '{engine.url}'"
            ) from e

    def col(self, column_name: str) -> Column:
        """Get a column

        Args:
            column_name: The name of the column to fetch.

        Raises:
            RuntimeError: Thrown if the column does not exist
        """
        try:
            return self.table.c[column_name]
        except AttributeError as e:
            raise RuntimeError(
                f"Could not find column name '{column_name}' in table '{self.name}'"
            ) from e

__init__(table_name, engine, schema='dbo')

Get a CheckedTable by reading from the remote server

This is a wrapper around the sqlalchemy Table for catching errors when accessing columns through the c attribute.

Parameters:

Name Type Description Default
table_name str

The name of the table whose metadata should be retrieved

required
engine Engine

The database connection

required

Returns:

Type Description
None

The table data for use in SQL queries

Source code in src\pyhbr\common.py
def __init__(self, table_name: str, engine: Engine, schema="dbo") -> None:
    """Get a CheckedTable by reading from the remote server

    This is a wrapper around the sqlalchemy Table for
    catching errors when accessing columns through the
    c attribute.

    Args:
        table_name: The name of the table whose metadata should be retrieved
        engine: The database connection

    Returns:
        The table data for use in SQL queries
    """
    self.name = table_name
    metadata_obj = MetaData(schema=schema)
    try:
        self.table = Table(self.name, metadata_obj, autoload_with=engine)
    except NoSuchTableError as e:
        raise RuntimeError(
            f"Could not find table '{e}' in database connection '{engine.url}'"
        ) from e

col(column_name)

Get a column

Parameters:

Name Type Description Default
column_name str

The name of the column to fetch.

required

Raises:

Type Description
RuntimeError

Thrown if the column does not exist

Source code in src\pyhbr\common.py
def col(self, column_name: str) -> Column:
    """Get a column

    Args:
        column_name: The name of the column to fetch.

    Raises:
        RuntimeError: Thrown if the column does not exist
    """
    try:
        return self.table.c[column_name]
    except AttributeError as e:
        raise RuntimeError(
            f"Could not find column name '{column_name}' in table '{self.name}'"
        ) from e

chunks(patient_ids, n)

Divide a list of patient ids into n-sized chunks

The last chunk may be shorter.

Parameters:

Name Type Description Default
patient_ids list[str]

The List of IDs to chunk

required
n int

The chunk size.

required

Returns:

Type Description
list[list[str]]

A list containing chunks (list) of patient IDs

Source code in src\pyhbr\common.py
def chunks(patient_ids: list[str], n: int) -> list[list[str]]:
    """Divide a list of patient ids into n-sized chunks

    The last chunk may be shorter.

    Args:
        patient_ids: The List of IDs to chunk
        n: The chunk size.

    Returns:
        A list containing chunks (list) of patient IDs
    """
    return [patient_ids[i : i + n] for i in range(0, len(patient_ids), n)]

current_commit()

Get current commit.

Returns:

Type Description
str

Get the first 12 characters of the current commit, using the first repository found above the current working directory. If the working directory is not in a git repository, return "nogit".

Source code in src\pyhbr\common.py
def current_commit() -> str:
    """Get current commit.

    Returns:
        Get the first 12 characters of the current commit,
            using the first repository found above the current
            working directory. If the working directory is not
            in a git repository, return "nogit".
    """
    try:
        repo = Repo(search_parent_directories=True)
        sha = repo.head.object.hexsha[0:11]
        return sha
    except InvalidGitRepositoryError:
        return "nogit"

current_timestamp()

Get the current timestamp.

Returns:

Type Description
int

The current timestamp (since epoch) rounded to the nearest second.

Source code in src\pyhbr\common.py
def current_timestamp() -> int:
    """Get the current timestamp.

    Returns:
        The current timestamp (since epoch) rounded
            to the nearest second.
    """
    return int(time())

get_data(engine, query, *args)

Convenience function to make a query and fetch data.

Wraps a function like hic.demographics_query with a call to pd.read_data.

Parameters:

Name Type Description Default
engine Engine

The database connection

required
query Callable[[Engine, ...], Select]

A function returning a sqlalchemy Select statement

required
*args ...

Positional arguments to be passed to query in addition to engine (which is passed first). Make sure they are passed in the same order expected by the query function.

()

Returns:

Type Description
DataFrame

The pandas dataframe containing the SQL data

Source code in src\pyhbr\common.py
def get_data(
    engine: Engine, query: Callable[[Engine, ...], Select], *args: ...
) -> DataFrame:
    """Convenience function to make a query and fetch data.

    Wraps a function like hic.demographics_query with a
    call to pd.read_data.

    Args:
        engine: The database connection
        query: A function returning a sqlalchemy Select statement
        *args: Positional arguments to be passed to query in addition
            to engine (which is passed first). Make sure they are passed
            in the same order expected by the query function.

    Returns:
        The pandas dataframe containing the SQL data
    """
    stmt = query(engine, *args)
    df = read_sql(stmt, engine)

    # Convert the column names to regular strings instead
    # of sqlalchemy.sql.elements.quoted_name. This avoids
    # an error down the line in sklearn, which cannot
    # process sqlalchemy column title tuples.
    df.columns = [str(col) for col in df.columns]

    return df

get_data_by_patient(engine, query, patient_ids, *args)

Fetch data using a query restricted by patient ID

The patient_id list is chunked into 2000 long batches to fit within an SQL IN clause, and each chunk is run as a separate query. The results are assembled into a single DataFrame.

Parameters:

Name Type Description Default
engine Engine

The database connection

required
query Callable[[Engine, ...], Select]

A function returning a sqlalchemy Select statement. Must take a list[str] as an argument after engine.

required
patient_ids list[str]

A list of patient IDs to restrict the query.

required
*args ...

Further positional arguments that will be passed to the query function after the patient_ids positional argument.

()

Returns:

Type Description
list[DataFrame]

A list of dataframes, one corresponding to each chunk.

Source code in src\pyhbr\common.py
def get_data_by_patient(
    engine: Engine,
    query: Callable[[Engine, ...], Select],
    patient_ids: list[str],
    *args: ...,
) -> list[DataFrame]:
    """Fetch data using a query restricted by patient ID

    The patient_id list is chunked into 2000 long batches to fit
    within an SQL IN clause, and each chunk is run as a separate
    query. The results are assembled into a single DataFrame.

    Args:
        engine: The database connection
        query: A function returning a sqlalchemy Select statement. Must
            take a list[str] as an argument after engine.
        patient_ids: A list of patient IDs to restrict the query.
        *args: Further positional arguments that will be passed to the
            query function after the patient_ids positional argument.

    Returns:
        A list of dataframes, one corresponding to each chunk.
    """
    dataframes = []
    patient_id_chunks = chunks(patient_ids, 2000)
    num_chunks = len(patient_id_chunks)
    chunk_count = 1
    for chunk in patient_id_chunks:
        print(f"Fetching chunk {chunk_count}/{num_chunks}")
        dataframes.append(get_data(engine, query, chunk, *args))
        chunk_count += 1
    return dataframes

get_saved_files_by_name(name, save_dir, extension)

Get all saved data files matching name

Get the list of files in the save_dir folder matching name. Return the result as a table of file path, commit hash, and saved date. The table is sorted by timestamp, with the most recent file first.

Raises:

Type Description
RuntimeError

If save_dir does not exist, or there are files in save_dir within invalid file names (not in the format name_commit_timestamp.pkl).

Parameters:

Name Type Description Default
name str

The name of the saved file to load. This matches name in the filename name_commit_timestamp.pkl.

required
save_dir str

The directory to search for files.

required
extension str

What file extension to look for. Do not include the dot.

required

Returns:

Type Description
DataFrame

A dataframe with columns path, commit and created_data.

Source code in src\pyhbr\common.py
def get_saved_files_by_name(name: str, save_dir: str, extension: str) -> DataFrame:
    """Get all saved data files matching name

    Get the list of files in the save_dir folder matching
    name. Return the result as a table of file path, commit
    hash, and saved date. The table is sorted by timestamp,
    with the most recent file first.

    Raises:
        RuntimeError: If save_dir does not exist, or there are files
            in save_dir within invalid file names (not in the format
            name_commit_timestamp.pkl).

    Args:
        name: The name of the saved file to load. This matches name in
            the filename name_commit_timestamp.pkl.
        save_dir: The directory to search for files.
        extension: What file extension to look for. Do not include the dot.

    Returns:
        A dataframe with columns `path`, `commit` and `created_data`.
    """

    # Check for missing datasets directory
    if not os.path.isdir(save_dir):
        raise RuntimeError(
            f"Missing folder '{save_dir}'. Check your working directory."
        )

    # Read all the .pkl files in the directory
    files = DataFrame({"path": os.listdir(save_dir)})

    # Identify the file name part. The horrible regex matches the
    # expression _[commit_hash]_[timestamp].pkl. It is important to
    # match this part, because "anything" can happen in the name part
    # (including underscores and letters and numbers), so splitting on
    # _ would not work. The name can then be removed.
    files["name"] = files["path"].str.replace(
        rf"_([0-9]|[a-zA-Z])*_\d*\.{extension}", "", regex=True
    )

    # Remove all the files whose name does not match, and drop
    # the name from the path
    files = files[files["name"] == name]
    if files.shape[0] == 0:
        raise ValueError(
            f"There is no file with the name '{name}' in the datasets directory"
        )
    files["commit_and_timestamp"] = files["path"].str.replace(name + "_", "")

    # Split the commit and timestamp up (note also the extension)
    try:
        files[["commit", "timestamp", "extension"]] = files[
            "commit_and_timestamp"
        ].str.split(r"_|\.", expand=True)
    except Exception as exc:
        raise RuntimeError(
            "Failed to parse files in the datasets folder. "
            "Ensure that all files have the correct format "
            "name_commit_timestamp.extension, and "
            "remove any files not matching this "
            "pattern. TODO handle this error properly, "
            "see save_datasets.py."
        ) from exc

    files["created_date"] = to_datetime(files["timestamp"].astype(int), unit="s")
    recent_first = files.sort_values(by="timestamp", ascending=False).reset_index()[
        ["path", "commit", "created_date"]
    ]
    return recent_first

load_exact_item(name, save_dir='save_data')

Load a previously saved item (pickle) from file by exact filename

This is similar to load_item, but loads the exact filename given by name instead of looking for the most recent file. name must contain the commit, timestamp, and file extension.

A RuntimeError is raised if the file does not exist.

To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.

Parameters:

Name Type Description Default
name str

The name of the item to load

required
save_fir

Which folder to load the item from.

required

Returns:

Type Description
Any

The data item loaded.

Source code in src\pyhbr\common.py
def load_exact_item(
    name: str, save_dir: str = "save_data"
) -> Any:
    """Load a previously saved item (pickle) from file by exact filename

    This is similar to load_item, but loads the exact filename given by name
    instead of looking for the most recent file. name must contain the
    commit, timestamp, and file extension.

    A RuntimeError is raised if the file does not exist.

    To load an item that is an object from a library (e.g. a pandas DataFrame),
    the library must be installed (otherwise you will get a ModuleNotFound
    exception). However, you do not have to import the library before calling this
    function.

    Args:
        name: The name of the item to load
        save_fir: Which folder to load the item from.

    Returns:
        The data item loaded. 

    """

    # Make the path to the file
    file_path = Path(save_dir) / Path(name)

    # If the file does not exist, raise an error
    if not file_path.exists():
        raise RuntimeError(f"The file {name} does not exist in the directory {save_dir}")

    # Load a generic pickle. Note that if this is a pandas dataframe,
    # pandas must be installed (otherwise you will get module not found).
    # The same goes for a pickle storing an object from any other library.
    with open(file_path, "rb") as file:
        return pickle.load(file)

load_item(name, interactive=False, save_dir='save_data')

Load a previously saved item (pickle) from file

Use this function to load a file that was previously saved using save_item(). By default, the latest version of the item will be returned (the one with the most recent timestamp).

None is returned if an interactive load is cancelled by the user.

To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.

Parameters:

Name Type Description Default
name str

The name of the item to load

required
interactive bool

If True, let the user pick which item version to load interactively. If False, non-interactively load the most recent item (i.e. with the most recent timestamp). The commit hash is not considered when loading the item.

False
save_fir

Which folder to load the item from.

required

Returns:

Type Description
(Any, Path)

A tuple, with the python object loaded from file as first element and the Path to the item as the second element, or None if the user cancelled an interactive load.

Source code in src\pyhbr\common.py
def load_item(
    name: str, interactive: bool = False, save_dir: str = "save_data"
) -> (Any, Path):
    """Load a previously saved item (pickle) from file

    Use this function to load a file that was previously saved using
    save_item(). By default, the latest version of the item will be returned
    (the one with the most recent timestamp).

    None is returned if an interactive load is cancelled by the user.

    To load an item that is an object from a library (e.g. a pandas DataFrame),
    the library must be installed (otherwise you will get a ModuleNotFound
    exception). However, you do not have to import the library before calling this
    function.

    Args:
        name: The name of the item to load
        interactive: If True, let the user pick which item version to load interactively.
            If False, non-interactively load the most recent item (i.e. with the most
            recent timestamp). The commit hash is not considered when loading the item.
        save_fir: Which folder to load the item from.

    Returns:
        A tuple, with the python object loaded from file as first element and the
            Path to the item as the second element, or None if the user cancelled
            an interactive load.

    """
    if interactive:
        item_path = pick_saved_file_interactive(name, save_dir, "pkl")
    else:
        item_path = pick_most_recent_saved_file(name, save_dir, "pkl")

    if item_path is None:
        print("Aborted (interactive) load item")
        return None, None

    print(f"Loading {item_path}")

    # Load a generic pickle. Note that if this is a pandas dataframe,
    # pandas must be installed (otherwise you will get module not found).
    # The same goes for a pickle storing an object from any other library.
    with open(item_path, "rb") as file:
        return pickle.load(file), item_path

load_most_recent_data_files(analysis_name, save_dir)

Load the most recent timestamp data file matching the analysis name

The data file is a pickle of a dictionary, containing pandas DataFrames and other metadata. It is expected to contain a "raw_file" key, which contains the path to the associated raw data file.

Both files are loaded, and a tuple of all the data is returned

Parameters:

Name Type Description Default
analysis_name str

The "analysis_name" key from the config file, which is the filename prefix

required
save_dir str

The folder to load the data from

required

Returns:

Type Description
(dict[str, Any], dict[str, Any], str)

(data, raw_data, data_path). data and raw_data are dictionaries containing (mainly) Pandas DataFrames, and data_path is the path to the data file (this can be stored in any output products from this script to record which data file was used to generate the data.

Source code in src\pyhbr\common.py
def load_most_recent_data_files(analysis_name: str, save_dir: str) -> (dict[str, Any], dict[str, Any], str):
    """Load the most recent timestamp data file matching the analysis name

    The data file is a pickle of a dictionary, containing pandas DataFrames and
    other metadata. It is expected to contain a "raw_file" key, which contains
    the path to the associated raw data file.

    Both files are loaded, and a tuple of all the data is returned

    Args:
        analysis_name: The "analysis_name" key from the config file, which is the filename prefix
        save_dir: The folder to load the data from

    Returns:
        (data, raw_data, data_path). data and raw_data are dictionaries containing
            (mainly) Pandas DataFrames, and data_path is the path to the data
            file (this can be stored in any output products from this script to
            record which data file was used to generate the data.
    """

    item_name = f"{analysis_name}_data"
    log.info(f"Loading most recent data file '{item_name}'")
    data, data_path = load_item(item_name, save_dir=save_dir)

    raw_file = data["raw_file"]
    log.info(f"Loading the underlying raw data file '{raw_file}'")
    raw_data = load_exact_item(raw_file, save_dir=save_dir)

    log.info(f"Items in the data file {data.keys()}")
    log.info(f"Items in the raw data file: {raw_data.keys()}")

    return data, raw_data, data_path

make_engine(con_string='mssql+pyodbc://dsn', database='hic_cv_test')

Make a sqlalchemy engine

This function is intended for use with Microsoft SQL Server. The preferred method to connect to the server on Windows is to use a Data Source Name (DSN). To use the default connection string argument, set up a data source name called "dsn" using the program "ODBC Data Sources".

If you need to access multiple different databases on the same server, you will need different engines. Specify the database name while creating the engine (this will override a default database in the DSN, if there is one).

Parameters:

Name Type Description Default
con_string str

The sqlalchemy connection string.

'mssql+pyodbc://dsn'
database str

The database name to connect to.

'hic_cv_test'

Returns:

Type Description
Engine

The sqlalchemy engine

Source code in src\pyhbr\common.py
def make_engine(
    con_string: str = "mssql+pyodbc://dsn", database: str = "hic_cv_test"
) -> Engine:
    """Make a sqlalchemy engine

    This function is intended for use with Microsoft SQL
    Server. The preferred method to connect to the server
    on Windows is to use a Data Source Name (DSN). To use the
    default connection string argument, set up a data source
    name called "dsn" using the program "ODBC Data Sources".

    If you need to access multiple different databases on the
    same server, you will need different engines. Specify the
    database name while creating the engine (this will override
    a default database in the DSN, if there is one).

    Args:
        con_string: The sqlalchemy connection string.
        database: The database name to connect to.

    Returns:
        The sqlalchemy engine
    """
    connect_args = {"database": database}
    return create_engine(con_string, connect_args=connect_args)

make_new_save_item_path(name, save_dir, extension)

Make the path to save a new item to the save_dir

The name will have the format name_{current_common}_{timestamp}.{extension}.

Parameters:

Name Type Description Default
name str

The base name for the new filename

required
save_dir str

The folder in which to place the item

required
extension str

The file extension (omit the dot)

required

Returns:

Type Description
Path

The relative path to the new object to be saved

Source code in src\pyhbr\common.py
def make_new_save_item_path(name: str, save_dir: str, extension: str) -> Path:
    """Make the path to save a new item to the save_dir

    The name will have the format name_{current_common}_{timestamp}.{extension}.

    Args:
        name: The base name for the new filename
        save_dir: The folder in which to place the item
        extension: The file extension (omit the dot)

    Returns:
        The relative path to the new object to be saved
    """

    # Make the file suffix out of the current git
    # commit hash and the current time
    filename = f"{name}_{current_commit()}_{current_timestamp()}.{extension}"
    return Path(save_dir) / Path(filename)

mean_confidence_interval(data, confidence=0.95)

Compute the confidence interval around the mean

Parameters:

Name Type Description Default
data Series

A series of numerical values to compute the confidence interval.

required
confidence float

The confidence interval to compute.

0.95

Returns:

Type Description
dict[str, float]

A map containing the keys "mean", "lower", and "upper". The latter keys contain the confidence interval limits.

Source code in src\pyhbr\common.py
def mean_confidence_interval(
    data: Series, confidence: float = 0.95
) -> dict[str, float]:
    """Compute the confidence interval around the mean

    Args:
        data: A series of numerical values to compute the confidence interval.
        confidence: The confidence interval to compute.

    Returns:
        A map containing the keys "mean", "lower", and "upper". The latter
            keys contain the confidence interval limits.
    """
    a = 1.0 * np.array(data)
    n = len(a)
    mean = np.mean(a)
    standard_error = scipy.stats.sem(a)

    # Check this
    half_width = standard_error * scipy.stats.t.ppf((1 + confidence) / 2.0, n - 1)
    return {
        "mean": mean,
        "confidence": confidence,
        "lower": mean - half_width,
        "upper": mean + half_width,
    }

median_to_string(instability, unit='%')

Convert the median-quartile DataFrame to a String

Parameters:

Name Type Description Default
instability DataFrame

Table containing three rows, indexed by 0.5 (median), 0.25 (lower quartile) and 0.75 (upper quartile).

required
unit

What units to add to the values in the string.

'%'

Returns:

Type Description
str

A string containing the median, and the lower and upper quartiles.

Source code in src\pyhbr\common.py
def median_to_string(instability: DataFrame, unit="%") -> str:
    """Convert the median-quartile DataFrame to a String

    Args:
        instability: Table containing three rows, indexed by
            0.5 (median), 0.25 (lower quartile) and 0.75
            (upper quartile).
        unit: What units to add to the values in the string.

    Returns:
        A string containing the median, and the lower and upper
            quartiles.
    """
    return f"{instability.loc[0.5]:.2f}{unit} Q [{instability.loc[0.025]:.2f}{unit}, {instability.loc[0.975]:.2f}{unit}]"

pick_most_recent_saved_file(name, save_dir, extension='pkl')

Get the path to the most recent file matching name.

Like pick_saved_file_interactive, but automatically selects the most recent file in save_data.

Parameters:

Name Type Description Default
name str

The name of the saved file to list

required
save_dir str

The directory to search for files

required
extension str

What file extension to look for. Do not include the dot.

'pkl'

Returns:

Type Description
Path

The relative path to the most recent matching file.

Source code in src\pyhbr\common.py
def pick_most_recent_saved_file(
    name: str, save_dir: str, extension: str = "pkl"
) -> Path:
    """Get the path to the most recent file matching name.

    Like pick_saved_file_interactive, but automatically selects the most
    recent file in save_data.

    Args:
        name: The name of the saved file to list
        save_dir: The directory to search for files
        extension: What file extension to look for. Do not include the dot.

    Returns:
        The relative path to the most recent matching file.
    """
    recent_first = get_saved_files_by_name(name, save_dir, extension)
    return Path(save_dir) / Path(recent_first.loc[0, "path"])

pick_saved_file_interactive(name, save_dir, extension='pkl')

Select a file matching name interactively

Print a list of the saved items in the save_dir folder, along with the date and time it was generated, and the commit hash, and let the user pick which item should be loaded interactively. The full filename of the resulting file is returned, which can then be read by the user.

Parameters:

Name Type Description Default
name str

The name of the saved file to list

required
save_dir str

The directory to search for files

required
extension str

What file extension to look for. Do not include the dot.

'pkl'

Returns:

Type Description
str | None

The absolute path to the interactively selected file, or None if the interactive load was aborted.

Source code in src\pyhbr\common.py
def pick_saved_file_interactive(
    name: str, save_dir: str, extension: str = "pkl"
) -> str | None:
    """Select a file matching name interactively

    Print a list of the saved items in the save_dir folder, along
    with the date and time it was generated, and the commit hash,
    and let the user pick which item should be loaded interactively.
    The full filename of the resulting file is returned, which can
    then be read by the user.

    Args:
        name: The name of the saved file to list
        save_dir: The directory to search for files
        extension: What file extension to look for. Do not include the dot.

    Returns:
        The absolute path to the interactively selected file, or None
            if the interactive load was aborted.
    """

    recent_first = get_saved_files_by_name(name, save_dir, extension)
    print(recent_first)

    num_datasets = recent_first.shape[0]
    while True:
        try:
            raw_choice = input(
                f"Pick a dataset to load: [{0} - {num_datasets-1}] (type q[uit]/exit, then Enter, to quit): "
            )
            if "exit" in raw_choice or "q" in raw_choice:
                return None
            choice = int(raw_choice)
        except Exception:
            print(f"{raw_choice} is not valid; try again.")
            continue
        if choice < 0 or choice >= num_datasets:
            print(f"{choice} is not in range; try again.")
            continue
        break

    full_path = os.path.join(save_dir, recent_first.loc[choice, "path"])
    return full_path

query_yes_no(question, default='yes')

Ask a yes/no question via raw_input() and return their answer.

From https://stackoverflow.com/a/3041990.

"question" is a string that is presented to the user. "default" is the presumed answer if the user just hits . It must be "yes" (the default), "no" or None (meaning an answer is required of the user).

The "answer" return value is True for "yes" or False for "no".

Source code in src\pyhbr\common.py
def query_yes_no(question, default="yes"):
    """Ask a yes/no question via raw_input() and return their answer.

    From https://stackoverflow.com/a/3041990.

    "question" is a string that is presented to the user.
    "default" is the presumed answer if the user just hits <Enter>.
            It must be "yes" (the default), "no" or None (meaning
            an answer is required of the user).

    The "answer" return value is True for "yes" or False for "no".
    """
    valid = {"yes": True, "y": True, "ye": True, "no": False, "n": False}
    if default is None:
        prompt = " [y/n] "
    elif default == "yes":
        prompt = " [Y/n] "
    elif default == "no":
        prompt = " [y/N] "
    else:
        raise ValueError("invalid default answer: '%s'" % default)

    while True:
        sys.stdout.write(question + prompt)
        choice = input().lower()
        if default is not None and choice == "":
            return valid[default]
        elif choice in valid:
            return valid[choice]
        else:
            sys.stdout.write("Please respond with 'yes' or 'no' " "(or 'y' or 'n').\n")

read_config_file(yaml_path)

Read the configuration file from

Parameters:

Name Type Description Default
yaml_path str

The path to the experiment config file

required
Source code in src\pyhbr\common.py
def read_config_file(yaml_path: str):
    """Read the configuration file from

    Args:
        yaml_path: The path to the experiment config file
    """
    # Read the configuration file
    with open(yaml_path) as stream:
        try:
            return yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            print(f"Failed to load config file: {exc}")
            exit(1)

requires_commit()

Check whether changes need committing

To make most effective use of the commit hash stored with a save_item call, the current branch should be clean (all changes committed). Call this function to check.

Returns False if there is no git repository.

Returns:

Type Description
bool

True if the working directory is in a git repository that requires a commit; False otherwise.

Source code in src\pyhbr\common.py
def requires_commit() -> bool:
    """Check whether changes need committing

    To make most effective use of the commit hash stored with a
    save_item call, the current branch should be clean (all changes
    committed). Call this function to check.

    Returns False if there is no git repository.

    Returns:
        True if the working directory is in a git repository that requires
            a commit; False otherwise.
    """
    try:
        repo = Repo(search_parent_directories=True)
        return repo.is_dirty(untracked_files=True)
    except InvalidGitRepositoryError:
        # No need to commit if not repository
        return False

save_item(item, name, save_dir='save_data/', enforce_clean_branch=True, prompt_commit=False)

Save an item to a pickle file

Saves a python object (e.g. a pandas DataFrame) dataframe in the save_dir folder, using a filename that includes the current timestamp and the current commit hash. Use load_item to retrieve the file.

Important

Ensure that save_data/ (or your chosen save_dir) is added to the .gitignore of your repository to ensure sensitive data is not committed.

By storing the commit hash and timestamp, it is possible to identify when items were created and what code created them. To make most effective use of the commit hash, ensure that you commit, and do not make any further code edits, before running a script that calls save_item (otherwise the commit hash will not quite reflect the state of the running code).

Parameters:

Name Type Description Default
item Any

The python object to save (e.g. pandas DataFrame)

required
name str

The name of the item. The filename will be created by adding a suffix for the current commit and the timestamp to show when the data was saved (format: name_commit_timestamp.pkl)

required
save_dir str

Where to save the data, relative to the current working directory. The directory will be created if it does not exist.

'save_data/'
enforce_clean_branch

If True, the function will raise an exception if an attempt is made to save an item when the repository has uncommitted changes.

True
prompt_commit

if enforce_clean_branch is true, choose whether the prompt the user to commit on an unclean branch. This can help avoiding losing the results of a long-running script. Prefer to use false if the script is cheap to run.

False
Source code in src\pyhbr\common.py
def save_item(
    item: Any,
    name: str,
    save_dir: str = "save_data/",
    enforce_clean_branch=True,
    prompt_commit=False,
) -> None:
    """Save an item to a pickle file

    Saves a python object (e.g. a pandas DataFrame) dataframe in the save_dir
    folder, using a filename that includes the current timestamp and the current
    commit hash. Use load_item to retrieve the file.

    !!! important
        Ensure that `save_data/` (or your chosen `save_dir`) is added to the
        .gitignore of your repository to ensure sensitive data is not committed.

    By storing the commit hash and timestamp, it is possible to identify when items
    were created and what code created them. To make most effective use of the
    commit hash, ensure that you commit, and do not make any further code edits,
    before running a script that calls save_item (otherwise the commit hash will
    not quite reflect the state of the running code).

    Args:
        item: The python object to save (e.g. pandas DataFrame)
        name: The name of the item. The filename will be created by adding
            a suffix for the current commit and the timestamp to show when the
            data was saved (format: `name_commit_timestamp.pkl`)
        save_dir: Where to save the data, relative to the current working directory.
            The directory will be created if it does not exist.
        enforce_clean_branch: If True, the function will raise an exception if an attempt
            is made to save an item when the repository has uncommitted changes.
        prompt_commit: if enforce_clean_branch is true, choose whether the prompt the
            user to commit on an unclean branch. This can help avoiding losing
            the results of a long-running script. Prefer to use false if the script
            is cheap to run.
    """

    if enforce_clean_branch:

        abort_msg = "Aborting save_item() because branch is not clean. Commit your changes before saving item to increase the chance of reproducing the item based on the filename commit hash."

        if prompt_commit:
            # If the branch is not clean, prompt the user to commit to avoid losing
            # long-running model results. Take care to only commit if the state of
            # the repository truly reflects what was run (i.e. if no changes were made
            # while the script was running).
            while requires_commit():
                print(abort_msg)
                print(
                    "You can commit now and then retry the save after committing."
                )
                retry_save = query_yes_no(
                    "Do you want to retry the save? Commit, then select yes, or choose no to abort the save."
                )

                if not retry_save:
                    print(f"Aborting save of {name}")
                    return

            # If we get out the loop without returning, then the branch
            # is not clean and the save can proceed.
            print("Branch now clean, proceeding to save")

        else:

            if requires_commit():
                # In this case, unconditionally throw an error
                raise RuntimeError(abort_msg)

    if not Path(save_dir).exists():
        print(f"Creating missing folder '{save_dir}' for storing item")
        Path(save_dir).mkdir(parents=True, exist_ok=True)

    path = make_new_save_item_path(name, save_dir, "pkl")
    with open(path, "wb") as file:
        print(f"Saving {str(path)}")
        pickle.dump(item, file)

data_source

Routines for fetching data from sources.

This module is intended to interface to the data source, and should be modified to port this package to new SQL databases.

hic

SQL queries and functions for HIC (v3, UHBW) data.

Most data available in the HIC tables is fetched in the queries below, apart from columns which are all-NULL, provide keys/IDs that will not be used, or provide duplicate information (e.g. duplicated in two tables).

demographics_query(engine)

Get demographic information from HIC data

The date/time at which the data was obtained is not stored in the table, but patient age can be computed from the date of the episode under consideration and the year_of_birth in this table.

The underlying table does have a cause_of_death column, but it is all null, so not included.

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\hic.py
def demographics_query(engine: Engine) -> Select:
    """Get demographic information from HIC data

    The date/time at which the data was obtained is
    not stored in the table, but patient age can be
    computed from the date of the episode under consideration
    and the year_of_birth in this table.

    The underlying table does have a cause_of_death column,
    but it is all null, so not included.

    Args:
        engine: the connection to the database

    Returns:
        SQL query to retrieve episodes table
    """
    table = CheckedTable("cv1_demographics", engine)
    return select(
        table.col("subject").cast(String).label("patient_id"),
        table.col("gender"),
        table.col("year_of_birth"),
        table.col("death_date"),
    )

diagnoses_query(engine)

Get the diagnoses corresponding to episodes

This should be linked to the episodes table to obtain information about the diagnoses in the episode.

Diagnoses are encoded using ICD-10 codes, and the position column contains the order of diagnoses in the episode (1-indexed).

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required

Returns:

Type Description
Select

SQL query to retrieve diagnoses table

Source code in src\pyhbr\data_source\hic.py
def diagnoses_query(engine: Engine) -> Select:
    """Get the diagnoses corresponding to episodes

    This should be linked to the episodes table to
    obtain information about the diagnoses in the episode.

    Diagnoses are encoded using ICD-10 codes, and the
    position column contains the order of diagnoses in
    the episode (1-indexed).

    Args:
        engine: the connection to the database

    Returns:
        SQL query to retrieve diagnoses table
    """
    table = CheckedTable("cv1_episodes_diagnosis", engine)
    return select(
        table.col("episode_identifier").cast(String).label("episode_id"),
        table.col("diagnosis_date_time").label("time"),
        table.col("diagnosis_position").label("position"),
        table.col("diagnosis_code_icd").label("code"),
    )

episodes_query(engine, start_date, end_date)

Get the episodes list in the HIC data

This table does not contain any episode information, just a patient and an episode id for linking to diagnosis and procedure information in other tables.

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required
start_date date

first valid consultant-episode start date

required
end_date date

last valid consultant-episode start date

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\hic.py
def episodes_query(engine: Engine, start_date: date, end_date: date) -> Select:
    """Get the episodes list in the HIC data

    This table does not contain any episode information,
    just a patient and an episode id for linking to diagnosis
    and procedure information in other tables.

    Args:
        engine: the connection to the database
        start_date: first valid consultant-episode start date
        end_date: last valid consultant-episode start date

    Returns:
        SQL query to retrieve episodes table
    """
    table = CheckedTable("cv1_episodes", engine)
    return select(
        table.col("subject").cast(String).label("patient_id"),
        table.col("episode_identifier").cast(String).label("episode_id"),
        table.col("spell_identifier").cast(String).label("spell_id"),
        table.col("episode_start_time").label("episode_start"),
        table.col("episode_end_time").label("episode_end"),
        table.col("admission_date_time").label("admission"),
        table.col("discharge_date_time").label("discharge"),
    ).where(
        table.col("episode_start_time") >= start_date,
        table.col("episode_end_time") <= end_date,
    )

pathology_blood_query(engine, investigations)

Get the table of blood test results in the HIC data

Since blood tests in this table are not associated with an episode directly by key, it is necessary to link them based on the patient identifier and date. This operation can be quite slow if the blood tests table is large. One way to reduce the size is to filter by investigation using the investigations parameter. The investigation codes in the HIC data are shown below:

investigation Description
OBR_BLS_UL LFT
OBR_BLS_UE UREA,CREAT + ELECTROLYTES
OBR_BLS_FB FULL BLOOD COUNT
OBR_BLS_UT THYROID FUNCTION TEST
OBR_BLS_TP TOTAL PROTEIN
OBR_BLS_CR C-REACTIVE PROTEIN
OBR_BLS_CS CLOTTING SCREEN
OBR_BLS_FI FIB-4
OBR_BLS_AS AST
OBR_BLS_CA CALCIUM GROUP
OBR_BLS_TS TSH AND FT4
OBR_BLS_FO SERUM FOLATE
OBR_BLS_PO PHOSPHATE
OBR_BLS_LI LIPID PROFILE
OBR_POC_VG POCT BLOOD GAS VENOUS SAMPLE
OBR_BLS_HD HDL CHOLESTEROL
OBR_BLS_FT FREE T4
OBR_BLS_FE SERUM FERRITIN
OBR_BLS_GP ELECTROLYTES NO POTASSIUM
OBR_BLS_CH CHOLESTEROL
OBR_BLS_MG MAGNESIUM
OBR_BLS_CO CORTISOL

Each test is similarly encoded. The valid test codes in the full blood count and U+E investigations are shown below:

investigation test Description
OBR_BLS_FB OBX_BLS_NE Neutrophils
OBR_BLS_FB OBX_BLS_PL Platelets
OBR_BLS_FB OBX_BLS_WB White Cell Count
OBR_BLS_FB OBX_BLS_LY Lymphocytes
OBR_BLS_FB OBX_BLS_MC MCV
OBR_BLS_FB OBX_BLS_HB Haemoglobin
OBR_BLS_FB OBX_BLS_HC Haematocrit
OBR_BLS_UE OBX_BLS_NA Sodium
OBR_BLS_UE OBX_BLS_UR Urea
OBR_BLS_UE OBX_BLS_K Potassium
OBR_BLS_UE OBX_BLS_CR Creatinine
OBR_BLS_UE OBX_BLS_EP eGFR/1.73m2 (CKD-EPI)

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required
investigations list[str]

Which types of laboratory test to include in the query. Fetching fewer types of test makes the query faster.

required

Returns:

Type Description
Engine

SQL query to retrieve blood tests table

Source code in src\pyhbr\data_source\hic.py
def pathology_blood_query(engine: Engine, investigations: list[str]) -> Engine:
    """Get the table of blood test results in the HIC data

    Since blood tests in this table are not associated with an episode
    directly by key, it is necessary to link them based on the patient
    identifier and date. This operation can be quite slow if the blood
    tests table is large. One way to reduce the size is to filter by
    investigation using the investigations parameter. The investigation
    codes in the HIC data are shown below:

    | `investigation` | Description                 |
    |-----------------|-----------------------------|
    | OBR_BLS_UL      |                          LFT|
    | OBR_BLS_UE      |    UREA,CREAT + ELECTROLYTES|
    | OBR_BLS_FB      |             FULL BLOOD COUNT|
    | OBR_BLS_UT      |        THYROID FUNCTION TEST|
    | OBR_BLS_TP      |                TOTAL PROTEIN|
    | OBR_BLS_CR      |           C-REACTIVE PROTEIN|
    | OBR_BLS_CS      |              CLOTTING SCREEN|
    | OBR_BLS_FI      |                        FIB-4|
    | OBR_BLS_AS      |                          AST|
    | OBR_BLS_CA      |                CALCIUM GROUP|
    | OBR_BLS_TS      |                  TSH AND FT4|
    | OBR_BLS_FO      |                SERUM FOLATE|
    | OBR_BLS_PO      |                    PHOSPHATE|
    | OBR_BLS_LI      |                LIPID PROFILE|
    | OBR_POC_VG      | POCT BLOOD GAS VENOUS SAMPLE|
    | OBR_BLS_HD      |              HDL CHOLESTEROL|
    | OBR_BLS_FT      |                      FREE T4|
    | OBR_BLS_FE      |               SERUM FERRITIN|
    | OBR_BLS_GP      |    ELECTROLYTES NO POTASSIUM|
    | OBR_BLS_CH      |                  CHOLESTEROL|
    | OBR_BLS_MG      |                    MAGNESIUM|
    | OBR_BLS_CO      |                     CORTISOL|

    Each test is similarly encoded. The valid test codes in the full
    blood count and U+E investigations are shown below:

    | `investigation` | `test`     | Description          |
    |-----------------|------------|----------------------|
    | OBR_BLS_FB      | OBX_BLS_NE |           Neutrophils|
    | OBR_BLS_FB      | OBX_BLS_PL |             Platelets|
    | OBR_BLS_FB      | OBX_BLS_WB |      White Cell Count|
    | OBR_BLS_FB      | OBX_BLS_LY |           Lymphocytes|
    | OBR_BLS_FB      | OBX_BLS_MC |                   MCV|
    | OBR_BLS_FB      | OBX_BLS_HB |           Haemoglobin|
    | OBR_BLS_FB      | OBX_BLS_HC |           Haematocrit|
    | OBR_BLS_UE      | OBX_BLS_NA |                Sodium|
    | OBR_BLS_UE      | OBX_BLS_UR |                  Urea|
    | OBR_BLS_UE      | OBX_BLS_K  |             Potassium|
    | OBR_BLS_UE      | OBX_BLS_CR |            Creatinine|
    | OBR_BLS_UE      | OBX_BLS_EP | eGFR/1.73m2 (CKD-EPI)|

    Args:
        engine: the connection to the database
        investigations: Which types of laboratory
            test to include in the query. Fetching fewer types of
            test makes the query faster.

    Returns:
        SQL query to retrieve blood tests table
    """

    table = CheckedTable("cv1_pathology_blood", engine)
    return select(
        table.col("subject").cast(String).label("patient_id"),
        table.col("investigation_code").label("investigation"),
        table.col("test_code").label("test"),
        table.col("test_result").label("result"),
        table.col("test_result_unit").label("unit"),
        table.col("sample_collected_date_time").label("sample_date"),
        table.col("result_available_date_time").label("result_date"),
        table.col("result_flag"),
        table.col("result_lower_range"),
        table.col("result_upper_range"),
    ).where(table.col("investigation_code").in_(investigations))

pharmacy_prescribing_query(engine, table_name='cv1_pharmacy_prescribing')

Get medicines prescribed to patients over time

This table contains information about medicines prescribed to patients, identified by patient and time (i.e. it is not associated to an episode). The information includes the medicine name, dose (includes unit), frequency, form (e.g. tablets), route (e.g. oral), and whether the medicine was present on admission.

The most commonly occurring formats for various relevant medicines are shown in the table below:

name dose frequency drug_form route
aspirin 75 mg in the MORNING NaN Oral
aspirin 75 mg in the MORNING dispersible tablet Oral
clopidogrel 75 mg in the MORNING film coated tablets Oral
ticagrelor 90 mg TWICE a day tablets Oral
warfarin 3 mg ONCE a day at 18:00 NaN Oral
warfarin 5 mg ONCE a day at 18:00 tablets Oral
apixaban 5 mg TWICE a day tablets Oral
dabigatran etexilate 110 mg TWICE a day capsules Oral
edoxaban 60 mg in the MORNING tablets Oral
rivaroxaban 20 mg in the MORNING film coated tablets Oral

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required
table_name str

This defaults to "cv1_pharmacy_prescribing" for UHBW, but can be overwritten with "HIC_Pharmacy" for ICB.

'cv1_pharmacy_prescribing'

Returns:

Type Description
Select

SQL query to retrieve procedures table

Source code in src\pyhbr\data_source\hic.py
def pharmacy_prescribing_query(engine: Engine, table_name: str = "cv1_pharmacy_prescribing") -> Select:
    """Get medicines prescribed to patients over time

    This table contains information about medicines 
    prescribed to patients, identified by patient and time
    (i.e. it is not associated to an episode). The information
    includes the medicine name, dose (includes unit), frequency, 
    form (e.g. tablets), route (e.g. oral), and whether the
    medicine was present on admission.

    The most commonly occurring formats for various relevant
    medicines are shown in the table below:

    | `name`       | `dose`  | `frequency`    | `drug_form`         | `route` |
    |--------------|---------|----------------|---------------------|---------|
    | aspirin      | 75 mg   | in the MORNING | NaN                 | Oral    |
    | aspirin      | 75 mg   | in the MORNING | dispersible tablet  | Oral    |
    | clopidogrel  | 75 mg   | in the MORNING | film coated tablets | Oral    |
    | ticagrelor   | 90 mg   | TWICE a day    | tablets             | Oral    |
    | warfarin     | 3 mg    | ONCE a day  at 18:00 | NaN           | Oral    |
    | warfarin     | 5 mg    | ONCE a day  at 18:00 | tablets       | Oral    |       
    | apixaban     | 5 mg    | TWICE a day          | tablets       | Oral    |
    | dabigatran etexilate | 110 mg | TWICE a day   | capsules      | Oral    |
    | edoxaban     | 60 mg   | in the MORNING       | tablets       | Oral    |
    | rivaroxaban  | 20 mg   | in the MORNING | film coated tablets | Oral    |

    Args:
        engine: the connection to the database
        table_name: This defaults to "cv1_pharmacy_prescribing" for UHBW,
            but can be overwritten with "HIC_Pharmacy" for ICB.

    Returns:
        SQL query to retrieve procedures table
    """

    # This field name depends on UHBW vs. ICB.
    if table_name == "cv1_pharmacy_prescribing":
        patient_id_field = "subject"
    else:
        patient_id_field = "nhs_number"

    table = CheckedTable(table_name, engine)
    return select(
        table.col(patient_id_field).cast(String).label("patient_id"),
        table.col("order_date_time").label("order_date"),
        table.col("medication_name").label("name"),
        table.col("ordered_dose").label("dose"),
        table.col("ordered_frequency").label("frequency"),
        table.col("ordered_drug_form").label("drug_form"),
        table.col("ordered_route").label("route"),
        table.col("admission_medicine_y_n").label("on_admission"),
    )

procedures_query(engine)

Get the procedures corresponding to episodes

This should be linked to the episodes table to obtain information about the procedures in the episode.

Procedures are encoded using OPCS-4 codes, and the position column contains the order of procedures in the episode (1-indexed).

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required

Returns:

Type Description
Select

SQL query to retrieve procedures table

Source code in src\pyhbr\data_source\hic.py
def procedures_query(engine: Engine) -> Select:
    """Get the procedures corresponding to episodes

    This should be linked to the episodes table to
    obtain information about the procedures in the episode.

    Procedures are encoded using OPCS-4 codes, and the
    position column contains the order of procedures in
    the episode (1-indexed).

    Args:
        engine: the connection to the database

    Returns:
        SQL query to retrieve procedures table
    """
    table = CheckedTable("cv1_episodes_procedures", engine)
    return select(
        table.col("episode_identifier").cast(String).label("episode_id"),
        table.col("procedure_date_time").label("time"),
        table.col("procedure_position").label("position"),
        table.col("procedure_code_opcs").label("code"),
    )

hic_covid

SQL queries and functions for HIC (COVID-19, UHBW) data.

episodes_query(engine)

Get the episodes list in the HIC data

This table does not contain any episode information, just a patient and an episode id for linking to diagnosis and procedure information in other tables.

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required
start_date

first valid consultant-episode start date

required
end_date

last valid consultant-episode start date

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\hic_covid.py
def episodes_query(engine: Engine) -> Select:
    """Get the episodes list in the HIC data

    This table does not contain any episode information,
    just a patient and an episode id for linking to diagnosis
    and procedure information in other tables.

    Args:
        engine: the connection to the database
        start_date: first valid consultant-episode start date
        end_date: last valid consultant-episode start date

    Returns:
        SQL query to retrieve episodes table
    """
    table = CheckedTable("cv_covid_episodes", engine)
    return select(
        table.col("NHS_NUMBER").cast(String).label("nhs_number"),
        table.col("Other Number").cast(String).label("t_number"),
        table.col("episode_identifier").cast(String).label("episode_id"),
    )

hic_icb

SQL queries and functions for HIC (ICB version)

Most data available in the HIC tables is fetched in the queries below, apart from columns which are all-NULL, provide keys/IDs that will not be used, or provide duplicate information (e.g. duplicated in two tables).

Note that the lab results/pharmacy queries are in the hic.py module, because there are no changes to the query apart from the table name.

episode_id_query(engine)

Get the episodes list in the HIC data

This table is just a list of IDs to identify the data in other ICB tables.

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\hic_icb.py
def episode_id_query(engine: Engine) -> Select:
    """Get the episodes list in the HIC data

    This table is just a list of IDs to identify the data in other ICB tables.

    Args:
        engine: the connection to the database

    Returns:
        SQL query to retrieve episodes table
    """
    table = CheckedTable("hic_episodes", engine)
    return select(
        table.col("nhs_number").cast(String).label("patient_id"),
        table.col("episode_identified").cast(String).label("episode_id"),
    )

pathology_blood_query(engine, test_names)

Get the table of blood test results in the HIC data

Since blood tests in this table are not associated with an episode directly by key, it is necessary to link them based on the patient identifier and date. This operation can be quite slow if the blood tests table is large. One way to reduce the size is to filter by investigation using the investigations parameter. The investigation codes in the HIC data are shown below:

investigation Description
OBR_BLS_UL LFT
OBR_BLS_UE UREA,CREAT + ELECTROLYTES
OBR_BLS_FB FULL BLOOD COUNT
OBR_BLS_UT THYROID FUNCTION TEST
OBR_BLS_TP TOTAL PROTEIN
OBR_BLS_CR C-REACTIVE PROTEIN
OBR_BLS_CS CLOTTING SCREEN
OBR_BLS_FI FIB-4
OBR_BLS_AS AST
OBR_BLS_CA CALCIUM GROUP
OBR_BLS_TS TSH AND FT4
OBR_BLS_FO SERUM FOLATE
OBR_BLS_PO PHOSPHATE
OBR_BLS_LI LIPID PROFILE
OBR_POC_VG POCT BLOOD GAS VENOUS SAMPLE
OBR_BLS_HD HDL CHOLESTEROL
OBR_BLS_FT FREE T4
OBR_BLS_FE SERUM FERRITIN
OBR_BLS_GP ELECTROLYTES NO POTASSIUM
OBR_BLS_CH CHOLESTEROL
OBR_BLS_MG MAGNESIUM
OBR_BLS_CO CORTISOL

Each test is similarly encoded. The valid test codes in the full blood count and U+E investigations are shown below:

investigation test Description
OBR_BLS_FB OBX_BLS_NE Neutrophils
OBR_BLS_FB OBX_BLS_PL Platelets
OBR_BLS_FB OBX_BLS_WB White Cell Count
OBR_BLS_FB OBX_BLS_LY Lymphocytes
OBR_BLS_FB OBX_BLS_MC MCV
OBR_BLS_FB OBX_BLS_HB Haemoglobin
OBR_BLS_FB OBX_BLS_HC Haematocrit
OBR_BLS_UE OBX_BLS_NA Sodium
OBR_BLS_UE OBX_BLS_UR Urea
OBR_BLS_UE OBX_BLS_K Potassium
OBR_BLS_UE OBX_BLS_CR Creatinine
OBR_BLS_UE OBX_BLS_EP eGFR/1.73m2 (CKD-EPI)

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required
test_names list[str]

Unlike the UHBW version of this table, there are no investigation names here. Instead, restrict directly using the test_name field.

required

Returns:

Type Description
Engine

SQL query to retrieve blood tests table

Source code in src\pyhbr\data_source\hic_icb.py
def pathology_blood_query(engine: Engine, test_names: list[str]) -> Engine:
    """Get the table of blood test results in the HIC data

    Since blood tests in this table are not associated with an episode
    directly by key, it is necessary to link them based on the patient
    identifier and date. This operation can be quite slow if the blood
    tests table is large. One way to reduce the size is to filter by
    investigation using the investigations parameter. The investigation
    codes in the HIC data are shown below:

    | `investigation` | Description                 |
    |-----------------|-----------------------------|
    | OBR_BLS_UL      |                          LFT|
    | OBR_BLS_UE      |    UREA,CREAT + ELECTROLYTES|
    | OBR_BLS_FB      |             FULL BLOOD COUNT|
    | OBR_BLS_UT      |        THYROID FUNCTION TEST|
    | OBR_BLS_TP      |                TOTAL PROTEIN|
    | OBR_BLS_CR      |           C-REACTIVE PROTEIN|
    | OBR_BLS_CS      |              CLOTTING SCREEN|
    | OBR_BLS_FI      |                        FIB-4|
    | OBR_BLS_AS      |                          AST|
    | OBR_BLS_CA      |                CALCIUM GROUP|
    | OBR_BLS_TS      |                  TSH AND FT4|
    | OBR_BLS_FO      |                SERUM FOLATE|
    | OBR_BLS_PO      |                    PHOSPHATE|
    | OBR_BLS_LI      |                LIPID PROFILE|
    | OBR_POC_VG      | POCT BLOOD GAS VENOUS SAMPLE|
    | OBR_BLS_HD      |              HDL CHOLESTEROL|
    | OBR_BLS_FT      |                      FREE T4|
    | OBR_BLS_FE      |               SERUM FERRITIN|
    | OBR_BLS_GP      |    ELECTROLYTES NO POTASSIUM|
    | OBR_BLS_CH      |                  CHOLESTEROL|
    | OBR_BLS_MG      |                    MAGNESIUM|
    | OBR_BLS_CO      |                     CORTISOL|

    Each test is similarly encoded. The valid test codes in the full
    blood count and U+E investigations are shown below:

    | `investigation` | `test`     | Description          |
    |-----------------|------------|----------------------|
    | OBR_BLS_FB      | OBX_BLS_NE |           Neutrophils|
    | OBR_BLS_FB      | OBX_BLS_PL |             Platelets|
    | OBR_BLS_FB      | OBX_BLS_WB |      White Cell Count|
    | OBR_BLS_FB      | OBX_BLS_LY |           Lymphocytes|
    | OBR_BLS_FB      | OBX_BLS_MC |                   MCV|
    | OBR_BLS_FB      | OBX_BLS_HB |           Haemoglobin|
    | OBR_BLS_FB      | OBX_BLS_HC |           Haematocrit|
    | OBR_BLS_UE      | OBX_BLS_NA |                Sodium|
    | OBR_BLS_UE      | OBX_BLS_UR |                  Urea|
    | OBR_BLS_UE      | OBX_BLS_K  |             Potassium|
    | OBR_BLS_UE      | OBX_BLS_CR |            Creatinine|
    | OBR_BLS_UE      | OBX_BLS_EP | eGFR/1.73m2 (CKD-EPI)|

    Args:
        engine: the connection to the database
        test_names: Unlike the UHBW version of this table, there are no
            investigation names here. Instead, restrict directly using
            the test_name field.

    Returns:
        SQL query to retrieve blood tests table
    """

    table = CheckedTable("HIC_BLoods", engine)
    return select(
        table.col("nhs_number").cast(String).label("patient_id"),
        table.col("test_name"),
        table.col("test_result").label("result"),
        table.col("test_result_unit").label("unit"),
        table.col("sample_collected_date_time").label("sample_date"),
        table.col("result_available_date_time").label("result_date"),
        table.col("result_lower_range"),
        table.col("result_upper_range"),
    ).where(table.col("test_name").in_(test_names))

icb

Data sources available from the BNSSG ICB This file contains queries that fetch the raw data from the BNSSG ICB, which includes hospital episode statistics (HES) and primary care data.

This file does not include the HIC data transferred to the ICB.

clinical_code_column_name(kind, position)

Make the primary/secondary diagnosis/procedure column names

Parameters:

Name Type Description Default
kind str

Either "diagnosis" or "procedure".

required
position int

0 for primary, 1 and higher for secondaries.

required

Returns:

Type Description
str

The column name for the clinical code compatible with the ICB HES tables.

Source code in src\pyhbr\data_source\icb.py
def clinical_code_column_name(kind: str, position: int) -> str:
    """Make the primary/secondary diagnosis/procedure column names

    Args:
        kind: Either "diagnosis" or "procedure".
        position: 0 for primary, 1 and higher for secondaries.

    Returns:
        The column name for the clinical code compatible with
            the ICB HES tables.
    """

    if kind == "diagnosis":
        if position == 0:
            return "DiagnosisPrimary_ICD"

        return f"Diagnosis{ordinal(position)}Secondary_ICD"
    else:
        if position == 0:
            # reversed compared to diagnoses
            return "PrimaryProcedure_OPCS"

        # Secondaries are offset by one compared to diagnoses
        return f"Procedure{ordinal(position+1)}_OPCS"

mortality_query(engine, start_date, end_date)

Get the mortality query, including cause of death

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
start_date date

First date of death that will be included

required
end_date date

Last date of death that will be included

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\icb.py
def mortality_query(engine: Engine, start_date: date, end_date: date) -> Select:
    """Get the mortality query, including cause of death

    Args:
        engine: The connection to the database
        start_date: First date of death that will be included
        end_date: Last date of death that will be included

    Returns:
        SQL query to retrieve episodes table
    """

    table = CheckedTable("mortality", engine, schema="civil_registration")

    # Secondary cause of death columns
    cause_of_death_columns = [
        table.col(f"S_COD_CODE_{n}").label(f"cause_of_death_{n+1}")
        for n in range(1, 16)
    ]

    return select(
        table.col("Derived_Pseudo_NHS").cast(String).label("patient_id"),
        table.col("REG_DATE_OF_DEATH").cast(DateTime).label("date_of_death"),
        table.col("S_UNDERLYING_COD_ICD10").label("cause_of_death_1"),
        *cause_of_death_columns,
    ).where(
        table.col("REG_DATE_OF_DEATH") >= start_date,
        table.col("REG_DATE_OF_DEATH") <= end_date,
        table.col("Derived_Pseudo_NHS").is_not(None),
        table.col("Derived_Pseudo_NHS") != 9000219621,  # Invalid-patient marker    
    )

ordinal(n)

Make an an ordinal like "2nd" from a number n

See https://stackoverflow.com/a/20007730.

Parameters:

Name Type Description Default
n int

The integer to convert to an ordinal string.

required

Returns:

Type Description
str

For an integer (e.g. 5), the ordinal string (e.g. "5th")

Source code in src\pyhbr\data_source\icb.py
def ordinal(n: int) -> str:
    """Make an an ordinal like "2nd" from a number n

    See https://stackoverflow.com/a/20007730.

    Args:
        n: The integer to convert to an ordinal string.

    Returns:
        For an integer (e.g. 5), the ordinal string (e.g. "5th")
    """
    if 11 <= (n % 100) <= 13:
        suffix = "th"
    else:
        suffix = ["th", "st", "nd", "rd", "th"][min(n % 10, 4)]
    return str(n) + suffix

primary_care_attributes_query(engine, patient_ids, gp_opt_outs)

Get primary care patient information

This is translated into an IN clause, which has an item limit. If patient_ids is longer than 2000, an error is raised. If more patient IDs are needed, split patient_ids and call this function multiple times.

The values in patient_ids must be valid (they should come from a query such as sus_query).

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
patient_ids list[str]

The list of patient identifiers to filter the nhs_number column.

required
gp_opt_outs list[str]

List of practice codes that are excluded from the data fetch (corresponds to the "practice_code" column in the table).

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\icb.py
def primary_care_attributes_query(engine: Engine, patient_ids: list[str], gp_opt_outs: list[str]) -> Select:
    """Get primary care patient information

    This is translated into an IN clause, which has an item limit. 
    If patient_ids is longer than 2000, an error is raised. If 
    more patient IDs are needed, split patient_ids and call this
    function multiple times.

    The values in patient_ids must be valid (they should come from
    a query such as sus_query).

    Args:
        engine: The connection to the database
        patient_ids: The list of patient identifiers to filter
            the nhs_number column.
        gp_opt_outs: List of practice codes that are excluded
            from the data fetch (corresponds to the "practice_code"
            column in the table).

    Returns:
        SQL query to retrieve episodes table
    """
    if len(patient_ids) > 2000:
        raise ValueError("The list patient_ids must be less than 2000 long.")

    table = CheckedTable("primary_care_attributes", engine)

    return select(
        table.col("nhs_number").cast(String).label("patient_id"),
        table.col("attribute_period").cast(DateTime).label("date"),
        table.col("homeless"),

        # No need for these, available in episodes data
        #table.col("age"),
        #table.col("sex"),

        table.col("abortion"),
        table.col("adhd"),
        table.col("af"),
        table.col("alcohol_cscore"),
        table.col("alcohol_units"),
        table.col("amputations"),
        table.col("anaemia_iron"),
        table.col("anaemia_other"),
        table.col("angio_anaph"),
        table.col("arrhythmia_other"),
        table.col("asthma"),
        table.col("autism"),
        table.col("back_pain"),
        table.col("bmi"),
        table.col("bp_date"),
        table.col("bp_reading"),
        table.col("cancer_bladder_year"),
        table.col("cancer_bladder"),
        table.col("cancer_bowel_year"),
        table.col("cancer_bowel"),
        table.col("cancer_breast_year"),
        table.col("cancer_breast"),
        table.col("cancer_cervical_year"),
        table.col("cancer_cervical"),
        table.col("cancer_giliver_year"),
        table.col("cancer_giliver"),
        table.col("cancer_headneck_year"),
        table.col("cancer_headneck"),
        table.col("cancer_kidney_year"),
        table.col("cancer_kidney"),
        table.col("cancer_leuklymph_year"),
        table.col("cancer_leuklymph"),
        table.col("cancer_lung_year"),
        table.col("cancer_lung"),
        table.col("cancer_melanoma_year"),
        table.col("cancer_melanoma"),
        table.col("cancer_metase_year"),
        table.col("cancer_metase"),
        table.col("cancer_nonmaligskin_year"),
        table.col("cancer_nonmaligskin"),
        table.col("cancer_other_year"),
        table.col("cancer_other"),
        table.col("cancer_ovarian_year"),
        table.col("cancer_ovarian"),
        table.col("cancer_prostate_year"),
        table.col("cancer_prostate"),
        table.col("cardio_other"),
        table.col("cataracts"),
        table.col("ckd"),
        table.col("coag"),
        table.col("coeliac"),
        table.col("contraception"),
        table.col("copd"),
        table.col("cystic_fibrosis"),
        table.col("dementia"),
        table.col("dep_alcohol"),
        table.col("dep_benzo"),
        table.col("dep_cannabis"),
        table.col("dep_cocaine"),
        table.col("dep_opioid"),
        table.col("dep_other"),
        table.col("depression"),
        table.col("diabetes_1"),
        table.col("diabetes_2"),
        table.col("diabetes_gest"),
        table.col("diabetes_retina"),
        table.col("disorder_eating"),
        table.col("disorder_pers"),
        table.col("dna_cpr"),
        table.col("eczema"),
        table.col("efi_category"),
        table.col("egfr"),
        table.col("endocrine_other"),
        table.col("endometriosis"),
        table.col("eol_plan"),
        table.col("epaccs"),
        table.col("epilepsy"),
        table.col("ethnicity"),
        table.col("fatigue"),
        table.col("fev1"),
        table.col("fragility"),
        table.col("gender_identity"),
        table.col("gout"),
        table.col("gppaq"),
        table.col("has_carer"),
        table.col("health_check"),
        table.col("hearing_impair"),
        table.col("hep_b"),
        table.col("hep_c"),
        table.col("hf"),
        table.col("hiv"),
        table.col("housebound"),
        table.col("ht"),
        table.col("ibd"),
        table.col("ibs"),
        table.col("ihd_mi"),
        table.col("ihd_nonmi"),
        table.col("incont_urinary"),
        table.col("infant_feeding"),
        table.col("inflam_arthritic"),
        table.col("is_carer"),
        table.col("learning_diff"),
        table.col("learning_dis"),
        table.col("live_birth"),
        table.col("liver_alcohol"),
        table.col("liver_nafl"),
        table.col("liver_other"),
        table.col("lsoa"),
        table.col("lung_restrict"),
        table.col("macular_degen"),
        table.col("marital"),
        table.col("measles_mumps"),
        table.col("migraine"),
        table.col("miscarriage"),
        table.col("mmr1"),
        table.col("mmr2"),
        table.col("mnd"),
        table.col("mrc_dyspnoea"),
        table.col("ms"),
        table.col("neuro_pain"),
        table.col("neuro_various"),
        table.col("newborn_check"),
        table.col("newborn_weight"),
        table.col("nh_rh"),
        table.col("nose"),
        table.col("obesity"),
        table.col("organ_transplant"),
        table.col("osteoarthritis"),
        table.col("osteoporosis"),
        table.col("parkinsons"),
        table.col("pelvic"),
        table.col("phys_disability"),
        table.col("poly_ovary"),
        table.col("polypharmacy_acute"),
        table.col("polypharmacy_repeat"),
        table.col("pre_diabetes"),
        table.col("pref_death"),
        table.col("pregnancy"),
        table.col("prim_language"),
        table.col("psoriasis"),
        table.col("ptsd"),
        table.col("qof_af"),
        table.col("qof_asthma"),
        table.col("qof_cancer"),
        table.col("qof_chd"),
        table.col("qof_ckd"),
        table.col("qof_copd"),
        table.col("qof_dementia"),
        table.col("qof_depression"),
        table.col("qof_diabetes"),
        table.col("qof_epilepsy"),
        table.col("qof_hf"),
        table.col("qof_ht"),
        table.col("qof_learndis"),
        table.col("qof_mental"),
        table.col("qof_obesity"),
        table.col("qof_osteoporosis"),
        table.col("qof_pad"),
        table.col("qof_pall"),
        table.col("qof_rheumarth"),
        table.col("qof_stroke"),

        # Excluding a cardiovascular risk score as not wanting to use
        # a feature that may require hidden variables to calculate.
        #table.col("qrisk2_3"),

        table.col("religion"),
        table.col("ricketts"),
        table.col("sad"),
        table.col("screen_aaa"),
        table.col("screen_bowel"),
        table.col("screen_breast"),
        table.col("screen_cervical"),
        table.col("screen_eye"),
        table.col("self_harm"),
        table.col("sexual_orient"),
        table.col("sickle"),
        table.col("smi"),
        table.col("smoking"),
        table.col("stomach"),
        table.col("stroke"),
        table.col("tb"),
        table.col("thyroid"),
        table.col("uterine"),
        table.col("vasc_dis"),
        table.col("veteran"),
        table.col("visual_impair"),
    ).where(
        table.col("nhs_number").in_(patient_ids),
        table.col("practice_code").not_in(gp_opt_outs),
    )

primary_care_measurements_query(engine, patient_ids, gp_opt_outs)

Get physiological measurements performed in primary care

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required
patient_ids list[str]

The list of patient identifiers to filter the nhs_number column.

required
gp_opt_outs list[str]

List of practice codes that are excluded from the data fetch (corresponds to the "practice_code" column in the table).

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\icb.py
def primary_care_measurements_query(
    engine: Engine, patient_ids: list[str], gp_opt_outs: list[str]
) -> Select:
    """Get physiological measurements performed in primary care

    Args:
        engine: the connection to the database
        patient_ids: The list of patient identifiers to filter
            the nhs_number column.
        gp_opt_outs: List of practice codes that are excluded
            from the data fetch (corresponds to the "practice_code"
            column in the table).

    Returns:
        SQL query to retrieve episodes table
    """
    table = CheckedTable("measurement", engine, schema="swd")

    return select(
        table.col("nhs_number").cast(String).label("patient_id"),
        table.col("measurement_date").label("date"),
        table.col("measurement_name").label("name"),
        table.col("measurement_value").label("result"),
        table.col("measurement_group").label("group"),
    ).where(
        table.col("nhs_number").in_(patient_ids),
        table.col("practice_code").not_in(gp_opt_outs),
    )

primary_care_prescriptions_query(engine, patient_ids, gp_opt_outs)

Get medications dispensed in primary care

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required
patient_ids list[str]

The list of patient identifiers to filter the nhs_number column.

required
gp_opt_outs list[str]

List of practice codes that are excluded from the data fetch (corresponds to the "practice_code" column in the table).

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\icb.py
def primary_care_prescriptions_query(
    engine: Engine, patient_ids: list[str], gp_opt_outs: list[str]
) -> Select:
    """Get medications dispensed in primary care

    Args:
        engine: the connection to the database
        patient_ids: The list of patient identifiers to filter
            the nhs_number column.
        gp_opt_outs: List of practice codes that are excluded
            from the data fetch (corresponds to the "practice_code"
            column in the table).

    Returns:
        SQL query to retrieve episodes table
    """
    table = CheckedTable("prescription", engine, schema="swd")

    return select(
        table.col("nhs_number").cast(String).label("patient_id"),
        table.col("prescription_date").cast(DateTime).label("date"),
        table.col("prescription_name").label("name"),
        table.col("prescription_quantity").label("quantity"),
        table.col("prescription_type").label("acute_or_repeat"),
    ).where(
        table.col("nhs_number").in_(patient_ids),
        table.col("practice_code").not_in(gp_opt_outs),
    )

score_seg_query(engine, patient_ids)

Get score segment information from SWD (Charlson/Cambridge score, etc.)

This is translated into an IN clause, which has an item limit. If patient_ids is longer than 2000, an error is raised. If more patient IDs are needed, split patient_ids and call this function multiple times.

The values in patient_ids must be valid (they should come from a query such as sus_query).

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
patient_ids list[str]

The list of patient identifiers to filter the nhs_number column.

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\icb.py
def score_seg_query(engine: Engine, patient_ids: list[str]) -> Select:
    """Get score segment information from SWD (Charlson/Cambridge score, etc.)

    This is translated into an IN clause, which has an item limit. 
    If patient_ids is longer than 2000, an error is raised. If 
    more patient IDs are needed, split patient_ids and call this
    function multiple times.

    The values in patient_ids must be valid (they should come from
    a query such as sus_query).

    Args:
        engine: The connection to the database
        patient_ids: The list of patient identifiers to filter
            the nhs_number column.

    Returns:
        SQL query to retrieve episodes table
    """
    if len(patient_ids) > 2000:
        raise ValueError("The list patient_ids must be less than 2000 long.")

    table = CheckedTable("score_seg", engine, schema="swd")

    return select(
        table.col("nhs_number").cast(String).label("patient_id"),
        table.col("attribute_period").cast(DateTime).label("date"),
        table.col("cambridge_score"),
        table.col("charlson_score"),
    ).where(
        table.col("nhs_number").in_(patient_ids),
    )

sus_query(engine, start_date, end_date)

Get the episodes list in the HES data

This table contains one episode per row. Diagnosis/procedure clinical codes are represented in wide format (one clinical code position per columns), and patient demographic information is also included.

Parameters:

Name Type Description Default
engine Engine

the connection to the database

required
start_date date

first valid consultant-episode start date

required
end_date date

last valid consultant-episode start date

required

Returns:

Type Description
Select

SQL query to retrieve episodes table

Source code in src\pyhbr\data_source\icb.py
def sus_query(engine: Engine, start_date: date, end_date: date) -> Select:
    """Get the episodes list in the HES data

    This table contains one episode per row. Diagnosis/procedure clinical
    codes are represented in wide format (one clinical code position per
    columns), and patient demographic information is also included.

    Args:
        engine: the connection to the database
        start_date: first valid consultant-episode start date
        end_date: last valid consultant-episode start date

    Returns:
        SQL query to retrieve episodes table
    """
    table = CheckedTable("vw_apc_sem_001", engine)

    # Standard columns containing IDs, dates, patient demographics, etc
    columns = [
        table.col("AIMTC_Pseudo_NHS").cast(String).label("patient_id"),
        table.col("AIMTC_Age").cast(String).label("age"),
        table.col("Sex").cast(String).label("gender"),
        table.col("PBRspellID").cast(String).label("spell_id"),
        table.col("StartDate_ConsultantEpisode").label("episode_start"),
        table.col("EndDate_ConsultantEpisode").label("episode_end"),
        # Using the start and the end of the spells as admission/discharge
        # times for the purposes of identifying lab results and prescriptions
        # within the spell.
        table.col("StartDate_HospitalProviderSpell").label("admission"),
        table.col("DischargeDate_FromHospitalProviderSpell").label("discharge"),
    ]

    # Diagnosis and procedure columns are renamed to (diagnosis|procedure)_n,
    # where n begins from 1 (for the primary code; secondaries are represented
    # using n > 1)
    clinical_code_column_names = {
        clinical_code_column_name(kind, n): f"{kind}_{n+1}"
        for kind, n in product(["diagnosis", "procedure"], range(24))
    }

    clinical_code_columns = [
        table.col(real_name).cast(String).label(new_name)
        for real_name, new_name in clinical_code_column_names.items()
    ]

    # Append the clinical code columns to the other data columns
    columns += clinical_code_columns

    # Valid rows must have one of the following commissioner codes
    #
    # These commissioner codes are used to restrict the system-wide dataset
    # to just in-area patients (those registered with BNSSG GP practices).
    # When linking to primary care data 
    valid_list = ["5M8", "11T", "5QJ", "11H", "5A3", "12A", "15C", "14F", "Q65"]

    return select(*columns).where(
        table.col("StartDate_ConsultantEpisode") >= start_date,
        table.col("EndDate_ConsultantEpisode") <= end_date,
        table.col("AIMTC_Pseudo_NHS").is_not(None),
        table.col("AIMTC_Pseudo_NHS") != 9000219621,  # Invalid-patient marker
        table.col("AIMTC_OrganisationCode_Codeofcommissioner").in_(valid_list),
    )

middle

Routines for interfacing between the data sources and analysis functions

from_hic

Convert HIC tables into the formats required for analysis

calculate_age(episodes, demographics)

Calculate the patient age at each episode

The HIC data contains only year_of_birth, which is used here. In order to make an unbiased estimate of the age, birthday is assumed to be 2nd july (halfway through the year).

Parameters:

Name Type Description Default
episodes DataFrame

Contains episode_start date and column patient_id, indexed by episode_id.

required
demographics DataFrame

Contains year_of_birth date and index patient_id.

required

Returns:

Type Description
Series

A series containing age, indexed by episode_id.

Source code in src\pyhbr\middle\from_hic.py
def calculate_age(episodes: DataFrame, demographics: DataFrame) -> Series:
    """Calculate the patient age at each episode

    The HIC data contains only year_of_birth, which is used here. In order
    to make an unbiased estimate of the age, birthday is assumed to be
    2nd july (halfway through the year).

    Args:
        episodes: Contains `episode_start` date and column `patient_id`,
            indexed by `episode_id`.
        demographics: Contains `year_of_birth` date and index `patient_id`.

    Returns:
        A series containing age, indexed by `episode_id`.
    """
    df = episodes.merge(demographics, how="left", on="patient_id")
    age_offset = np.where(
        (df["episode_start"].dt.month < 7) & (df["episode_start"].dt.day < 2), 1, 0
    )
    age = df["episode_start"].dt.year - df["year_of_birth"] - age_offset
    age.index = episodes.index
    return age

check_const_column(df, col_name, expect)

Raise an error if a column is not constant

Parameters:

Name Type Description Default
df DataFrame

The table to check

required
col_name str

The name of the column which should be constant

required
expect str

The expected constant value of the column

required

Raises:

Type Description
RuntimeError

Raised if the column is not constant with the expected value.

Source code in src\pyhbr\middle\from_hic.py
def check_const_column(df: pd.DataFrame, col_name: str, expect: str):
    """Raise an error if a column is not constant

    Args:
        df: The table to check
        col_name: The name of the column which should be constant
        expect: The expected constant value of the column

    Raises:
        RuntimeError: Raised if the column is not constant with
            the expected value.
    """
    if not all(df[col_name] == expect):
        raise RuntimeError(
            f"Found unexpected value in '{col_name}' column. "
            f"Expected constant '{expect}', but got: "
            f"{df[col_name].unique()}"
        )

filter_by_medicine(df)

Filter a dataframe by medicine name

Parameters:

Name Type Description Default
df DataFrame

Contains a column name containing the medicine name

required

Returns:

Type Description
DataFrame

The dataframe, filtered to the set of medicines of interest, with a new column group containing just the medicine type (e.g. "oac", "nsaid").

Source code in src\pyhbr\middle\from_hic.py
def filter_by_medicine(df: DataFrame) -> DataFrame:
    """Filter a dataframe by medicine name

    Args:
        df: Contains a column `name` containing the medicine
            name

    Returns:
        The dataframe, filtered to the set of medicines of interest,
            with a new column `group` containing just the medicine
            type (e.g. "oac", "nsaid").
    """

    prescriptions_of_interest = {
        "warfarin": "oac",
        "apixaban": "oac",
        "dabigatran etexilate": "oac",
        "edoxaban": "oac",
        "rivaroxaban": "oac",
        "ibuprofen": "nsaid",
        "naproxen": "nsaid",
        "diclofenac": "nsaid",
        "diclofenac sodium": "nsaid",
        "celecoxib": "nsaid",  # Not present in HIC data
        "mefenamic acid": "nsaid",  # Not present in HIC data
        "etoricoxib": "nsaid",
        "indometacin": "nsaid",  # This spelling is used in HIC data
        "indomethacin": "nsaid",  # Alternative spelling
        # "aspirin": "nsaid" -- not accounting for high dose
    }

    # Remove rows with missing medicine name
    df = df[~df["name"].isna()]

    # This line is really slow (30s for 3.5m rows)
    df = df[
        df["name"].str.contains(
            "|".join(prescriptions_of_interest.keys()), case=False, regex=True
        )
    ]

    # Add the type of prescription to the table
    df["group"] = df["name"]
    for prescription, group in prescriptions_of_interest.items():
        df["group"] = df["group"].str.replace(
            ".*" + prescription + ".*", group, case=False, regex=True
        )

    return df

get_clinical_codes(engine, diagnoses_file, procedures_file)

Main diagnoses/procedures fetch for the HIC data

This function wraps the diagnoses/procedures queries and a filtering operation to reduce the tables to only those rows which contain a code in a group. One table is returned which contains both the diagnoses and procedures in long format, along with the associated episode ID and the primary/secondary position of the code in the episode.

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
diagnoses_file str

The diagnoses codes file name (loaded from the package)

required
procedures_file str

The procedures codes file name (loaded from the package)

required

Returns:

Type Description
DataFrame

A table containing diagnoses/procedures, normalised codes, code groups, diagnosis positions, and associated episode ID.

Source code in src\pyhbr\middle\from_hic.py
def get_clinical_codes(
    engine: Engine, diagnoses_file: str, procedures_file: str
) -> pd.DataFrame:
    """Main diagnoses/procedures fetch for the HIC data

    This function wraps the diagnoses/procedures queries and a filtering
    operation to reduce the tables to only those rows which contain a code
    in a group. One table is returned which contains both the diagnoses and
    procedures in long format, along with the associated episode ID and the
    primary/secondary position of the code in the episode.

    Args:
        engine: The connection to the database
        diagnoses_file: The diagnoses codes file name (loaded from the package)
        procedures_file: The procedures codes file name (loaded from the package)

    Returns:
        A table containing diagnoses/procedures, normalised codes, code groups,
            diagnosis positions, and associated episode ID.
    """

    diagnosis_codes = clinical_codes.load_from_package(diagnoses_file)
    procedures_codes = clinical_codes.load_from_package(procedures_file)

    # Fetch the data from the server
    diagnoses = get_data(engine, hic.diagnoses_query)
    procedures = get_data(engine, hic.procedures_query)

    # Reduce data to only code groups, and combine diagnoses/procedures
    filtered_diagnoses = clinical_codes.filter_to_groups(diagnoses, diagnosis_codes)
    filtered_procedures = clinical_codes.filter_to_groups(procedures, procedures_codes)

    # Tag the diagnoses/procedures, and combine the tables
    filtered_diagnoses["type"] = "diagnosis"
    filtered_procedures["type"] = "procedure"

    codes = pd.concat([filtered_diagnoses, filtered_procedures])
    codes["type"] = codes["type"].astype("category")

    return codes

get_demographics(engine)

Get patient demographic information

Gender is encoded using the NHS data dictionary values, which is mapped to a category column in the table. (Note that initial values are strings, not integers.)

  • "0": Not known. Mapped to "unknown"
  • "1": Male: Mapped to "male"
  • "2": Female. Mapped to "female"
  • "9": Not specified. Mapped to "unknown".

Not mapping 0/9 to NA in case either is related to non-binary genders (i.e. it contains information, rather than being a NULL field).

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required

Returns:

Type Description
DataFrame

A table indexed by patient_id, containing gender, birth year, and death_date (if applicable).

Source code in src\pyhbr\middle\from_hic.py
def get_demographics(engine: Engine) -> pd.DataFrame:
    """Get patient demographic information

    Gender is encoded using the NHS data dictionary values, which
    is mapped to a category column in the table. (Note that initial
    values are strings, not integers.)

    * "0": Not known. Mapped to "unknown"
    * "1": Male: Mapped to "male"
    * "2": Female. Mapped to "female"
    * "9": Not specified. Mapped to "unknown".

    Not mapping 0/9 to NA in case either is related to non-binary
    genders (i.e. it contains information, rather than being a NULL field).

    Args:
        engine: The connection to the database

    Returns:
        A table indexed by patient_id, containing gender, birth
            year, and death_date (if applicable).

    """
    df = get_data(engine, hic.demographics_query)
    df.set_index("patient_id", drop=True, inplace=True)

    # Convert gender to categories
    df["gender"] = df["gender"].replace("9", "0")
    df["gender"] = df["gender"].astype("category")
    df["gender"] = df["gender"].cat.rename_categories(
        {"0": "unknown", "1": "male", "2": "female"}
    )

    return df

get_episodes(engine, start_date, end_date)

Get the table of episodes

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
start_date date

The start date (inclusive) for returned episodes

required
end_date date

The end date (inclusive) for returned episodes

required

Returns:

Type Description
DataFrame

The episode data, indexed by episode_id. This contains the columns patient_id, spell_id, episode_start, admission, discharge, age, and gender

Source code in src\pyhbr\middle\from_hic.py
def get_episodes(engine: Engine, start_date: date, end_date: date) -> pd.DataFrame:
    """Get the table of episodes

    Args:
        engine: The connection to the database
        start_date: The start date (inclusive) for returned episodes
        end_date:  The end date (inclusive) for returned episodes

    Returns:
        The episode data, indexed by episode_id. This contains
            the columns `patient_id`, `spell_id`, `episode_start`,
            `admission`, `discharge`, `age`, and `gender`

    """
    episodes = get_data(engine, hic.episodes_query, start_date, end_date)
    episodes = episodes.set_index("episode_id", drop=True)
    demographics = get_demographics(engine)
    episodes["age"] = calculate_age(episodes, demographics)
    episodes["gender"] = get_gender(episodes, demographics) 

    return episodes

get_gender(episodes, demographics)

Get gender from the demographics table for each index event

Parameters:

Name Type Description Default
episodes DataFrame

Indexed by episode_id and having column patient_id

required
demographics DataFrame

Having columns patient_id and gender.

required

Returns:

Type Description
Series

A series containing gender indexed by episode_id

Source code in src\pyhbr\middle\from_hic.py
def get_gender(episodes: DataFrame, demographics: DataFrame) -> Series:
    """Get gender from the demographics table for each index event

    Args:
        episodes: Indexed by `episode_id` and having column `patient_id`
        demographics: Having columns `patient_id` and `gender`.

    Returns:
        A series containing gender indexed by `episode_id`
    """
    gender = episodes.merge(demographics, how="left", on="patient_id")["gender"]
    gender.index = episodes.index
    return gender

get_lab_results(engine, episodes)

Get relevant laboratory results from the HIC data, linked to episode

For information about the contents of the table, refer to the documentation for get_unlinked_lab_results().

This function links each laboratory test to the first episode containing the sample collected date in its date range. For more about this, see link_to_episodes().

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
episodes DataFrame

The episodes table, used for linking. Must contain patient_id, episode_id, episode_start and episode_end.

required

Returns:

Type Description
DataFrame

Table of laboratory results, including Hb (haemoglobin), platelet count, and eGFR (kidney function). The columns are sample_date, test_name, episode_id.

Source code in src\pyhbr\middle\from_hic.py
def get_lab_results(engine: Engine, episodes: pd.DataFrame) -> pd.DataFrame:
    """Get relevant laboratory results from the HIC data, linked to episode

    For information about the contents of the table, refer to the
    documentation for [get_unlinked_lab_results()][pyhbr.middle.from_hic.get_unlinked_lab_results].

    This function links each laboratory test to the first episode containing
    the sample collected date in its date range. For more about this, see
    [link_to_episodes()][pyhbr.middle.from_hic.link_to_episodes].

    Args:
        engine: The connection to the database
        episodes: The episodes table, used for linking. Must contain
            `patient_id`, `episode_id`, `episode_start` and `episode_end`.

    Returns:
        Table of laboratory results, including Hb (haemoglobin),
            platelet count, and eGFR (kidney function). The columns are
            `sample_date`, `test_name`, `episode_id`.
    """

    # Do not link to episodes
    return get_unlinked_lab_results(engine)

get_prescriptions(engine, episodes)

Get relevant prescriptions from the HIC data, linked to episode

For information about the contents of the table, refer to the documentation for get_unlinked_prescriptions().

This function links each prescription to the first episode containing the prescription order date in its date range. For more about this, see link_to_episodes().

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
episodes DataFrame

The episodes table, used for linking. Must contain patient_id, episode_id, episode_start and episode_end.

required

Returns:

Type Description
DataFrame

The table of prescriptions, including the prescription name, prescription group (oac or nsaid), frequency (in doses per day), and link to the associated episode.

Source code in src\pyhbr\middle\from_hic.py
def get_prescriptions(engine: Engine, episodes: pd.DataFrame) -> pd.DataFrame:
    """Get relevant prescriptions from the HIC data, linked to episode

    For information about the contents of the table, refer to the
    documentation for [get_unlinked_prescriptions()][pyhbr.middle.from_hic.get_unlinked_prescriptions].

    This function links each prescription to the first episode containing
    the prescription order date in its date range. For more about this, see
    [link_to_episodes()][pyhbr.middle.from_hic.link_to_episodes].

    Args:
        engine: The connection to the database
        episodes: The episodes table, used for linking. Must contain
            `patient_id`, `episode_id`, `episode_start` and `episode_end`.

    Returns:
        The table of prescriptions, including the prescription name,
            prescription group (oac or nsaid), frequency (in doses per day),
            and link to the associated episode.
    """

    # Do not link the prescriptions to episode
    return get_unlinked_prescriptions(engine)

get_unlinked_lab_results(engine, table_name='cv1_pathology_blood')

Get laboratory results from the HIC database (unlinked to episode)

This function returns data for the following three tests, identified by one of these values in the test_name column:

  • hb: haemoglobin (unit: g/dL)
  • egfr: eGFR (unit: mL/min)
  • platelets: platelet count (unit: 10^9/L)

The test result is associated to a patient_id, and the time when the sample for the test was collected is stored in the sample_date column.

Some values in the underlying table contain inequalities in the results column, which have been removed (so egfr >90 becomes 90).

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
table_name str

This defaults to "cv1_pathology_blood" for UHBW, but can be overwritten with "HIC_Bloods" for ICB.

'cv1_pathology_blood'

Returns:

Type Description
DataFrame

Table of laboratory results, including Hb (haemoglobin), platelet count, and eGFR (kidney function). The columns are patient_id, test_name, and sample_date.

Source code in src\pyhbr\middle\from_hic.py
def get_unlinked_lab_results(engine: Engine, table_name: str = "cv1_pathology_blood") -> pd.DataFrame:
    """Get laboratory results from the HIC database (unlinked to episode)

    This function returns data for the following three
    tests, identified by one of these values in the
    `test_name` column:

    * `hb`: haemoglobin (unit: g/dL)
    * `egfr`: eGFR (unit: mL/min)
    * `platelets`: platelet count (unit: 10^9/L)

    The test result is associated to a `patient_id`,
    and the time when the sample for the test was collected
    is stored in the `sample_date` column.

    Some values in the underlying table contain inequalities
    in the results column, which have been removed (so
    egfr >90 becomes 90).

    Args:
        engine: The connection to the database
        table_name: This defaults to "cv1_pathology_blood" for UHBW, but
            can be overwritten with "HIC_Bloods" for ICB.

    Returns:
        Table of laboratory results, including Hb (haemoglobin),
            platelet count, and eGFR (kidney function). The columns are
            `patient_id`, `test_name`, and `sample_date`.

    """
    df = get_data(engine, hic.pathology_blood_query, ["OBR_BLS_UE", "OBR_BLS_FB"])

    df["test_name"] = df["investigation"] + "_" + df["test"]

    test_of_interest = {
        "OBR_BLS_FB_OBX_BLS_HB": "hb",
        "OBR_BLS_UE_OBX_BLS_EP": "egfr",
        "OBR_BLS_FB_OBX_BLS_PL": "platelets",
    }

    # Only keep tests of interest: platelets, egfr, and hb
    df = df[df["test_name"].isin(test_of_interest.keys())]

    # Rename the items
    df["test_name"] = df["test_name"].map(test_of_interest)

    # Check egfr unit
    rows = df[df["test_name"] == "egfr"]
    check_const_column(rows, "unit", "mL/min")

    # Check hb unit
    rows = df[df["test_name"] == "hb"]
    check_const_column(rows, "unit", "g/L")

    # Check platelets unit (note 10*9/L is not a typo)
    rows = df[df["test_name"] == "platelets"]
    check_const_column(rows, "unit", "10*9/L")

    # Some values include an inequality; e.g.:
    # - egfr: >90
    # - platelets: <3
    #
    # Remove instances of < or > to enable conversion
    # to float.
    df["result"] = df["result"].str.replace("<|>", "", regex=True)

    # Convert results column to float
    df["result"] = df["result"].astype(float)

    # Convert hb units to g/dL (to match ARC HBR definition)
    df.loc[df["test_name"] == "hb", "result"] /= 10.0

    return df[["patient_id", "sample_date", "test_name", "result"]]

get_unlinked_prescriptions(engine, table_name='cv1_pharmacy_prescribing')

Get relevant prescriptions from the HIC data (unlinked to episode)

This function is tailored towards the calculation of the ARC HBR score, so it focusses on prescriptions on oral anticoagulants (e.g. warfarin) and non-steroidal anti-inflammatory drugs (NSAIDs, e.g. ibuprofen).

The frequency column reflects the maximum allowable doses per day. For the purposes of ARC HBR, where NSAIDs must be prescribed > 4 days/week, all prescriptions in the HIC data indicate frequency > 1 (i.e. at least one per day), and therefore qualify for ARC HBR purposes.

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
table_name str

Defaults to "cv1_pharmacy_prescribing" for UHBW, but can be overwritten by "HIC_Pharmacy" for ICB.

'cv1_pharmacy_prescribing'

Returns:

Type Description
DataFrame

The table of prescriptions, including the patient_id, order_date (to link to an episode), prescription name, prescription group (oac or nsaid), and frequency (in doses per day).

Source code in src\pyhbr\middle\from_hic.py
def get_unlinked_prescriptions(engine: Engine, table_name: str = "cv1_pharmacy_prescribing") -> pd.DataFrame:
    """Get relevant prescriptions from the HIC data (unlinked to episode)

    This function is tailored towards the calculation of the
    ARC HBR score, so it focusses on prescriptions on oral
    anticoagulants (e.g. warfarin) and non-steroidal
    anti-inflammatory drugs (NSAIDs, e.g. ibuprofen).

    The frequency column reflects the maximum allowable
    doses per day. For the purposes of ARC HBR, where NSAIDs
    must be prescribed > 4 days/week, all prescriptions in
    the HIC data indicate frequency > 1 (i.e. at least one
    per day), and therefore qualify for ARC HBR purposes.

    Args:
        engine: The connection to the database
        table_name: Defaults to "cv1_pharmacy_prescribing" for UHBW, but can
            be overwritten by "HIC_Pharmacy" for ICB.

    Returns:
        The table of prescriptions, including the patient_id,
            order_date (to link to an episode), prescription name,
            prescription group (oac or nsaid), and frequency (in
            doses per day).
    """

    df = get_data(engine, hic.pharmacy_prescribing_query, table_name)

    # Create a new `group` column containing the medicine type
    df = filter_by_medicine(df)

    # Replace alternative spellings
    df["name"] = df["name"].str.replace("indomethacin", "indometacin")

    # Replace admission medicine column with bool
    on_admission_map = {"y": True, "n": False}
    df["on_admission"] = df["on_admission"].map(on_admission_map)

    # Extra spaces are not typos.
    per_day = {
        "TWICE a day": 2,
        "in the MORNING": 1,
        "THREE times a day": 3,
        "TWICE a day at 08:00 and 22:00": 2,
        "ONCE a day  at 18:00": 1,
        "up to every SIX hours": 4,
        "up to every EIGHT hours": 3,
        "TWICE a day at 08:00 and 20:00": 2,
        "up to every 24 hours": 1,
        "THREE times a day at 08:00 15:00 and 22:00": 3,
        "TWICE a day at 08:00 and 19:00": 2,
        "ONCE a day  at 20:00": 1,
        "ONCE a day  at 08:00": 1,
        "up to every 12 hours": 2,
        "ONCE a day  at 19:00": 1,
        "THREE times a day at 08:00 15:00 and 20:00": 3,
        "THREE times a day at 08:00 14:00 and 22:00": 3,
        "ONCE a day  at 22:00": 1,
        "every EIGHT hours": 24,
        "ONCE a day  at 09:00": 1,
        "up to every FOUR hours": 6,
        "TWICE a day at 06:00 and 18:00": 2,
        "at NIGHT": 1,
        "ONCE a day  at 14:00": 1,
        "ONCE a day  at 12:00": 1,
        "THREE times a day at 08:00 14:00 and 20:00": 3,
        "THREE times a day at 00:00 08:00 and 16:00": 3,
    }

    # Replace frequencies strings with doses per day
    df["frequency"] = df["frequency"].map(per_day)

    return df[
        ["patient_id", "order_date", "name", "group", "frequency", "on_admission"]
    ].reset_index(drop=True)

Link HIC laboratory test/prescriptions to episode by date

Use this function to add an episode_id to the laboratory tests table or the prescriptions table. Tests/prescriptions are generically referred to as items below.

This function associates each item with the first episode containing the item date in its [episode_start, episode_end) range. The column containing the item date is given by date_col_name.

For prescriptions, use the prescription order date for linking. For laboratory tests, use the sample collected date.

This function assumes that the episode_id in the episodes table is unique (i.e. no patients share an episode ID).

For higher performance, reduce the item table to items of interest before calling this function.

Since episodes may slightly overlap, an item may be associated with more than one episode. In this case, the function will associate the item with the earliest episode (the returned table will not contain duplicate items).

The final table does not use episode_id as an index, because an episode may contain multiple items.

Parameters:

Name Type Description Default
items DataFrame

The prescriptions or laboratory tests table. Must contain a date_col_name column, which is used to compare with episode start/end dates, and the patient_id.

required
episodes DataFrame

The episodes table. Must contain patient_id, episode_id, episode_start and episode_end.

required

Returns:

Type Description
DataFrame

The items table with additional episode_id and spell_id columns.

Source code in src\pyhbr\middle\from_hic.py
def link_to_episodes(
    items: pd.DataFrame, episodes: pd.DataFrame, date_col_name: str
) -> pd.DataFrame:
    """Link HIC laboratory test/prescriptions to episode by date

    Use this function to add an episode_id to the laboratory tests
    table or the prescriptions table. Tests/prescriptions are generically
    referred to as items below.

    This function associates each item with the first episode containing
    the item date in its [episode_start, episode_end) range. The column
    containing the item date is given by `date_col_name`.

    For prescriptions, use the prescription order date for linking. For
    laboratory tests, use the sample collected date.

    This function assumes that the episode_id in the episodes table is
    unique (i.e. no patients share an episode ID).

    For higher performance, reduce the item table to items of interest
    before calling this function.

    Since episodes may slightly overlap, an item may be associated
    with more than one episode. In this case, the function will associate
    the item with the earliest episode (the returned table will
    not contain duplicate items).

    The final table does not use episode_id as an index, because an episode
    may contain multiple items.

    Args:
        items: The prescriptions or laboratory tests table. Must contain a
            `date_col_name` column, which is used to compare with episode
            start/end dates, and the `patient_id`.

        episodes: The episodes table. Must contain `patient_id`, `episode_id`,
            `episode_start` and `episode_end`.

    Returns:
        The items table with additional `episode_id` and `spell_id` columns.
    """

    # Before linking to episodes, add an item ID. This is to
    # remove duplicated items in the last step of linking,
    # due ot overlapping episode time windows.
    items["item_id"] = range(items.shape[0])

    # Join together all items and episode information by patient. Use
    # a left join on items (assuming items is narrowed to the item types
    # of interest) to keep the result smaller. Reset the index to move
    # episode_id to a column.
    with_episodes = pd.merge(items, episodes.reset_index(), how="left", on="patient_id")

    # Thinking of each row as both an episode and a item, drop any
    # rows where the item date does not fall within the start
    # and end of the episode (start date inclusive).
    consistent_dates = (
        with_episodes[date_col_name] >= with_episodes["episode_start"]
    ) & (with_episodes[date_col_name] < with_episodes["episode_end"])
    overlapping_episodes = with_episodes[consistent_dates]

    # Since some episodes overlap in time, some items will end up
    # being associated with more than one episode. Remove any
    # duplicates by associating only with the earliest episode.
    deduplicated = (
        overlapping_episodes.sort_values("episode_start").groupby("item_id").head(1)
    )

    # Keep episode_id, drop other episodes/unnecessary columns.
    return deduplicated.drop(columns=["item_id"]).drop(columns=episodes.columns)

from_icb

blood_pressure(swd_index_spells, primary_care_measurements)

Get recent blood pressure readings

Parameters:

Name Type Description Default
primary_care_measurements DataFrame

Contains a name column containing the measurement name (expected to contain "blood_pressure"), a result column with the format "systolic/diastolic" for the blood pressure rows, a date, and a patient_id.

required
swd_index_spells DataFrame

Has Pandas index spell_id, and columns patient_id and spell_start.

required

Returns:

Type Description
DataFrame

A dataframe index by spell_id containing bp_systolic and bp_diastolic columns.

Source code in src\pyhbr\middle\from_icb.py
def blood_pressure(
    swd_index_spells: DataFrame, primary_care_measurements: DataFrame
) -> DataFrame:
    """Get recent blood pressure readings

    Args:
        primary_care_measurements: Contains a `name` column containing
            the measurement name (expected to contain "blood_pressure"),
            a `result` column with the format "systolic/diastolic" for
            the blood pressure rows, a `date`, and a `patient_id`.
        swd_index_spells: Has Pandas index `spell_id`, and columns
            `patient_id` and `spell_start`.

    Returns:
        A dataframe index by `spell_id` containing `bp_systolic`
            and `bp_diastolic` columns.
    """

    df = primary_care_measurements

    # Drop rows where the measurement is not known
    df = df[~df["name"].isna()]

    # Drop rows where the prescription date is not known
    df = df[~df["date"].isna()]

    blood_pressure = df[df.name.str.contains("blood_pressure")][
        ["patient_id", "date", "result"]
    ].copy()
    blood_pressure[["bp_systolic", "bp_diastolic"]] = (
        df["result"].str.split("/", expand=True).apply(pd.to_numeric, errors="coerce")
    )

    # Join the prescriptions to the index spells
    df = (
        swd_index_spells[["spell_start", "patient_id"]]
        .reset_index()
        .merge(blood_pressure, how="left", on="patient_id")
    )
    df["time_to_index_spell"] = df["spell_start"] - df["date"]

    # Only keep measurements occurring in the year before the index event
    min_before = dt.timedelta(days=0)
    max_before = dt.timedelta(days=60)
    bp_before_index = counting.get_time_window(
        df, -max_before, -min_before, "time_to_index_spell"
    )

    most_recent_bp = bp_before_index.sort_values("date").groupby("spell_id").tail(1)
    prior_bp = swd_index_spells.merge(
        most_recent_bp, how="left", on="spell_id"
    ).set_index("spell_id")[["bp_systolic", "bp_diastolic"]]

    return prior_bp

get_clinical_codes(raw_sus_data, code_groups)

Get clinical codes in long format and normalised form.

Each row is a code that is contained in some group. Codes in an episode are dropped if they are not in any group, meaning episodes will be dropped if no code in that episode is in any group.

Parameters:

Name Type Description Default
raw_sus_data DataFrame

Must contain one row per episode, and contains clinical codes in wide format, with columns diagnosis_n and procedure_n, for n > 0. The value n == 1 is the primary diagnosis or procedure, and n > 1 is for secondary codes.

required
code_groups DataFrame

A table of all the codes in any group, at least containing columns code, group and type.

required

Returns:

Type Description
DataFrame

A table containing diagnoses/procedures, normalised codes, code groups, diagnosis positions, and associated episode ID.

Source code in src\pyhbr\middle\from_icb.py
def get_clinical_codes(
    raw_sus_data: DataFrame, code_groups: DataFrame
) -> DataFrame:
    """Get clinical codes in long format and normalised form.

    Each row is a code that is contained in some group. Codes in
    an episode are dropped if they are not in any group, meaning
    episodes will be dropped if no code in that episode is in any
    group. 

    Args:
        raw_sus_data: Must contain one row per episode, and
            contains clinical codes in wide format, with
            columns `diagnosis_n` and `procedure_n`, for
            n > 0. The value n == 1 is the primary diagnosis
            or procedure, and n > 1 is for secondary codes.
        code_groups: A table of all the codes in any group, at least containing
            columns `code`, `group` and `type`.

    Returns:
        A table containing diagnoses/procedures, normalised codes, code groups,
            diagnosis positions, and associated episode ID.
    """

    # Get all the clinical codes for all episodes in long format
    long_codes = get_long_clinical_codes(raw_sus_data)

    # Join all the code groups, and drop any codes that are not in any
    # group (inner join in order to retain all keep all codes in long_codes,
    # but only if they have an entry in code_groups)
    return long_codes.merge(code_groups, on=["code", "type"], how="inner")

get_episodes(raw_sus_data)

Get the episodes table

Age and gender are also included in each row.

Gender is encoded using the NHS data dictionary values, which is mapped to a category column in the table. (Note that initial values are strings, not integers.)

  • "0": Not known. Mapped to "unknown"
  • "1": Male: Mapped to "male"
  • "2": Female. Mapped to "female"
  • "9": Not specified. Mapped to "unknown".

Not mapping 0/9 to NA in case either is related to non-binary genders (i.e. it contains information, rather than being a NULL field).

Parameters:

Name Type Description Default
raw_sus_data DataFrame

Data returned by sus_query() query.

required

Returns:

Type Description
DataFrame

A dataframe indexed by episode_id, with columns episode_start, spell_id and patient_id.

Source code in src\pyhbr\middle\from_icb.py
def get_episodes(raw_sus_data: DataFrame) -> DataFrame:
    """Get the episodes table

    Age and gender are also included in each row.

    Gender is encoded using the NHS data dictionary values, which
    is mapped to a category column in the table. (Note that initial
    values are strings, not integers.)

    * "0": Not known. Mapped to "unknown"
    * "1": Male: Mapped to "male"
    * "2": Female. Mapped to "female"
    * "9": Not specified. Mapped to "unknown".

    Not mapping 0/9 to NA in case either is related to non-binary
    genders (i.e. it contains information, rather than being a NULL field).

    Args:
        raw_sus_data: Data returned by sus_query() query.

    Returns:
        A dataframe indexed by `episode_id`, with columns
            `episode_start`, `spell_id` and `patient_id`.
    """
    df = (
        raw_sus_data[["spell_id", "patient_id", "episode_start", "admission", "discharge", "age", "gender"]]
        .reset_index(names="episode_id")
        .set_index("episode_id")
    )

    # Convert gender to categories
    df["gender"] = df["gender"].replace("9", "0")
    valid_values = ["0", "1", "2"]
    df.loc[~df["gender"].isin(valid_values), "gender"] = "0"
    df["gender"] = df["gender"].astype("category")
    df["gender"] = df["gender"].cat.rename_categories(
        {"0": "unknown", "1": "male", "2": "female"}
    )

    # Convert age to numerical
    df["age"] = df["age"].astype(float)

    return df

get_episodes_and_codes(raw_sus_data, code_groups)

Get episode and clinical code data

This batch of data must be fetched first to find index events, which establishes the patient group of interest. This can then be used to narrow subsequent queries to the data base, to speed them up.

Parameters:

Name Type Description Default
raw_sus_data DataFrame

The raw HES data returned by get_raw_sus_data()

required
code_groups DataFrame

A table of all the codes in any group, at least containing columns code, group and type.

required

Returns:

Type Description
(DataFrame, DataFrame)

A tuple containing the episodes table (also contains age and gender) and the codes table containing the clinical code data in long format for any code that is in a diagnosis or procedure code group.

Source code in src\pyhbr\middle\from_icb.py
def get_episodes_and_codes(raw_sus_data: DataFrame, code_groups: DataFrame) -> (DataFrame, DataFrame):
    """Get episode and clinical code data

    This batch of data must be fetched first to find index events,
    which establishes the patient group of interest. This can then
    be used to narrow subsequent queries to the data base, to speed
    them up.

    Args:
        raw_sus_data: The raw HES data returned by get_raw_sus_data()
        code_groups: A table of all the codes in any group, at least containing
            columns `code`, `group` and `type`.

    Returns:
        A tuple containing the episodes table (also contains age and
            gender) and the codes table containing the clinical code data
            in long format for any code that is in a diagnosis or 
            procedure code group.
    """

    # Compared to the data fetch, this part is relatively fast, but still very
    # slow (approximately 10% of the total runtime).
    episodes = get_episodes(raw_sus_data)
    codes = get_clinical_codes(raw_sus_data, code_groups)

    return episodes, codes

get_long_cause_of_death(mortality)

Get cause-of-death diagnosis codes in normalised long format

Parameters:

Name Type Description Default
mortality DataFrame

A table containing patient_id, and columns with names cause_of_death_n, where n is an integer 1, 2, ...

required

Returns:

Type Description
DataFrame

A table containing the columns patient_id, code (for ICD-10 cause of death diagnosis), and position (for primary/secondary position of the code)

Source code in src\pyhbr\middle\from_icb.py
def get_long_cause_of_death(mortality: DataFrame) -> DataFrame:
    """Get cause-of-death diagnosis codes in normalised long format

    Args:
        mortality: A table containing `patient_id`, and columns 
            with names `cause_of_death_n`, where n is an integer 1, 2, ...

    Returns:
        A table containing the columns `patient_id`, `code` (for ICD-10
            cause of death diagnosis), and `position` (for primary/secondary
            position of the code)
    """
    df = mortality.filter(regex="(id|cause)").melt(id_vars="patient_id")
    df["position"] = df["variable"].str.split("_", expand=True).iloc[:, -1].astype(int)
    df = df[~df["value"].isna()]
    df["code"] = df["value"].apply(clinical_codes.normalise_code)
    return df[["patient_id", "code", "position"]]

get_long_clinical_codes(raw_sus_data)

Get a table of the clinical codes in normalised long format

This is modelled on the format of the HIC data, which works well, and makes it possible to re-use the code for processing that table.

Parameters:

Name Type Description Default
raw_sus_data DataFrame

Must contain one row per episode, and contains clinical codes in wide format, with columns diagnosis_n and procedure_n, for n > 0. The value n == 1 is the primary diagnosis or procedure, and n > 1 is for secondary codes.

required

Returns:

Type Description
DataFrame

A table containing episode_id, code, type, and position.

Source code in src\pyhbr\middle\from_icb.py
def get_long_clinical_codes(raw_sus_data: DataFrame) -> DataFrame:
    """Get a table of the clinical codes in normalised long format

    This is modelled on the format of the HIC data, which works
    well, and makes it possible to re-use the code for processing
    that table.

    Args:
        raw_sus_data: Must contain one row per episode, and
            contains clinical codes in wide format, with
            columns `diagnosis_n` and `procedure_n`, for
            n > 0. The value n == 1 is the primary diagnosis
            or procedure, and n > 1 is for secondary codes.

    Returns:
        A table containing `episode_id`, `code`, `type`, and
            position.
    """

    # Pivot the wide format to long based on the episode_id
    df = (
        raw_sus_data.reset_index(names="episode_id")
        .filter(regex="(diagnosis|procedure|episode_id)")
        .melt(id_vars="episode_id", value_name="code")
    )

    # Drop any codes that are empty or whitespace
    long_codes = df[~df["code"].str.isspace() & (df["code"] != "")].copy()

    # Convert the diagnosis/procedure and value of n into separate columns
    long_codes[["type", "position"]] = long_codes["variable"].str.split(
        "_", expand=True
    )

    long_codes["position"] = long_codes["position"].astype(int)
    long_codes["code"] = long_codes["code"].apply(clinical_codes.normalise_code)

    # Collect columns of interest and sort for ease of viewing
    return (
        long_codes[["episode_id", "code", "type", "position"]]
        .sort_values(["episode_id", "type", "position"])
        .reset_index(drop=True)
    )

get_mortality(engine, start_date, end_date, code_groups)

Get date of death and cause of death

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
start_date date

First date of death that will be included

required
end_date date

Last date of death that will be included

required
code_groups DataFrame

A table of all the codes in any group, at least containing columns code, group and type.

required

Returns:

Type Description
dict[str, DataFrame]

A tuple containing a date of death table, which is indexed by patient_id and has the single column date_of_death, and a cause of death table with columns patient_id, code for the cause of death diagnosis code (ICD-10), and position indicating the primary/secondary position of the code (1 is primary, >1 is secondary).

Source code in src\pyhbr\middle\from_icb.py
def get_mortality(engine: Engine, start_date: date, end_date: date, code_groups: DataFrame) -> dict[str, DataFrame]:
    """Get date of death and cause of death

    Args:
        engine: The connection to the database
        start_date: First date of death that will be included
        end_date: Last date of death that will be included
        code_groups: A table of all the codes in any group, at least containing
            columns `code`, `group` and `type`.

    Returns:
        A tuple containing a date of death table, which is indexed by `patient_id`
            and has the single column `date_of_death`, and a cause of death table
            with columns `patient_id`, `code` for the cause of death
            diagnosis code (ICD-10), and `position` indicating the primary/secondary
            position of the code (1 is primary, >1 is secondary).
    """

    # Fetch the mortality data limited by the date range
    raw_mortality_data = common.get_data(engine, icb.mortality_query, start_date, end_date)

    # Some patient IDs have multiple inconsistent death records. For these cases,
    # pick the most recent record. This will ensure that no patients recorded in the
    # mortality tables are dropped, at the expense of some potential inaccuracies in
    # the date of death.
    mortality = raw_mortality_data.sort_values("date_of_death").groupby("patient_id").tail(1)

    # Get the date of death.
    date_of_death = mortality.set_index("patient_id")[["date_of_death"]]

    # Convert the cause of death to a long format, normalise the codes,
    # and keep only the code and position for each patient.
    long_cause_of_death = get_long_cause_of_death(mortality)

    # Join the code groups to the codes (does not filter -- leaves
    # NA group for a code not in any group).
    diagnosis_code_groups = code_groups[code_groups["type"] == "diagnosis"]
    cause_of_death = long_cause_of_death.merge(
        diagnosis_code_groups, on="code", how="inner"
    ).sort_values(["patient_id", "position"]).reset_index(drop=True)

    return date_of_death, cause_of_death

get_raw_sus_data(engine, start_date, end_date)

Get the raw SUS (secondary uses services hospital episode statistics)

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required
start_date date

The start date (inclusive) for returned episodes

required
end_date date

The end date (inclusive) for returned episodes

required

Returns:

Type Description
DataFrame

A dataframe with one row per episode, containing clinical code data and patient demographics at that episode.

Source code in src\pyhbr\middle\from_icb.py
def get_raw_sus_data(engine: Engine, start_date: date, end_date: date) -> DataFrame:
    """Get the raw SUS (secondary uses services hospital episode statistics)

    Args:
        engine: The connection to the database
        start_date: The start date (inclusive) for returned episodes
        end_date:  The end date (inclusive) for returned episodes

    Returns:
        A dataframe with one row per episode, containing clinical code
            data and patient demographics at that episode.
    """

    # The fetch is very slow (and varies depending on the internet connection).
    # Fetching 5 years of data takes approximately 20 minutes (about 2m episodes).
    print("Starting SUS data fetch...")
    raw_sus_data = common.get_data(engine, icb.sus_query, start_date, end_date)
    print("SUS data fetch finished.")

    return raw_sus_data

get_unlinked_lab_results(engine)

Get laboratory results from the HIC database (unlinked to episode)

This function returns data for the following three tests, identified by one of these values in the test_name column:

  • hb: haemoglobin (unit: g/dL)
  • egfr: eGFR (unit: mL/min)
  • platelets: platelet count (unit: 10^9/L)

The test result is associated to a patient_id, and the time when the sample for the test was collected is stored in the sample_date column.

Some values in the underlying table contain inequalities in the results column, which have been removed (so egfr >90 becomes 90).

Parameters:

Name Type Description Default
engine Engine

The connection to the database

required

Returns:

Type Description
DataFrame

Table of laboratory results, including Hb (haemoglobin), platelet count, and eGFR (kidney function). The columns are patient_id, test_name, and sample_date.

Source code in src\pyhbr\middle\from_icb.py
def get_unlinked_lab_results(engine: Engine) -> pd.DataFrame:
    """Get laboratory results from the HIC database (unlinked to episode)

    This function returns data for the following three
    tests, identified by one of these values in the
    `test_name` column:

    * `hb`: haemoglobin (unit: g/dL)
    * `egfr`: eGFR (unit: mL/min)
    * `platelets`: platelet count (unit: 10^9/L)

    The test result is associated to a `patient_id`,
    and the time when the sample for the test was collected
    is stored in the `sample_date` column.

    Some values in the underlying table contain inequalities
    in the results column, which have been removed (so
    egfr >90 becomes 90).

    Args:
        engine: The connection to the database

    Returns:
        Table of laboratory results, including Hb (haemoglobin),
            platelet count, and eGFR (kidney function). The columns are
            `patient_id`, `test_name`, and `sample_date`.

    """

    test_of_interest = {
        "Haemoglobin": "hb",
        "eGFR/1.73m2 (CKD-EPI)": "egfr",
        "Platelets": "platelets",
    }

    df = common.get_data(engine, hic_icb.pathology_blood_query, test_of_interest.keys())

    # Only keep tests of interest: platelets, egfr, and hb
    df = df[df["test_name"].isin(test_of_interest.keys())]

    # Rename the items
    df["test_name"] = df["test_name"].map(test_of_interest)

    # Check egfr unit
    rows = df[df["test_name"] == "egfr"]
    check_const_column(rows, "unit", "mL/min")

    # Check hb unit
    rows = df[df["test_name"] == "hb"]
    check_const_column(rows, "unit", "g/L")

    # Check platelets unit (note 10*9/L is not a typo)
    rows = df[df["test_name"] == "platelets"]
    check_const_column(rows, "unit", "10*9/L")

    # Some values include an inequality; e.g.:
    # - egfr: >90
    # - platelets: <3
    #
    # Remove instances of < or > to enable conversion
    # to float.
    df["result"] = df["result"].str.replace("<|>", "", regex=True)

    # Convert results column to float
    df["result"] = df["result"].astype(float)

    # Convert hb units to g/dL (to match ARC HBR definition)
    df.loc[df["test_name"] == "hb", "result"] /= 10.0

    return df[["patient_id", "sample_date", "test_name", "result"]]

hba1c(swd_index_spells, primary_care_measurements)

Get recent HbA1c from the primary care measurements

Parameters:

Name Type Description Default
primary_care_measurements DataFrame

Contains a name column containing the measurement name (expected to contain "blood_pressure"), a result column with the format "systolic/diastolic" for the blood pressure rows, a date, and a patient_id.

required
swd_index_spells DataFrame

Has Pandas index spell_id, and columns patient_id and spell_start.

required

Returns:

Type Description
DataFrame

A dataframe indexed by spell_id containing recent (within 2 months) HbA1c values.

Source code in src\pyhbr\middle\from_icb.py
def hba1c(
    swd_index_spells: DataFrame, primary_care_measurements: DataFrame
) -> DataFrame:
    """Get recent HbA1c from the primary care measurements

    Args:
        primary_care_measurements: Contains a `name` column containing
            the measurement name (expected to contain "blood_pressure"),
            a `result` column with the format "systolic/diastolic" for
            the blood pressure rows, a `date`, and a `patient_id`.
        swd_index_spells: Has Pandas index `spell_id`, and columns
            `patient_id` and `spell_start`.

    Returns:
        A dataframe indexed by `spell_id` containing recent (within 2 months)
            HbA1c values.
    """

    df = primary_care_measurements

    # Drop rows where the measurement is not known
    df = df[~df["name"].isna()]

    # Drop rows where the prescription date is not known
    df = df[~df["date"].isna()]

    hba1c = df[df.name.str.contains("hba1c")][["patient_id", "date", "result"]].copy()
    hba1c["hba1c"] = pd.to_numeric(hba1c["result"], errors="coerce")

    # Join the prescriptions to the index spells
    df = (
        swd_index_spells[["spell_start", "patient_id"]]
        .reset_index()
        .merge(hba1c, how="left", on="patient_id")
    )
    df["time_to_index_spell"] = df["spell_start"] - df["date"]

    # Only keep measurements occurring in the year before the index event
    min_before = dt.timedelta(days=0)
    max_before = dt.timedelta(days=60)
    hba1c_before_index = counting.get_time_window(
        df, -max_before, -min_before, "time_to_index_spell"
    )

    most_recent_hba1c = (
        hba1c_before_index.sort_values("date").groupby("spell_id").tail(1)
    )
    prior_hba1c = swd_index_spells.merge(
        most_recent_hba1c, how="left", on="spell_id"
    ).set_index("spell_id")[["hba1c"]]

    return prior_hba1c

preprocess_ethnicity(column)

Map the ethnicity column to standard ethnicities.

Ethnicities were obtained from www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups, from the 2021 census:

  • asian_or_asian_british
  • black_black_british_caribbean_or_african
  • mixed_or_multiple_ethnic_groups
  • white
  • other_ethnic_group

Parameters:

Name Type Description Default
column Series

A column of object ("string") containing ethnicities from the primary care attributes table.

required

Returns:

Type Description
Series

A column of type category containing the standard ethnicities (and NaN).

Source code in src\pyhbr\middle\from_icb.py
def preprocess_ethnicity(column: Series) -> Series:
    """Map the ethnicity column to standard ethnicities.

    Ethnicities were obtained from www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups,
    from the 2021 census:

    * asian_or_asian_british
    * black_black_british_caribbean_or_african
    * mixed_or_multiple_ethnic_groups
    * white
    * other_ethnic_group

    Args:
        column: A column of object ("string") containing
            ethnicities from the primary care attributes table.

    Returns:
        A column of type category containing the standard
            ethnicities (and NaN).
    """

    column = column.str.replace(" - ethnic category 2001 census", "")
    column = column.str.replace(" - England and Wales ethnic category 2011 census", "")
    column = column.str.replace(" - 2011 census England and Wales", "")
    column = column.str.replace(" - Scotland ethnic category 2011 census", "")
    column = column.str.replace(" - 2001 census", "")
    column = column.str.lower()
    column = column.str.replace("(\(|\)|:| )+", "_", regex=True)

    ethnicity_map = {
        "white_british": "white",
        "british_or_mixed_british": "white",
        "white_english_or_welsh_or_scottish_or_northern_irish_or_british": "white",
        "english": "white",
        "other_white_background": "white",
        "white": "white",
        "ethnic_category_not_stated": np.nan,
        "pakistani_or_british_pakistani": "asian_or_asian_british",
        "refusal_by_patient_to_provide_information_about_ethnic_group": np.nan,
        "ethnic_category": np.nan,
        "indian_or_british_indian": "asian_or_asian_british",
        "caribbean": "black_black_british_caribbean_or_african",
        "other_asian_background": "asian_or_asian_british",
        "african": "black_black_british_caribbean_or_african",
        "white_any_other_white_background": "white",
        "bangladeshi_or_british_bangladeshi": "asian_or_asian_british",
        "irish": "white",
        "white_irish": "white",
        "white_-_ethnic_group": "white",
        "chinese": "asian_or_asian_british",
        "polish": "white",
        "black_british": "black_black_british_caribbean_or_african",
        "white_and_black_caribbean": "mixed_or_multiple_ethnic_groups",
        "pakistani": "asian_or_asian_british",
        "other": "other_ethnic_group",
        "black_african": "black_black_british_caribbean_or_african",
        "asian_or_asian_british_indian": "asian_or_asian_british",
        "black_caribbean": "black_black_british_caribbean_or_african",
        "indian": "asian_or_asian_british",
        "asian_or_asian_british_pakistani": "asian_or_asian_british",
        "other_white_european_or_european_unspecified_or_mixed_european": "white",
        "somali": "black_black_british_caribbean_or_african",
        "ethnic_group_not_recorded": np.nan,
        "asian_or_asian_british_any_other_asian_background": "asian_or_asian_british",
        "white_and_asian": "mixed_or_multiple_ethnic_groups",
        "white_and_black_african": "mixed_or_multiple_ethnic_groups",
        "other_black_background": "black_black_british_caribbean_or_african",
        "italian": "white",
        "scottish": "white",
        "other_white_or_white_unspecified": "white",
        "other_ethnic_group_any_other_ethnic_group": "other_ethnic_group",
        "other_mixed_background": "mixed_or_multiple_ethnic_groups",
        "other_european_nmo_": "white",
        "welsh": "white",
        "greek": "white",
        "patient_ethnicity_unknown": np.nan,
        "mixed_multiple_ethnic_groups_any_other_mixed_or_multiple_ethnic_background": "mixed_or_multiple_ethnic_groups",
        "black_or_african_or_caribbean_or_black_british_caribbean": "black_black_british_caribbean_or_african",
        "filipino": "asian_or_asian_british",
        "ethnic_group": np.nan,
        "other_mixed_white": "white",  # Unclear
        "british_asian": "asian_or_asian_british",
        "iranian": "other_ethnic_group",
        "other_asian_ethnic_group": "asian_or_asian_british",
        "kurdish": "other_ethnic_group",
        "black_or_african_or_caribbean_or_black_british_african": "black_black_british_caribbean_or_african",
        "other_asian_nmo_": "asian_or_asian_british",
        "moroccan": "other_ethnic_group",
        "other_white_british_ethnic_group": "white",
        "mixed_multiple_ethnic_groups_white_and_black_caribbean": "mixed_or_multiple_ethnic_groups",
        "black_and_white": "mixed_or_multiple_ethnic_groups",
        "asian_or_asian_british_bangladeshi": "asian_or_asian_british",
        "mixed_multiple_ethnic_groups_white_and_black_african": "mixed_or_multiple_ethnic_groups",
        "white_polish": "white",
        "asian_and_chinese": "asian_or_asian_british",
        "black_or_african_or_caribbean_or_black_british_other_black_or_african_or_caribbean_background": "black_black_british_caribbean_or_african",
        "black_and_asian": "black_black_british_caribbean_or_african",
        "white_scottish": "white",
        "any_other_group": "other_ethnic_group",
        "other_ethnic_non-mixed_nmo_": "other_ethnic_group",
        "ethnicity_and_other_related_nationality_data": np.nan,
        "caucasian_race": "white",
        "multi-ethnic_islands_mauritian_or_seychellois_or_maldivian_or_st_helena": "other_ethnic_group",
        "punjabi": "asian_or_asian_british",
        "albanian": "white",
        "turkish/turkish_cypriot_nmo_": "other_ethnic_group",
        "black_-_other_african_country": "black_black_british_caribbean_or_african",
        "other_black_or_black_unspecified": "black_black_british_caribbean_or_african",
        "sri_lankan": "asian_or_asian_british",
        "mixed_asian": "asian_or_asian_british",
        "other_black_ethnic_group": "black_black_british_caribbean_or_african",
        "bulgarian": "white",
        "sikh": "asian_or_asian_british",
        "other_ethnic_mixed_origin": "other_ethnic_group",
        "n_african_arab/iranian_nmo_": "other_ethnic_group",
        "south_and_central_american": "other_ethnic_group",
        "asian_or_asian_british_chinese": "asian_or_asian_british",
        "ethnic_groups_census_nos": np.nan,
        "arab": "other_ethnic_group",
        "ethnic_group_finding": np.nan,
        "white_any_other_white_ethnic_group": "white",
        "greek_cypriot": "white",
        "latin_american": "other_ethnic_group",
        "other_asian_or_asian_unspecified": "asian_or_asian_british",
        "cypriot_part_not_stated_": "other_ethnic_group",
        "east_african_asian": "other_ethnic_group",
        "mixed_multiple_ethnic_groups_white_and_asian": "mixed_or_multiple_ethnic_groups",
        "other_ethnic_group_arab_arab_scottish_or_arab_british": "other_ethnic_group",
        "other_ethnic_group_arab": "other_ethnic_group",
        "turkish": "other_ethnic_group",
        "north_african": "black_black_british_caribbean_or_african",
        "greek_nmo_": "white",
        "bangladeshi": "asian_or_asian_british",
        "chinese_and_white": "mixed_or_multiple_ethnic_groups",
        "white_gypsy_or_irish_traveller": "white",
        "vietnamese": "asian_or_asian_british",
        "romanian": "white",
        "serbian": "white",
    }

    return column.map(ethnicity_map).astype("category")

preprocess_smoking(column)

Convert the smoking column from string to category

The values in the column are "unknown", "ex", "Unknown", "current", "Smoker", "Ex", and "Never".

Based on the distribution of values in the column, it likely that "Unknown/unknown" mostly means "no". This makes the percentage of smoking about 15%, which is roughly in line with the average. Without performing this mapping, smokers outnumber non-smokers ("Never") approx. 20 to 1.

Note that the column does also include NA values, which will be left as NA.

Parameters:

Name Type Description Default
column Series

The smoking column from the primary care attributes

required

Returns:

Type Description
Series

A category column containing "yes", "no", and "ex".

Source code in src\pyhbr\middle\from_icb.py
def preprocess_smoking(column: Series) -> Series:
    """Convert the smoking column from string to category

    The values in the column are "unknown", "ex", "Unknown",
    "current", "Smoker", "Ex", and "Never".

    Based on the distribution of values in the column, it
    likely that "Unknown/unknown" mostly means "no". This
    makes the percentage of smoking about 15%, which is
    roughly in line with the average. Without performing this
    mapping, smokers outnumber non-smokers ("Never") approx.
    20 to 1.

    Note that the column does also include NA values, which
    will be left as NA.

    Args:
        column: The smoking column from the primary
            care attributes

    Returns:
        A category column containing "yes", "no", and "ex".
    """

    value_map = {
        "unknown": "no",
        "Unknown": "no",
        "current": "yes",
        "Smoker": "yes",
        "ex": "ex",
        "Ex": "ex",
        "Never": "no",
    }

    return column.map(value_map).astype("category")

process_flag_columns(primary_care_attributes)

Replace NaN with false and convert to bool for a selection of rows

Many columns in the primary care attributes encode a flag using 1 for true and NA/NULL for false. These must be replaced with a boolean type so that NA can distinguish missing data. Instead of using a bool, use Int8 so that NaNs can be stored. (This is important later on for index spells with missing attributes, which need to store NaN in these flag columns.)

Parameters:

Name Type Description Default
primary_care_attributes DataFrame

Original table containing 1/NA flag columns

required

Returns:

Type Description
DataFrame

The primary care attributes with flag columns encoded as Int8.

Source code in src\pyhbr\middle\from_icb.py
def process_flag_columns(primary_care_attributes: DataFrame) -> DataFrame:
    """Replace NaN with false and convert to bool for a selection of rows

    Many columns in the primary care attributes encode a flag
    using 1 for true and NA/NULL for false. These must be replaced
    with a boolean type so that NA can distinguish missing data. 
    Instead of using a `bool`, use Int8 so that NaNs can be stored.
    (This is important later on for index spells with missing attributes,
    which need to store NaN in these flag columns.)

    Args:
        primary_care_attributes: Original table containing
            1/NA flag columns

    Returns:
        The primary care attributes with flag columns encoded
            as Int8.

    """

    # Columns interpreted as flags have been taken from
    # the SWD guide, where the data format column says
    # 1/Null. SWD documentation has been taken as a proxy
    # for the primary care attributes table (which does
    # not have column documentation).
    flag_columns = [
        "abortion",
        "adhd",
        "af",
        "amputations",
        "anaemia_iron",
        "anaemia_other",
        "angio_anaph",
        "arrhythmia_other",
        "asthma",
        "autism",
        "back_pain",
        "cancer_bladder",
        # Not sure what *_year means as a flag
        "cancer_bladder_year",
        "cancer_bowel",
        "cancer_bowel_year",
        "cancer_breast",
        "cancer_breast_year",
        "cancer_cervical",
        "cancer_cervical_year",
        "cancer_giliver",
        "cancer_giliver_year",
        "cancer_headneck",
        "cancer_headneck_year",
        "cancer_kidney",
        "cancer_kidney_year",
        "cancer_leuklymph",
        "cancer_leuklymph_year",
        "cancer_lung",
        "cancer_lung_year",
        "cancer_melanoma",
        "cancer_melanoma_year",
        "cancer_metase",
        "cancer_metase_year",
        "cancer_other",
        "cancer_other_year",
        "cancer_ovarian",
        "cancer_ovarian_year",
        "cancer_prostate",
        "cancer_prostate_year",
        "cardio_other",
        "cataracts",        
        "ckd",
        "coag",
        "coeliac",
        "contraception",
        "copd",
        "cystic_fibrosis",
        "dementia",
        "dep_alcohol",
        "dep_benzo",
        "dep_cannabis",
        "dep_cocaine",
        "dep_opioid",
        "dep_other",
        "depression",
        "diabetes_1",
        "diabetes_2",
        "diabetes_gest",
        "diabetes_retina",
        "disorder_eating",
        "disorder_pers",
        "dna_cpr",
        "eczema",
        "endocrine_other",
        "endometriosis",
        "eol_plan",
        "epaccs",
        "epilepsy",
        "fatigue",
        "fragility",
        "gout",
        "has_carer",
        "health_check",
        "hearing_impair",
        "hep_b",
        "hep_c",
        "hf",
        "hiv",
        "homeless",
        "housebound",
        "ht",
        "ibd",
        "ibs",
        "ihd_mi",
        "ihd_nonmi",
        "incont_urinary",
        "inflam_arthritic",
        "is_carer",
        "learning_diff",
        "learning_dis",
        "live_birth",
        "liver_alcohol",
        "liver_nafl",
        "liver_other",
        "lung_restrict",
        "macular_degen",
        "measles_mumps",
        "migraine",
        "miscarriage",
        "mmr1",
        "mmr2",
        "mnd",
        "ms",
        "neuro_pain",
        "neuro_various",
        "newborn_check",
        "nh_rh",
        "nose",
        "obesity",
        "organ_transplant",
        "osteoarthritis",
        "osteoporosis",
        "parkinsons",
        "pelvic",
        "phys_disability",
        "poly_ovary",
        "pre_diabetes",
        "pregnancy",
        "psoriasis",
        "ptsd",
        "qof_af",
        "qof_asthma",
        "qof_chd",
        "qof_ckd",
        "qof_copd",
        "qof_dementia",
        "qof_depression",
        "qof_diabetes",
        "qof_epilepsy",
        "qof_hf",
        "qof_ht",
        "qof_learndis",
        "qof_mental",
        "qof_obesity",
        "qof_osteoporosis",
        "qof_pad",
        "qof_pall",
        "qof_rheumarth",
        "qof_stroke",
        "sad",
        "screen_aaa",
        "screen_bowel",
        "screen_breast",
        "screen_cervical",
        "screen_eye",
        "self_harm",
        "sickle",
        "smi",
        "stomach",
        "stroke",
        "tb",
        "thyroid",
        "uterine",
        "vasc_dis",
        "veteran",
        "visual_impair",
    ]

    df = primary_care_attributes.copy()
    df[flag_columns] = (
        df[flag_columns].astype("float").fillna(0).astype("Int8")
    )
    return df

tools

fetch_data

Fetch raw data from the database and save it to a file

generate_report

Generate the report folder from a config file and model data

plot_describe

plot_or_save(plot, name, save_dir)

Plot the graph interactively or save the figure

Parameters:

Name Type Description Default
plot bool

If true, plot interactively and don't save. Otherwise, save

required
name str

The filename (without the .png) to save the figure has

required
save_dir str

The directory in which to save the figure

required
Source code in src\pyhbr\tools\plot_describe.py
def plot_or_save(plot: bool, name: str, save_dir: str):
    """Plot the graph interactively or save the figure

    Args:
        plot: If true, plot interactively and don't save. Otherwise, save
        name: The filename (without the .png) to save the figure has
        save_dir: The directory in which to save the figure
    """
    if plot:
        log.info(f"Plotting {name}, not saving")
        plt.show()
    else:
        log.info(f"Saving figure {name} in {save_dir}")
        plt.savefig(common.make_new_save_item_path(name, save_dir, "png"))

run_model

fit_and_save(model_name, config, pipe, X_train, y_train, X_test, y_test, data_file, random_state)

Fit the model and save the results

Parameters:

Name Type Description Default
model_name str

The name of the model, a key under the "models" top-level key in the config file

required
config dict[str, Any]

The config file as a dictionary

required
X_train DataFrame

The features training dataframe

required
y_train DataFrame

The outcomes training dataframe

required
X_test DataFrame

The features testing dataframe

required
y_test DataFrame

The outcomes testing dataframe

required
data_file str

The name of the raw data file used for the modelling

required
random_state RandomState

The source of randomness used by the model

required
Source code in src\pyhbr\tools\run_model.py
def fit_and_save(
    model_name: str,
    config: dict[str, Any],
    pipe: Pipeline,
    X_train: DataFrame,
    y_train: DataFrame,
    X_test: DataFrame,
    y_test: DataFrame,
    data_file: str,
    random_state: RandomState,
) -> None:
    """Fit the model and save the results

    Args:
        model_name: The name of the model, a key under the "models" top-level
            key in the config file
        config: The config file as a dictionary
        X_train: The features training dataframe
        y_train: The outcomes training dataframe
        X_test: The features testing dataframe
        y_test: The outcomes testing dataframe
        data_file: The name of the raw data file used for the modelling
        random_state: The source of randomness used by the model
    """

    print("Starting fit")

    # Using a larger number of bootstrap resamples will make
    # the stability analysis better, but will take longer to fit.
    num_bootstraps = config["num_bootstraps"]

    # Choose the number of bins for the calibration calculation.
    # Using more bins will resolve the risk estimates more
    # precisely, but will reduce the sample size in each bin for
    # estimating the prevalence.
    num_bins = config["num_bins"]

    # Fit the model, and also fit bootstrapped models (using resamples
    # of the training set) to assess stability.
    fit_results = fit.fit_model(
        pipe, X_train, y_train, X_test, y_test, num_bootstraps, num_bins, random_state
    )

    # Save the fitted models
    model_data = {
        "name": model_name,
        "config": config,
        "fit_results": fit_results,
        "X_train": X_train,
        "X_test": X_test,
        "y_train": y_train,
        "y_test": y_test,
        "data_file": data_file,
    }

    analysis_name = config["analysis_name"]

    # If the branch is not clean, prompt the user to commit to avoid losing
    # long-running model results. Take care to only commit if the state of
    # the repository truly reflects what was run (i.e. if no changes were made
    # while the script was running).
    retry_save = True
    while retry_save:
        try:
            common.save_item(
                model_data, f"{analysis_name}_{model_name}", save_dir=config["save_dir"]
            )
            # Getting here successfully means that the save worked; exit the loop
            log.info("Saved model")
            break
        except RuntimeError as e:
            print(e)
            print("You can commit now and then retry the save after committing.")
            retry_save = common.query_yes_no(
                "Do you want to retry the save? Commit, then select yes, or choose no to exit the script."
            )

get_pipe_fn(model_config)

Get the pipe function based on the name in the config file

Parameters:

Name Type Description Default
model_config dict[str, str]

The dictionary in models.{model_name} in the config file

required
Source code in src\pyhbr\tools\run_model.py
def get_pipe_fn(model_config: dict[str, str]) -> Callable:
    """Get the pipe function based on the name in the config file

    Args:
        model_config: The dictionary in models.{model_name} in
            the config file
    """

    # Make the preprocessing/fitting pipeline
    pipe_fn_path = model_config["pipe_fn"]
    module_name, pipe_fn_name = pipe_fn_path.rsplit(".", 1)
    module = importlib.import_module(module_name)
    return getattr(module, pipe_fn_name)

Analysis

Routines for performing statistics, analysis, or fitting models

Common Utilities

Common utilities for other modules.

A collection of routines used by the data source or analysis functions.

CheckedTable

Wrapper for sqlalchemy table with checks for table/columns

Source code in src\pyhbr\common.py
class CheckedTable:
    """Wrapper for sqlalchemy table with checks for table/columns"""

    def __init__(self, table_name: str, engine: Engine, schema="dbo") -> None:
        """Get a CheckedTable by reading from the remote server

        This is a wrapper around the sqlalchemy Table for
        catching errors when accessing columns through the
        c attribute.

        Args:
            table_name: The name of the table whose metadata should be retrieved
            engine: The database connection

        Returns:
            The table data for use in SQL queries
        """
        self.name = table_name
        metadata_obj = MetaData(schema=schema)
        try:
            self.table = Table(self.name, metadata_obj, autoload_with=engine)
        except NoSuchTableError as e:
            raise RuntimeError(
                f"Could not find table '{e}' in database connection '{engine.url}'"
            ) from e

    def col(self, column_name: str) -> Column:
        """Get a column

        Args:
            column_name: The name of the column to fetch.

        Raises:
            RuntimeError: Thrown if the column does not exist
        """
        try:
            return self.table.c[column_name]
        except AttributeError as e:
            raise RuntimeError(
                f"Could not find column name '{column_name}' in table '{self.name}'"
            ) from e

__init__(table_name, engine, schema='dbo')

Get a CheckedTable by reading from the remote server

This is a wrapper around the sqlalchemy Table for catching errors when accessing columns through the c attribute.

Parameters:

Name Type Description Default
table_name str

The name of the table whose metadata should be retrieved

required
engine Engine

The database connection

required

Returns:

Type Description
None

The table data for use in SQL queries

Source code in src\pyhbr\common.py
def __init__(self, table_name: str, engine: Engine, schema="dbo") -> None:
    """Get a CheckedTable by reading from the remote server

    This is a wrapper around the sqlalchemy Table for
    catching errors when accessing columns through the
    c attribute.

    Args:
        table_name: The name of the table whose metadata should be retrieved
        engine: The database connection

    Returns:
        The table data for use in SQL queries
    """
    self.name = table_name
    metadata_obj = MetaData(schema=schema)
    try:
        self.table = Table(self.name, metadata_obj, autoload_with=engine)
    except NoSuchTableError as e:
        raise RuntimeError(
            f"Could not find table '{e}' in database connection '{engine.url}'"
        ) from e

col(column_name)

Get a column

Parameters:

Name Type Description Default
column_name str

The name of the column to fetch.

required

Raises:

Type Description
RuntimeError

Thrown if the column does not exist

Source code in src\pyhbr\common.py
def col(self, column_name: str) -> Column:
    """Get a column

    Args:
        column_name: The name of the column to fetch.

    Raises:
        RuntimeError: Thrown if the column does not exist
    """
    try:
        return self.table.c[column_name]
    except AttributeError as e:
        raise RuntimeError(
            f"Could not find column name '{column_name}' in table '{self.name}'"
        ) from e

chunks(patient_ids, n)

Divide a list of patient ids into n-sized chunks

The last chunk may be shorter.

Parameters:

Name Type Description Default
patient_ids list[str]

The List of IDs to chunk

required
n int

The chunk size.

required

Returns:

Type Description
list[list[str]]

A list containing chunks (list) of patient IDs

Source code in src\pyhbr\common.py
def chunks(patient_ids: list[str], n: int) -> list[list[str]]:
    """Divide a list of patient ids into n-sized chunks

    The last chunk may be shorter.

    Args:
        patient_ids: The List of IDs to chunk
        n: The chunk size.

    Returns:
        A list containing chunks (list) of patient IDs
    """
    return [patient_ids[i : i + n] for i in range(0, len(patient_ids), n)]

current_commit()

Get current commit.

Returns:

Type Description
str

Get the first 12 characters of the current commit, using the first repository found above the current working directory. If the working directory is not in a git repository, return "nogit".

Source code in src\pyhbr\common.py
def current_commit() -> str:
    """Get current commit.

    Returns:
        Get the first 12 characters of the current commit,
            using the first repository found above the current
            working directory. If the working directory is not
            in a git repository, return "nogit".
    """
    try:
        repo = Repo(search_parent_directories=True)
        sha = repo.head.object.hexsha[0:11]
        return sha
    except InvalidGitRepositoryError:
        return "nogit"

current_timestamp()

Get the current timestamp.

Returns:

Type Description
int

The current timestamp (since epoch) rounded to the nearest second.

Source code in src\pyhbr\common.py
def current_timestamp() -> int:
    """Get the current timestamp.

    Returns:
        The current timestamp (since epoch) rounded
            to the nearest second.
    """
    return int(time())

get_data(engine, query, *args)

Convenience function to make a query and fetch data.

Wraps a function like hic.demographics_query with a call to pd.read_data.

Parameters:

Name Type Description Default
engine Engine

The database connection

required
query Callable[[Engine, ...], Select]

A function returning a sqlalchemy Select statement

required
*args ...

Positional arguments to be passed to query in addition to engine (which is passed first). Make sure they are passed in the same order expected by the query function.

()

Returns:

Type Description
DataFrame

The pandas dataframe containing the SQL data

Source code in src\pyhbr\common.py
def get_data(
    engine: Engine, query: Callable[[Engine, ...], Select], *args: ...
) -> DataFrame:
    """Convenience function to make a query and fetch data.

    Wraps a function like hic.demographics_query with a
    call to pd.read_data.

    Args:
        engine: The database connection
        query: A function returning a sqlalchemy Select statement
        *args: Positional arguments to be passed to query in addition
            to engine (which is passed first). Make sure they are passed
            in the same order expected by the query function.

    Returns:
        The pandas dataframe containing the SQL data
    """
    stmt = query(engine, *args)
    df = read_sql(stmt, engine)

    # Convert the column names to regular strings instead
    # of sqlalchemy.sql.elements.quoted_name. This avoids
    # an error down the line in sklearn, which cannot
    # process sqlalchemy column title tuples.
    df.columns = [str(col) for col in df.columns]

    return df

get_data_by_patient(engine, query, patient_ids, *args)

Fetch data using a query restricted by patient ID

The patient_id list is chunked into 2000 long batches to fit within an SQL IN clause, and each chunk is run as a separate query. The results are assembled into a single DataFrame.

Parameters:

Name Type Description Default
engine Engine

The database connection

required
query Callable[[Engine, ...], Select]

A function returning a sqlalchemy Select statement. Must take a list[str] as an argument after engine.

required
patient_ids list[str]

A list of patient IDs to restrict the query.

required
*args ...

Further positional arguments that will be passed to the query function after the patient_ids positional argument.

()

Returns:

Type Description
list[DataFrame]

A list of dataframes, one corresponding to each chunk.

Source code in src\pyhbr\common.py
def get_data_by_patient(
    engine: Engine,
    query: Callable[[Engine, ...], Select],
    patient_ids: list[str],
    *args: ...,
) -> list[DataFrame]:
    """Fetch data using a query restricted by patient ID

    The patient_id list is chunked into 2000 long batches to fit
    within an SQL IN clause, and each chunk is run as a separate
    query. The results are assembled into a single DataFrame.

    Args:
        engine: The database connection
        query: A function returning a sqlalchemy Select statement. Must
            take a list[str] as an argument after engine.
        patient_ids: A list of patient IDs to restrict the query.
        *args: Further positional arguments that will be passed to the
            query function after the patient_ids positional argument.

    Returns:
        A list of dataframes, one corresponding to each chunk.
    """
    dataframes = []
    patient_id_chunks = chunks(patient_ids, 2000)
    num_chunks = len(patient_id_chunks)
    chunk_count = 1
    for chunk in patient_id_chunks:
        print(f"Fetching chunk {chunk_count}/{num_chunks}")
        dataframes.append(get_data(engine, query, chunk, *args))
        chunk_count += 1
    return dataframes

get_saved_files_by_name(name, save_dir, extension)

Get all saved data files matching name

Get the list of files in the save_dir folder matching name. Return the result as a table of file path, commit hash, and saved date. The table is sorted by timestamp, with the most recent file first.

Raises:

Type Description
RuntimeError

If save_dir does not exist, or there are files in save_dir within invalid file names (not in the format name_commit_timestamp.pkl).

Parameters:

Name Type Description Default
name str

The name of the saved file to load. This matches name in the filename name_commit_timestamp.pkl.

required
save_dir str

The directory to search for files.

required
extension str

What file extension to look for. Do not include the dot.

required

Returns:

Type Description
DataFrame

A dataframe with columns path, commit and created_data.

Source code in src\pyhbr\common.py
def get_saved_files_by_name(name: str, save_dir: str, extension: str) -> DataFrame:
    """Get all saved data files matching name

    Get the list of files in the save_dir folder matching
    name. Return the result as a table of file path, commit
    hash, and saved date. The table is sorted by timestamp,
    with the most recent file first.

    Raises:
        RuntimeError: If save_dir does not exist, or there are files
            in save_dir within invalid file names (not in the format
            name_commit_timestamp.pkl).

    Args:
        name: The name of the saved file to load. This matches name in
            the filename name_commit_timestamp.pkl.
        save_dir: The directory to search for files.
        extension: What file extension to look for. Do not include the dot.

    Returns:
        A dataframe with columns `path`, `commit` and `created_data`.
    """

    # Check for missing datasets directory
    if not os.path.isdir(save_dir):
        raise RuntimeError(
            f"Missing folder '{save_dir}'. Check your working directory."
        )

    # Read all the .pkl files in the directory
    files = DataFrame({"path": os.listdir(save_dir)})

    # Identify the file name part. The horrible regex matches the
    # expression _[commit_hash]_[timestamp].pkl. It is important to
    # match this part, because "anything" can happen in the name part
    # (including underscores and letters and numbers), so splitting on
    # _ would not work. The name can then be removed.
    files["name"] = files["path"].str.replace(
        rf"_([0-9]|[a-zA-Z])*_\d*\.{extension}", "", regex=True
    )

    # Remove all the files whose name does not match, and drop
    # the name from the path
    files = files[files["name"] == name]
    if files.shape[0] == 0:
        raise ValueError(
            f"There is no file with the name '{name}' in the datasets directory"
        )
    files["commit_and_timestamp"] = files["path"].str.replace(name + "_", "")

    # Split the commit and timestamp up (note also the extension)
    try:
        files[["commit", "timestamp", "extension"]] = files[
            "commit_and_timestamp"
        ].str.split(r"_|\.", expand=True)
    except Exception as exc:
        raise RuntimeError(
            "Failed to parse files in the datasets folder. "
            "Ensure that all files have the correct format "
            "name_commit_timestamp.extension, and "
            "remove any files not matching this "
            "pattern. TODO handle this error properly, "
            "see save_datasets.py."
        ) from exc

    files["created_date"] = to_datetime(files["timestamp"].astype(int), unit="s")
    recent_first = files.sort_values(by="timestamp", ascending=False).reset_index()[
        ["path", "commit", "created_date"]
    ]
    return recent_first

load_exact_item(name, save_dir='save_data')

Load a previously saved item (pickle) from file by exact filename

This is similar to load_item, but loads the exact filename given by name instead of looking for the most recent file. name must contain the commit, timestamp, and file extension.

A RuntimeError is raised if the file does not exist.

To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.

Parameters:

Name Type Description Default
name str

The name of the item to load

required
save_fir

Which folder to load the item from.

required

Returns:

Type Description
Any

The data item loaded.

Source code in src\pyhbr\common.py
def load_exact_item(
    name: str, save_dir: str = "save_data"
) -> Any:
    """Load a previously saved item (pickle) from file by exact filename

    This is similar to load_item, but loads the exact filename given by name
    instead of looking for the most recent file. name must contain the
    commit, timestamp, and file extension.

    A RuntimeError is raised if the file does not exist.

    To load an item that is an object from a library (e.g. a pandas DataFrame),
    the library must be installed (otherwise you will get a ModuleNotFound
    exception). However, you do not have to import the library before calling this
    function.

    Args:
        name: The name of the item to load
        save_fir: Which folder to load the item from.

    Returns:
        The data item loaded. 

    """

    # Make the path to the file
    file_path = Path(save_dir) / Path(name)

    # If the file does not exist, raise an error
    if not file_path.exists():
        raise RuntimeError(f"The file {name} does not exist in the directory {save_dir}")

    # Load a generic pickle. Note that if this is a pandas dataframe,
    # pandas must be installed (otherwise you will get module not found).
    # The same goes for a pickle storing an object from any other library.
    with open(file_path, "rb") as file:
        return pickle.load(file)

load_item(name, interactive=False, save_dir='save_data')

Load a previously saved item (pickle) from file

Use this function to load a file that was previously saved using save_item(). By default, the latest version of the item will be returned (the one with the most recent timestamp).

None is returned if an interactive load is cancelled by the user.

To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.

Parameters:

Name Type Description Default
name str

The name of the item to load

required
interactive bool

If True, let the user pick which item version to load interactively. If False, non-interactively load the most recent item (i.e. with the most recent timestamp). The commit hash is not considered when loading the item.

False
save_fir

Which folder to load the item from.

required

Returns:

Type Description
(Any, Path)

A tuple, with the python object loaded from file as first element and the Path to the item as the second element, or None if the user cancelled an interactive load.

Source code in src\pyhbr\common.py
def load_item(
    name: str, interactive: bool = False, save_dir: str = "save_data"
) -> (Any, Path):
    """Load a previously saved item (pickle) from file

    Use this function to load a file that was previously saved using
    save_item(). By default, the latest version of the item will be returned
    (the one with the most recent timestamp).

    None is returned if an interactive load is cancelled by the user.

    To load an item that is an object from a library (e.g. a pandas DataFrame),
    the library must be installed (otherwise you will get a ModuleNotFound
    exception). However, you do not have to import the library before calling this
    function.

    Args:
        name: The name of the item to load
        interactive: If True, let the user pick which item version to load interactively.
            If False, non-interactively load the most recent item (i.e. with the most
            recent timestamp). The commit hash is not considered when loading the item.
        save_fir: Which folder to load the item from.

    Returns:
        A tuple, with the python object loaded from file as first element and the
            Path to the item as the second element, or None if the user cancelled
            an interactive load.

    """
    if interactive:
        item_path = pick_saved_file_interactive(name, save_dir, "pkl")
    else:
        item_path = pick_most_recent_saved_file(name, save_dir, "pkl")

    if item_path is None:
        print("Aborted (interactive) load item")
        return None, None

    print(f"Loading {item_path}")

    # Load a generic pickle. Note that if this is a pandas dataframe,
    # pandas must be installed (otherwise you will get module not found).
    # The same goes for a pickle storing an object from any other library.
    with open(item_path, "rb") as file:
        return pickle.load(file), item_path

load_most_recent_data_files(analysis_name, save_dir)

Load the most recent timestamp data file matching the analysis name

The data file is a pickle of a dictionary, containing pandas DataFrames and other metadata. It is expected to contain a "raw_file" key, which contains the path to the associated raw data file.

Both files are loaded, and a tuple of all the data is returned

Parameters:

Name Type Description Default
analysis_name str

The "analysis_name" key from the config file, which is the filename prefix

required
save_dir str

The folder to load the data from

required

Returns:

Type Description
(dict[str, Any], dict[str, Any], str)

(data, raw_data, data_path). data and raw_data are dictionaries containing (mainly) Pandas DataFrames, and data_path is the path to the data file (this can be stored in any output products from this script to record which data file was used to generate the data.

Source code in src\pyhbr\common.py
def load_most_recent_data_files(analysis_name: str, save_dir: str) -> (dict[str, Any], dict[str, Any], str):
    """Load the most recent timestamp data file matching the analysis name

    The data file is a pickle of a dictionary, containing pandas DataFrames and
    other metadata. It is expected to contain a "raw_file" key, which contains
    the path to the associated raw data file.

    Both files are loaded, and a tuple of all the data is returned

    Args:
        analysis_name: The "analysis_name" key from the config file, which is the filename prefix
        save_dir: The folder to load the data from

    Returns:
        (data, raw_data, data_path). data and raw_data are dictionaries containing
            (mainly) Pandas DataFrames, and data_path is the path to the data
            file (this can be stored in any output products from this script to
            record which data file was used to generate the data.
    """

    item_name = f"{analysis_name}_data"
    log.info(f"Loading most recent data file '{item_name}'")
    data, data_path = load_item(item_name, save_dir=save_dir)

    raw_file = data["raw_file"]
    log.info(f"Loading the underlying raw data file '{raw_file}'")
    raw_data = load_exact_item(raw_file, save_dir=save_dir)

    log.info(f"Items in the data file {data.keys()}")
    log.info(f"Items in the raw data file: {raw_data.keys()}")

    return data, raw_data, data_path

make_engine(con_string='mssql+pyodbc://dsn', database='hic_cv_test')

Make a sqlalchemy engine

This function is intended for use with Microsoft SQL Server. The preferred method to connect to the server on Windows is to use a Data Source Name (DSN). To use the default connection string argument, set up a data source name called "dsn" using the program "ODBC Data Sources".

If you need to access multiple different databases on the same server, you will need different engines. Specify the database name while creating the engine (this will override a default database in the DSN, if there is one).

Parameters:

Name Type Description Default
con_string str

The sqlalchemy connection string.

'mssql+pyodbc://dsn'
database str

The database name to connect to.

'hic_cv_test'

Returns:

Type Description
Engine

The sqlalchemy engine

Source code in src\pyhbr\common.py
def make_engine(
    con_string: str = "mssql+pyodbc://dsn", database: str = "hic_cv_test"
) -> Engine:
    """Make a sqlalchemy engine

    This function is intended for use with Microsoft SQL
    Server. The preferred method to connect to the server
    on Windows is to use a Data Source Name (DSN). To use the
    default connection string argument, set up a data source
    name called "dsn" using the program "ODBC Data Sources".

    If you need to access multiple different databases on the
    same server, you will need different engines. Specify the
    database name while creating the engine (this will override
    a default database in the DSN, if there is one).

    Args:
        con_string: The sqlalchemy connection string.
        database: The database name to connect to.

    Returns:
        The sqlalchemy engine
    """
    connect_args = {"database": database}
    return create_engine(con_string, connect_args=connect_args)

make_new_save_item_path(name, save_dir, extension)

Make the path to save a new item to the save_dir

The name will have the format name_{current_common}_{timestamp}.{extension}.

Parameters:

Name Type Description Default
name str

The base name for the new filename

required
save_dir str

The folder in which to place the item

required
extension str

The file extension (omit the dot)

required

Returns:

Type Description
Path

The relative path to the new object to be saved

Source code in src\pyhbr\common.py
def make_new_save_item_path(name: str, save_dir: str, extension: str) -> Path:
    """Make the path to save a new item to the save_dir

    The name will have the format name_{current_common}_{timestamp}.{extension}.

    Args:
        name: The base name for the new filename
        save_dir: The folder in which to place the item
        extension: The file extension (omit the dot)

    Returns:
        The relative path to the new object to be saved
    """

    # Make the file suffix out of the current git
    # commit hash and the current time
    filename = f"{name}_{current_commit()}_{current_timestamp()}.{extension}"
    return Path(save_dir) / Path(filename)

mean_confidence_interval(data, confidence=0.95)

Compute the confidence interval around the mean

Parameters:

Name Type Description Default
data Series

A series of numerical values to compute the confidence interval.

required
confidence float

The confidence interval to compute.

0.95

Returns:

Type Description
dict[str, float]

A map containing the keys "mean", "lower", and "upper". The latter keys contain the confidence interval limits.

Source code in src\pyhbr\common.py
def mean_confidence_interval(
    data: Series, confidence: float = 0.95
) -> dict[str, float]:
    """Compute the confidence interval around the mean

    Args:
        data: A series of numerical values to compute the confidence interval.
        confidence: The confidence interval to compute.

    Returns:
        A map containing the keys "mean", "lower", and "upper". The latter
            keys contain the confidence interval limits.
    """
    a = 1.0 * np.array(data)
    n = len(a)
    mean = np.mean(a)
    standard_error = scipy.stats.sem(a)

    # Check this
    half_width = standard_error * scipy.stats.t.ppf((1 + confidence) / 2.0, n - 1)
    return {
        "mean": mean,
        "confidence": confidence,
        "lower": mean - half_width,
        "upper": mean + half_width,
    }

median_to_string(instability, unit='%')

Convert the median-quartile DataFrame to a String

Parameters:

Name Type Description Default
instability DataFrame

Table containing three rows, indexed by 0.5 (median), 0.25 (lower quartile) and 0.75 (upper quartile).

required
unit

What units to add to the values in the string.

'%'

Returns:

Type Description
str

A string containing the median, and the lower and upper quartiles.

Source code in src\pyhbr\common.py
def median_to_string(instability: DataFrame, unit="%") -> str:
    """Convert the median-quartile DataFrame to a String

    Args:
        instability: Table containing three rows, indexed by
            0.5 (median), 0.25 (lower quartile) and 0.75
            (upper quartile).
        unit: What units to add to the values in the string.

    Returns:
        A string containing the median, and the lower and upper
            quartiles.
    """
    return f"{instability.loc[0.5]:.2f}{unit} Q [{instability.loc[0.025]:.2f}{unit}, {instability.loc[0.975]:.2f}{unit}]"

pick_most_recent_saved_file(name, save_dir, extension='pkl')

Get the path to the most recent file matching name.

Like pick_saved_file_interactive, but automatically selects the most recent file in save_data.

Parameters:

Name Type Description Default
name str

The name of the saved file to list

required
save_dir str

The directory to search for files

required
extension str

What file extension to look for. Do not include the dot.

'pkl'

Returns:

Type Description
Path

The relative path to the most recent matching file.

Source code in src\pyhbr\common.py
def pick_most_recent_saved_file(
    name: str, save_dir: str, extension: str = "pkl"
) -> Path:
    """Get the path to the most recent file matching name.

    Like pick_saved_file_interactive, but automatically selects the most
    recent file in save_data.

    Args:
        name: The name of the saved file to list
        save_dir: The directory to search for files
        extension: What file extension to look for. Do not include the dot.

    Returns:
        The relative path to the most recent matching file.
    """
    recent_first = get_saved_files_by_name(name, save_dir, extension)
    return Path(save_dir) / Path(recent_first.loc[0, "path"])

pick_saved_file_interactive(name, save_dir, extension='pkl')

Select a file matching name interactively

Print a list of the saved items in the save_dir folder, along with the date and time it was generated, and the commit hash, and let the user pick which item should be loaded interactively. The full filename of the resulting file is returned, which can then be read by the user.

Parameters:

Name Type Description Default
name str

The name of the saved file to list

required
save_dir str

The directory to search for files

required
extension str

What file extension to look for. Do not include the dot.

'pkl'

Returns:

Type Description
str | None

The absolute path to the interactively selected file, or None if the interactive load was aborted.

Source code in src\pyhbr\common.py
def pick_saved_file_interactive(
    name: str, save_dir: str, extension: str = "pkl"
) -> str | None:
    """Select a file matching name interactively

    Print a list of the saved items in the save_dir folder, along
    with the date and time it was generated, and the commit hash,
    and let the user pick which item should be loaded interactively.
    The full filename of the resulting file is returned, which can
    then be read by the user.

    Args:
        name: The name of the saved file to list
        save_dir: The directory to search for files
        extension: What file extension to look for. Do not include the dot.

    Returns:
        The absolute path to the interactively selected file, or None
            if the interactive load was aborted.
    """

    recent_first = get_saved_files_by_name(name, save_dir, extension)
    print(recent_first)

    num_datasets = recent_first.shape[0]
    while True:
        try:
            raw_choice = input(
                f"Pick a dataset to load: [{0} - {num_datasets-1}] (type q[uit]/exit, then Enter, to quit): "
            )
            if "exit" in raw_choice or "q" in raw_choice:
                return None
            choice = int(raw_choice)
        except Exception:
            print(f"{raw_choice} is not valid; try again.")
            continue
        if choice < 0 or choice >= num_datasets:
            print(f"{choice} is not in range; try again.")
            continue
        break

    full_path = os.path.join(save_dir, recent_first.loc[choice, "path"])
    return full_path

query_yes_no(question, default='yes')

Ask a yes/no question via raw_input() and return their answer.

From https://stackoverflow.com/a/3041990.

"question" is a string that is presented to the user. "default" is the presumed answer if the user just hits . It must be "yes" (the default), "no" or None (meaning an answer is required of the user).

The "answer" return value is True for "yes" or False for "no".

Source code in src\pyhbr\common.py
def query_yes_no(question, default="yes"):
    """Ask a yes/no question via raw_input() and return their answer.

    From https://stackoverflow.com/a/3041990.

    "question" is a string that is presented to the user.
    "default" is the presumed answer if the user just hits <Enter>.
            It must be "yes" (the default), "no" or None (meaning
            an answer is required of the user).

    The "answer" return value is True for "yes" or False for "no".
    """
    valid = {"yes": True, "y": True, "ye": True, "no": False, "n": False}
    if default is None:
        prompt = " [y/n] "
    elif default == "yes":
        prompt = " [Y/n] "
    elif default == "no":
        prompt = " [y/N] "
    else:
        raise ValueError("invalid default answer: '%s'" % default)

    while True:
        sys.stdout.write(question + prompt)
        choice = input().lower()
        if default is not None and choice == "":
            return valid[default]
        elif choice in valid:
            return valid[choice]
        else:
            sys.stdout.write("Please respond with 'yes' or 'no' " "(or 'y' or 'n').\n")

read_config_file(yaml_path)

Read the configuration file from

Parameters:

Name Type Description Default
yaml_path str

The path to the experiment config file

required
Source code in src\pyhbr\common.py
def read_config_file(yaml_path: str):
    """Read the configuration file from

    Args:
        yaml_path: The path to the experiment config file
    """
    # Read the configuration file
    with open(yaml_path) as stream:
        try:
            return yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            print(f"Failed to load config file: {exc}")
            exit(1)

requires_commit()

Check whether changes need committing

To make most effective use of the commit hash stored with a save_item call, the current branch should be clean (all changes committed). Call this function to check.

Returns False if there is no git repository.

Returns:

Type Description
bool

True if the working directory is in a git repository that requires a commit; False otherwise.

Source code in src\pyhbr\common.py
def requires_commit() -> bool:
    """Check whether changes need committing

    To make most effective use of the commit hash stored with a
    save_item call, the current branch should be clean (all changes
    committed). Call this function to check.

    Returns False if there is no git repository.

    Returns:
        True if the working directory is in a git repository that requires
            a commit; False otherwise.
    """
    try:
        repo = Repo(search_parent_directories=True)
        return repo.is_dirty(untracked_files=True)
    except InvalidGitRepositoryError:
        # No need to commit if not repository
        return False

save_item(item, name, save_dir='save_data/', enforce_clean_branch=True, prompt_commit=False)

Save an item to a pickle file

Saves a python object (e.g. a pandas DataFrame) dataframe in the save_dir folder, using a filename that includes the current timestamp and the current commit hash. Use load_item to retrieve the file.

Important

Ensure that save_data/ (or your chosen save_dir) is added to the .gitignore of your repository to ensure sensitive data is not committed.

By storing the commit hash and timestamp, it is possible to identify when items were created and what code created them. To make most effective use of the commit hash, ensure that you commit, and do not make any further code edits, before running a script that calls save_item (otherwise the commit hash will not quite reflect the state of the running code).

Parameters:

Name Type Description Default
item Any

The python object to save (e.g. pandas DataFrame)

required
name str

The name of the item. The filename will be created by adding a suffix for the current commit and the timestamp to show when the data was saved (format: name_commit_timestamp.pkl)

required
save_dir str

Where to save the data, relative to the current working directory. The directory will be created if it does not exist.

'save_data/'
enforce_clean_branch

If True, the function will raise an exception if an attempt is made to save an item when the repository has uncommitted changes.

True
prompt_commit

if enforce_clean_branch is true, choose whether the prompt the user to commit on an unclean branch. This can help avoiding losing the results of a long-running script. Prefer to use false if the script is cheap to run.

False
Source code in src\pyhbr\common.py
def save_item(
    item: Any,
    name: str,
    save_dir: str = "save_data/",
    enforce_clean_branch=True,
    prompt_commit=False,
) -> None:
    """Save an item to a pickle file

    Saves a python object (e.g. a pandas DataFrame) dataframe in the save_dir
    folder, using a filename that includes the current timestamp and the current
    commit hash. Use load_item to retrieve the file.

    !!! important
        Ensure that `save_data/` (or your chosen `save_dir`) is added to the
        .gitignore of your repository to ensure sensitive data is not committed.

    By storing the commit hash and timestamp, it is possible to identify when items
    were created and what code created them. To make most effective use of the
    commit hash, ensure that you commit, and do not make any further code edits,
    before running a script that calls save_item (otherwise the commit hash will
    not quite reflect the state of the running code).

    Args:
        item: The python object to save (e.g. pandas DataFrame)
        name: The name of the item. The filename will be created by adding
            a suffix for the current commit and the timestamp to show when the
            data was saved (format: `name_commit_timestamp.pkl`)
        save_dir: Where to save the data, relative to the current working directory.
            The directory will be created if it does not exist.
        enforce_clean_branch: If True, the function will raise an exception if an attempt
            is made to save an item when the repository has uncommitted changes.
        prompt_commit: if enforce_clean_branch is true, choose whether the prompt the
            user to commit on an unclean branch. This can help avoiding losing
            the results of a long-running script. Prefer to use false if the script
            is cheap to run.
    """

    if enforce_clean_branch:

        abort_msg = "Aborting save_item() because branch is not clean. Commit your changes before saving item to increase the chance of reproducing the item based on the filename commit hash."

        if prompt_commit:
            # If the branch is not clean, prompt the user to commit to avoid losing
            # long-running model results. Take care to only commit if the state of
            # the repository truly reflects what was run (i.e. if no changes were made
            # while the script was running).
            while requires_commit():
                print(abort_msg)
                print(
                    "You can commit now and then retry the save after committing."
                )
                retry_save = query_yes_no(
                    "Do you want to retry the save? Commit, then select yes, or choose no to abort the save."
                )

                if not retry_save:
                    print(f"Aborting save of {name}")
                    return

            # If we get out the loop without returning, then the branch
            # is not clean and the save can proceed.
            print("Branch now clean, proceeding to save")

        else:

            if requires_commit():
                # In this case, unconditionally throw an error
                raise RuntimeError(abort_msg)

    if not Path(save_dir).exists():
        print(f"Creating missing folder '{save_dir}' for storing item")
        Path(save_dir).mkdir(parents=True, exist_ok=True)

    path = make_new_save_item_path(name, save_dir, "pkl")
    with open(path, "wb") as file:
        print(f"Saving {str(path)}")
        pickle.dump(item, file)