PyHBR Function Reference
This page contains the documentation for all objects in PyHBR.
Data Sources
analysis
Routines for performing statistics, analysis, or fitting models
acs
filter_by_code_groups(episode_codes, code_group, max_position, exclude_index_spell)
Filter based on matching code conditions occurring in other episodes
From any table derived from get_all_other_episodes (e.g. the output of get_time_window), identify clinical codes (and therefore episodes) which correspond to an outcome of interest.
The input table has one row per clinical code, which is grouped into episodes and spells by other columns. The outcome only contains codes that define an episode or spell as an outcome. The result from this function can be used to analyse the make-up of outcomes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
episode_codes |
DataFrame
|
Table of other episodes to filter. This can be narrowed to either the previous or subsequent year, or a different time frame. (In particular, exclude the index event if required.) The table must contain these columns:
|
required |
code_group |
str
|
The code group name used to identify outcomes |
required |
max_position |
int
|
The maximum clinical code position that will be allowed to define an outcome. Pass 1 to allow primary diagnosis only, 2 to allow primary diagnosis and the first secondary diagnosis, etc. |
required |
exclude_index_spell |
bool
|
Do not allow any code present in the index spell to define an outcome. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A series containing the number of code group occurrences in the other_episodes table. |
Source code in src\pyhbr\analysis\acs.py
get_code_features(index_spells, all_other_codes)
Get counts of previous clinical codes in code groups before the index.
Predictors derived from clinical code groups use clinical coding data from 365 days before the index to 30 days before the index (this excludes episodes where no coding data would be available, because the coding process itself takes approximately one month).
All groups included anywhere in the group
column of all_other_codes are
included, and each one becomes a new column with "_before" appended.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
A table containing |
required |
all_other_codes |
DataFrame
|
A table of other episodes (and their clinical codes) relative to the index spell, output from counting.get_all_other_codes. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table with one column per code group, counting the number of codes in that group that appeared in the year before the index. |
Source code in src\pyhbr\analysis\acs.py
get_index_attributes(swd_index_spells, primary_care_attributes)
Link the primary care patient data to the index spells
Parameters:
Name | Type | Description | Default |
---|---|---|---|
swd_index_spells |
DataFrame
|
Index_spells linked to a a recent, valid
patient attributes row. Contains the columns |
required |
primary_care_attributes |
DataFrame
|
The full attributes table. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The table of index-spell patient attributes, indexed by |
Source code in src\pyhbr\analysis\acs.py
get_index_spells(episodes, codes, acs_group, pci_group, stemi_group, nstemi_group, complex_pci_group)
Get the index spells for ACS/PCI patients
Index spells are defined by the contents of the first episode of the spell (i.e. the cause of admission to hospital). Spells are considered an index event if either of the following hold:
- The primary diagnosis of the first episode contains an ACS ICD-10 code. This is to ensure that only episodes where the main diagnosis of the episode is ACS are considered, and not cases where a secondary ACS is present that could refer to a historical event.
- There is a PCI procedure in any primary or secondary position in the first episode of the spell. It is assumed that a procedure is only coded in secondary positions if it did occur in that episode.
A prerequisite for spell to be an index spell is that it contains episodes present in both the episodes and codes tables. The episodes table contains start-time/spell information, and the codes table contains information about what diagnoses/procedures occurred in each episode.
The table returned contains one row per index spell (and is indexed by spell id). It also contains other information about the index spell, which is derived from the first episode of the spell.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
episodes |
DataFrame
|
All patient episodes. Must contain |
required |
codes |
DataFrame
|
All diagnosis/procedure codes by episode. Must contain
|
required |
acs_group |
str
|
The name of the ICD-10 code group used to define ACS. |
required |
pci_group |
str | None
|
The name of the OPCS-4 code group used to define PCI. Pass None to not use PCI as an inclusion criterion for index events. In this case, the pci_index column is omitted, and only ACS primary diagnoses are allowed. |
required |
stemi_group |
str
|
The name of the ICD-10 code group used to identify STEMI MI |
required |
nstemi_group |
str
|
The name of the ICD-10 code group used to identify NSTEMI MI |
required |
complex_pci_group |
str | None
|
The name of the OPCS-4 code group used to define complex PCI (in any primary/secondary position) |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table of index spells and associated information about the first episode of the spell. |
Source code in src\pyhbr\analysis\acs.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
get_management(index_spells, all_other_codes, min_after, max_after, pci_group, cabg_group)
Get the management type for each index event
The result is a category series containing "PCI" if a PCI was performed, "CABG" if CABG was performed, or "Conservative" if neither were performed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
|
required |
all_other_codes |
DataFrame
|
description |
required |
min_after |
timedelta
|
The start of the window after the index to look for management |
required |
max_after |
timedelta
|
The end of the window after the index which defines management |
required |
pci_group |
str
|
The name of the code group defining PCI management |
required |
cabg_management |
The name of the code group defining CABG management |
required |
Returns:
Type | Description |
---|---|
Series
|
A category series containing "PCI", "CABG", or "Conservative" |
Source code in src\pyhbr\analysis\acs.py
get_outcomes(index_spells, all_other_codes, date_of_death, cause_of_death, non_fatal_group, fatal_group)
Get non-fatal and fatal outcomes defined by code groups
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
A table containing |
required |
all_other_codes |
DataFrame
|
A table of other episodes (and their clinical codes) relative to the index spell, output from counting.get_all_other_codes. |
required |
date_of_death |
DataFrame
|
Contains a column date_of_death, with Pandas index
|
required |
cause_of_death |
DataFrame
|
Contains columns |
required |
non_fatal_group |
str
|
The name of the ICD-10 group defining the non-fatal outcome (the primary diagnosis of subsequent episodes are checked for codes in this group) |
required |
fatal_group |
str
|
The name of the ICD-10 group defining the fatal outcome (the primary diagnosis in the cause-of-death is checked for codes in this group). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe, indexed by |
Source code in src\pyhbr\analysis\acs.py
get_secondary_care_prescriptions_features(prescriptions, index_spells, episodes)
Get dummy feature columns for OAC and NSAID medications on admission
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prescriptions |
DataFrame
|
The table of secondary care prescriptions, containing
a |
required |
index_spells |
DataFrame
|
The index spells, which must be indexed by |
required |
episodes |
DataFrame
|
The episodes table containing |
required |
Source code in src\pyhbr\analysis\acs.py
get_survival_data(index_spells, fatal, non_fatal, max_after)
Get survival data from fatal and non-fatal outcomes
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
The index spells, indexed by |
required |
fatal |
DataFrame
|
The table of fatal outcomes, containing a |
required |
non_fatal |
DataFrame
|
The table of non-fatal outcomes, containing a |
required |
max_after |
timedelta
|
The right censor time. This is the maximum time for data contained in the fatal and non_fatal tables; any index spells with no events in either table will be right-censored with this time. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The survival data containing both fatal and non-fatal events. The survival time is the
|
Source code in src\pyhbr\analysis\acs.py
get_therapy(index_spells, primary_care_prescriptions)
Get therapy (DAPT, etc.) recorded in primary care prescriptions in 60 days after index
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
Index spells, containing |
required |
primary_care_prescriptions |
DataFrame
|
Contains a column |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
DataFrame with a column |
Source code in src\pyhbr\analysis\acs.py
identify_fatal_outcome(index_spells, date_of_death, cause_of_death, outcome_group, max_position, max_after)
Get fatal outcomes defined by a diagnosis code in a code group
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
A table containing |
required |
date_of_death |
DataFrame
|
Contains a column date_of_death, with Pandas index
|
required |
cause_of_death |
DataFrame
|
Contains columns |
required |
outcome_group |
str
|
The name of the ICD-10 code group which defines the fatal outcome. |
required |
max_position |
int
|
The maximum primary/secondary cause of death that will be checked for the code group. |
required |
max_after |
timedelta
|
The maximum follow-up period after the index for valid outcomes. |
required |
Returns:
Type | Description |
---|---|
Series
|
A series of boolean containing whether a fatal outcome occurred in the follow-up period. |
Source code in src\pyhbr\analysis\acs.py
link_attribute_period_to_index(index_spells, primary_care_attributes)
Link primary care attributes to index spells by attribute date
The date column of an attributes row indicates that the attribute was valid at the end of the interval (date, date + 1month). It is important that no attribute is used in modelling that could have occurred after the index event, meaning that date + 1month < spell_start must hold for any attribute used as a predictor. On the other hand, data substantially before the index event should not be used. The valid window is controlled by imposing:
date < spell_start - attribute_valid_window
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
The index spell table, containing a |
required |
primary_care_attributes |
DataFrame
|
The patient attributes table, containing
|
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The index_spells table with a |
Source code in src\pyhbr\analysis\acs.py
prescriptions_before_index(swd_index_spells, primary_care_prescriptions)
Get the number of primary care prescriptions before each index spell
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
Must have Pandas index |
required | |
primary_care_prescriptions |
DataFrame
|
Must contain a |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table indexed by |
Source code in src\pyhbr\analysis\acs.py
remove_features(index_attributes, max_missingness, const_threshold)
Reduce to just the columns meeting minimum missingness and variability criteria.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_attributes |
DataFrame
|
The table of primary care attributes for the index spells |
required |
max_missingness |
The maximum allowed missingness in a column before a column is removed as a feature. |
required | |
const_threshold |
The maximum allowed constant-value proportion (NA + most common non-NA value) before a column is removed as a feature |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table containing the features that remain, which contain sufficient non-missing values and sufficient variance. |
Source code in src\pyhbr\analysis\acs.py
arc_hbr
Calculation of the ARC HBR score
all_index_spell_episodes(index_episodes, episodes)
Get all the other episodes in the index spell
This is a dataframe of index spells (defined as the spell containing an episode in index_episodes), along with all the episodes in that spell (including the index episode itself). This is useful for performing operations at index-spell granularity
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_episodes |
DataFrame
|
Must contain Pandas index |
required |
episodes |
DataFrame
|
Must contain Pandas index |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with a column |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_age(has_age)
Calculate the age ARC-HBR criterion
Calculate the age ARC HBR criterion (0.5 points if > 75 at index, 0 otherwise.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
has_age |
DataFrame
|
Dataframe which has a column |
required |
Returns:
Type | Description |
---|---|
Series
|
A series of values 0.5 (if age > 75 at index) or 0 otherwise, indexed by input dataframe index. |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_anaemia(has_index_hb_and_gender)
Calculate the ARC HBR anaemia (low Hb) criterion
Calculates anaemia based on the worst (lowest) index Hb measurement and gender currently. Should be modified to take most recent Hb value or clinical code.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
has_index_hb_and_gender |
DataFrame
|
Dataframe having the column |
required |
Returns:
Type | Description |
---|---|
Series
|
A series containing the HBR score for the index episode. |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_cancer(has_prior_cancer)
Calculate the cancer ARC HBR criterion
This function takes a dataframe with a column prior_cancer with a count of the cancer diagnoses in the previous year.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
has_prior_cancer |
DataFrame
|
Has a column |
required |
Returns:
Type | Description |
---|---|
Series
|
The ARC HBR cancer criterion (0.0, 1.0) |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_cirrhosis_ptl_hyp(has_prior_cirrhosis)
Calculate the liver cirrhosis with portal hypertension ARC HBR criterion
This function takes a dataframe with two columns prior_cirrhosis and prior_portal_hyp, which count the number of diagnosis of liver cirrhosis and portal hypertension seen in the previous year.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
has_prior_cirrhosis |
DataFrame
|
Has columns |
required |
Returns:
Type | Description |
---|---|
Series
|
The ARC HBR criterion (0.0, 1.0) |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_ckd(has_index_egfr)
Calculate the ARC HBR chronic kidney disease (CKD) criterion
The ARC HBR CKD criterion is calculated based on the eGFR as follows:
eGFR | Score |
---|---|
eGFR < 30 mL/min | 1.0 |
30 mL/min \<= eGFR < 60 mL/min | 0.5 |
eGFR >= 60 mL/min | 0.0 |
If the eGFR is NaN, set score to zero (TODO: fall back to ICD-10 codes in this case)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
has_index_egfr |
DataFrame
|
Dataframe having the column |
required |
Returns:
Type | Description |
---|---|
Series
|
A series containing the CKD ARC criterion, based on the eGFR at index. |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_ischaemic_stroke_ich(has_prior_ischaemic_stroke)
Calculate the ischaemic stroke/intracranial haemorrhage ARC HBR criterion
This function takes a dataframe with two columns prior_bavm_ich and prior_portal_hyp, which count the number of diagnosis of liver cirrhosis and portal hypertension seen in the previous year.
If bAVM/ICH is present, 1.0 is added to the score. Else, if ischaemic stroke is present, add 0.5. Otherwise add 0.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
has_prior_ischaemic_stroke |
DataFrame
|
Has a column |
required |
Returns:
Type | Description |
---|---|
Series
|
The ARC HBR criterion (0.0, 1.0) |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_medicine(index_spells, episodes, prescriptions, medicine_group, arc_score)
Calculate the oral-anticoagulant/NSAID ARC HBR criterion
Pass the list of medicines which qualifies for the OAC ARC criterion, along with the ARC score; or pass the same data for the NSAID criterion.
The score is added if a prescription of the medicine is seen at any time during the patient spell.
Notes on the OAC and NSAID criteria:
1.0 point if an one of the OACs "warfarin", "apixaban", "rivaroxaban", "edoxaban", "dabigatran", is present in the index spell (meaning the index episode, or any other episode in the spell).
1.0 point is added if an one of the following NSAIDs is present on admission:
- Ibuprofen
- Naproxen
- Diclofenac
- Celecoxib
- Mefenamic acid
- Etoricoxib
- Indomethacin
Note
The on admission flag could be used to imply expected chronic/extended use, but this is not included as it filters out all OAC prescriptions in the HIC data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
Index |
required |
prescriptions |
DataFrame
|
Contains |
required |
Returns:
Type | Description |
---|---|
Series
|
The ARC score for each index spell |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_nsaid(index_episodes, prescriptions)
Calculate the non-steroidal anti-inflamatory drug (NSAID) ARC HBR criterion
1.0 point is added if an one of the following NSAIDs is present on admission:
- Ibuprofen
- Naproxen
- Diclofenac
- Celecoxib
- Mefenamic acid
- Etoricoxib
- Indomethacin
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_episodes |
DataFrame
|
Index |
required |
prescriptions |
DataFrame
|
Contains |
required |
Returns:
Type | Description |
---|---|
Series
|
The OAC ARC score for each index event. |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_prior_bleeding(has_prior_bleeding)
Calculate the prior bleeding/transfusion ARC HBR criterion
This function takes a dataframe with a column prior_bleeding_12 with a count of the prior bleeding events in the previous year.
TODO: Input needs a separate column for bleeding in 6 months and bleeding in a year, so distinguish 0.5 from 1. Also need to add transfusion.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
has_prior_bleeding |
DataFrame
|
Has a column |
required |
Returns:
Type | Description |
---|---|
Series
|
The ARC HBR bleeding/transfusion criterion (0.0, 0.5, or 1.0) |
Source code in src\pyhbr\analysis\arc_hbr.py
arc_hbr_tcp(has_index_platelets)
Calculate the ARC HBR thrombocytopenia (low platelet count) criterion
The score is 1.0 if platelet count < 100e9/L, otherwise it is 0.0.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
has_index_platelets |
DataFrame
|
Has column |
required |
Returns:
Type | Description |
---|---|
Series
|
Series containing the ARC score |
Source code in src\pyhbr\analysis\arc_hbr.py
first_index_lab_result(index_spells, lab_results, episodes)
Get the (first) lab result associated to each index spell
Get a table of the first lab result seen in the index admission (between
the admission date and discharge date), with one column for each
value of the test_name
column in lab_results.
The resulting table has all-NA rows for index spells where no lab results were seen, and cells contain NA if that lab result was missing from the index spell.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
Has an |
required |
lab_results |
DataFrame
|
Has a |
required |
episodes |
DataFrame
|
Indexed by |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table indexed by |
Source code in src\pyhbr\analysis\arc_hbr.py
plot_index_measurement_distribution(features)
Plot a histogram of measurement results at the index
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_episodes |
Must contain |
required |
Source code in src\pyhbr\analysis\arc_hbr.py
calibration
Calibration plots
A calibration plot is a comparison of the proportion p of events that occur in the subset of those with predicted probability p'. Ideally, p = p' meaning that of the cases predicted to occur with probability p', p of them do occur. Calibration is presented as a plot of p against 'p'.
The stability of the calibration can be investigated, by plotting p against p' for multiple bootstrapped models (see stability.py).
draw_calibration_confidence(ax, calibration)
Draw a single model's calibration curve with confidence intervals
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
Axes
|
The axes on which to draw the plot |
required |
calibration |
DataFrame
|
The model's calibration data |
required |
Source code in src\pyhbr\analysis\calibration.py
get_average_calibration_error(probs, y_test, n_bins)
This is the weighted average discrepancy between the predicted risk and the observed proportions on the calibration curve.
See "https://towardsdatascience.com/expected-calibration-error-ece-a-step- by-step-visual-explanation-with-python-code-c3e9aa12937d" for a good explanation.
The formula for estimated calibration error (ece) is:
ece = Sum over bins [samples_in_bin / N] * | P_observed - P_pred |,
where P_observed is the empirical proportion of positive samples in the bin, and P_pred is the predicted probability for that bin. The results are weighted by the number of samples in the bin (because some probabilities are predicted more frequently than others).
The result is interpreted as an absolute error: i.e. a value of 0.1 means that the calibration is out on average by 10%. It may be better to modify the formula to compute an average relative error.
Testing: not yet tested.
Source code in src\pyhbr\analysis\calibration.py
get_calibration(probs, y_test, n_bins)
Calculate the calibration of the fitted models
Warning
This function is deprecated. Use the variable bin width calibration function instead.
Get the calibration curves for all models (whose probability predictions for the positive class are columns of probs) based on the outcomes in y_test. Rows of y_test correspond to rows of probs. The result is a list of pairs, one for each model (column of probs). Each pair contains the vector of x- and y-coordinates of the calibration curve.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
probs |
DataFrame
|
The dataframe of probabilities predicted by the model. The first column is the model-under-test (fitted on the training data) and the other columns are from the fits on the training data resamples. |
required |
y_test |
Series
|
The outcomes corresponding to the predicted probabilities. |
required |
n_bins |
int
|
The number of bins to group probability predictions into, for the purpose of averaging the observed frequency of outcome in the test set. |
required |
Returns:
Type | Description |
---|---|
list[DataFrame]
|
A list of DataFrames containing the calibration curves. Each DataFrame
contains the columns |
Source code in src\pyhbr\analysis\calibration.py
get_prevalence(y_test)
Estimate the prevalence in a set of outcomes
To calculate model calibration, patients are grouped together into similar-risk groups. The prevalence of the outcome in each group is then compared to the predicted risk.
The true risk of the outcome within each group is not known, but it is known what outcome occurred.
One possible assumption is that the patients in each group all have the same risk, p. In this case, the outcomes from the group follow a Bernoulli distribution. The population parameter p (where the popopulation is all patients receiving risk predictions in this group) can be estimated simply using \(\hat{p} = N_\text{outcome}/N_\text{group_size}\). Using a simple approach to calculate the confidence interval on this estimate, assuming a large enough sample size for normally distributed estimate of the mean, gives a CI of:
(See this answer for details.)
However, the assumption of uniform risk within the models groups-of-equal-risk-prediction may not be valid, because it assumes that the model is predicting reasonably accurate risks, and the model is the item under test.
One argument is that, if the estimated prevalence matches the risk of the group closely, then this may give evidence that the models predicted risks are accurate -- the alternative would be that the real risks follow a different distribution, whose mean happens (coincidentally) to coincide with the predicted risk. Such a conclusion may be possible if the confidence interval for the estimated prevalence is narrow, and agrees with the predicted risk closely.
Without further assumptions, there is nothing further that can be said about the distribution of patient risks within each group. As a result, good calibration is a necessary, but not sufficient, condition for accurate risk predictions in the model .
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_test |
Series
|
The (binary) outcomes in a single risk group. The values are True/False (boolean) |
required |
Returns:
Type | Description |
---|---|
A map containing the key "prevalence", for the estimated mean of the Bernoulli distribution, and "lower" and "upper" for the estimated confidence interval, assuming all patients in the risk group are drawn from a single Bernoulii distribution. The "variance" is the estimate of the sample variance of the estimated prevalence, and can be used to form an average of the accuracy uncertainties in each bin. Note that the assumption of a Bernoulli distribution is not necessarily accurate. |
Source code in src\pyhbr\analysis\calibration.py
302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 |
|
get_variable_width_calibration(probs, y_test, n_bins)
Get variable-bin-width calibration curves
Model predictions are arranged in ascending order, and then risk ranges are selected so that an equal number of predictions falls in each group. This means bin widths will be more granular at points where many patients are predicted the same risk. The risk bins are shown on the x-axis of calibration plots.
In each bin, the proportion of patient with an event are calculated. This value, which is a function of each bin, is plotted on the y-axis of the calibration plot, and is a measure of the prevalence of the outcome in each bin. In a well calibrated model, this prevalence should match the mean risk prediction in the bin (the bin center).
Note that a well-calibrated model is not a sufficient condition for correctness of risk predictions. One way that the prevalence of the bin can match the bin risk is for all true risks to roughly match the bin risk P. However, other ways are possible, for example, a proportion P of patients in the bin could have 100% risk, and the other have zero risk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
probs |
DataFrame
|
Each column is the predictions from one of the resampled models. The first column corresponds to the model-under-test. |
required |
y |
Contains the observed outcomes. |
required | |
n_bins |
int
|
The number of (variable-width) bins to include. |
required |
Returns:
Type | Description |
---|---|
list[DataFrame]
|
A list of dataframes, one for each calibration curve. The "bin_center" column contains the central bin width; the "bin_half_width" column contains the half-width of each equal-risk group. The "est_prev" column contains the mean number of events in that bin; and the "est_prev_err" contains the half-width of the 95% confidence interval (symmetrical above and below bin_prev). |
Source code in src\pyhbr\analysis\calibration.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
make_error_boxes(ax, calibration)
Plot error boxes and error bars around points
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
Axes
|
The axis on which to plot the error boxes. |
required |
calibration |
DataFrame
|
Dataframe containing one row per bin, showing how the predicted risk compares to the estimated prevalence. |
required |
Source code in src\pyhbr\analysis\calibration.py
plot_calibration_curves(ax, curves, title='Stability of Calibration')
Plot calibration curves for the model under test and resampled models
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
Axes
|
The axes on which to plot the calibration curves |
required |
curves |
list[DataFrame]
|
A list of DataFrames containing the calibration curve data |
required |
title |
Title to add to the plot. |
'Stability of Calibration'
|
Source code in src\pyhbr\analysis\calibration.py
plot_prediction_distribution(ax, probs, n_bins)
Plot the distribution of predicted probabilities over the models as a bar chart, with error bars showing the standard deviation of each model height. All model predictions (columns of probs) are given equal weight in the average; column 0 (the model under test) is not singled out in any way.
The function plots vertical error bars that are one standard deviation up and down (so 2*sd in total)
Source code in src\pyhbr\analysis\calibration.py
describe
column_prop(bool_col)
Return a string with the number of non-zero items in the columns and a percentage
get_column_rates(data)
Get the proportion of rows in each column that are non-zero
Either pass the full table, or subset it based on a condition to get the rates for that subset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
DataFrame
|
A table containing columns where the proportion of non-zero rows should be calculated. |
required |
Returns:
Type | Description |
---|---|
Series
|
A Series (single column) with one row per column in the original data, containing the rate of non-zero items in each column. The Series is indexed by the names of the columns, with "_rate" appended. |
Source code in src\pyhbr\analysis\describe.py
get_outcome_prevalence(outcomes)
Get the prevalence of each outcome as a percentage.
This function takes the outcomes dataframe used to define the y vector of the training/testing set and calculates the prevalence of each outcome in a form suitable for inclusion in a report.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
outcomes |
DataFrame
|
A dataframe with the columns "fatal_{outcome}", "non_fatal_{outcome}", and "{outcome}" (for the total), where {outcome} is "bleeding" or "ischaemia". Each row is an index spell, and the elements in the table are boolean (whether or not the outcome occurred). |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table with the prevalence of each outcome, and a multi-index containing the "Outcome" ("Bleeding" or "Ischaemia"), and the outcome "Type" (fatal, total, etc.) |
Source code in src\pyhbr\analysis\describe.py
get_summary_table(models, high_risk_thresholds, config)
Get a table of model metric comparison across different models
Parameters:
Name | Type | Description | Default |
---|---|---|---|
models |
dict[str, Any]
|
A map from model names to model data (containing the key "fit_results") |
required |
high_risk_thresholds |
dict[str, float]
|
A dictionary containing the keys "bleeding" and "ischaemia" mapped to the thresholds used to determine whether a patient is at high risk from the models. |
required |
config |
dict[str, Any]
|
The config file used as input to the results and report generator scripts. It must contain the keys "outcomes" and "models", which are dictionaries containing the outcome or model name and a sub-key "abbr" which contains a short name of the outcome/model. |
required |
Source code in src\pyhbr\analysis\describe.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
|
nearly_constant(data, threshold)
Check which columns of the input table have low variation
A column is considered low variance if the proportion of rows containing NA or the most common non-NA value exceeds threshold. For example, if NA and one other value together comprise 99% of the column, then it is considered to be low variance based on a threshold of 0.9.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
DataFrame
|
The table to check for zero variance |
required |
threshold |
float
|
The proportion of the column that must be NA or the most common value above which the column is considered low variance. |
required |
Returns:
Type | Description |
---|---|
Series
|
A Series containing bool, indexed by the column name in the original data, containing whether the column has low variance. |
Source code in src\pyhbr\analysis\describe.py
plot_arc_hbr_survival(ax, data)
Plot survival curves for bleeding by ARC HBR score.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
List of two axes objects |
required | |
data |
A loaded data file |
required | |
config |
The analysis config (from yaml) |
required |
Source code in src\pyhbr\analysis\describe.py
457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 |
|
plot_clinical_code_distribution(ax, data, config)
Plot histograms of the distribution of bleeding/ischaemia codes
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
A list of two axes objects |
required | |
data |
A loaded data file |
required | |
config |
The analysis config (from yaml) |
required |
Source code in src\pyhbr\analysis\describe.py
plot_survival_curves(ax, data, config)
Plot survival curves for bleeding/ischaemia broken down by age
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
A list of two axes objects |
required | |
data |
A loaded data file |
required | |
config |
The analysis config (from yaml) |
required |
Source code in src\pyhbr\analysis\describe.py
378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 |
|
proportion_missingness(data)
Get the proportion of missing values in each column
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
DataFrame
|
A table where missingness should be calculate for each column |
required |
Returns:
Type | Description |
---|---|
Series
|
The proportion of missing values in each column, indexed by the original table column name. The values are sorted in order of increasing missingness |
Source code in src\pyhbr\analysis\describe.py
proportion_nonzero(column)
pvalue_chi2_high_risk_vs_outcome(probs, y_test, high_risk_threshold)
Perform a Chi-2 hypothesis test on the contingency between estimated high risk and outcome
Get the p-value from the hypothesis test that there is no association between the estimated high-risk category, and the outcome. The p-value is interpreted as the probability of getting obtaining the outcomes corresponding to the model's estimated high-risk category under the assumption that there is no association between the two.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
probs |
DataFrame
|
The model-estimated probabilities (first column is used) |
required |
y_test |
Series
|
Whether the outcome occurred |
required |
high_risk_threshold |
float
|
The cut-off risk (probability) defining an estimate to be high risk. |
required |
Returns:
Type | Description |
---|---|
float
|
The p-value for the hypothesis test. |
Source code in src\pyhbr\analysis\describe.py
dim_reduce
Functions for dimension-reduction of clinical codes
Dataset
dataclass
make_full_pipeline(model, reducer=None)
Make a model pipeline from the model part and dimension reduction
This pipeline has one or two steps:
- If no reduction is performed, the only step is "model"
- If dimension reduction is performed, the steps are "reducer", "model"
This function can be used to make the pipeline with no dimension (pass None to reducer). Otherwise, pass the reducer which will reduce a subset of the columns before fitting the model (use make_column_transformer to create this argument).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
Pipeline
|
A list of model fitting steps that should be applied after the (optional) dimension reduction. |
required |
reducer |
Pipeline
|
If non-None, this reduction pipeline is applied before the model to reduce a subset of the columns. |
None
|
Returns:
Type | Description |
---|---|
Pipeline
|
A scikit-learn pipeline that can be fitted to training data. |
Source code in src\pyhbr\analysis\dim_reduce.py
make_grad_boost(random_state)
Make a new gradient boosting classifier
Returns:
Type | Description |
---|---|
Pipeline
|
The unfitted pipeline for the gradient boosting classifier |
Source code in src\pyhbr\analysis\dim_reduce.py
make_logistic_regression(random_state)
Make a new logistic regression model
The model involves scaling all predictors and then applying a logistic regression model.
Returns:
Type | Description |
---|---|
Pipeline
|
The unfitted pipeline for the logistic regression model |
Source code in src\pyhbr\analysis\dim_reduce.py
make_random_forest(random_state)
Make a new random forest model
Returns:
Type | Description |
---|---|
Pipeline
|
The unfitted pipeline for the random forest model |
Source code in src\pyhbr\analysis\dim_reduce.py
make_reducer_pipeline(reducer, cols_to_reduce)
Make a wrapper that applies dimension reduction to a subset of columns.
A column transformer is necessary if only some of the columns should be dimension-reduced, and others should be preserved. The resulting pipeline is intended for use in a scikit-learn pipeline taking a pandas DataFrame as input (where a subset of the columns are cols_to_reduce).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reducer |
The dimension reduction model to use for reduction |
required | |
cols_to_reduce |
list[str]
|
The list of column names to reduce |
required |
Returns:
Type | Description |
---|---|
Pipeline
|
A pipeline which contains the column_transformer that applies the reducer to cols_to_reduce. This can be included as a step in a larger pipeline. |
Source code in src\pyhbr\analysis\dim_reduce.py
prepare_train_test(data_manual, data_reduce, random_state)
Make the test/train datasets for manually-chosen groups and high-dimensional data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_manual |
DataFrame
|
The dataset with manually-chosen code groups |
required |
data_reduce |
DataFrame
|
The high-dimensional dataset |
required |
random_state |
RandomState
|
The random state to pick the test/train split |
required |
Returns:
Type | Description |
---|---|
(Dataset, Dataset)
|
A tuple (train, test) containing the datasets to be used for training and testing the models. Both contain the outcome y along with the features for both the manually-chosen code groups and the data for dimension reduction. |
Source code in src\pyhbr\analysis\dim_reduce.py
fit
fit_model(pipe, X_train, y_train, X_test, y_test, num_bootstraps, num_bins, random_state)
Fit the model and bootstrap models, and calculate model performance
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pipe |
Pipeline
|
The model pipeline to fit |
required |
X_train |
DataFrame
|
Training features |
required |
y_train |
DataFrame
|
Training outcomes (containing "bleeding"/"ischaemia" columns) |
required |
X_test |
DataFrame
|
Test features |
required |
y_test |
DataFrame
|
Test outcomes |
required |
num_bootstraps |
int
|
The number of resamples of the training set to use to fit bootstrap models. |
required |
num_bins |
int
|
The number of equal-size bins to split risk estimates into to calculate calibration curves. |
required |
random_state |
RandomState
|
The source of randomness for the resampling and fitting process. |
required |
Returns:
Type | Description |
---|---|
dict[str, DataFrame | Pipeline]
|
Dictionary with keys "probs", "calibrations", "roc_curves", "roc_aucs". |
Source code in src\pyhbr\analysis\fit.py
model
DenseTransformer
Bases: TransformerMixin
Useful when the model requires a dense matrix but the preprocessing steps produce a sparse output
Source code in src\pyhbr\analysis\model.py
Preprocessor
dataclass
Preprocessing steps for a subset of columns
This holds the set of preprocessing steps that should be applied to a subset of the (named) columns in the input training dataframe.
Multiple instances of this classes (for different subsets of columns) are grouped together to create a ColumnTransformer, which preprocesses all columns in the training dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the preprocessor (which will become the name of the transformer in ColumnTransformer |
required |
pipe |
Pipeline
|
The sklearn Pipeline that should be applied to the set of columns |
required |
columns |
list[str]
|
The set of columns that should have pipe applied to them. |
required |
Source code in src\pyhbr\analysis\model.py
TradeOffModel
Bases: ClassifierMixin
, BaseEstimator
Source code in src\pyhbr\analysis\model.py
fit(X, y)
Use the name of the Y variable to choose between bleeding and ischaemia
Source code in src\pyhbr\analysis\model.py
get_feature_importances(fit)
Get a table of the features used in the model along with feature importances
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fit |
Pipeline
|
The fitted Pipeline |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
Contains a column for feature names, a column for type, and a feature importance column. |
Source code in src\pyhbr\analysis\model.py
get_feature_names(fit)
Get a table of feature names
The feature names are the names of the columns in the output from the preprocessing step in the fitted pipeline
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fit |
Pipeline
|
A fitted sklearn pipeline, containing a "preprocess" step. |
required |
Raises:
Type | Description |
---|---|
RuntimeError
|
description |
Returns:
Type | Description |
---|---|
DataFrame
|
dict[str, str]: description |
Source code in src\pyhbr\analysis\model.py
400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 |
|
get_features(fit, X)
Get the features after preprocessing the input X dataset
The features are generated by the "preprocess" step in the fitted pipe. This step is a column transformer that one-hot-encodes discrete data, and imputes, centers, and scales numerical data.
Note that the result may be a dense or sparse Pandas dataframe, depending on whether the preprocessing steps produce a sparse numpy array or not.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fit |
Pipeline
|
Fitted pipeline with "preprocess" step. |
required |
X |
DataFrame
|
An input dataset (either training or test) containing the input columns to be preprocessed. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The resulting feature columns generated by the preprocessing step. |
Source code in src\pyhbr\analysis\model.py
get_num_feature_columns(fit)
Get the total number of feature columns Args: fit: The fitted pipeline, containing a "preprocess" step.
Returns:
Type | Description |
---|---|
int
|
The total number of columns in the features, after preprocessing. |
Source code in src\pyhbr\analysis\model.py
make_abc(random_state, X_train, config)
Make the AdaBoost classifier pipeline
Source code in src\pyhbr\analysis\model.py
make_category_preprocessor(X_train, drop=None)
Create a preprocessor for string/category columns
Columns in the training features that are discrete, represented using strings ("object") or "category" dtypes, should be one-hot encoded. This generates one new columns for each possible value in the original columns.
The ColumnTransformer transformer created from this preprocessor will be called "category".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X_train |
DataFrame
|
The training features |
required |
drop |
The drop argument to be passed to OneHotEncoder. Default None means no features will be dropped. Using "first" drops the first item in the category, which is useful to avoid collinearity in linear models. |
None
|
Returns:
Type | Description |
---|---|
Preprocessor | None
|
A preprocessor for processing the discrete columns. None is returned if the training features do not contain any string/category columns |
Source code in src\pyhbr\analysis\model.py
make_flag_preprocessor(X_train, drop=None)
Create a preprocessor for flag columns
Columns in the training features that are flags (bool + NaN) are represented using Int8 (because bool does not allow NaN). These columns are also one-hot encoded.
The ColumnTransformer transformer created from this preprocessor will be called "flag".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X_train |
DataFrame
|
The training features. |
required |
drop |
The drop argument to be passed to OneHotEncoder. Default None means no features will be dropped. Using "first" drops the first item in the category, which is useful to avoid collinearity in linear models. |
None
|
Returns:
Type | Description |
---|---|
Preprocessor | None
|
A preprocessor for processing the flag columns. None is returned if the training features do not contain any Int8 columns. |
Source code in src\pyhbr\analysis\model.py
make_float_preprocessor(X_train)
Create a preprocessor for float (numerical) columns
Columns in the training features that are numerical are encoded using float (to distinguish them from Int8, which is used for flags).
Missing values in these columns are imputed using the mean, then low variance columns are removed. The remaining columns are centered and scaled.
The ColumnTransformer transformer created from this preprocessor will be called "float".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X_train |
DataFrame
|
The training features |
required |
Returns:
Type | Description |
---|---|
Preprocessor | None
|
A preprocessor for processing the float columns. None is returned if the training features do not contain any Int8 columns. |
Source code in src\pyhbr\analysis\model.py
make_nearest_neighbours_cv(random_state, X_train, config)
Nearest neighbours classifier trained using cross validation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
random_state |
RandomState
|
Source of randomness for creating the model |
required |
X_train |
DataFrame
|
The training dataset containing all features for modelling |
required |
config |
dict[str, Any]
|
The dictionary of keyword arguments to configure the CV search. |
required |
Returns:
Type | Description |
---|---|
Pipeline
|
The preprocessing and fitting pipeline. |
Source code in src\pyhbr\analysis\model.py
make_random_forest(random_state, X_train)
Make the random forest model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
random_state |
RandomState
|
Source of randomness for creating the model |
required |
X_train |
DataFrame
|
The training dataset containing all features for modelling |
required |
Returns:
Type | Description |
---|---|
Pipeline
|
The preprocessing and fitting pipeline. |
Source code in src\pyhbr\analysis\model.py
make_random_forest_cv(random_state, X_train, config)
Random forest model trained using cross validation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
random_state |
RandomState
|
Source of randomness for creating the model |
required |
X_train |
DataFrame
|
The training dataset containing all features for modelling |
required |
config |
dict[str, Any]
|
The dictionary of keyword arguments to configure the CV search. |
required |
Returns:
Type | Description |
---|---|
Pipeline
|
The preprocessing and fitting pipeline. |
Source code in src\pyhbr\analysis\model.py
make_trade_off(random_state, X_train, config)
Make the ARC HBR bleeding/ischaemia trade-off model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
random_state |
RandomState
|
Source of randomness for creating the model |
required |
X_train |
DataFrame
|
The training dataset containing all features for modelling |
required |
Returns:
Type | Description |
---|---|
Pipeline
|
The preprocessing and fitting pipeline. |
Source code in src\pyhbr\analysis\model.py
make_xgboost_cv(random_state, X_train, config)
XGBoost model trained using cross validation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
random_state |
RandomState
|
Source of randomness for creating the model |
required |
X_train |
DataFrame
|
The training dataset containing all features for modelling |
required |
config |
dict[str, Any]
|
The dictionary of keyword arguments to configure the CV search. |
required |
Returns:
Type | Description |
---|---|
Pipeline
|
The preprocessing and fitting pipeline. |
Source code in src\pyhbr\analysis\model.py
trade_off_model_bleeding_risk(features)
ARC-HBR bleeding part of the trade-off model
This function implements the bleeding model contained here https://pubmed.ncbi.nlm.nih.gov/33404627/. The numbers used below come from correspondence with the authors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
features |
DataFrame
|
must contain age, smoking, copd, hb, egfr_x, oac. |
required |
Returns:
Type | Description |
---|---|
Series
|
The bleeding risks as a Series. |
Source code in src\pyhbr\analysis\model.py
trade_off_model_ischaemia_risk(features)
ARC-HBR ischaemia part of the trade-off model
This function implements the bleeding model contained here https://pubmed.ncbi.nlm.nih.gov/33404627/. The numbers used below come from correspondence with the authors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
features |
DataFrame
|
must contain diabetes_before, smoking, |
required |
Returns:
Type | Description |
---|---|
Series
|
The ischaemia risks as a Series. |
Source code in src\pyhbr\analysis\model.py
patient_viewer
get_patient_history(patient_id, hic_data)
Get a list of all this patient's episode data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
patient_id |
str
|
Which patient to fetch |
required |
hic_data |
HicData
|
Contains |
required |
Returns:
Type | Description |
---|---|
A table indexed by spell_id, episode_id, type (of clinical code) and clinical code position. |
Source code in src\pyhbr\analysis\patient_viewer.py
roc
ROC Curves
The file calculates the ROC curves of the bootstrapped models (for assessing ROC curve stability; see stability.py).
AucData
dataclass
Source code in src\pyhbr\analysis\roc.py
mean_resample_auc()
get_auc(probs, y_test)
Get the area under the ROC curves for the fitted models
Compute area under the ROC curve (AUC) for the model-under-test (the first column of probs), and the other bootstrapped models (other columns of probs).
Source code in src\pyhbr\analysis\roc.py
get_roc_curves(probs, y_test)
Get the ROC curves for the fitted models
Get the ROC curves for all models (whose probability predictions for the positive class are columns of probs) based on the outcomes in y_test. Rows of y_test correspond to rows of probs. The result is a list of pairs, one for each model (column of probs). Each pair contains the vector of x- and y-coordinates of the ROC curve.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
probs |
DataFrame
|
The probabilities predicted by all the fitted models. The first column is the model-under-test (the training set), and the other columns are resamples of the training set. |
required |
y_test |
Series
|
The outcome data corresponding to each row of probs. |
required |
Returns:
Type | Description |
---|---|
list[DataFrame]
|
A list of DataFrames, each of which contains one ROC curve,
corresponding to the columns in probs. The columns of the
DataFrames are |
Source code in src\pyhbr\analysis\roc.py
plot_roc_curves(ax, curves, auc, title='ROC-stability Curves')
Plot ROC curves of the model-under-test and resampled models
Plot the set of bootstrapped ROC curves (an instability plot), using the data in curves (a list of curves to plot). Assume that the first curve is the model-under-test (which is coloured differently).
The auc argument is an array where the first element is the AUC of the model under test, and the second element is the mean AUC of the bootstrapped models, and the third element is the standard deviation of the AUC of the bootstrapped models (these latter two measure stability). This argument is the output from get_bootstrapped_auc.
Source code in src\pyhbr\analysis\roc.py
stability
Assessing model stability
Model stability of an internally-validated model refers to how well models developed on a similar internal population agree with each other. The methodology for assessing model stability follows Riley and Collins, 2022 (https://arxiv.org/abs/2211.01061)
Assessing model stability is an end-to-end test of the entire model development process. Riley and Collins do not refer to a test/train split, but their method will be interpreted as applying to the training set (with instability measures assessed by applying models to the test set). As a result, the first step in the process is to split the internal dataset into a training set P0 and a test set T.
Assuming that a training set P0 is used to develop a model M0 using a model development process D (involving steps such cross-validation and hyperparameter tuning in the training set, and validation of accuracy of model prediction in the test set), the following steps are required to assess the stability of M0:
- Bootstrap resample P0 with replacement M >= 200 times, creating M new datasets Pm that are all the same size as P0
- Apply D to each Pm, to obtain M new models Mn which are all comparable with M0.
- Collect together the predictions from all Mn and compare them to the predictions from M0 for each sample in the test set T.
- From the data in 3, plot instability plots such as a scatter plot of M0 predictions on the x-axis and all the Mn predictions on the y-axis, for each sample of T. In addition, plot graphs of how all the model validation metrics vary as a function of the bootstrapped models Mn.
Implementation
A function is required that takes the original training set P0 and generates N bootstrapped resamples Pn that are the same size as P.
A function is required that wraps the entire model into one call, taking as input the bootstrapped resample Pn and providing as output the bootstrapped model Mn. This function can then be called M times to generate the bootstrapped models. This function is not defined in this file (see the fit.py file)
An aggregating function will then take all the models Mn, the model-under-test M0, and the test set T, and make predictions using all the models for each sample in the test set. It should return all these predictions (probabilities) in a 2D array, where each row corresponds to a test-set sample, column 0 is the probability from M0, and columns 1 through M are the probabilities from each Mn.
This 2D array may be used as the basis of instability plots. Paired with information about the true outcomes y_test, this can also be used to plot ROC-curve variability (i.e. plotting the ROC curve for all model M0 and Mn on one graph). Any other accuracy metric of interest can be calculated from this information (i.e. for step 4 above).
FittedModel
dataclass
Stores a model fitted to a training set and resamples of the training set.
Source code in src\pyhbr\analysis\stability.py
flatten()
Get a flat list of all the models
Returns:
Type | Description |
---|---|
list[Pipeline]
|
The list of fitted models, with M0 at the front |
Resamples
dataclass
Store a training set along with M resamples of it
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X0 |
DataFrame
|
The matrix of predictors |
required |
Y0 |
DataFrame
|
The matrix of outcomes (one column per outcome) |
required |
Xm |
list[DataFrame]
|
A list of resamples of the predictors |
required |
Ym |
list[DataFrame]
|
A list of resamples of the outcomes |
required |
Source code in src\pyhbr\analysis\stability.py
absolute_instability(probs)
Get a list of the absolute percentage-point differences
Compare the primary model to the bootstrap models by flattening all the bootstrap model estimates and calculating the absolute difference between the primary model estimate and the bootstraps. Results are expressed in percentage points.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
probs |
DataFrame
|
First column is primary model risk estimates, other columns are bootstrap model estimates. |
required |
Returns:
Type | Description |
---|---|
Series
|
A Series of absolute percentage-point discrepancies between the primary model predictions and the bootstrap estimates. |
Source code in src\pyhbr\analysis\stability.py
average_absolute_instability(probs)
Get the average absolute error between primary model and bootstrap estimates.
This function computes the average of the absolute difference between the risks estimated by the primary model, and the risks estimated by the bootstrap models. For example, if the primary model estimates 1%, and a bootstrap model provides 2% and 3%, the result is 1.5% error.
Expressed differently, the function calculates the average percentage-point difference between the model under test and bootstrap models.
Using the absolute error instead of the relative error is more useful in practice, because it does not inflate errors between very small risks. Since most risks are on the order < 20%, with a risk threshold like 5%, it is easier to interpret an absolute risk difference.
Further granularity in the variability of risk estimates as a function of risk is obtained by looking at the instability box plot.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
probs |
DataFrame
|
The table of risks estimated by the models. The first column is the model under test, and the other columns are bootstrap models. |
required |
Returns:
Type | Description |
---|---|
dict[str, float]
|
A mean and confidence interval for the estimate. The units are percent. |
Source code in src\pyhbr\analysis\stability.py
fit_model(model, X0, y0, M, random_state)
Fit a model to a training set and resamples of the training set.
Use the unfitted model pipeline to:
- Fit a model to the training set (X0, y0)
- Fit a model to M resamples (Xm, ym) of the training set
The model is an unfitted scikit-learn Pipeline. Note that if RandomState is used when specifying the model, then the models used to fit the resamples here will be statstical clones (i.e. they might not necessarily produce the same result on the same data). clone() is called on model before fitting, so each fit gets a new clean object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
Pipeline
|
An unfitted scikit-learn pipeline, which is used as the basis for all the fits. Each fit calls clone() on this object before fitting, to get a new model with clean parameters. The cloned fitted models are then stored in the returned fitted model. |
required |
X0 |
DataFrame
|
The training set features |
required |
y0 |
Series
|
The training set outcome |
required |
M |
int
|
How many resamples to take from the training set (ideally >= 200) |
required |
random_state |
RandomState
|
The source of randomness for model fitting |
required |
Returns:
Type | Description |
---|---|
FittedModel
|
An object containing the model fitted on (X0,y0) and all (Xm,ym) |
Source code in src\pyhbr\analysis\stability.py
get_average_instability(probs)
Instability is the extent to which the bootstrapped models give a different prediction from the model under test. The average instability is an average of the SMAPE between the prediction of the model-under-test and the predictions of each of the other bootstrap models (i.e. pairing the model-under-test) with a single bootstrapped model gives one SMAPE value, and these are averaged over all the bootstrap models).
SMAPE is preferable to mean relative error, because the latter diverges when the prediction from the model-under-test is very small. It may however be better still to use the log of the accuracy ratio; see https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error, since the probabilities are all positive (or maybe there is a better thing for comparing probabilities specifically)
Testing: not yet tested
Source code in src\pyhbr\analysis\stability.py
get_reclass_probabilities(probs, y_test, threshold)
Get the probability of risk reclassification for each patient
Parameters:
Name | Type | Description | Default |
---|---|---|---|
probs |
DataFrame
|
The matrix of probabilities from the model-under-test (first column) and the bootstrapped models (subsequent models). |
required |
y_test |
Series
|
The true outcome corresponding to each row of the probs matrix. This is used to colour the points based on whether the outcome occurred on not. |
required |
threshold |
float
|
The risk level at which a patient is considered high risk |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table containing columns "original_risk", "unstable_prob", and "outcome". |
Source code in src\pyhbr\analysis\stability.py
make_bootstrapped_resamples(X0, y0, M, random_state)
Make M resamples of the training data
Makes M bootstrapped resamples of a training set (X0,y0). M should be at least 200 (as per recommendation).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X0 |
DataFrame
|
The features in the training set to be resampled |
required |
y0 |
DataFrame
|
The outcome in the training set to be resampled. Can have multiple columns (corresponding to different outcomes). |
required |
M |
int
|
How many resamples to take |
required |
random_state |
RandomState
|
Source of randomness for resampling |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If the number of rows in X0 and y0 do not match |
Returns:
Type | Description |
---|---|
Resamples
|
An object containing the original training set and the resamples. |
Source code in src\pyhbr\analysis\stability.py
plot_instability(ax, probs, y_test, title='Probability stability')
Plot the instability of risk predictions
This function plots a scatter graph of one point per value in the test set (row of probs), where the x-axis is the value of the model under test (the first column of probs), and the y-axis is every other probability predicted from the bootstrapped models Mn (the other columns of probs). The predictions from the model-under-test corresponds to the straight line at 45 degrees through the origin
For a stable model M0, the scattered points should be close to the M0 line, indicating that the bootstrapped models Mn broadly agree with the predictions made by M0.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
Axes
|
The axes on which to plot the risks |
required |
probs |
DataFrame
|
The matrix of probabilities from the model-under-test (first column) and the bootstrapped models (subsequent models). |
required |
y_test |
Series
|
The true outcome corresponding to each row of the probs matrix. This is used to colour the points based on whether the outcome occurred on not. |
required |
title |
The title to place on the axes. |
'Probability stability'
|
Source code in src\pyhbr\analysis\stability.py
285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 |
|
plot_reclass_instability(ax, probs, y_test, threshold, title='Stability of Risk Class')
Plot the probability of reclassification by predicted risk
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
Axes
|
The axes on which to draw the plot |
required |
probs |
DataFrame
|
The matrix of probabilities from the model-under-test (first column) and the bootstrapped models (subsequent models). |
required |
y_test |
Series
|
The true outcome corresponding to each row of the probs matrix. This is used to colour the points based on whether the outcome occurred on not. |
required |
threshold |
float
|
The risk level at which a patient is considered high risk |
required |
title |
str
|
The plot title. |
'Stability of Risk Class'
|
Source code in src\pyhbr\analysis\stability.py
435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 |
|
plot_stability_analysis(ax, outcome_name, probs, y_test, high_risk_thresholds)
Plot the two stability plots
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ax |
Axes
|
The axes on which to plot the graphs (must have two |
required |
outcome_name |
str
|
One of "bleeding" or "ischaemia" |
required |
probs |
DataFrame
|
The model predictions. The first column is the model-under-test, and the other columns are the bootstrap model predictions. |
required |
y_test |
DataFrame
|
The outcomes table, with columns for "bleeding" and "ischaemia". |
required |
high_risk_thresholds |
dict[str, float]
|
Map containing the vertical risk prediction threshold for "bleeding" and "ischaemia". |
required |
Source code in src\pyhbr\analysis\stability.py
predict_probabilities(fitted_model, X_test)
Predict outcome probabilities using the fitted models on the test set
Aggregating function which finds the predicted probability from the model-under-test M0 and all the bootstrapped models Mn on each sample of the training set features X_test. The result is a 2D numpy array, where each row corresponds to a test-set sample, the first column is the predicted probabilities from M0, and the following N columns are the predictions from all the other Mn.
Note: the numbers in the matrix are the probabilities of 1 in the test set y_test.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fitted_model |
FittedModel
|
The model fitted on the training set and resamples |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
An table of probabilities of the positive outcome in the class, where each column comes from a different model. Column zero corresponds to the training set, and the other columns are from the resamples. The index for the DataFrame is the same as X_test |
Source code in src\pyhbr\analysis\stability.py
clinical_codes
Contains utilities for clinical code groups
Category
dataclass
Code/categories struct
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
The name of the category (e.g. I20) or clinical code (I20.1) |
docs |
str
|
The description of the category or code |
index |
str | tuple[str, str]
|
Used to sort a list of Categories |
categories |
list[Category] | None
|
For a category, the list of sub-categories contained. None for a code. |
exclude |
set[str] | None
|
Contains code groups which do not contain any members from this category or any of its sub-categories. |
Source code in src\pyhbr\clinical_codes\__init__.py
excludes(group)
Check if this category excludes a code group
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group |
str
|
The string name of the group to check |
required |
Returns:
Type | Description |
---|---|
bool
|
True if the group is excluded; False otherwise |
Source code in src\pyhbr\clinical_codes\__init__.py
is_leaf()
Check if the categories is a leaf node
Returns:
Type | Description |
---|---|
True if leaf node (i.e. clinical code), false otherwise |
ClinicalCode
dataclass
Store a clinical code together with its description.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
The code itself, e.g. "I21.0" |
docs |
str
|
The code description, e.g. "Acute transmural myocardial infarction of anterior wall" |
Source code in src\pyhbr\clinical_codes\__init__.py
normalise()
Return the name without whitespace/dots, as lowercase
See the documentation for normalize_code().
Returns:
Type | Description |
---|---|
The normalized form of this clinical code |
Source code in src\pyhbr\clinical_codes\__init__.py
ClinicalCodeTree
dataclass
Code definition file structure
Source code in src\pyhbr\clinical_codes\__init__.py
codes_in_group(group)
Get the clinical codes in a code group
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group |
str
|
The group to fetch |
required |
Raises:
Type | Description |
---|---|
ValueError
|
Raised if the requested group does not exist |
Returns:
Type | Description |
---|---|
list[ClinicalCode]
|
The list of code groups |
Source code in src\pyhbr\clinical_codes\__init__.py
codes_in_any_group(codes)
Get a DataFrame of all the codes in any group in a codes file
Returns a table with the normalised code (lowercase/no whitespace/no
dots) in column code
, and the group containing the code in the
column group
.
All codes which are in any group will be included.
Codes will be duplicated if they appear in more than one group.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
codes |
ClinicalCodeTree
|
The tree clinical codes (e.g. ICD-10 or OPCS-4, loaded from a file) to search for codes |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: All codes in any group in the codes file |
Source code in src\pyhbr\clinical_codes\__init__.py
filter_to_groups(codes_table, codes)
Filter a table of raw clinical codes to only keep codes in groups
Use this function to drop clinical codes which are not of interest, and convert all codes to normalised form (lowercase, no whitespace, no dot).
This function is tested on the HIC dataset, but should be modifiable for use with any data source returning diagnoses and procedures as separate tables in long format. Consider modifying the columns of codes_table that are contained in the output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
codes_table |
DataFrame
|
Either a diagnoses or procedures table. For this function to work, it needs:
|
required |
codes |
ClinicalCodeTree
|
The clinical codes object (previously loaded from a file) containing code groups to use. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table containing the episode ID, the clinical code (normalised), the group containing the code, and the code position. |
Source code in src\pyhbr\clinical_codes\__init__.py
get_code_groups(diagnosis_codes, procedure_codes)
Get a table of any diagnosis/procedure code which is in a code group
This function converts the code tree formats into a simple table containing normalised codes (lowercase, no dot), the documentation string for the code, what group the code is in, and whether it is a diagnosis or procedure code
Parameters:
Name | Type | Description | Default |
---|---|---|---|
diagnosis_codes |
ClinicalCodeTree
|
The tree of diagnosis codes |
required |
procedure_codes |
ClinicalCodeTree
|
The tree of procedure codes |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table with columns |
Source code in src\pyhbr\clinical_codes\__init__.py
get_codes_in_group(group, categories)
Helper function to get clinical codes in a group
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group |
str
|
The group to fetch |
required |
categories |
list[Category]
|
The list of categories to search for codes |
required |
Returns:
Type | Description |
---|---|
list[ClinicalCode]
|
A list of clinical codes in the group |
Source code in src\pyhbr\clinical_codes\__init__.py
load_from_file(path)
Load a clinical codes file relative to the working directory
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path to the codes file relative to the current working directory. |
required |
Returns:
Type | Description |
---|---|
ClinicalCodeTree
|
The contents of the file |
Source code in src\pyhbr\clinical_codes\__init__.py
load_from_package(name)
Load a clinical codes file from the pyhbr package.
The clinical codes are stored in yaml format, and this function returns a dictionary corresponding to the structure of the yaml file.
Examples:
>>> import pyhbr.clinical_codes as codes
>>> tree = codes.load_from_package("icd10_test.yaml")
>>> group = tree.codes_in_group("group_1")
>>> [code.name for code in group]
['I20.0', 'I20.1', 'I20.8', 'I20.9']
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The file name of the codes file to load |
required |
Returns:
Type | Description |
---|---|
ClinicalCodeTree
|
The contents of the file. |
Source code in src\pyhbr\clinical_codes\__init__.py
normalise_code(code)
Remove whitespace/dots, and convert to lower-case
The format of clinical codes can vary across different data sources. A simple way to compare codes is to convert them into a common format and compare them as strings. The purpose of this function is to define the common format, which uses all lower-case letters, does not contain any dots, and does not include any leading/trailing whitespace.
Comparing codes for equality does not immediately allow checking whether one code is a sub-category of another. It also ignores clinical code annotations such as dagger/asterisk.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
code |
str
|
The raw code, e.g. |
required |
Returns:
Type | Description |
---|---|
str
|
The normalised form of the clinical code |
Source code in src\pyhbr\clinical_codes\__init__.py
codes_editor
Edit groups of ICD-10 and OPCS-4 codes
codes_editor
run_app()
Run the main codes editor application
Source code in src\pyhbr\clinical_codes\codes_editor\codes_editor.py
counting
Utilities for counting clinical codes satisfying conditions
count_code_groups(index_spells, filtered_episodes)
Count the number of matching codes relative to index episodes
This function counts the rows for each index spell ID in the output of filter_by_code_groups, and adds 0 for any index spell ID without any matching rows in filtered_episodes.
The intent is to count the number of codes (one per row) that matched filter conditions in other episodes with respect to the index spell.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
The index spells, which provides the list of spell IDs of interest. The output will be NA for any spell ID that does not have any matching rows in filtered_episodes. |
required |
filtered_episodes |
DataFrame
|
The output from filter_by_code_groups, which produces a table where each row represents a matching code. |
required |
Returns:
Type | Description |
---|---|
Series
|
How many codes (rows) occurred for each index spell |
Source code in src\pyhbr\clinical_codes\counting.py
count_events(index_spells, events, event_name)
Count the occurrences (rows) of an event given in long format.
The input table (events) contains instances of events, one per row,
where the event_name contains the name of a string column labelling the
events. The table also contains a spell_id
column, which may be
associated with multiple rows.
The function pivots the events so that there is one row per spell, each event has its own column, and the table contains the total number of each event associated with the spell.
The index_spells table is required because some index spells may have no events. These index spells will have a row of zeros in the output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
Must have Pandas index |
required |
events |
DataFrame
|
Contains a |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table of the counts for each event (one event per column), with
Pandas index |
Source code in src\pyhbr\clinical_codes\counting.py
get_all_other_codes(index_spells, episodes, codes)
For each patient, get clinical codes in other episodes before/after the index
This makes a table of index episodes (which is the first episode of the index spell)
along with all other episodes for a patient. Two columns index_episode_id
and
other_episode_id
identify the two episodes for each row (they may be equal), and
other information is stored such as the time of the base episode, the time to the
other episode, and clinical code information for the other episode.
This table is used as the basis for all processing involving counting codes before and after an episode.
Note
Episodes will not be included in the result if they do not have any clinical codes that are in any code group.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_spells |
DataFrame
|
Contains |
required |
episodes |
DataFrame
|
Contains |
required |
codes |
DataFrame
|
Contains |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table containing columns |
Source code in src\pyhbr\clinical_codes\counting.py
get_time_window(time_diff_table, window_start, window_end, time_diff_column='time_to_other_episode')
Get events that occurred in a time window with respect to a base event
Use the time_diff_column column to filter the time_diff_table to just those that occurred between window_start and window_end with respect to the base. For example, rows can represent an index episode paired with other episodes, with the time_diff_column representing the time to the other episode.
The arguments window_start and window_end control the minimum and maximum values for the time difference. Use positive values for a window after the base event, and use negative values for a window before the base event.
Events on the boundary of the window are included.
Note that the base event itself will be included as a row if window_start is negative and window_end is positive.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
time_diff_table |
DataFrame
|
Table containing at least the |
required |
window_start |
timedelta
|
The smallest value of |
required |
window_end |
timedelta
|
The largest value of |
required |
time_diff_column |
str
|
The name of the column containing the time difference, which is positive for an event occurring after the base event. |
'time_to_other_episode'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The rows within the specific time window |
Source code in src\pyhbr\clinical_codes\counting.py
common
Common utilities for other modules.
A collection of routines used by the data source or analysis functions.
CheckedTable
Wrapper for sqlalchemy table with checks for table/columns
Source code in src\pyhbr\common.py
__init__(table_name, engine, schema='dbo')
Get a CheckedTable by reading from the remote server
This is a wrapper around the sqlalchemy Table for catching errors when accessing columns through the c attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name |
str
|
The name of the table whose metadata should be retrieved |
required |
engine |
Engine
|
The database connection |
required |
Returns:
Type | Description |
---|---|
None
|
The table data for use in SQL queries |
Source code in src\pyhbr\common.py
col(column_name)
Get a column
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_name |
str
|
The name of the column to fetch. |
required |
Raises:
Type | Description |
---|---|
RuntimeError
|
Thrown if the column does not exist |
Source code in src\pyhbr\common.py
chunks(patient_ids, n)
Divide a list of patient ids into n-sized chunks
The last chunk may be shorter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
patient_ids |
list[str]
|
The List of IDs to chunk |
required |
n |
int
|
The chunk size. |
required |
Returns:
Type | Description |
---|---|
list[list[str]]
|
A list containing chunks (list) of patient IDs |
Source code in src\pyhbr\common.py
current_commit()
Get current commit.
Returns:
Type | Description |
---|---|
str
|
Get the first 12 characters of the current commit, using the first repository found above the current working directory. If the working directory is not in a git repository, return "nogit". |
Source code in src\pyhbr\common.py
current_timestamp()
Get the current timestamp.
Returns:
Type | Description |
---|---|
int
|
The current timestamp (since epoch) rounded to the nearest second. |
get_data(engine, query, *args)
Convenience function to make a query and fetch data.
Wraps a function like hic.demographics_query with a call to pd.read_data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The database connection |
required |
query |
Callable[[Engine, ...], Select]
|
A function returning a sqlalchemy Select statement |
required |
*args |
...
|
Positional arguments to be passed to query in addition to engine (which is passed first). Make sure they are passed in the same order expected by the query function. |
()
|
Returns:
Type | Description |
---|---|
DataFrame
|
The pandas dataframe containing the SQL data |
Source code in src\pyhbr\common.py
get_data_by_patient(engine, query, patient_ids, *args)
Fetch data using a query restricted by patient ID
The patient_id list is chunked into 2000 long batches to fit within an SQL IN clause, and each chunk is run as a separate query. The results are assembled into a single DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The database connection |
required |
query |
Callable[[Engine, ...], Select]
|
A function returning a sqlalchemy Select statement. Must take a list[str] as an argument after engine. |
required |
patient_ids |
list[str]
|
A list of patient IDs to restrict the query. |
required |
*args |
...
|
Further positional arguments that will be passed to the query function after the patient_ids positional argument. |
()
|
Returns:
Type | Description |
---|---|
list[DataFrame]
|
A list of dataframes, one corresponding to each chunk. |
Source code in src\pyhbr\common.py
get_saved_files_by_name(name, save_dir, extension)
Get all saved data files matching name
Get the list of files in the save_dir folder matching name. Return the result as a table of file path, commit hash, and saved date. The table is sorted by timestamp, with the most recent file first.
Raises:
Type | Description |
---|---|
RuntimeError
|
If save_dir does not exist, or there are files in save_dir within invalid file names (not in the format name_commit_timestamp.pkl). |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the saved file to load. This matches name in the filename name_commit_timestamp.pkl. |
required |
save_dir |
str
|
The directory to search for files. |
required |
extension |
str
|
What file extension to look for. Do not include the dot. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with columns |
Source code in src\pyhbr\common.py
load_exact_item(name, save_dir='save_data')
Load a previously saved item (pickle) from file by exact filename
This is similar to load_item, but loads the exact filename given by name instead of looking for the most recent file. name must contain the commit, timestamp, and file extension.
A RuntimeError is raised if the file does not exist.
To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the item to load |
required |
save_fir |
Which folder to load the item from. |
required |
Returns:
Type | Description |
---|---|
Any
|
The data item loaded. |
Source code in src\pyhbr\common.py
load_item(name, interactive=False, save_dir='save_data')
Load a previously saved item (pickle) from file
Use this function to load a file that was previously saved using save_item(). By default, the latest version of the item will be returned (the one with the most recent timestamp).
None is returned if an interactive load is cancelled by the user.
To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the item to load |
required |
interactive |
bool
|
If True, let the user pick which item version to load interactively. If False, non-interactively load the most recent item (i.e. with the most recent timestamp). The commit hash is not considered when loading the item. |
False
|
save_fir |
Which folder to load the item from. |
required |
Returns:
Type | Description |
---|---|
(Any, Path)
|
A tuple, with the python object loaded from file as first element and the Path to the item as the second element, or None if the user cancelled an interactive load. |
Source code in src\pyhbr\common.py
load_most_recent_data_files(analysis_name, save_dir)
Load the most recent timestamp data file matching the analysis name
The data file is a pickle of a dictionary, containing pandas DataFrames and other metadata. It is expected to contain a "raw_file" key, which contains the path to the associated raw data file.
Both files are loaded, and a tuple of all the data is returned
Parameters:
Name | Type | Description | Default |
---|---|---|---|
analysis_name |
str
|
The "analysis_name" key from the config file, which is the filename prefix |
required |
save_dir |
str
|
The folder to load the data from |
required |
Returns:
Type | Description |
---|---|
(dict[str, Any], dict[str, Any], str)
|
(data, raw_data, data_path). data and raw_data are dictionaries containing (mainly) Pandas DataFrames, and data_path is the path to the data file (this can be stored in any output products from this script to record which data file was used to generate the data. |
Source code in src\pyhbr\common.py
make_engine(con_string='mssql+pyodbc://dsn', database='hic_cv_test')
Make a sqlalchemy engine
This function is intended for use with Microsoft SQL Server. The preferred method to connect to the server on Windows is to use a Data Source Name (DSN). To use the default connection string argument, set up a data source name called "dsn" using the program "ODBC Data Sources".
If you need to access multiple different databases on the same server, you will need different engines. Specify the database name while creating the engine (this will override a default database in the DSN, if there is one).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con_string |
str
|
The sqlalchemy connection string. |
'mssql+pyodbc://dsn'
|
database |
str
|
The database name to connect to. |
'hic_cv_test'
|
Returns:
Type | Description |
---|---|
Engine
|
The sqlalchemy engine |
Source code in src\pyhbr\common.py
make_new_save_item_path(name, save_dir, extension)
Make the path to save a new item to the save_dir
The name will have the format name_{current_common}_{timestamp}.{extension}.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The base name for the new filename |
required |
save_dir |
str
|
The folder in which to place the item |
required |
extension |
str
|
The file extension (omit the dot) |
required |
Returns:
Type | Description |
---|---|
Path
|
The relative path to the new object to be saved |
Source code in src\pyhbr\common.py
mean_confidence_interval(data, confidence=0.95)
Compute the confidence interval around the mean
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Series
|
A series of numerical values to compute the confidence interval. |
required |
confidence |
float
|
The confidence interval to compute. |
0.95
|
Returns:
Type | Description |
---|---|
dict[str, float]
|
A map containing the keys "mean", "lower", and "upper". The latter keys contain the confidence interval limits. |
Source code in src\pyhbr\common.py
median_to_string(instability, unit='%')
Convert the median-quartile DataFrame to a String
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instability |
DataFrame
|
Table containing three rows, indexed by 0.5 (median), 0.25 (lower quartile) and 0.75 (upper quartile). |
required |
unit |
What units to add to the values in the string. |
'%'
|
Returns:
Type | Description |
---|---|
str
|
A string containing the median, and the lower and upper quartiles. |
Source code in src\pyhbr\common.py
pick_most_recent_saved_file(name, save_dir, extension='pkl')
Get the path to the most recent file matching name.
Like pick_saved_file_interactive, but automatically selects the most recent file in save_data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the saved file to list |
required |
save_dir |
str
|
The directory to search for files |
required |
extension |
str
|
What file extension to look for. Do not include the dot. |
'pkl'
|
Returns:
Type | Description |
---|---|
Path
|
The relative path to the most recent matching file. |
Source code in src\pyhbr\common.py
pick_saved_file_interactive(name, save_dir, extension='pkl')
Select a file matching name interactively
Print a list of the saved items in the save_dir folder, along with the date and time it was generated, and the commit hash, and let the user pick which item should be loaded interactively. The full filename of the resulting file is returned, which can then be read by the user.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the saved file to list |
required |
save_dir |
str
|
The directory to search for files |
required |
extension |
str
|
What file extension to look for. Do not include the dot. |
'pkl'
|
Returns:
Type | Description |
---|---|
str | None
|
The absolute path to the interactively selected file, or None if the interactive load was aborted. |
Source code in src\pyhbr\common.py
query_yes_no(question, default='yes')
Ask a yes/no question via raw_input() and return their answer.
From https://stackoverflow.com/a/3041990.
"question" is a string that is presented to the user.
"default" is the presumed answer if the user just hits
The "answer" return value is True for "yes" or False for "no".
Source code in src\pyhbr\common.py
read_config_file(yaml_path)
Read the configuration file from
Parameters:
Name | Type | Description | Default |
---|---|---|---|
yaml_path |
str
|
The path to the experiment config file |
required |
Source code in src\pyhbr\common.py
requires_commit()
Check whether changes need committing
To make most effective use of the commit hash stored with a save_item call, the current branch should be clean (all changes committed). Call this function to check.
Returns False if there is no git repository.
Returns:
Type | Description |
---|---|
bool
|
True if the working directory is in a git repository that requires a commit; False otherwise. |
Source code in src\pyhbr\common.py
save_item(item, name, save_dir='save_data/', enforce_clean_branch=True, prompt_commit=False)
Save an item to a pickle file
Saves a python object (e.g. a pandas DataFrame) dataframe in the save_dir folder, using a filename that includes the current timestamp and the current commit hash. Use load_item to retrieve the file.
Important
Ensure that save_data/
(or your chosen save_dir
) is added to the
.gitignore of your repository to ensure sensitive data is not committed.
By storing the commit hash and timestamp, it is possible to identify when items were created and what code created them. To make most effective use of the commit hash, ensure that you commit, and do not make any further code edits, before running a script that calls save_item (otherwise the commit hash will not quite reflect the state of the running code).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item |
Any
|
The python object to save (e.g. pandas DataFrame) |
required |
name |
str
|
The name of the item. The filename will be created by adding
a suffix for the current commit and the timestamp to show when the
data was saved (format: |
required |
save_dir |
str
|
Where to save the data, relative to the current working directory. The directory will be created if it does not exist. |
'save_data/'
|
enforce_clean_branch |
If True, the function will raise an exception if an attempt is made to save an item when the repository has uncommitted changes. |
True
|
|
prompt_commit |
if enforce_clean_branch is true, choose whether the prompt the user to commit on an unclean branch. This can help avoiding losing the results of a long-running script. Prefer to use false if the script is cheap to run. |
False
|
Source code in src\pyhbr\common.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 |
|
data_source
Routines for fetching data from sources.
This module is intended to interface to the data source, and should be modified to port this package to new SQL databases.
hic
SQL queries and functions for HIC (v3, UHBW) data.
Most data available in the HIC tables is fetched in the queries below, apart from columns which are all-NULL, provide keys/IDs that will not be used, or provide duplicate information (e.g. duplicated in two tables).
demographics_query(engine)
Get demographic information from HIC data
The date/time at which the data was obtained is not stored in the table, but patient age can be computed from the date of the episode under consideration and the year_of_birth in this table.
The underlying table does have a cause_of_death column, but it is all null, so not included.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\hic.py
diagnoses_query(engine)
Get the diagnoses corresponding to episodes
This should be linked to the episodes table to obtain information about the diagnoses in the episode.
Diagnoses are encoded using ICD-10 codes, and the position column contains the order of diagnoses in the episode (1-indexed).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve diagnoses table |
Source code in src\pyhbr\data_source\hic.py
episodes_query(engine, start_date, end_date)
Get the episodes list in the HIC data
This table does not contain any episode information, just a patient and an episode id for linking to diagnosis and procedure information in other tables.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
start_date |
date
|
first valid consultant-episode start date |
required |
end_date |
date
|
last valid consultant-episode start date |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\hic.py
pathology_blood_query(engine, investigations)
Get the table of blood test results in the HIC data
Since blood tests in this table are not associated with an episode directly by key, it is necessary to link them based on the patient identifier and date. This operation can be quite slow if the blood tests table is large. One way to reduce the size is to filter by investigation using the investigations parameter. The investigation codes in the HIC data are shown below:
investigation |
Description |
---|---|
OBR_BLS_UL | LFT |
OBR_BLS_UE | UREA,CREAT + ELECTROLYTES |
OBR_BLS_FB | FULL BLOOD COUNT |
OBR_BLS_UT | THYROID FUNCTION TEST |
OBR_BLS_TP | TOTAL PROTEIN |
OBR_BLS_CR | C-REACTIVE PROTEIN |
OBR_BLS_CS | CLOTTING SCREEN |
OBR_BLS_FI | FIB-4 |
OBR_BLS_AS | AST |
OBR_BLS_CA | CALCIUM GROUP |
OBR_BLS_TS | TSH AND FT4 |
OBR_BLS_FO | SERUM FOLATE |
OBR_BLS_PO | PHOSPHATE |
OBR_BLS_LI | LIPID PROFILE |
OBR_POC_VG | POCT BLOOD GAS VENOUS SAMPLE |
OBR_BLS_HD | HDL CHOLESTEROL |
OBR_BLS_FT | FREE T4 |
OBR_BLS_FE | SERUM FERRITIN |
OBR_BLS_GP | ELECTROLYTES NO POTASSIUM |
OBR_BLS_CH | CHOLESTEROL |
OBR_BLS_MG | MAGNESIUM |
OBR_BLS_CO | CORTISOL |
Each test is similarly encoded. The valid test codes in the full blood count and U+E investigations are shown below:
investigation |
test |
Description |
---|---|---|
OBR_BLS_FB | OBX_BLS_NE | Neutrophils |
OBR_BLS_FB | OBX_BLS_PL | Platelets |
OBR_BLS_FB | OBX_BLS_WB | White Cell Count |
OBR_BLS_FB | OBX_BLS_LY | Lymphocytes |
OBR_BLS_FB | OBX_BLS_MC | MCV |
OBR_BLS_FB | OBX_BLS_HB | Haemoglobin |
OBR_BLS_FB | OBX_BLS_HC | Haematocrit |
OBR_BLS_UE | OBX_BLS_NA | Sodium |
OBR_BLS_UE | OBX_BLS_UR | Urea |
OBR_BLS_UE | OBX_BLS_K | Potassium |
OBR_BLS_UE | OBX_BLS_CR | Creatinine |
OBR_BLS_UE | OBX_BLS_EP | eGFR/1.73m2 (CKD-EPI) |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
investigations |
list[str]
|
Which types of laboratory test to include in the query. Fetching fewer types of test makes the query faster. |
required |
Returns:
Type | Description |
---|---|
Engine
|
SQL query to retrieve blood tests table |
Source code in src\pyhbr\data_source\hic.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
|
pharmacy_prescribing_query(engine, table_name='cv1_pharmacy_prescribing')
Get medicines prescribed to patients over time
This table contains information about medicines prescribed to patients, identified by patient and time (i.e. it is not associated to an episode). The information includes the medicine name, dose (includes unit), frequency, form (e.g. tablets), route (e.g. oral), and whether the medicine was present on admission.
The most commonly occurring formats for various relevant medicines are shown in the table below:
name |
dose |
frequency |
drug_form |
route |
---|---|---|---|---|
aspirin | 75 mg | in the MORNING | NaN | Oral |
aspirin | 75 mg | in the MORNING | dispersible tablet | Oral |
clopidogrel | 75 mg | in the MORNING | film coated tablets | Oral |
ticagrelor | 90 mg | TWICE a day | tablets | Oral |
warfarin | 3 mg | ONCE a day at 18:00 | NaN | Oral |
warfarin | 5 mg | ONCE a day at 18:00 | tablets | Oral |
apixaban | 5 mg | TWICE a day | tablets | Oral |
dabigatran etexilate | 110 mg | TWICE a day | capsules | Oral |
edoxaban | 60 mg | in the MORNING | tablets | Oral |
rivaroxaban | 20 mg | in the MORNING | film coated tablets | Oral |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
table_name |
str
|
This defaults to "cv1_pharmacy_prescribing" for UHBW, but can be overwritten with "HIC_Pharmacy" for ICB. |
'cv1_pharmacy_prescribing'
|
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve procedures table |
Source code in src\pyhbr\data_source\hic.py
procedures_query(engine)
Get the procedures corresponding to episodes
This should be linked to the episodes table to obtain information about the procedures in the episode.
Procedures are encoded using OPCS-4 codes, and the position column contains the order of procedures in the episode (1-indexed).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve procedures table |
Source code in src\pyhbr\data_source\hic.py
hic_covid
SQL queries and functions for HIC (COVID-19, UHBW) data.
episodes_query(engine)
Get the episodes list in the HIC data
This table does not contain any episode information, just a patient and an episode id for linking to diagnosis and procedure information in other tables.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
start_date |
first valid consultant-episode start date |
required | |
end_date |
last valid consultant-episode start date |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\hic_covid.py
hic_icb
SQL queries and functions for HIC (ICB version)
Most data available in the HIC tables is fetched in the queries below, apart from columns which are all-NULL, provide keys/IDs that will not be used, or provide duplicate information (e.g. duplicated in two tables).
Note that the lab results/pharmacy queries are in the hic.py module, because there are no changes to the query apart from the table name.
episode_id_query(engine)
Get the episodes list in the HIC data
This table is just a list of IDs to identify the data in other ICB tables.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\hic_icb.py
pathology_blood_query(engine, test_names)
Get the table of blood test results in the HIC data
Since blood tests in this table are not associated with an episode directly by key, it is necessary to link them based on the patient identifier and date. This operation can be quite slow if the blood tests table is large. One way to reduce the size is to filter by investigation using the investigations parameter. The investigation codes in the HIC data are shown below:
investigation |
Description |
---|---|
OBR_BLS_UL | LFT |
OBR_BLS_UE | UREA,CREAT + ELECTROLYTES |
OBR_BLS_FB | FULL BLOOD COUNT |
OBR_BLS_UT | THYROID FUNCTION TEST |
OBR_BLS_TP | TOTAL PROTEIN |
OBR_BLS_CR | C-REACTIVE PROTEIN |
OBR_BLS_CS | CLOTTING SCREEN |
OBR_BLS_FI | FIB-4 |
OBR_BLS_AS | AST |
OBR_BLS_CA | CALCIUM GROUP |
OBR_BLS_TS | TSH AND FT4 |
OBR_BLS_FO | SERUM FOLATE |
OBR_BLS_PO | PHOSPHATE |
OBR_BLS_LI | LIPID PROFILE |
OBR_POC_VG | POCT BLOOD GAS VENOUS SAMPLE |
OBR_BLS_HD | HDL CHOLESTEROL |
OBR_BLS_FT | FREE T4 |
OBR_BLS_FE | SERUM FERRITIN |
OBR_BLS_GP | ELECTROLYTES NO POTASSIUM |
OBR_BLS_CH | CHOLESTEROL |
OBR_BLS_MG | MAGNESIUM |
OBR_BLS_CO | CORTISOL |
Each test is similarly encoded. The valid test codes in the full blood count and U+E investigations are shown below:
investigation |
test |
Description |
---|---|---|
OBR_BLS_FB | OBX_BLS_NE | Neutrophils |
OBR_BLS_FB | OBX_BLS_PL | Platelets |
OBR_BLS_FB | OBX_BLS_WB | White Cell Count |
OBR_BLS_FB | OBX_BLS_LY | Lymphocytes |
OBR_BLS_FB | OBX_BLS_MC | MCV |
OBR_BLS_FB | OBX_BLS_HB | Haemoglobin |
OBR_BLS_FB | OBX_BLS_HC | Haematocrit |
OBR_BLS_UE | OBX_BLS_NA | Sodium |
OBR_BLS_UE | OBX_BLS_UR | Urea |
OBR_BLS_UE | OBX_BLS_K | Potassium |
OBR_BLS_UE | OBX_BLS_CR | Creatinine |
OBR_BLS_UE | OBX_BLS_EP | eGFR/1.73m2 (CKD-EPI) |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
test_names |
list[str]
|
Unlike the UHBW version of this table, there are no investigation names here. Instead, restrict directly using the test_name field. |
required |
Returns:
Type | Description |
---|---|
Engine
|
SQL query to retrieve blood tests table |
Source code in src\pyhbr\data_source\hic_icb.py
icb
Data sources available from the BNSSG ICB This file contains queries that fetch the raw data from the BNSSG ICB, which includes hospital episode statistics (HES) and primary care data.
This file does not include the HIC data transferred to the ICB.
clinical_code_column_name(kind, position)
Make the primary/secondary diagnosis/procedure column names
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind |
str
|
Either "diagnosis" or "procedure". |
required |
position |
int
|
0 for primary, 1 and higher for secondaries. |
required |
Returns:
Type | Description |
---|---|
str
|
The column name for the clinical code compatible with the ICB HES tables. |
Source code in src\pyhbr\data_source\icb.py
mortality_query(engine, start_date, end_date)
Get the mortality query, including cause of death
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
start_date |
date
|
First date of death that will be included |
required |
end_date |
date
|
Last date of death that will be included |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\icb.py
ordinal(n)
Make an an ordinal like "2nd" from a number n
See https://stackoverflow.com/a/20007730.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n |
int
|
The integer to convert to an ordinal string. |
required |
Returns:
Type | Description |
---|---|
str
|
For an integer (e.g. 5), the ordinal string (e.g. "5th") |
Source code in src\pyhbr\data_source\icb.py
primary_care_attributes_query(engine, patient_ids, gp_opt_outs)
Get primary care patient information
This is translated into an IN clause, which has an item limit. If patient_ids is longer than 2000, an error is raised. If more patient IDs are needed, split patient_ids and call this function multiple times.
The values in patient_ids must be valid (they should come from a query such as sus_query).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
patient_ids |
list[str]
|
The list of patient identifiers to filter the nhs_number column. |
required |
gp_opt_outs |
list[str]
|
List of practice codes that are excluded from the data fetch (corresponds to the "practice_code" column in the table). |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\icb.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 |
|
primary_care_measurements_query(engine, patient_ids, gp_opt_outs)
Get physiological measurements performed in primary care
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
patient_ids |
list[str]
|
The list of patient identifiers to filter the nhs_number column. |
required |
gp_opt_outs |
list[str]
|
List of practice codes that are excluded from the data fetch (corresponds to the "practice_code" column in the table). |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\icb.py
primary_care_prescriptions_query(engine, patient_ids, gp_opt_outs)
Get medications dispensed in primary care
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
patient_ids |
list[str]
|
The list of patient identifiers to filter the nhs_number column. |
required |
gp_opt_outs |
list[str]
|
List of practice codes that are excluded from the data fetch (corresponds to the "practice_code" column in the table). |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\icb.py
score_seg_query(engine, patient_ids)
Get score segment information from SWD (Charlson/Cambridge score, etc.)
This is translated into an IN clause, which has an item limit. If patient_ids is longer than 2000, an error is raised. If more patient IDs are needed, split patient_ids and call this function multiple times.
The values in patient_ids must be valid (they should come from a query such as sus_query).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
patient_ids |
list[str]
|
The list of patient identifiers to filter the nhs_number column. |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\icb.py
sus_query(engine, start_date, end_date)
Get the episodes list in the HES data
This table contains one episode per row. Diagnosis/procedure clinical codes are represented in wide format (one clinical code position per columns), and patient demographic information is also included.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
the connection to the database |
required |
start_date |
date
|
first valid consultant-episode start date |
required |
end_date |
date
|
last valid consultant-episode start date |
required |
Returns:
Type | Description |
---|---|
Select
|
SQL query to retrieve episodes table |
Source code in src\pyhbr\data_source\icb.py
middle
Routines for interfacing between the data sources and analysis functions
from_hic
Convert HIC tables into the formats required for analysis
calculate_age(episodes, demographics)
Calculate the patient age at each episode
The HIC data contains only year_of_birth, which is used here. In order to make an unbiased estimate of the age, birthday is assumed to be 2nd july (halfway through the year).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
episodes |
DataFrame
|
Contains |
required |
demographics |
DataFrame
|
Contains |
required |
Returns:
Type | Description |
---|---|
Series
|
A series containing age, indexed by |
Source code in src\pyhbr\middle\from_hic.py
check_const_column(df, col_name, expect)
Raise an error if a column is not constant
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The table to check |
required |
col_name |
str
|
The name of the column which should be constant |
required |
expect |
str
|
The expected constant value of the column |
required |
Raises:
Type | Description |
---|---|
RuntimeError
|
Raised if the column is not constant with the expected value. |
Source code in src\pyhbr\middle\from_hic.py
filter_by_medicine(df)
Filter a dataframe by medicine name
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Contains a column |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The dataframe, filtered to the set of medicines of interest,
with a new column |
Source code in src\pyhbr\middle\from_hic.py
get_clinical_codes(engine, diagnoses_file, procedures_file)
Main diagnoses/procedures fetch for the HIC data
This function wraps the diagnoses/procedures queries and a filtering operation to reduce the tables to only those rows which contain a code in a group. One table is returned which contains both the diagnoses and procedures in long format, along with the associated episode ID and the primary/secondary position of the code in the episode.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
diagnoses_file |
str
|
The diagnoses codes file name (loaded from the package) |
required |
procedures_file |
str
|
The procedures codes file name (loaded from the package) |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table containing diagnoses/procedures, normalised codes, code groups, diagnosis positions, and associated episode ID. |
Source code in src\pyhbr\middle\from_hic.py
get_demographics(engine)
Get patient demographic information
Gender is encoded using the NHS data dictionary values, which is mapped to a category column in the table. (Note that initial values are strings, not integers.)
- "0": Not known. Mapped to "unknown"
- "1": Male: Mapped to "male"
- "2": Female. Mapped to "female"
- "9": Not specified. Mapped to "unknown".
Not mapping 0/9 to NA in case either is related to non-binary genders (i.e. it contains information, rather than being a NULL field).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table indexed by patient_id, containing gender, birth year, and death_date (if applicable). |
Source code in src\pyhbr\middle\from_hic.py
get_episodes(engine, start_date, end_date)
Get the table of episodes
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
start_date |
date
|
The start date (inclusive) for returned episodes |
required |
end_date |
date
|
The end date (inclusive) for returned episodes |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The episode data, indexed by episode_id. This contains
the columns |
Source code in src\pyhbr\middle\from_hic.py
get_gender(episodes, demographics)
Get gender from the demographics table for each index event
Parameters:
Name | Type | Description | Default |
---|---|---|---|
episodes |
DataFrame
|
Indexed by |
required |
demographics |
DataFrame
|
Having columns |
required |
Returns:
Type | Description |
---|---|
Series
|
A series containing gender indexed by |
Source code in src\pyhbr\middle\from_hic.py
get_lab_results(engine, episodes)
Get relevant laboratory results from the HIC data, linked to episode
For information about the contents of the table, refer to the documentation for get_unlinked_lab_results().
This function links each laboratory test to the first episode containing the sample collected date in its date range. For more about this, see link_to_episodes().
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
episodes |
DataFrame
|
The episodes table, used for linking. Must contain
|
required |
Returns:
Type | Description |
---|---|
DataFrame
|
Table of laboratory results, including Hb (haemoglobin),
platelet count, and eGFR (kidney function). The columns are
|
Source code in src\pyhbr\middle\from_hic.py
get_prescriptions(engine, episodes)
Get relevant prescriptions from the HIC data, linked to episode
For information about the contents of the table, refer to the documentation for get_unlinked_prescriptions().
This function links each prescription to the first episode containing the prescription order date in its date range. For more about this, see link_to_episodes().
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
episodes |
DataFrame
|
The episodes table, used for linking. Must contain
|
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The table of prescriptions, including the prescription name, prescription group (oac or nsaid), frequency (in doses per day), and link to the associated episode. |
Source code in src\pyhbr\middle\from_hic.py
get_unlinked_lab_results(engine, table_name='cv1_pathology_blood')
Get laboratory results from the HIC database (unlinked to episode)
This function returns data for the following three
tests, identified by one of these values in the
test_name
column:
hb
: haemoglobin (unit: g/dL)egfr
: eGFR (unit: mL/min)platelets
: platelet count (unit: 10^9/L)
The test result is associated to a patient_id
,
and the time when the sample for the test was collected
is stored in the sample_date
column.
Some values in the underlying table contain inequalities in the results column, which have been removed (so egfr >90 becomes 90).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
table_name |
str
|
This defaults to "cv1_pathology_blood" for UHBW, but can be overwritten with "HIC_Bloods" for ICB. |
'cv1_pathology_blood'
|
Returns:
Type | Description |
---|---|
DataFrame
|
Table of laboratory results, including Hb (haemoglobin),
platelet count, and eGFR (kidney function). The columns are
|
Source code in src\pyhbr\middle\from_hic.py
get_unlinked_prescriptions(engine, table_name='cv1_pharmacy_prescribing')
Get relevant prescriptions from the HIC data (unlinked to episode)
This function is tailored towards the calculation of the ARC HBR score, so it focusses on prescriptions on oral anticoagulants (e.g. warfarin) and non-steroidal anti-inflammatory drugs (NSAIDs, e.g. ibuprofen).
The frequency column reflects the maximum allowable doses per day. For the purposes of ARC HBR, where NSAIDs must be prescribed > 4 days/week, all prescriptions in the HIC data indicate frequency > 1 (i.e. at least one per day), and therefore qualify for ARC HBR purposes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
table_name |
str
|
Defaults to "cv1_pharmacy_prescribing" for UHBW, but can be overwritten by "HIC_Pharmacy" for ICB. |
'cv1_pharmacy_prescribing'
|
Returns:
Type | Description |
---|---|
DataFrame
|
The table of prescriptions, including the patient_id, order_date (to link to an episode), prescription name, prescription group (oac or nsaid), and frequency (in doses per day). |
Source code in src\pyhbr\middle\from_hic.py
link_to_episodes(items, episodes, date_col_name)
Link HIC laboratory test/prescriptions to episode by date
Use this function to add an episode_id to the laboratory tests table or the prescriptions table. Tests/prescriptions are generically referred to as items below.
This function associates each item with the first episode containing
the item date in its [episode_start, episode_end) range. The column
containing the item date is given by date_col_name
.
For prescriptions, use the prescription order date for linking. For laboratory tests, use the sample collected date.
This function assumes that the episode_id in the episodes table is unique (i.e. no patients share an episode ID).
For higher performance, reduce the item table to items of interest before calling this function.
Since episodes may slightly overlap, an item may be associated with more than one episode. In this case, the function will associate the item with the earliest episode (the returned table will not contain duplicate items).
The final table does not use episode_id as an index, because an episode may contain multiple items.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
items |
DataFrame
|
The prescriptions or laboratory tests table. Must contain a
|
required |
episodes |
DataFrame
|
The episodes table. Must contain |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The items table with additional |
Source code in src\pyhbr\middle\from_hic.py
from_icb
blood_pressure(swd_index_spells, primary_care_measurements)
Get recent blood pressure readings
Parameters:
Name | Type | Description | Default |
---|---|---|---|
primary_care_measurements |
DataFrame
|
Contains a |
required |
swd_index_spells |
DataFrame
|
Has Pandas index |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe index by |
Source code in src\pyhbr\middle\from_icb.py
get_clinical_codes(raw_sus_data, code_groups)
Get clinical codes in long format and normalised form.
Each row is a code that is contained in some group. Codes in an episode are dropped if they are not in any group, meaning episodes will be dropped if no code in that episode is in any group.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_sus_data |
DataFrame
|
Must contain one row per episode, and
contains clinical codes in wide format, with
columns |
required |
code_groups |
DataFrame
|
A table of all the codes in any group, at least containing
columns |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table containing diagnoses/procedures, normalised codes, code groups, diagnosis positions, and associated episode ID. |
Source code in src\pyhbr\middle\from_icb.py
get_episodes(raw_sus_data)
Get the episodes table
Age and gender are also included in each row.
Gender is encoded using the NHS data dictionary values, which is mapped to a category column in the table. (Note that initial values are strings, not integers.)
- "0": Not known. Mapped to "unknown"
- "1": Male: Mapped to "male"
- "2": Female. Mapped to "female"
- "9": Not specified. Mapped to "unknown".
Not mapping 0/9 to NA in case either is related to non-binary genders (i.e. it contains information, rather than being a NULL field).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_sus_data |
DataFrame
|
Data returned by sus_query() query. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe indexed by |
Source code in src\pyhbr\middle\from_icb.py
get_episodes_and_codes(raw_sus_data, code_groups)
Get episode and clinical code data
This batch of data must be fetched first to find index events, which establishes the patient group of interest. This can then be used to narrow subsequent queries to the data base, to speed them up.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_sus_data |
DataFrame
|
The raw HES data returned by get_raw_sus_data() |
required |
code_groups |
DataFrame
|
A table of all the codes in any group, at least containing
columns |
required |
Returns:
Type | Description |
---|---|
(DataFrame, DataFrame)
|
A tuple containing the episodes table (also contains age and gender) and the codes table containing the clinical code data in long format for any code that is in a diagnosis or procedure code group. |
Source code in src\pyhbr\middle\from_icb.py
get_long_cause_of_death(mortality)
Get cause-of-death diagnosis codes in normalised long format
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mortality |
DataFrame
|
A table containing |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table containing the columns |
Source code in src\pyhbr\middle\from_icb.py
get_long_clinical_codes(raw_sus_data)
Get a table of the clinical codes in normalised long format
This is modelled on the format of the HIC data, which works well, and makes it possible to re-use the code for processing that table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_sus_data |
DataFrame
|
Must contain one row per episode, and
contains clinical codes in wide format, with
columns |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A table containing |
Source code in src\pyhbr\middle\from_icb.py
get_mortality(engine, start_date, end_date, code_groups)
Get date of death and cause of death
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
start_date |
date
|
First date of death that will be included |
required |
end_date |
date
|
Last date of death that will be included |
required |
code_groups |
DataFrame
|
A table of all the codes in any group, at least containing
columns |
required |
Returns:
Type | Description |
---|---|
dict[str, DataFrame]
|
A tuple containing a date of death table, which is indexed by |
Source code in src\pyhbr\middle\from_icb.py
get_raw_sus_data(engine, start_date, end_date)
Get the raw SUS (secondary uses services hospital episode statistics)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
start_date |
date
|
The start date (inclusive) for returned episodes |
required |
end_date |
date
|
The end date (inclusive) for returned episodes |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with one row per episode, containing clinical code data and patient demographics at that episode. |
Source code in src\pyhbr\middle\from_icb.py
get_unlinked_lab_results(engine)
Get laboratory results from the HIC database (unlinked to episode)
This function returns data for the following three
tests, identified by one of these values in the
test_name
column:
hb
: haemoglobin (unit: g/dL)egfr
: eGFR (unit: mL/min)platelets
: platelet count (unit: 10^9/L)
The test result is associated to a patient_id
,
and the time when the sample for the test was collected
is stored in the sample_date
column.
Some values in the underlying table contain inequalities in the results column, which have been removed (so egfr >90 becomes 90).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The connection to the database |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
Table of laboratory results, including Hb (haemoglobin),
platelet count, and eGFR (kidney function). The columns are
|
Source code in src\pyhbr\middle\from_icb.py
hba1c(swd_index_spells, primary_care_measurements)
Get recent HbA1c from the primary care measurements
Parameters:
Name | Type | Description | Default |
---|---|---|---|
primary_care_measurements |
DataFrame
|
Contains a |
required |
swd_index_spells |
DataFrame
|
Has Pandas index |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe indexed by |
Source code in src\pyhbr\middle\from_icb.py
preprocess_ethnicity(column)
Map the ethnicity column to standard ethnicities.
Ethnicities were obtained from www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups, from the 2021 census:
- asian_or_asian_british
- black_black_british_caribbean_or_african
- mixed_or_multiple_ethnic_groups
- white
- other_ethnic_group
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
Series
|
A column of object ("string") containing ethnicities from the primary care attributes table. |
required |
Returns:
Type | Description |
---|---|
Series
|
A column of type category containing the standard ethnicities (and NaN). |
Source code in src\pyhbr\middle\from_icb.py
412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 |
|
preprocess_smoking(column)
Convert the smoking column from string to category
The values in the column are "unknown", "ex", "Unknown", "current", "Smoker", "Ex", and "Never".
Based on the distribution of values in the column, it likely that "Unknown/unknown" mostly means "no". This makes the percentage of smoking about 15%, which is roughly in line with the average. Without performing this mapping, smokers outnumber non-smokers ("Never") approx. 20 to 1.
Note that the column does also include NA values, which will be left as NA.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
Series
|
The smoking column from the primary care attributes |
required |
Returns:
Type | Description |
---|---|
Series
|
A category column containing "yes", "no", and "ex". |
Source code in src\pyhbr\middle\from_icb.py
process_flag_columns(primary_care_attributes)
Replace NaN with false and convert to bool for a selection of rows
Many columns in the primary care attributes encode a flag
using 1 for true and NA/NULL for false. These must be replaced
with a boolean type so that NA can distinguish missing data.
Instead of using a bool
, use Int8 so that NaNs can be stored.
(This is important later on for index spells with missing attributes,
which need to store NaN in these flag columns.)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
primary_care_attributes |
DataFrame
|
Original table containing 1/NA flag columns |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The primary care attributes with flag columns encoded as Int8. |
Source code in src\pyhbr\middle\from_icb.py
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 |
|
tools
fetch_data
Fetch raw data from the database and save it to a file
generate_report
Generate the report folder from a config file and model data
plot_describe
plot_or_save(plot, name, save_dir)
Plot the graph interactively or save the figure
Parameters:
Name | Type | Description | Default |
---|---|---|---|
plot |
bool
|
If true, plot interactively and don't save. Otherwise, save |
required |
name |
str
|
The filename (without the .png) to save the figure has |
required |
save_dir |
str
|
The directory in which to save the figure |
required |
Source code in src\pyhbr\tools\plot_describe.py
run_model
fit_and_save(model_name, config, pipe, X_train, y_train, X_test, y_test, data_file, random_state)
Fit the model and save the results
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name of the model, a key under the "models" top-level key in the config file |
required |
config |
dict[str, Any]
|
The config file as a dictionary |
required |
X_train |
DataFrame
|
The features training dataframe |
required |
y_train |
DataFrame
|
The outcomes training dataframe |
required |
X_test |
DataFrame
|
The features testing dataframe |
required |
y_test |
DataFrame
|
The outcomes testing dataframe |
required |
data_file |
str
|
The name of the raw data file used for the modelling |
required |
random_state |
RandomState
|
The source of randomness used by the model |
required |
Source code in src\pyhbr\tools\run_model.py
get_pipe_fn(model_config)
Get the pipe function based on the name in the config file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_config |
dict[str, str]
|
The dictionary in models.{model_name} in the config file |
required |
Source code in src\pyhbr\tools\run_model.py
Analysis
Common Utilities
Common utilities for other modules.
A collection of routines used by the data source or analysis functions.
CheckedTable
Wrapper for sqlalchemy table with checks for table/columns
Source code in src\pyhbr\common.py
__init__(table_name, engine, schema='dbo')
Get a CheckedTable by reading from the remote server
This is a wrapper around the sqlalchemy Table for catching errors when accessing columns through the c attribute.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name |
str
|
The name of the table whose metadata should be retrieved |
required |
engine |
Engine
|
The database connection |
required |
Returns:
Type | Description |
---|---|
None
|
The table data for use in SQL queries |
Source code in src\pyhbr\common.py
col(column_name)
Get a column
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_name |
str
|
The name of the column to fetch. |
required |
Raises:
Type | Description |
---|---|
RuntimeError
|
Thrown if the column does not exist |
Source code in src\pyhbr\common.py
chunks(patient_ids, n)
Divide a list of patient ids into n-sized chunks
The last chunk may be shorter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
patient_ids |
list[str]
|
The List of IDs to chunk |
required |
n |
int
|
The chunk size. |
required |
Returns:
Type | Description |
---|---|
list[list[str]]
|
A list containing chunks (list) of patient IDs |
Source code in src\pyhbr\common.py
current_commit()
Get current commit.
Returns:
Type | Description |
---|---|
str
|
Get the first 12 characters of the current commit, using the first repository found above the current working directory. If the working directory is not in a git repository, return "nogit". |
Source code in src\pyhbr\common.py
current_timestamp()
Get the current timestamp.
Returns:
Type | Description |
---|---|
int
|
The current timestamp (since epoch) rounded to the nearest second. |
get_data(engine, query, *args)
Convenience function to make a query and fetch data.
Wraps a function like hic.demographics_query with a call to pd.read_data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The database connection |
required |
query |
Callable[[Engine, ...], Select]
|
A function returning a sqlalchemy Select statement |
required |
*args |
...
|
Positional arguments to be passed to query in addition to engine (which is passed first). Make sure they are passed in the same order expected by the query function. |
()
|
Returns:
Type | Description |
---|---|
DataFrame
|
The pandas dataframe containing the SQL data |
Source code in src\pyhbr\common.py
get_data_by_patient(engine, query, patient_ids, *args)
Fetch data using a query restricted by patient ID
The patient_id list is chunked into 2000 long batches to fit within an SQL IN clause, and each chunk is run as a separate query. The results are assembled into a single DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
engine |
Engine
|
The database connection |
required |
query |
Callable[[Engine, ...], Select]
|
A function returning a sqlalchemy Select statement. Must take a list[str] as an argument after engine. |
required |
patient_ids |
list[str]
|
A list of patient IDs to restrict the query. |
required |
*args |
...
|
Further positional arguments that will be passed to the query function after the patient_ids positional argument. |
()
|
Returns:
Type | Description |
---|---|
list[DataFrame]
|
A list of dataframes, one corresponding to each chunk. |
Source code in src\pyhbr\common.py
get_saved_files_by_name(name, save_dir, extension)
Get all saved data files matching name
Get the list of files in the save_dir folder matching name. Return the result as a table of file path, commit hash, and saved date. The table is sorted by timestamp, with the most recent file first.
Raises:
Type | Description |
---|---|
RuntimeError
|
If save_dir does not exist, or there are files in save_dir within invalid file names (not in the format name_commit_timestamp.pkl). |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the saved file to load. This matches name in the filename name_commit_timestamp.pkl. |
required |
save_dir |
str
|
The directory to search for files. |
required |
extension |
str
|
What file extension to look for. Do not include the dot. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with columns |
Source code in src\pyhbr\common.py
load_exact_item(name, save_dir='save_data')
Load a previously saved item (pickle) from file by exact filename
This is similar to load_item, but loads the exact filename given by name instead of looking for the most recent file. name must contain the commit, timestamp, and file extension.
A RuntimeError is raised if the file does not exist.
To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the item to load |
required |
save_fir |
Which folder to load the item from. |
required |
Returns:
Type | Description |
---|---|
Any
|
The data item loaded. |
Source code in src\pyhbr\common.py
load_item(name, interactive=False, save_dir='save_data')
Load a previously saved item (pickle) from file
Use this function to load a file that was previously saved using save_item(). By default, the latest version of the item will be returned (the one with the most recent timestamp).
None is returned if an interactive load is cancelled by the user.
To load an item that is an object from a library (e.g. a pandas DataFrame), the library must be installed (otherwise you will get a ModuleNotFound exception). However, you do not have to import the library before calling this function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the item to load |
required |
interactive |
bool
|
If True, let the user pick which item version to load interactively. If False, non-interactively load the most recent item (i.e. with the most recent timestamp). The commit hash is not considered when loading the item. |
False
|
save_fir |
Which folder to load the item from. |
required |
Returns:
Type | Description |
---|---|
(Any, Path)
|
A tuple, with the python object loaded from file as first element and the Path to the item as the second element, or None if the user cancelled an interactive load. |
Source code in src\pyhbr\common.py
load_most_recent_data_files(analysis_name, save_dir)
Load the most recent timestamp data file matching the analysis name
The data file is a pickle of a dictionary, containing pandas DataFrames and other metadata. It is expected to contain a "raw_file" key, which contains the path to the associated raw data file.
Both files are loaded, and a tuple of all the data is returned
Parameters:
Name | Type | Description | Default |
---|---|---|---|
analysis_name |
str
|
The "analysis_name" key from the config file, which is the filename prefix |
required |
save_dir |
str
|
The folder to load the data from |
required |
Returns:
Type | Description |
---|---|
(dict[str, Any], dict[str, Any], str)
|
(data, raw_data, data_path). data and raw_data are dictionaries containing (mainly) Pandas DataFrames, and data_path is the path to the data file (this can be stored in any output products from this script to record which data file was used to generate the data. |
Source code in src\pyhbr\common.py
make_engine(con_string='mssql+pyodbc://dsn', database='hic_cv_test')
Make a sqlalchemy engine
This function is intended for use with Microsoft SQL Server. The preferred method to connect to the server on Windows is to use a Data Source Name (DSN). To use the default connection string argument, set up a data source name called "dsn" using the program "ODBC Data Sources".
If you need to access multiple different databases on the same server, you will need different engines. Specify the database name while creating the engine (this will override a default database in the DSN, if there is one).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con_string |
str
|
The sqlalchemy connection string. |
'mssql+pyodbc://dsn'
|
database |
str
|
The database name to connect to. |
'hic_cv_test'
|
Returns:
Type | Description |
---|---|
Engine
|
The sqlalchemy engine |
Source code in src\pyhbr\common.py
make_new_save_item_path(name, save_dir, extension)
Make the path to save a new item to the save_dir
The name will have the format name_{current_common}_{timestamp}.{extension}.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The base name for the new filename |
required |
save_dir |
str
|
The folder in which to place the item |
required |
extension |
str
|
The file extension (omit the dot) |
required |
Returns:
Type | Description |
---|---|
Path
|
The relative path to the new object to be saved |
Source code in src\pyhbr\common.py
mean_confidence_interval(data, confidence=0.95)
Compute the confidence interval around the mean
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Series
|
A series of numerical values to compute the confidence interval. |
required |
confidence |
float
|
The confidence interval to compute. |
0.95
|
Returns:
Type | Description |
---|---|
dict[str, float]
|
A map containing the keys "mean", "lower", and "upper". The latter keys contain the confidence interval limits. |
Source code in src\pyhbr\common.py
median_to_string(instability, unit='%')
Convert the median-quartile DataFrame to a String
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instability |
DataFrame
|
Table containing three rows, indexed by 0.5 (median), 0.25 (lower quartile) and 0.75 (upper quartile). |
required |
unit |
What units to add to the values in the string. |
'%'
|
Returns:
Type | Description |
---|---|
str
|
A string containing the median, and the lower and upper quartiles. |
Source code in src\pyhbr\common.py
pick_most_recent_saved_file(name, save_dir, extension='pkl')
Get the path to the most recent file matching name.
Like pick_saved_file_interactive, but automatically selects the most recent file in save_data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the saved file to list |
required |
save_dir |
str
|
The directory to search for files |
required |
extension |
str
|
What file extension to look for. Do not include the dot. |
'pkl'
|
Returns:
Type | Description |
---|---|
Path
|
The relative path to the most recent matching file. |
Source code in src\pyhbr\common.py
pick_saved_file_interactive(name, save_dir, extension='pkl')
Select a file matching name interactively
Print a list of the saved items in the save_dir folder, along with the date and time it was generated, and the commit hash, and let the user pick which item should be loaded interactively. The full filename of the resulting file is returned, which can then be read by the user.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
The name of the saved file to list |
required |
save_dir |
str
|
The directory to search for files |
required |
extension |
str
|
What file extension to look for. Do not include the dot. |
'pkl'
|
Returns:
Type | Description |
---|---|
str | None
|
The absolute path to the interactively selected file, or None if the interactive load was aborted. |
Source code in src\pyhbr\common.py
query_yes_no(question, default='yes')
Ask a yes/no question via raw_input() and return their answer.
From https://stackoverflow.com/a/3041990.
"question" is a string that is presented to the user.
"default" is the presumed answer if the user just hits
The "answer" return value is True for "yes" or False for "no".
Source code in src\pyhbr\common.py
read_config_file(yaml_path)
Read the configuration file from
Parameters:
Name | Type | Description | Default |
---|---|---|---|
yaml_path |
str
|
The path to the experiment config file |
required |
Source code in src\pyhbr\common.py
requires_commit()
Check whether changes need committing
To make most effective use of the commit hash stored with a save_item call, the current branch should be clean (all changes committed). Call this function to check.
Returns False if there is no git repository.
Returns:
Type | Description |
---|---|
bool
|
True if the working directory is in a git repository that requires a commit; False otherwise. |
Source code in src\pyhbr\common.py
save_item(item, name, save_dir='save_data/', enforce_clean_branch=True, prompt_commit=False)
Save an item to a pickle file
Saves a python object (e.g. a pandas DataFrame) dataframe in the save_dir folder, using a filename that includes the current timestamp and the current commit hash. Use load_item to retrieve the file.
Important
Ensure that save_data/
(or your chosen save_dir
) is added to the
.gitignore of your repository to ensure sensitive data is not committed.
By storing the commit hash and timestamp, it is possible to identify when items were created and what code created them. To make most effective use of the commit hash, ensure that you commit, and do not make any further code edits, before running a script that calls save_item (otherwise the commit hash will not quite reflect the state of the running code).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item |
Any
|
The python object to save (e.g. pandas DataFrame) |
required |
name |
str
|
The name of the item. The filename will be created by adding
a suffix for the current commit and the timestamp to show when the
data was saved (format: |
required |
save_dir |
str
|
Where to save the data, relative to the current working directory. The directory will be created if it does not exist. |
'save_data/'
|
enforce_clean_branch |
If True, the function will raise an exception if an attempt is made to save an item when the repository has uncommitted changes. |
True
|
|
prompt_commit |
if enforce_clean_branch is true, choose whether the prompt the user to commit on an unclean branch. This can help avoiding losing the results of a long-running script. Prefer to use false if the script is cheap to run. |
False
|
Source code in src\pyhbr\common.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 |
|