Calculations, Normalizations, and Math Protocols

On yesterday’s working group call, there was discussion around how best to report normalization of single measures. This revived previous discussions about the possibility of recording calculation methods (likely separate from protocols) to report with aggregated measures. This was something we discussed previously as being a good idea, but we said we would hold off until version 3. This new conversation, however, sparked renewed interest in this possibility.

An example use case of this is the 7-day moving average. This is an (arbitrary) type of mean used across the field, but there are differences in how it’s calculated - front-weighting, end-weighting, the number of observations used to calculate it, etc. So having a different aggregation value for each option seems ungainly and inefficient. Using the 7-day moving average aggregation value allows us to group similar measures, however, which having this calculations table would allow for more detailed reporting on how reporters arrived at that number.

The main reasoning for not recording this with lab protocols in the protocols, protocolRelationships, and protocolSteps tables, is that each measure can only be linked to a single protocol ID, and so a separate calculation protocol would be difficult to record. Furthermore, we have received some feedback that there are already a lot of steps in a single protocol ID, making changes, updates, and deviations from the protocol challenging to manage in model administration. If we also added all the calculations to protocols, it is efficient when it works, but becomes ungainly when things change.

What I’m proposing is a calculations table, with the following headers:

  • calculationID : the primary key of the table, used to refer to the calculation.
  • name : the name of the calculation or series of calculations.
  • summ : a plain-language summary of the calculation, why it is done, and how.
  • equation : the mathematical equation/expression being used.
  • lastEdited : when the field was last edited or updated.
  • notes : any additional notes on the calculation.

calculationID would also be added as a header in the measures table, where it would be a foreign key.

For example:

When recording a single flow- and population-normalized measure of the N1 region of the SARS-CoV-2 virus at the National Microbiology Lab (NML).

The calculations table looks like this:

calculationID name summ equation lastEdited notes
nml_floPopNorm Flow and population normalization at NML Normalization of the measures levels of viral DNA or RNA in gene copies are normalized and standardized to the flow volume measured in the wastewater treatment that day, and well as to the population served by a wastewater treatment plant in 100,00 people. (genome copies / day / 100 000 people) = gcmL * 10^9 * flowVol / (population / 100 000) 24-09-2024 NA

and then the measures table would look like this:

measureRepID protocolID sampleID calculationID measure value unit aggregation notes
measureA nmlProtocol sampleA nml_floPopNorm covN1 0.345 gcDay100k single lorem ipsum

Obviously the naming of parts here is entirely in draft, so happy to hear feedback on that. Also very happy to hear feedback on any other recommended part of the structure proposed here.

@Sorin @dmanuel @jeandavidt @NHizon - anxious to hear your thoughts.

Hi Mathew, I think having a way to record calculations is a very exciting development. Here are my thoughts.

The utility of reporting calculations is that it increases transparency and it allows the users of the data to understand how the reported value is related to an observed measurement.

There are different levels of fidelity that can be targeted, and each needs a bit more metadata to be achievable.

  1. Indicating that one or many calculations were applied → can be achieved with a raw/processed tag.
  2. Indicating the nature of those calculations → The calculation table you’re showing here does this well. However, it supposes that a transformed value can be attained using a single equation / calculation step. That may not always be the case. Imagine for instance, that N1 concentrations are:
  • filtered for outliers
  • log-transformed
  • linearly-interpolated
  • log-untransformed
  • smoothed with a 7-day window
    It would be difficult then to explain all of this in a single equation field. Noreover, some steps (like outlier detection) may rely more on an algorithm than on a specific mathematical equation
    The steps above are an example of a pipeline of transformations. Pipelines are typically applied to signals (group of measurement that represent the evolution of a quantity over time), and each step of the pipeline generates a new version of the time series, with (sometimes) modified values for each point. So, we can imagine
  1. Reporting each step of the pipeline (with its equation or algorithm), along with the modified value of the data point at each step. Reporting at this level makes the transformation process more transparent (each version of the time series is made available) and it makes it composable (maybe I want to use the data after it has been filtered for outliers, but I want to interpolate it differently). It would require a different table structure than what you are proposing though.

In my thesis, I worked on a system to report data processing in a way similar to level 3. I called it metEAUdata, and it’s currently available in open-source as a python package. The package itself is not necessarily relevant for this thread, but the metadata attributes and the structure it uses to organize processing information might inform this conversation :slight_smile: The README is not super clear, but I will attach the chapter if you want to take a look (relevant bits pages 43–53).
I am also attaching a slide I used in my defence that shows what metEAUdata does, i.e. it defines a entity that is transformed (the signal), and each version of its data is attached to it with information on how it was produced.

Looking forward to discuss this further!
Meteaudata repo:

Chapter:
https://ulavaldti-my.sharepoint.com/:b:/g/personal/jdthe3_ulaval_ca/EQNijmRqYMRHiOzLs52V9hMBDVHxOmoZhRkt5KeYeM5YvA?e=Czj6ey

Slides:
https://ulavaldti-my.sharepoint.com/:p:/g/personal/jdthe3_ulaval_ca/Ee7bXSC0OENAltKkW-gqkRwB4xf9iG8mGNNnzcVxuHlm9Q?e=QS5SZP

Having read through some of JD’s linked thesis chapter above, and talked more with folks, I think we’ve arrived at a more reasonable draft solution.

To JD’s point above, there’s a lot that can happen to data, and some transparency about what has been done is very useful when comparing across points. But this can’t always be recorded neatly as an equation, and often there are multiple steps that should be reported together (ideally).

After some discussion, we landed on a new draft table structure that looks like this (header names are not final and only placeholders):

containerID name summary purpose provisional PK (CK) stepID order equation reflink sourceCode lastEditted notes
nmlNormPmmov ct log for PCR generate PCR gc/ml value using ct curve log-transformed nmlNormPmmov_ctLog ctLog 1 e=mc2 website hdsgjd date lorem ipsum
nmlNormPmmov deduplication of PCR inputs remove accidental duplicate PCR gc/ml values using after ct curve de-duplicated nmlNormPmmov_dedup dedup 2 a2+b2 = c2 website hdsgjd date lorem ipsum

Where:

  • containerID -varchar - mandatory - linked to a single measure (may be used multiple times) is the shorthand for the data transformation pipeline used
  • name - varchar - mandatory - free text human-readable name of a single calculation/data treatment
  • summary - varchar - mandatory - free text human-readable summary of a single calculation/data treatment. This should also explain terms used in the equation field.
  • purpose - categorical - mandatory - explain the purpose/nature of a single calculation/data treatment.
  • provisional PK (CK) - varchar - mandatory - the PK for this table, a CK of container ID and step ID, possibly
  • stepID - varchar - mandatory - an ID for a single calculation/data treatment.
  • order - integer - mandatory - the order a single calculation/data treatment takes within the full container pipeline/workflow.
  • equation - varchar - recommended - the equation used in the calculation/data treatment.
  • reflink- varchar - recommended - a reference link to the calculation/data treatment.
  • sourceCode - varchar - recommended - the source code for the calculation/data treatment, more applicable for algorithms/more complex steps.
  • lastEditted - dateTime - optional - when the row was last updated/changed.
  • notes - varchar - optional - free text - notes on the row and the the calculation/data treatment.

This would only be linked to reported measures, which are most likely to be raw values or final values, but may include values from intermediary steps. Measures would also have a new header where they would be tagged as raw or processed. Some documentation would be required for explaining how to define which category a measure belongs to.

Ideally we’d also want to point out to metEAUdata as well, as it would be very useful to users of this feature.

Anything I missed or any further thoughts?

@dmanuel @jeandavidt @Sorin @NHizon

This calculation discussion is thoughtful. It is great to have @jeandavidt who lived with this issue and came up with solutions.

Another context from health are clinical drug trials where data must be submitted to regulatory agents (e.g. FDA) with full transparency and clarity. Back in the day, SAS programming language was used and that language allowed appending code to a variable as metadata.

In general,

  1. Having a new calculation table makes sense. I don’t say this lightly because the number of tables in ODM is daunting and challenging for users.
  2. I’d aim for transparency but without the expectation that details are in a reference link, rather than within the ODM metadata. Calculations will get complex. I make analogies to sequence data. ODM doesn’t need to store sequencing data, just the metadata. Similarly, ODM doesn’t need to store calculation code, just metadata for calculation code.

So, I’d think reasonable to have one row per calculation with a reference link out to details of the indivdual steps or calculation processes.

I look forward to more discussions!

@dmanuel and @jeandavidt met yesterday to talk through this table and the idea more thoroughly, and I think while we may have reasoned in circles a few times, we are closer to a solution now.

One consistent theme was considering whether we want calculations to just be protocols. While it would be more parsimonious in some ways, it would mean using a protocol ID that records every lab step and every data treatment together. This becomes quite onerous, and may be an impediment. Even if we record calculations in a similar style, we end up with more detailed calculation metadata (yes, pro), but at the cost of additional tables and complexity (con). So I believe we landed on the page of trying to contain this in a single table as proposed here.

Of the last proposed headers, nothing was cut. However some reordering and renaming was proposed. The headers, to repeat them here, were:

  • containerID -varchar - mandatory - linked to a single measure (may be used multiple times) is the shorthand for the data transformation pipeline used
  • name - varchar - mandatory - free text human-readable name of a single calculation/data treatment
  • summary - varchar - mandatory - free text human-readable summary of a single calculation/data treatment. This should also explain terms used in the equation field.
  • purpose - categorical - mandatory - explain the purpose/nature of a single calculation/data treatment.
  • provisional PK (CK) - varchar - mandatory - the PK for this table, a CK of container ID and step ID, possibly
  • stepID - varchar - mandatory - an ID for a single calculation/data treatment.
  • order - integer - mandatory - the order a single calculation/data treatment takes within the full container pipeline/workflow.
  • equation - varchar - recommended - the equation used in the calculation/data treatment.
  • reflink- varchar - recommended - a reference link to the calculation/data treatment.
  • sourceCode - varchar - recommended - the source code for the calculation/data treatment, more applicable for algorithms/more complex steps.
  • lastEditted - dateTime - optional - when the row was last updated/changed.
  • notes - varchar - optional - free text - notes on the row and the the calculation/data treatment.

Parralells with the measures table were drawn, and something analogous to the value field was proposed. I suggested that something like a standard field where one can categorically record the standard to which something is being standardized (ie. PMMoV, Crassphage, Flow, etc.) or smoothed (ie. bayesian smoothing, central average smoothing, 7-days, time, etc.). There was disagreement around whether to bring measurements over from the parts table for this, or to create some duplication with new categorical inputs for this field.

There was also discussion around units - how units are calculated, should they be connected here, or connected at all. This discussion was largely unresolved.

With the suggested additions and possible name changes the list of headers (in order) becomes (with some notes):

  • calculationID → Formerly provisional PK (CK)
  • pipelineID → formerly containerID, a data pipeline identifier
  • treatmentID → formerly stepID, a single data treatment identifier
  • name
  • summary
  • calcType → formerly purpose, categorical calculation type. Suggested possible inputs are: normalization, standardization, smoothing, and predictiveModelling.
  • value or standard
  • order
  • equation
  • reflink
  • sourceCode → open ended, could take the full code as text, or a URL to where code is stored. Likely a different URL than the refLink field.
  • lastEditted
  • notes

Names are still pretty up in the air. For the table and the IDs, we tossed around function, derivation, calculation, calculation set, data treatment, etc.

Please feel free to add anything I may have missed, or to respond with any further thoughts or reactions you have to this. We’ll try to meet again next week to discuss further and hopefully finalize something.

Another discussion led us to finalizing the above headers for the calculations table. To repeat them here, the final headers are:

  • calculationID: - varchar - mandatory - the PK for this table, a CK of container ID and step ID. Formerly provisional PK (CK)
  • pipelineID: -varchar - mandatory - linked to a single measure (may be used multiple times) is the shorthand for the data transformation pipeline used. Formerly containerID, a data pipeline identifier
  • treatmentID: - varchar - mandatory - an ID for a single calculation/data treatment, a single data treatment identifier. Formerly stepID.
  • name: - varchar - optional - free text human-readable name of a single calculation/data treatment.
  • summary: - varchar - optional - free text human-readable summary of a single calculation/data treatment. This should also explain terms used in the equation field.
  • calcType: - categorical - recommended - explain the purpose/nature of a single calculation/data treatment. Possible inputs are: normalization, standardization, smoothing, and predictiveModelling Formerly purpose.
  • standard: - varchar - mandatoryIf - field where one can categorically record the standard to which something is being standardized (ie. PMMoV, Crassphage, Flow, etc.) or smoothed (ie. bayesian smoothing, central average smoothing, 7-days, time, etc.).
  • order: - integer - recommended - the order a single calculation/data treatment takes within the full container pipeline/workflow.
  • equation: - varchar - recommended - the equation used in the calculation/data treatment.
  • reflink: - varchar - recommended - a reference link to the calculation/data treatment.
  • sourceCode: - varchar - recommended - the source code for the calculation/data treatment, more applicable for algorithms/more complex steps. It is possible to record the full code as text, to record a URL to where code is stored. Likely a different URL than the refLink field.
  • lastEditted: - dateTime - optional - when the row was last updated/changed.
  • notes: - varchar - optional - free text - notes on the row and the the calculation/data treatment.

There was some discussion around the standard field and naming, but it was decided that because value has a rather specific meaning in the measures and methods context in ODM, it would be inconsistent to use it here. Given that that field will mostly (we anticipate) be used for reporting standardization details, we opted to use that name.

Additional ODM infrastructure to be added, beyond this new table and its fields:

  • A category set for the calcType field, with the options of normalization, standardization, smoothing, and predictiveModels.
  • Conditional sets for the standard field, depending on what the input value for calcType is.
  • A new field in the measures table called valTreat or “value treatment”, which specifies what kinds of data treatments or the nature of the measurement value.
  • A category set for the valTreat field, with the options of raw, derived, estimate, and predicted.
  • Adding calculationID as a FK to the measures table.
  • Add more generic units to the ODM so that the unit/aggregation and standardization metadata can begin to be stored separately. ie. pmmov standardized mean → standardized mean + calcType standardized, PMMoV standard

This last bullet factors into discussion about unit and aggregation metadata, and charts a path toward eventually splitting them. Cognizant that this may require some changes to the wide names rules for measures. Calculations as a table may already require some additional attention on this point.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.