Calculations, Normalizations, and Math Protocols

matthomson · September 25, 2024, 3:59pm

On yesterday’s working group call, there was discussion around how best to report normalization of single measures. This revived previous discussions about the possibility of recording calculation methods (likely separate from protocols) to report with aggregated measures. This was something we discussed previously as being a good idea, but we said we would hold off until version 3. This new conversation, however, sparked renewed interest in this possibility.

An example use case of this is the 7-day moving average. This is an (arbitrary) type of mean used across the field, but there are differences in how it’s calculated - front-weighting, end-weighting, the number of observations used to calculate it, etc. So having a different aggregation value for each option seems ungainly and inefficient. Using the 7-day moving average aggregation value allows us to group similar measures, however, which having this calculations table would allow for more detailed reporting on how reporters arrived at that number.

The main reasoning for not recording this with lab protocols in the protocols, protocolRelationships, and protocolSteps tables, is that each measure can only be linked to a single protocol ID, and so a separate calculation protocol would be difficult to record. Furthermore, we have received some feedback that there are already a lot of steps in a single protocol ID, making changes, updates, and deviations from the protocol challenging to manage in model administration. If we also added all the calculations to protocols, it is efficient when it works, but becomes ungainly when things change.

What I’m proposing is a calculations table, with the following headers:

calculationID : the primary key of the table, used to refer to the calculation.
name : the name of the calculation or series of calculations.
summ : a plain-language summary of the calculation, why it is done, and how.
equation : the mathematical equation/expression being used.
lastEdited : when the field was last edited or updated.
notes : any additional notes on the calculation.

calculationID would also be added as a header in the measures table, where it would be a foreign key.

For example:

When recording a single flow- and population-normalized measure of the N1 region of the SARS-CoV-2 virus at the National Microbiology Lab (NML).

The calculations table looks like this:

calculationID	name	summ	equation	lastEdited	notes
nml_floPopNorm	Flow and population normalization at NML	Normalization of the measures levels of viral DNA or RNA in gene copies are normalized and standardized to the flow volume measured in the wastewater treatment that day, and well as to the population served by a wastewater treatment plant in 100,00 people.	(genome copies / day / 100 000 people) = gcmL * 10^9 * flowVol / (population / 100 000)	24-09-2024	NA

and then the measures table would look like this:

measureRepID	protocolID	sampleID	calculationID	…	measure	value	unit	aggregation	…	notes
measureA	nmlProtocol	sampleA	nml_floPopNorm	…	covN1	0.345	gcDay100k	single	…	lorem ipsum

Obviously the naming of parts here is entirely in draft, so happy to hear feedback on that. Also very happy to hear feedback on any other recommended part of the structure proposed here.

@Sorin @dmanuel @jeandavidt @NHizon - anxious to hear your thoughts.

jeandavidt · September 26, 2024, 2:52pm

Hi Mathew, I think having a way to record calculations is a very exciting development. Here are my thoughts.

The utility of reporting calculations is that it increases transparency and it allows the users of the data to understand how the reported value is related to an observed measurement.

There are different levels of fidelity that can be targeted, and each needs a bit more metadata to be achievable.

Indicating that one or many calculations were applied → can be achieved with a raw/processed tag.
Indicating the nature of those calculations → The calculation table you’re showing here does this well. However, it supposes that a transformed value can be attained using a single equation / calculation step. That may not always be the case. Imagine for instance, that N1 concentrations are:

filtered for outliers
log-transformed
linearly-interpolated
log-untransformed
smoothed with a 7-day window
It would be difficult then to explain all of this in a single equation field. Noreover, some steps (like outlier detection) may rely more on an algorithm than on a specific mathematical equation
The steps above are an example of a pipeline of transformations. Pipelines are typically applied to signals (group of measurement that represent the evolution of a quantity over time), and each step of the pipeline generates a new version of the time series, with (sometimes) modified values for each point. So, we can imagine

Reporting each step of the pipeline (with its equation or algorithm), along with the modified value of the data point at each step. Reporting at this level makes the transformation process more transparent (each version of the time series is made available) and it makes it composable (maybe I want to use the data after it has been filtered for outliers, but I want to interpolate it differently). It would require a different table structure than what you are proposing though.

In my thesis, I worked on a system to report data processing in a way similar to level 3. I called it metEAUdata, and it’s currently available in open-source as a python package. The package itself is not necessarily relevant for this thread, but the metadata attributes and the structure it uses to organize processing information might inform this conversation The README is not super clear, but I will attach the chapter if you want to take a look (relevant bits pages 43–53).
I am also attaching a slide I used in my defence that shows what metEAUdata does, i.e. it defines a entity that is transformed (the signal), and each version of its data is attached to it with information on how it was produced.

Looking forward to discuss this further!
Meteaudata repo:

Chapter:
https://ulavaldti-my.sharepoint.com/:b:/g/personal/jdthe3_ulaval_ca/EQNijmRqYMRHiOzLs52V9hMBDVHxOmoZhRkt5KeYeM5YvA?e=Czj6ey

Slides:
https://ulavaldti-my.sharepoint.com/:p:/g/personal/jdthe3_ulaval_ca/Ee7bXSC0OENAltKkW-gqkRwB4xf9iG8mGNNnzcVxuHlm9Q?e=QS5SZP

matthomson · October 4, 2024, 2:10am

Having read through some of JD’s linked thesis chapter above, and talked more with folks, I think we’ve arrived at a more reasonable draft solution.

To JD’s point above, there’s a lot that can happen to data, and some transparency about what has been done is very useful when comparing across points. But this can’t always be recorded neatly as an equation, and often there are multiple steps that should be reported together (ideally).

After some discussion, we landed on a new draft table structure that looks like this (header names are not final and only placeholders):

containerID	name	summary	purpose	provisional PK (CK)	stepID	order	equation	reflink	sourceCode	lastEditted	notes
nmlNormPmmov	ct log for PCR	generate PCR gc/ml value using ct curve	log-transformed	nmlNormPmmov_ctLog	ctLog	1	e=mc2	website	hdsgjd	date	lorem ipsum
nmlNormPmmov	deduplication of PCR inputs	remove accidental duplicate PCR gc/ml values using after ct curve	de-duplicated	nmlNormPmmov_dedup	dedup	2	a2+b2 = c2	website	hdsgjd	date	lorem ipsum

Where:

containerID -varchar - mandatory - linked to a single measure (may be used multiple times) is the shorthand for the data transformation pipeline used
name - varchar - mandatory - free text human-readable name of a single calculation/data treatment
summary - varchar - mandatory - free text human-readable summary of a single calculation/data treatment. This should also explain terms used in the equation field.
purpose - categorical - mandatory - explain the purpose/nature of a single calculation/data treatment.
provisional PK (CK) - varchar - mandatory - the PK for this table, a CK of container ID and step ID, possibly
stepID - varchar - mandatory - an ID for a single calculation/data treatment.
order - integer - mandatory - the order a single calculation/data treatment takes within the full container pipeline/workflow.
equation - varchar - recommended - the equation used in the calculation/data treatment.
reflink- varchar - recommended - a reference link to the calculation/data treatment.
sourceCode - varchar - recommended - the source code for the calculation/data treatment, more applicable for algorithms/more complex steps.
lastEditted - dateTime - optional - when the row was last updated/changed.
notes - varchar - optional - free text - notes on the row and the the calculation/data treatment.

This would only be linked to reported measures, which are most likely to be raw values or final values, but may include values from intermediary steps. Measures would also have a new header where they would be tagged as raw or processed. Some documentation would be required for explaining how to define which category a measure belongs to.

Ideally we’d also want to point out to metEAUdata as well, as it would be very useful to users of this feature.

Anything I missed or any further thoughts?

@dmanuel @jeandavidt @Sorin @NHizon

dmanuel · November 4, 2024, 2:47pm

This calculation discussion is thoughtful. It is great to have @jeandavidt who lived with this issue and came up with solutions.

Another context from health are clinical drug trials where data must be submitted to regulatory agents (e.g. FDA) with full transparency and clarity. Back in the day, SAS programming language was used and that language allowed appending code to a variable as metadata.

In general,

Having a new calculation table makes sense. I don’t say this lightly because the number of tables in ODM is daunting and challenging for users.
I’d aim for transparency but without the expectation that details are in a reference link, rather than within the ODM metadata. Calculations will get complex. I make analogies to sequence data. ODM doesn’t need to store sequencing data, just the metadata. Similarly, ODM doesn’t need to store calculation code, just metadata for calculation code.

So, I’d think reasonable to have one row per calculation with a reference link out to details of the indivdual steps or calculation processes.

I look forward to more discussions!

matthomson · January 15, 2025, 6:49pm

@dmanuel and @jeandavidt met yesterday to talk through this table and the idea more thoroughly, and I think while we may have reasoned in circles a few times, we are closer to a solution now.

One consistent theme was considering whether we want calculations to just be protocols. While it would be more parsimonious in some ways, it would mean using a protocol ID that records every lab step and every data treatment together. This becomes quite onerous, and may be an impediment. Even if we record calculations in a similar style, we end up with more detailed calculation metadata (yes, pro), but at the cost of additional tables and complexity (con). So I believe we landed on the page of trying to contain this in a single table as proposed here.

Of the last proposed headers, nothing was cut. However some reordering and renaming was proposed. The headers, to repeat them here, were:

containerID -varchar - mandatory - linked to a single measure (may be used multiple times) is the shorthand for the data transformation pipeline used
name - varchar - mandatory - free text human-readable name of a single calculation/data treatment
summary - varchar - mandatory - free text human-readable summary of a single calculation/data treatment. This should also explain terms used in the equation field.
purpose - categorical - mandatory - explain the purpose/nature of a single calculation/data treatment.
provisional PK (CK) - varchar - mandatory - the PK for this table, a CK of container ID and step ID, possibly
stepID - varchar - mandatory - an ID for a single calculation/data treatment.
order - integer - mandatory - the order a single calculation/data treatment takes within the full container pipeline/workflow.
equation - varchar - recommended - the equation used in the calculation/data treatment.
reflink- varchar - recommended - a reference link to the calculation/data treatment.
sourceCode - varchar - recommended - the source code for the calculation/data treatment, more applicable for algorithms/more complex steps.
lastEditted - dateTime - optional - when the row was last updated/changed.
notes - varchar - optional - free text - notes on the row and the the calculation/data treatment.

Parralells with the measures table were drawn, and something analogous to the value field was proposed. I suggested that something like a standard field where one can categorically record the standard to which something is being standardized (ie. PMMoV, Crassphage, Flow, etc.) or smoothed (ie. bayesian smoothing, central average smoothing, 7-days, time, etc.). There was disagreement around whether to bring measurements over from the parts table for this, or to create some duplication with new categorical inputs for this field.

There was also discussion around units - how units are calculated, should they be connected here, or connected at all. This discussion was largely unresolved.

With the suggested additions and possible name changes the list of headers (in order) becomes (with some notes):

calculationID → Formerly provisional PK (CK)
pipelineID → formerly containerID, a data pipeline identifier
treatmentID → formerly stepID, a single data treatment identifier
name
summary
calcType → formerly purpose, categorical calculation type. Suggested possible inputs are: normalization, standardization, smoothing, and predictiveModelling.
value or standard
order
equation
reflink
sourceCode → open ended, could take the full code as text, or a URL to where code is stored. Likely a different URL than the refLink field.
lastEditted
notes

Names are still pretty up in the air. For the table and the IDs, we tossed around function, derivation, calculation, calculation set, data treatment, etc.

Please feel free to add anything I may have missed, or to respond with any further thoughts or reactions you have to this. We’ll try to meet again next week to discuss further and hopefully finalize something.

matthomson · January 30, 2025, 5:40pm

Another discussion led us to finalizing the above headers for the calculations table. To repeat them here, the final headers are:

calculationID: - varchar - mandatory - the PK for this table, a CK of container ID and step ID. Formerly provisional PK (CK)
pipelineID: -varchar - mandatory - linked to a single measure (may be used multiple times) is the shorthand for the data transformation pipeline used. Formerly containerID, a data pipeline identifier
treatmentID: - varchar - mandatory - an ID for a single calculation/data treatment, a single data treatment identifier. Formerly stepID.
name: - varchar - optional - free text human-readable name of a single calculation/data treatment.
summary: - varchar - optional - free text human-readable summary of a single calculation/data treatment. This should also explain terms used in the equation field.
calcType: - categorical - recommended - explain the purpose/nature of a single calculation/data treatment. Possible inputs are: normalization, standardization, smoothing, and predictiveModelling Formerly purpose.
standard: - varchar - mandatoryIf - field where one can categorically record the standard to which something is being standardized (ie. PMMoV, Crassphage, Flow, etc.) or smoothed (ie. bayesian smoothing, central average smoothing, 7-days, time, etc.).
order: - integer - recommended - the order a single calculation/data treatment takes within the full container pipeline/workflow.
equation: - varchar - recommended - the equation used in the calculation/data treatment.
reflink: - varchar - recommended - a reference link to the calculation/data treatment.
sourceCode: - varchar - recommended - the source code for the calculation/data treatment, more applicable for algorithms/more complex steps. It is possible to record the full code as text, to record a URL to where code is stored. Likely a different URL than the refLink field.
lastEditted: - dateTime - optional - when the row was last updated/changed.
notes: - varchar - optional - free text - notes on the row and the the calculation/data treatment.

There was some discussion around the standard field and naming, but it was decided that because value has a rather specific meaning in the measures and methods context in ODM, it would be inconsistent to use it here. Given that that field will mostly (we anticipate) be used for reporting standardization details, we opted to use that name.

Additional ODM infrastructure to be added, beyond this new table and its fields:

A category set for the calcType field, with the options of normalization, standardization, smoothing, and predictiveModels.
Conditional sets for the standard field, depending on what the input value for calcType is.
A new field in the measures table called valTreat or “value treatment”, which specifies what kinds of data treatments or the nature of the measurement value.
A category set for the valTreat field, with the options of raw, derived, estimate, and predicted.
Adding calculationID as a FK to the measures table.
Add more generic units to the ODM so that the unit/aggregation and standardization metadata can begin to be stored separately. ie. pmmov standardized mean → standardized mean + calcType standardized, PMMoV standard

This last bullet factors into discussion about unit and aggregation metadata, and charts a path toward eventually splitting them. Cognizant that this may require some changes to the wide names rules for measures. Calculations as a table may already require some additional attention on this point.

matthomson · January 31, 2025, 5:41pm

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Additional fields? Or how best to capture diverse European data Development odm-v2 , wide-names , mapping	9	55	February 20, 2025
New Table Proposal: Public Health Actions Table Development tables , odm-v3	7	46	June 23, 2025
Standard Curves (for 'Master Curves') Development odm-v2 , standard-curves	2	127	November 4, 2023
Clinical Data in the ODM Support odm-v2 , data-types , mapping	2	109	February 14, 2024
NWSS to ODM Mapping - Issues & Questions Development odm-v2 , mapping , mapping-nwss	3	131	June 19, 2023

Calculations, Normalizations, and Math Protocols

Related topics