Additional fields? Or how best to capture diverse European data

The data structure that @Sorin has created for European Data based off the ODM is great and maps with relative ease. There are (of course) a few sticky areas.

Wide format problems:

There are five fields which Sorin has added that present some challenge for mapping:

  • measure_estimated: a binary flag to indicate whether a measure is truly an empirical measure, or an estimated measure generated by predictive algorithms.
  • measure_value_alpha: There is a measure-value field as well in the structure, but that field is solely for numerical/integer values. This second field is for categorical values.
  • measure_value_normalization_method: Specifies the normalization for a measure (ie. PMMoV, crassphage, etc.)
  • measures_name: a name for a measure.
  • site_code: Not a siteID, not a name, but an additional code used for referring to a site.

Wide format proposed solutions:

  • measure_estimated: add a an optional header in the measures table? A flag similar to reportable. Could bundle these together too in a measure set? As a use case.
  • measure_value_alpha: I think we realistically map value and value_alpha to the same field. Only one will ever be populated at a time, so there shouldn’t be any collisions.
  • measure_value_normalization_method: this might be an easy link to the calculations table? Perhaps even a optional header in the calculations table?
  • measures_name: add a an optional header in the measures table?
  • site_code: I’m tempted to say “dump it in notes” but I don’t always want to abuse the notes field like that. I’m a bit hard-pressed to come up with an alternate solution though.

Sequencing and variant challenges:

These come from the data coming from RIVM in the Netherlands:

  • Variant_code: takes the Pangolin nomenclature name for a variant, or simply “recombinants”.
  • Variant_name: The WHO nomenclature name for a variant (or group), or simply “recombinants”.
  • ECDC_category: How ECDC has categorized the threat level presented by the given variant, possible values being blank, VOC (variants of concern, usually associated with evidence of diminished effectiveness of treatments, increased transmissibility, immune escape, and/or diagnostic escape), VOI (variants of interest, present [genetic changes] that are predicted or known to cause the same effect of a VOC but have a limited prevalence), VUM (variants under monitoring, bear [genetic markers] suspected to impact the epidemic dynamics but circulate at a very low level), or DEV (de-escalated aminos, when a variant is no longer circulating or if there is solid evidence that it does not affect the overall epidemiological situation).
  • WHO_category: How the WHO has categorized the threat level presented by the given variant, possible values being blank, VOC (variants of concern, usually associated with evidence of diminished effectiveness of treatments, increased transmissibility, immune escape, and/or diagnostic escape), VOI (variants of interest, present [genetic changes] that are predicted or known to cause the same effect of a VOC but have a limited prevalence), VUM (variants under monitoring, bear [genetic markers] suspected to impact the epidemic dynamics but circulate at a very low level), or DEV (de-escalated aminos, when a variant is no longer circulating or if there is solid evidence that it does not affect the overall epidemiological situation).
  • Is_subvariant_of: the Pangolin nomenclature name for a variant that another variant is descended from, or simply “recombinants”.
  • Sample_size: (need clarification)
  • Variant_cases: (need clarification)

Possible RIVM Variant reporting solutions:

  • Variant_code: this (to me) is just a measure, but with specified nomenclature, and “Variant_name” is the same measure with different nomenclature. I think this may mean figuring out a wide-name scheme that includes nomenclature specs? But it would have to somehow also specify that it’s in a measure set with “Variant_name”. What do others think?
  • Variant_name: see “Variant_code” above.
  • ECDC_category: This I’m unsure about… ECDC would be an orgID probably. But then capturing how they’ve assigned a threat level, maybe as a measure…? That links via measure set to the rest of it?
  • WHO_category: see “ECDC_category” above.
  • Is_subvariant_of: I believe we already have the structure from this with a “sub variant” measure that gets put in a measure set with the other variant. But it seems that figuring out measure sets might be the next horizon for wide names.
  • Sample_size: (need clarification still)
  • Variant_cases: (need clarification still)

A little more lost with these than I usually am, so keen to have any input from @dmanuel and @jeandavidt when you’re back from holidays!

There are enough issues that merit a meeting with @Sorin, @matthomson, @jeandavidt and @dmanuel to ensure we maximize interoperability. It is also worth reviewing what @Sorin added to the measures because some may be covered already.

Here are a few comments to keep the discussion going.

  • measure_estimated: I like the idea of a new optional header, but this topic is worth discussing within the broader discussion of best recording calculation methods. Generally, we want to encourage sharing original “raw” data and support derived or calculated data. Adding measure_estimated is a smart way to point to people that there is derived or estimated data.

However, we need to discuss:

  • what is an estimated measure? Normalized? Moving average?

  • How do we best point people to what the estimated measure means, or where are the details about the estimate?

  • measure_value_normalization_method: Another good discussion is needed, which seems related to measure_estimated. We currently embed the normalization method into units, so you may argue we have normalization covered, and could map to that. On the one hand, we follow the most common best practice of describing the normalization in the unit. On the other hand, we are combining a method in a unit - which may merit splitting.

However, if we create a measure_value_normalization_method what happens to units? Do we create a generic unit “gene copies per normalize copy” or “gene copies (normalized)”?

It seems to me that we want to have a logical flow:

  • Is the measure “raw” or “estimated, derived, normalized, predicted, etc.”?
  • How was the measure estimated?
    • What are the important properties of estimation?
  • How do we record the measurement estimation method?
  • How are estimated measures best reported? What are the units?

So not that calculations is a bit more finalized, we cave some better solutions for some of these:

  • measure_estimated: I would suggest maybe building this in to measure wide names - but @dmanuel and @jeandavidt and I will talk about that more. mr_valTreat with the possible input values being predicted (I think) or derived (since I don’t imagine anyone is reporting raw data).
  • measure_value_alpha: I think this is a non-issue - we’ll map value and value_alpha to the same field.
  • measure_value_normalization_method: Goes in the calculations table, cl_standard (potentially cl_2_AND_calcType__standardization__standard ?).
  • measures_name: add a an optional header in the measures table.

site_code and the variant issues are still under discussion.

An update:

  • measure_estimated: After discussion, we think that we should treat valTreat like the attributes qualityFlag. value, and purpose and tag it onto the measure wide names that way. Still with the possible input values being predicted (I think) or derived (since I don’t imagine anyone is reporting raw data). See this explained further on the documentation site (Long-format, wide-format, and wide-names – Public Health Environmental Surveillance Open Data Model (PHES-ODM) Documentation) but the wide-name formula for a measure or measure-value is compartment_specimen_fraction_measure_unit_aggregation_index_attribute where the attribute at the end is always either value, purpose, qualityFlag, or (now) valTreat.

  • measure_value_alpha: Resolved above (map value and value_alpha to the same field)

  • measure_value_normalization_method: Resolved above (wide name: cl_2_AND_calcType__standardization__standard.

  • measures_name: After conferring with @Sorin, this is really just measure in the measures table. May been a little challenging to map free-text to categories, but. We’ll cross that bridge when we come to it.

  • site_code : Defined in the notes of the EU4S-ODM as “Standardized site code (preferably the site code used for reporting under the EU Urban Waste Water Treatment Directive / IATA code for the airports-originated samples)”. This differs from siteID (The unique ID of the site where the samples were collected) and site name (Name of the site where the samples were collected). I am tempted to call site_code equivalent to a parent siteID - does anyone have any thoughts on that?

Now with regard to the sequencing/variant reporting:

  • Variant_code: As we said above, this is the pangolin variant name, and maps easiest to measure as far as a drop down approach would go.
  • Variant_name: This, compared to variant_code above, is the WHO variant name, but more holistically refers to a variant family. This probably maps most neatly on to the group field in the measures table.
  • ECDC_category: This was somewhat controversial in my discussions with @jeandavidt and @dmanuel, but what I have suggested is (as already outlined above) recording variant_code as mr_measure; variant_name as mr_group, and then adding ECDC and WHO categorizations as a new measure, so it would have a wide name something like this: wat_sa_NR_hfMe_unitless_me_NA_ecdcVar_value and wat_sa_NR_hfMe_unitless_me_NA_whoVar_value, respectively. Other folks don’t necessarily agree that this should be a measure, but my logic is that it is something that changes over time, so it is recording how the variant is being designated at the time of testing - which is kind of a measurement.
  • WHO_category: see “ECDC_category” above.
  • Is_subvariant_of: This is also a rather controversial field. Some have argued that this kind of structure of describing properties of a static object within the ODM would need another table, but that this is also an edge case and maybe not worth making a new table for. Others have said it could go in notes. Others have said it could be recorded as a measure, but even then, its a property that really about defining a relationship of an object that is defined and recorded outside the ODM, and so maybe it isn’t something we record. We’re going to reflect on it more, but any input or additional comments would be welcome.
  • Sample_size: And easy add, a measure of sample size (ie. wat_sa_NR_hfMe_unitless_me_NA_sampleSize_value)
  • Variant_cases: And easy add, a measure of positive samples (ie. wat_sa_NR_hfMe_unitless_me_NA_posSmple_value)

There are two other sheets reporting variant data from Europe, (the above are from the Netherlands, but Slovenia and Austria are also reporting). The Slovenian headers are just (roughly translated):

  • date
  • WWTP
  • Census Region
  • Virus Variant Name
  • Variant proportion

Which in the ODM would be equivalent to:

  • aDateEnd
  • si_siteID
  • si_stateProvReg
  • mr_measure
  • wat_sa_NR_hfMe_unitless_me_NA_percPos_value

And Austria’s are (again, roughly translated):

  • Variant name
  • Date
  • Percent positive for variant
  • Variant Group

Which, again, in the ODM would be equivalent to:

  • mr_measure
  • aDateEnd
  • wat_sa_NR_hfMe_unitless_me_NA_percPos_value
  • mr_group

So some outstanding issues to resolve, but feeling much closer to a resolution.

My 2 cents in red below, wherever I thought I can contribute.

Sorin

Adding Sorin’s email comments below:

Actually both Sweden and Slovenia, if I am not mistaken, report raw concentration values per volume – apart from normalized ones – so I believe we need to allow for raw as well, also considering that most physical records like temperature, pH, conductivity or whatever, are raw measures.

Added _alpha just from a database perspective, as I prefer to store strictly numerical and alphanumerical values in dedicated fields. This brings the advantage of uniformly being able to represent the numerical values on graphs or calculating various aggregations, not suitable for alphanumeric values.

We use it internally as a filter for our current pan-European dashboard, where we show only “concentration” values, although we also record trends, inflows and whatnot.

Not sure about qualifying it as a parent, as it’s just another attribute of the site which in some geographies, like EU, might be relevant and scales well with a horizontal schema (as a new column where you store this code) compared with a vertical schema (where you would have to add the parent as a new row with a set of attributes – which in our case are inexistent – it’s just a code) .

Sorry, didn’t had much time to dig into this part (we’re reporting only concentrations at the moment) so I’ll let the discussion flow (as much as I can, hehe)

I think this perfectly qualifies for a parenting relationship but I’m not sure where to place it. Ideas?

And responding to @sorin’s comments:

True! Sorry - the possible values for this field will be raw, derived, estimate, and predicted. But for the EU4S I was thinking it might be more constrained rather than allowing for all four options.

I think that makes sense - I think that it basically is a measure then, or label for a measure. Unless you think it’s slightly different?

Let’s explore adding it as an additional header then.

I think you’re absolutely right, I mean, it doesn’t get more “parent-child relationship” than this. And you’re also right that it’s not clear where that relationship would be defined. We don’t have this kind of relationship building/definition in measures, and even if we were to add it, it becomes a little clunky because it needs to be stored in measure sets with values that are measures, rather than categorical… it’s a bit of a mess, really.

I also want to add - for @dmanuel and @jeandavidt - that upon further reflection, the suggested wide names I proposed in our chat and wrote in the post above are not actually a valid structure. wat_sa_NR_hfMe_unitless_me_NA_sampleSize_value and wat_sa_NR_hfMe_unitless_me_NA_posSmple_value for example. it’s adding a second measure near the end and that’s not allowed in the current structural definition.

I think, however, that we could tweak the structure for measures in the same set? Though when I think about it more, sample size is really a protocol item, and percent positive is actually in the units, I think…

so wat_sa_NR_hfMe_unitless_me_NA_percPos_value → * wat_sa_NR_hfMe_percPos_me_NA_value

and I think wat_sa_NR_hfMe_unitless_me_NA_sampleSize_valueps_mes_sampleSize_unitless_sin_value

If that makes more sense?

Could the variants conundrum be solved if we add a “parMeasure” column (attribute) to the measures table? Then if we store “variant” in the class column, we could store the parent variant in the parMeasure column. Not sure how deep we can go with this though…
Anyhow, if relevant, from what I’ve seen from EU data we get only one level deep and the value is usually a percentage of the variant in the total viral load.

To close this off:

  • measure_estimated: maps to the new valTreat header in measures, and will be treated in wide names like the attributes qualityFlag. value, and purpose. The wide-name formula for a measure or measure-value is compartment_specimen_fraction_measure_unit_aggregation_index_attribute where the attribute at the end is always either value, purpose, qualityFlag, or (now) valTreat.
  • measure_value_alpha: mapping value and value_alpha to the same field.
  • measure_value_normalization_method: Using the new “pre-populated and multiple attribute” wide name structure, creates the wide name: cl_2_AND_calcType__standardization__standard.
  • measures_name: maps to measure in the measures table.
  • site_code : resolved to map site_code to a parent siteID, but to not create a new sites table entry for the parent (@Sorin - let me know if you hate this idea).
  • Variant_code: maps to measure in the measures table (or “measure_name” in the current EUxODM structure)
  • Variant_name: This maps to group in the measures table.
  • ECDC_category: see this topic for the continuation of this discussion: New Table Proposal: Public Health Actions Table
  • WHO_category: see “ECDC_category” above.
  • Is_subvariant_of: To be continued in a separate discussion, if needed, but currently the suggestion is to drop this field and not record it in ODM.
  • Sample_size: a unit of “total # of samples”, ie. wat_sa_NR_hfMe_unitless_me_NA_sampleNum_value)
  • Variant_cases: a unit of “# of positive samples”, ie. wat_sa_NR_hfMe_unitless_me_NA_posSmple_value.

This covers all points raised, I’ll leave the topic open a little longer in case folks have other comments or additions. To @Sorin 's point about parent-child relationships for measures, I agree that it makes sense and could work elegantly. But because we already have measure sets, the additional relationship options would create some confusion. For something like sub variants too, it’s really not like most measures because it doesn’t change. So it’s a piece of metadata about a part (a measurement), and should (if logic is to be consistent) be stored in the parts table. Short of that, it makes more sense to point out for it, so maybe the lineage being stored in the sequencing info linked in the accessions table might be enough?