New Wide Rules - AND vs OR, Tagging Related Attributes

@martinwellman has been working hard on building out the tool to automate transformation from wide to long data in ODM. For this process we’ve been using example data from the EU4S program, which use a ODM wide format.

The three problems we’ve come up against so far are:

  1. What to do with identical/repeated headers;
  2. Creating a standard rule for generating new rows with “AND” headers, and whether the same should be implemented for “OR” headers; and
  3. How to link related headers together when it’s ambiguous, or when there is a repeated header has a different meaning.

For 1, the example is the measure_value and measure_value_alpha headers in the EU4S data, which are both the wide name wat_sa_NR_hMr_hUn_hAg_NR_value. The existence on the EU4S side, is that measure_value is for numeric inputs, while measure_value_alpha is for string inputs. The result is that there should never be a value in both columns at the same time. So the rule is to take the value in whatever column has the value, and to generate an error if there’s values in both.

For 2, the “AND” and “OR” headers are for reporting two (or more) values at once in exception headers. As such, “AND” and “OR” headers will always require generating additional columns in the long data format. For the “AND” headers in the EU4S data, they also require additional rows, because the first. parts (in_2_AND_name) repeat. So despite what I thought earlier, @martinwellman , I think the rule for new rows might need to be based on this kind of repetition, and not just based on the “AND”.

For 3, the issue is for items like qr_qualityFlag, if there are multiple measures in the wide table, how do we know what the quality flag is for? Or other situations where wide columns are linked, but there is ambiguity about that linkage. For this, @martinwellman had been using a tagging system, and we think we will adopt the tagging system more broadly now. One issue was that in R, the only symbol that’s not “_” that can also be used in a name is “.”, so if we want to differentiate the tags with a different symbol we have to use “.”. The other issue is that when importing data in Python with pandas, rows with identical names will be appended with a “.#” where “#” is a whole number to indicate the repetition and avoid duplication. To avoid our tags getting confused with this kind of repetition, I suggested adding the tags as “.o#” or “.odm#” with an integer, so that the tags are consistent, useable in R, and don’t create confusion in Python.

Curious to hear the thoughts of others on this, @dmanuel - and Martin, let me know if I missed anything or got anything wrong.

I think in the sentence “As such, “AND” and “OR” headers will always require generating additional columns in the long data format” you meant “additional rows” instead of “additional columns”.

One advantage of having a required leading letter, such as the “o” in .o#, is that we can define additional flags in the future. For example, not only can we have groups (defined by flags that start with “o”), we can also add special linking tags. So “qr_qualityFlag.o1.lsampleID” and “qr_notes.o1” will go in the same row in the qualityReports table because they have the same group flag (o1), and the “lsampleID” could tell us to link the quality flag with the sampleID that is a member of the group o1 (as opposed to linking it with a measureRepID or a measureSetRepID). For readability, we may want to add an underscore, like “l_sampleID” instead of “lsampleID”. This is just an example, the main idea is that it gives us room to expand the flags to have other meanings in the future, if we decide we want to.

Martin

I did actually mean additional columns - in the sense that for attributes at least, the the wide attribute is just one column, but for “AND” and “OR” headers, there are additional columns to consider/add when building out the long-format data. For example, if the EU4S template only had a “in_2_AND_name_insType__sampler_” column, we wouldn’t need an additional row in the long data. But because we have “in_2_AND_name_insType__sampler_” and “in_2_AND_name_insType__flowmeter_” as columns, we do need an additional row. If that makes sense?

And thank you for adding on about the expandability of the tags! I had forgotten.

This generally looks good.

  1. Agree. For EU this approach works. I expect there will be situations where there are more collision challenges and we may need to keep both. In that case add an additional index at the very end of the wide name? long_name_1, long_name_2`.

  2. I am having a hard time following – I’d need an example – but is sounds like you’ve figured it out? :man_shrugging:

  3. I am nervous about the approach using . because this is method chaing in javascript, SAS, etc. and it is a use that R and Python guides have been depreciating. For both R and Python, there is a convesion to use . for managing variable name duplcation, but there are coding considerations in python.

For qualityFlag, and similar (common) issues where there is a lack of specificity in a wide table. I wonder about either putting the tag on all measures - potentially as a new non-specific quality flag that is designed for this purpose that says conveys something like 'no quality issues for the collection of measures and samples attribute that are broadly related to this measure".

If we are using tags, I suggest something like _o#. Is this possible.