@martinwellman has been working hard on building out the tool to automate transformation from wide to long data in ODM. For this process we’ve been using example data from the EU4S program, which use a ODM wide format.
The three problems we’ve come up against so far are:
- What to do with identical/repeated headers;
- Creating a standard rule for generating new rows with “AND” headers, and whether the same should be implemented for “OR” headers; and
- How to link related headers together when it’s ambiguous, or when there is a repeated header has a different meaning.
For 1, the example is the measure_value and measure_value_alpha headers in the EU4S data, which are both the wide name wat_sa_NR_hMr_hUn_hAg_NR_value. The existence on the EU4S side, is that measure_value is for numeric inputs, while measure_value_alpha is for string inputs. The result is that there should never be a value in both columns at the same time. So the rule is to take the value in whatever column has the value, and to generate an error if there’s values in both.
For 2, the “AND” and “OR” headers are for reporting two (or more) values at once in exception headers. As such, “AND” and “OR” headers will always require generating additional columns in the long data format. For the “AND” headers in the EU4S data, they also require additional rows, because the first. parts (in_2_AND_name) repeat. So despite what I thought earlier, @martinwellman , I think the rule for new rows might need to be based on this kind of repetition, and not just based on the “AND”.
For 3, the issue is for items like qr_qualityFlag, if there are multiple measures in the wide table, how do we know what the quality flag is for? Or other situations where wide columns are linked, but there is ambiguity about that linkage. For this, @martinwellman had been using a tagging system, and we think we will adopt the tagging system more broadly now. One issue was that in R, the only symbol that’s not “_” that can also be used in a name is “.”, so if we want to differentiate the tags with a different symbol we have to use “.”. The other issue is that when importing data in Python with pandas, rows with identical names will be appended with a “.#” where “#” is a whole number to indicate the repetition and avoid duplication. To avoid our tags getting confused with this kind of repetition, I suggested adding the tags as “.o#” or “.odm#” with an integer, so that the tags are consistent, useable in R, and don’t create confusion in Python.
Curious to hear the thoughts of others on this, @dmanuel - and Martin, let me know if I missed anything or got anything wrong.