When mapping from a source dataset to ODM there is often a case where the source dataset has a multivalued slot that we map to a singlevalued ODM slot. For example, when mapping from PHA4GE to ODM v2/v3, we map from “environmental_site” to “sampleShed”. “environmental_site” is multivalued but “sampleShed” is single valued.
An example case where someone might enter multiple values in “environmental_site” would be to specify a drainage pipe that is part of a hospital. A user would enter both the values “Hospital” and “Drainage Pipe” for the same cell in the data (eg. separated by a comma). In this particular case we would probably map “Hospital” to the sampleShed and “Drainage Pipe” to the siteType, but there are probably cases where both values would be mapped to the same slot in ODM. It’s just an example, there are probably other cases where a user might enter multiple values.
When we map these to ODM, we would get both “hospital” and “drainage pipe” as an array in the same slot in ODM. The question is what do we do with these arrays?
I’ve written code that can deal with it in multiple ways, but not sure which one is the best one. Here are the options:
- Select the first value in the array of values and drop the others (so we’ll lose some data)
- Select the last value in the array of values and drop the others (so we’ll lose some data)
- Expand the array into multiple output rows. Each row will have one of the values in the array, but all the other columns will be identical in the duplicated rows. The primary key for the duplicated rows would also have to be unique.
In case 3 above, a side effect would be that if other output tables in ODM uses one of the duplicated row’s primary keys as a foreign key, we would always use the primary key of the first row in the set of duplicated rows, and the second and later rows will never be used as a foreign key.
Let me know what you think on how we should deal with this.
Martin