Workshopping Technical Solutions to Data Governance Problems
This blog documents one of OEDP’s two Environmental Data Labs, held on September 13, 2023. OEDP hosted these labs as part of our broader data stewardship work and with the intention of refining our emerging model for Community Data Hubs (CDH). Whereas the other lab centered on legal challenges surrounding environmental data stewardship, this lab hones in on technical aspects of data governance, highlighting common problems and identifying potential solutions.
Participants were split into two groups, each of which considered a specific data scenario and its associated technical challenges. These scenarios were based loosely on real community data projects, but were amended by OEDP staff to draw out conservation regarding technical components of data stewardship. One group examined the case of a community science project tracking water quality around the Salton Sea; the other considered an effort to document stormwater flooding in suburban Chicago (click here for more information on each case study). Lab participants identified several technical resources that might aid in each scenario, while also outlining a range of social practices or non-technical governance procedures that might help. Below, this blog recounts each conversation, before considering the lessons learned as we design our CDH model.
Several key themes emerged across both conversations:
- Various software can be a powerful resource for anonymizing or abstracting data, such that it is safe to share broadly, without losing the fundamental relationships or trends data stewards hope to communicate. Data stewards should carefully assess what level of specificity they need to discover or document a given trend, weighing “openness” and comprehensiveness against privacy concerns. Various technical resources can be an asset in this process of negotiating data sharing and data privacy, either offering software that automatically anonymizes data entries or in generating more sophisticated systems like a “synthetic scenario map” proposed by lab members below.
- Technical considerations are just one component of the total, deeply social, data stewardship process. Though lab members highlighted a variety of technical solutions to some of the problems data stewards encountered in each scenario above, they also devised several social interventions (e.g., data sharing agreements) that could complement, or in some cases serve as a necessary prerequisite for, a given technical solution. Data stewards should not seek to neatly separate social and technical problems they encounter in their work (nor to silo “technical people” and “community members”). Instead, lab members recommended a holistic approach, which explores the full range of technical tools alongside a careful attention to building an integrated social infrastructure all while pursuing questions of power.
Water Quality in the Salton Sea
Our first data scenario centered on the Salton Sea, where community scientists partnered with a university to collect, monitor, and store water quality data sampled from the inland sea. The large body of water has been severely polluted by agricultural runoff and has lost about a third of its volume. The project hoped to build scientific evidence to substantiate locals’ experiences with the sea, in order to persuade policymakers to act and to build up data which could be shared with communities experiencing similar issues. However, the project risks several harms if the data is completely open. Real estate companies may devalue homes near polluted areas. Moreover, the agricultural sector–the economic engine of the region–may face new challenges or be forced to shutter their operations entirely if new regulations are passed.
Lab members considered the precise technical needs of this particular data project, weighing them against various known tools, the demands of data governance, and the complexities of community-based work. Their conversation quickly identified an intractable problem: on the one hand, community science initiatives like that in the Salton Sea often have fairly idiosyncratic technical needs in order to collect, store, and present their data, given the diversity of their data and their heterogeneous goals. On the other hand, custom tools are labor intensive to make and maintain, and community science projects are often time and resources strapped. One participant who works on open software noted community groups need not reinvent the wheel, instead turning to existing open resources like the data management tool Ramadda or Open Street Map.
However, that same lab member warned that while outsourcing software development can lower the demands on community scientists, doing so also opens their projects to some vulnerabilities. Much open source software is run by just a few volunteers, and can collapse if a large enough community of users does not exist to shepherd it if and when key volunteers need to step back. Moreover, while exploring a range of complex (real and hypothetical) software, other lab members also noted that, at times, simple technical resources can suffice. For instance, there are virtues to the shared spreadsheet, which allows people to make comments, add columns, or offer qualitative notes as the conditions and goals of data collection change.
Several lab members emphasized the importance of starting data projects on the right foot, and considered the role software can play in that. Good data requires standardized collection procedures and simple softwares can guide community scientists through the process of collecting and cleaning data.
Another lab member invited us to reflect on and criticize the dichotomy between people with technical expertise and community members, something often deployed in these kinds of conversations about community science. Namely, this lab member noted that breaking down this division can help to 1) foreground the inherent entanglement of social and technical processes within data stewardship and 2) legitimize community members’ local knowledge of place as itself a valuable form of technical expertise.
Indeed, throughout the conversation several lab members indicated that technical resources are most powerful when paired with certain data governance procedures or social practices. Lab members noted several times that technical choices were downstream from some process of discovery or consensus-making among community members and affiliated scientists. More often than not, technical issues often arise because of shallow relationships among participants in a community science program, rather than bad choices about technology.
Stormwater Flooding in Chicago
The second scenario explored an effort by residents of Chicago’s south side to document local stormwater flooding. Residents took photos of flooding, which were then stored on an app with a variety of associated metadata (i.e., locations, how standing water impacts people and infrastructure, etc.). The project sought to create a record of local flooding for the benefit of community members themselves, city planners working on stormwater, and other communities in suburban Chicago suffering from flooding. As in the Salton Sea scenario, however, residents worried about data sharing. Community members were collecting data by taking photos of flooding events, images that often include both the public right of way and private property. Local homeowners worried the data would affect their home values.
Lab members gravitated towards one creative technical solution: a “synthetic scenario” map. Rather than geolocating the flooding data onto a real map of the south side, lab members suggested that stewards of this data project could abstract some key details from each entry and place them on an invented map. Using metadata for each entry (such as local levels of imperviousness, elevation, and proximity to other entries), residents could build a map that communicates real information regarding flooding incidents and the relationships between them, without revealing actual latitude and longitude coordinates. While lab members were not aware of examples where people had done this before, one member noted that this kind of data presentation bears considerable resemblance to various environmental models, including models used for planning purposes within marine conservation.
One lab member with experience working on community projects raised concerns about the labor and resources required to maintain this kind of synthetic map. Another suggested though that much of this work would be upfront: as data stewards decided what kind of abstracted or anonymized metadata to include in the synthetic map. Once clear abstraction procedures were set, a simple software could scrape the relevant data from resident’s entries, or even from their photos, and plug it into the synthetic scenario. This process could even be “gamified” to motivate people to submit data regularly.
One lab member noted, if residents were to use this medium, they could also avoid sharing photos all together, completely removing one of the major risk factors. Instead, data stewards could simply articulate an agreed upon scale for ranking the severity of a particular flood based on an image. The synthetic scenario map could store this rating, which could then be freely shared, without the photos needing to circulate at all.
Lab members were keen to note though that any technical solution required clear data governance procedures. A data sharing agreement would be an important prerequisite for building any synthetic map, with residents determining precisely what information should be included on the eventual map and with whom it could be shared. Indeed, some lab members indicated that, depending on community goals, a good data sharing agreement might make a synthetic map unnecessary.
These labs inform OEDP’s broader data stewardship work, including our ongoing Community Data Hub model design process. As that work proceeds, various themes from this Lab are worth revisiting:
- Good data governance, including sound technical choices, depend upon early and iterative conversations between the various parties involved in a given project. Many of the potential risks data projects face can be mitigated by articulating clear data sharing agreements prior to data collection. How can we incentivize these kinds of foundation-building conversations, even in cases where urgency may be a primary concern?
- Knowing that data stewardship is both a technical and social process, our CDH model should embrace a holistic approach, which explores the full range of technical tools available and builds an integrated social infrastructure, all while considering questions of power. Data governance procedures should create space to recognize the unique and worthy knowledge of every party involved in data stewardship, be they community members or visiting researchers.