Governments around the world are making data more accessible and more useable. By doing so, they are helping to make government more open, transparent and accountable. With open data, community members, local organizations, businesses and journalists can leverage public information to support new businesses, inform public policy, make fact-based decisions, and encourage greater civic engagement.
In California, a number of state agencies are beginning to implement open data programs. Releasing data isn’t as simple as just flipping a switch, however. These organizations need to plan for privacy, data quality governance structures and a host of other factors. The California Health and Human Services Agency and the State Controller’s Office have completed large-scale open data projects that have considered these issues. This publication reports on their successes, challenges and outcomes. Using shared lessons from these projects and others, the paper also summarizes current state policy on open data, including procurement policies; highlights various privacy laws and guidelines; and provides sample organizational structures for open data teams.
Open data requires more than just transparency. The general consensus is that—at minimum—it must be published without constraints to further public use. This means that open data is:
Agencies can do more, however, to make their data as inclusive as possible. Open data is often published under the assumption that the users are mostly made up of outside experts, such as private businesses, researchers, or members of the civic tech community. But, this can make it difficult for non-technical users to leverage the benefits of open data. Agencies can make their data more accessible and inclusive by:
Open data does not have a one-size-fits-all definition. As agencies consider publishing open data, it is important to adopt a definition that meets their needs and those of their audiences.
The California Health and Human Services Agency oversees 12 departments and three offices, employs nearly 34,000 staff and administers a budget of $140 billion. In 2014, when the Agency embarked on a unified process to publicly release its non-confidential data, it had a complex task before it. Successfully launching a portal for its public data required executive leadership, a shared vision across diverse programs, commitment by personnel and public-private partnerships. Almost two years later, the project has succeeded, allowing health advocates, civic technologists, journalists and anyone else to access and use data more easily than ever before. As a result of its success, the Agency plans to continue the open data program as part of a larger effort to better use information and data inside and outside of government.
This story begins inside and outside of government. Inside the California Health and Human Services Agency, both people and processes converged:
People: Two departments, the California Department of Public Health and the Office of Statewide Health Planning and Development, are data-rich organizations that already published data in a number of different formats. Key staff in those and other departments, along with the Agency’s Chief Information Officer, had an interest in data sharing across departments and making data more available and transparent.
Process: In 2010 the Agency received a federal Health Information Exchange grant as part of the American Recovery and Reinvestment Act. While the grant mainly focused on data sharing among healthcare providers, it also allowed the Agency to focus on internal data sharing to improve services its departments provide to the public. Their success in creating a health information exchange plan supported a second award in 2012 that they used to examine opportunities for health and human services data sharing and interoperability. This new grant resulted in two outcomes:
Outside of government, one foundation played critical roles in bringing this project to fruition:
As Agency and Foundation staff began to discuss an opportunity for partnership, they looked at a number of opportunities: publishing existing data in more accessible forms, publishing new non-confidential data, securely sharing confidential data across departments and ensuring data from different departments was interoperable. They started focusing on one cornerstone of this larger effort, an initiative to make existing public data more available – to make it open. They believed starting here would create examples underscoring how open data could inform decision making and drive better health outcomes in local communities. They also believed the emerging governance structure would bolster the implementation of an agency-wide open data project; an ideal opportunity to establish data governance. While other government open data projects have focused on a decrease in requests for public records, in interviews with senior leaders at the Health and Human Services Agency, it is clear they believed making more information available to the public was a part of “good government” and would also catalyze more strategic use of data internally, helping departments make better decisions and better serve the public. While open data may ultimately decrease costs through fewer Public Records Act requests, Agency leaders view that as a good by-product but not the reason for doing it.
As in any organization, senior leadership needed to sign off before an Agency-wide initiative could begin. For the Health and Human Services Agency, that meant the Undersecretary for Budget and Administration needed to agree that this project was worth the staff time and that it would produce useful results. To prepare for the project – and the pitch – the Agency’s Chief Information Officer convened an informal group of data management leaders from several departments. This group strategized about how to gain executive buy-in for the project. Concurrently, the Foundation also met with senior leadership in the Agency. Despite this planning, the Undersecretary was not initially convinced of the project’s merits. There was a language gap between the project team and the Undersecretary. The departments and the Foundation talked about “open data” and “hacking” – both terms that sound risky to an Agency safeguarding the private health information of millions of people. After some months of discussion, the Undersecretary and the department staff found common language in a framework originally developed by New York State’s Health Department. In New York, the Department had classified data into three tiers: Tier 1 – clearly public data, often already published; Tier 2 – data that might be able to be made public with some review or cleaning; and Tier 3 – private and confidential data that the Department will never make public. The project team made it clear they were only concerned with Tier 1 data in the beginning and Tier 2 data in later stages. The Agency would never publicly release Tier 3 data. The adoption of this rubric answered the Undersecretary’s major reservations, transforming his skepticism into active support.
With executive sponsorship in place, the project began in earnest. The California Health Care Foundation provided a $250,000 grant to cover subscription costs for a service to host the data and provide data visualizations, charts, graphs and other tools that the public could use to understand the data. This allowed the Agency team, composed of staff from different departments and the Office of the Secretary, to focus on the process rather than the technology. The team could have launched the project after only moving datasets that were already available in Excel or PDF into the portal. They did not do this because the ultimate goal was to put in place a structure accommodating both open data publishing and secure data sharing between departments. Instead, to accomplish this, the team transformed the data into machine-readable formats and created a robust process that included data review by the governance team.
The leadership team did not have to start from scratch. Open Data initiatives are expanding across the country. In New York, the state’s health agency created a handbook for their initiative. The California team built off this handbook, customizing it for their needs. Such cross collaboration is popular throughout the open data movement. If an organization is starting a new initiative, it’s likely they can borrow and build off of what another city or state has done.
Once procedures were in place, the team was ready to begin publishing datasets. One department needed to pilot the project; their datasets would test the process and the governance structure. The California Department of Public Health, whose staff had served on the leadership team and had invested significant time in drafting the handbook, volunteered to pilot the project. This made sense – the department was data rich, which resulted in a department culture that valued the project. The department also had the knowledge and abilities to react nimbly when inevitable issues arose during implementation. The Department of Public Health is large – approximately 3,500 staff – which allowed it to respond with the skilled resources required to clean and reformat the data. Finally, choosing Public Health was an important way to acknowledge the departmental resources it had already invested in the project.
It took the Department of Public Health four months to identify, organize and clean the initial datasets. They published data on a commercial platform paid for through the grant from the California Health Care Foundation. The reaction from the public was overwhelmingly positive, and journalists and civic groups alike began to use the data.
After the successful pilot, the Undersecretary chose the Office of Statewide Health Planning and Development to head up the next phase. Like Public Health, this office is a data-rich organization with significant capacity to use data. Also like Public Health, their team had been involved in the project from its inception. At that point, the Agency decided that it did not make sense for each department to publish its own portal. Members of the public would have difficulty finding data if it were dispersed across 12 distinct data portals. It made more sense to create one Agency-wide site that would host data from all departments. The Office of Statewide Health Planning and Development then became the first department on the California Health and Human Services Agency’s open data site.
Once Public Health and the Office of Statewide Health Planning and Development published their initial data, the Agency needed to find a way to bring departments with less experience publishing data into the project. The solution was to leverage the Agency governance structure to organize data managers from each department and designate one department, Statewide Health Planning and Development, to serve the Agency with project management, technical administration and training. As other departments came on board, the project management team sequenced the order of publications, provided training to each department’s project team and offered technical assistance whenever a department needed additional help. They also created a working group, under the Agency governance structure, that each department joined as it came on board. This peer-learning group facilitated sharing among departments as they moved through the phases of data publication: identifying data, making data ready for publication, publishing data and responding to feedback from stakeholders.
However, even with the lessons from the pilot, the departmental implementation still faced challenges. For example, the Agency had to find a way to streamline the review process. The initial process was intensive; the project’s Executive Sponsor, the Undersecretary, signed off on all data elements. Now that the Agency believed that the process worked and appropriately reduced risk, the Secretary’s office began to step back from the day-to-day role and delegate. Later, the Undersecretary delegated the final data sign-off to the Agency data governance committee, under the Agency CIO.
Once a department published data, public information officers heard from members of the public and stakeholder groups, particularly journalists, early and often. That left a question – how could the open data project connect with communities that knew nothing about the Agency or the data? How could it more formally connect with developer communities that might want to use the data? The California HealthCare Foundation had an idea to help bridge the availability of open data with use of that data. It created a Health Ambassadors program, initially enlisting the help of local civic technologists in three cities: Fresno, Los Angeles and Sacramento. In partnership with community stakeholders, these ambassadors were and are creating online tools using data that the Health and Human Services Agency has made available through the open data project. The Ambassadors have also worked with local Code for America brigades to coordinate health data “code-a-thons” in several California communities. The code-a-thons are events in which application developers are challenged to make use of CHHS data in new and innovative ways. These code-a-thons build relationships between CHHS and local stakeholders, resulting in the development of several applications using CHHS data.
Additionally, working with the California HealthCare Foundation and other partners, the Agency now conducts a yearly “Open Data Fest”. This convening of stakeholders, practitioners, thought-leaders, and government staff serves to further the dialogue about the value and benefit of open data publishing. The event is more strategic rather than nuts-and-bolts, providing important space for interdisciplinary, public-private-nonprofit collaboration around the future of health data and open engagement.
During the last two years, the California Health and Human Services Agency has reimagined how it makes data publicly available. Accomplishing this task required executive support, strong governance, and – of course - staff time. Fortunately, they were able to take advantage of a well-timed outside grant, an emerging governance structure, and a strong level of trust across departments, which allowed them to overcome any obstacles in their path.
Executive Support: Staff across the Agency indicated that executive support was critical to making open data an important project for all departments. Executive backing enabled the project leadership team to bring all decision makers to the table and empowered staff to find solutions to obstacles.
Governance Structure: Before the Agency identified a single dataset to publish on its portal, it created a governance structure for the project. Subject-matter experts, attorneys and public information officers sat at the same table from the beginning. The leadership team developed policies to identify, review and publish the data, giving comfort at all levels of the organization that it could safely publish this information in accessible formats. While all Agency departments are dedicated to providing access to essential health and human services, they each have different missions, operations, resource allocations and stakeholders. Building the project team with a cross-section of department leadership resulted in program policies reflecting Agency-wide opportunities and constraints, rather than just those of a single department.
Staff Time: While the data that the departments published were already public in some form, each department now had to organize the information to meet common standards. Making data machine-readable often requires reorganization and changes to existing data management processes. For example, if a department had published a single Excel file with six different tabs, the department now needed to reorganize each of those tabs into separate files. Data that had been published in a PDF format needed to be converted into a machine-readable format such as Comma Separated Values (CSV). Sometimes too, the way the department had organized data was incompatible with a computer reading it. Importantly, since data was now widely available on a single website, it was critical that the departments describe and document (i.e. metadata) how to use the datasets and whether the data had any limitations. Staff from different departments that participated in the leadership team invested significant time creating the structure for this program to work. Without their commitment to implementing this project – work that was often in addition to their regular job duties – the Agency would not have an open data program today.
California HealthCare Foundation Grant: Many local, state and federal government open data projects have succeeded without outside grant funding. The Agency could have launched an open data portal using internal resources; however, the grant was still invaluable, particularly as it allowed them to move quickly. Rather than carving out money from an existing and constrained budget, this grant allowed the Agency to put their staff time towards publishing rather than procurement, which in turn facilitated a focus on process rather than technology. Knowing that the grant funding was not sustainable, the Secretary’s Office also made the project sign-off contingent on departments agreeing to use existing funding to support the project at the end of the pilot phase.
Existing Cross-Department Dialogue and Governance Process: As a result of the 2010 and 2012 federal grants, the Agency identified how data was used internally and began having conversations about how it could be used more effectively. This opened a dialogue across departments for the open data activities that began in 2013. These earlier grant-funded projects informed the Agency’s vision for a cross-department governance structure supporting the open data project.
Lessons and Products from Existing Programs: The California Health and Human Services Agency drew upon the experiences from other cities, counties and states. In particular, it built on the Open Data Handbook authored by New York’s Department of Health. This made a heavy lift more manageable and also provided a model the Agency could cite to show that the process worked.
Trust: Mutual trust among partners and reassurance from executive levels accelerated the project’s timeline. From previous joint projects, the Agency and Foundation already knew they could work together. Department staff also knew that the Agency would help them meet their goals and that they all held accountability to the project’s success.
The California Health and Human Services Agency’s project officially began in April of 2014. By March 2016 all 12 departments had published data to the portal. Both the leadership team and department data stewards had to overcome obstacles in order to meet this compact timeline. Staff and management identified three main challenges: organizational structure, staff engagement and data limitations.
Organizational Structure: While the 2010 and 2012 federal grants laid the groundwork for an agency-wide data project, open data was still new to many departments. Historically, each department had acted fairly independently of the others. This project drove a cultural change requiring departments to adopt new norms for cross-department collaboration. Given the project’s large scale and the Agency’s concerns about protecting sensitive data, it is unsurprising that during the pilot phase the Agency adopted an intensive review process that ended with the Undersecretary approving individual data records and data dictionaries. However, once the Secretary’s office was comfortable with how the process was working, it delegated its authority to the Agency CIO and governance committee, shortening the publication workflow.
Consolidating department data on a single Agency portal rather than maintaining 12 separate department portals was a deliberate decision with practical benefits. A unified portal required fewer resources that some departments did not have and helped streamline what could have been a more complicated publishing process. To plan, implement and maintain a unified portal, the Agency created a governance model that brought together departments and staff that had otherwise worked separately. This process has served as a catalyst toward a more collaborative Agency culture.
Staff Engagement: The project team knew that even with sponsorship and support from the Secretary’s Office, each department needed to believe in the project’s goals and merits. It was to promote this buy-in that the project management team in the Office of Statewide Health Planning and Development met with each department’s leadership team for an initial meeting. During this event, the team discussed the project’s history and why data – and open data in particular – is an important function of the Agency. This helped inform each department’s thinking about why open data is important, even if the department had typically published data in other, more restricted ways. The project management team also provided significant assistance and support to staff in departments across the Agency. Whether answering technical questions, providing training, or giving people needed encouragement to finish the project, the project management team was critical to the Agency-wide deployment of open data.
Data Limitations: Each dataset has its limitations. For example, its granularity means that it might be useful for analysis at a county level, but not at a city level. Similarly, some existing datasets are easier to convert into a machine-readable format than others. For example, perhaps the department regularly publishes data in CSV format (which looks similar to Excel except without tabs and formatting), and has already written a robust description of the dataset. It may be ready to publish this data with little additional work. However, if the data are organized in a way a computer cannot read, then the department still needs to reformat the information.
The Health and Human Services Agency expected and received significant feedback from stakeholders. Some wanted more recent or granular data than previously published. Others found errors and inconsistencies in the data. The Agency found that the most effective way to address these issues was to implement a process by which it could take and act on feedback from the public. They also began small, starting with several small datasets from each department rather than all datasets from all departments. In effect, each department piloted their own datasets, received feedback, and, if required, modified how they published either their data or supporting information.
While the pilot program has only focused on publishing data that was already publicly available, stakeholders continue to request that the Agency publish additional data. Where the information is clearly public, the Agency will have an easy decision and will only need to prioritize which data to publish first. Other datasets require more analysis. Could someone re-identify personal information? Is there a sensitivity that does not exist in clearly innocuous datasets like “Most Popular Baby Names”? To answer these questions, an Agency team is currently developing a common set of data de-identification procedures for all departments to use.
As of April 2016, over 80,000 users have visited the portal, and the portal is averaging over 7,000 unique visitors per month since May 2015. The open data project is also popular among journalists and civic technologists. During the 2014 measles outbreak, for example, journalists at the New York Times and Los Angeles Times used immunization data from the portal to report stories. Civic technologists, often volunteers providing their time and skills to making data more available to the public, have created several tools with the data (see WICit and AsthmaStoryCA.org, for example).
The Agency team did not launch the portal believing it would decrease Public Records Act requests, but some staff indicated that they have seen a drop in requests. Others said that now it was simply easier to respond to requests because some of the needed data was on the portal. In other cases, staff said they did not see a direct connection because many Public Record Act requests are for specialized datasets, including confidential information, that are unavailable on the portal.
Internally, the project has helped lead to a culture of data sharing and interoperability across departments. For instance, sharing data among departments can be difficult because the same entity or action can be coded differently by each department. Three departments collect data on hospitals, but they each refer to them differently in their data, making comparisons across datasets difficult. In response, a crosswalk was created so users – both inside and outside government – could analyze hospital facility data across different datasets. By understanding their data assets, departments are better supporting their missions. The open data project has created opportunities for departments to share data with each other and the public. This has resulted in internal and external collaboration and innovation. While the program is still young, initial results suggest that the effort by Agency staff to make their data more accessible has benefited internal and external users, and most importantly, the people of California who are served by the California Health and Human Services Agency.
In 2010, LA Times investigative reporters revealed that city officials in Bell, California were paying themselves exorbitant salaries compared to other cities. The “Bell Scandal,” as it came to be known, raised immediate questions about whether that city was unique or whether the public just did not know about corruption taking place in other cities. The State Controller’s Office, which maintains fiscal control over $100 billion in receipts and disbursements of public funds a year, believed it could help answer that question by providing more transparency.
By the time the story broke, the Office had already initiated projects aimed at transparency in government. In 2002, it provided the Sacramento Bee, as well as other news agencies, with state employee salary information that the newspaper later made public. In 2009, the Contorller’s Office made information available online about registered warrants, also known as IOUs, which the state had used to pay vendors during the recent budget shortfall. Institutional knowledge gained from these previous projects, combined with an executive-level commitment to transparency and a dedicated staff, enabled the Controller’s Office to launch this time-sensitive project using existing resources.
The weekend after the LA Times story ran, the executive team came together for a half-day conference call to determine the path forward. The Controller laid out a goal to make city and county salary, wage and benefit data public, and the team quickly began discussing how they would undertake this task. They quickly reassigned a project manager from the information technology department to the project and then created a 90-day project plan. The project required support from nearly every department, and the governance team reflected this. The team included the Chief of Staff to the Controller, Director of Communications, Chief Counsel, Chief of Administration, Chief Operating Officer, Chief Information Officer, Chief of Audits and data stewards from the Accounting and Reporting Units. The team was passionate about the project and the potential result–making government more transparent for all Californians. The project plan laid out clear roles and responsibilities for each member of the team and each member of the team also knew that this project was a critical priority for the Controller. This meant that they had the support they needed to realign internal priorities so that the Public Pay project, as it had been named, had the resources it needed to meet its goals.
Ninety days later the team launched the first version of publicpay.ca.gov. Over the following months and years, the team updated that first version with more data and more tools. Today, the website provides a comprehensive open dataset that is used by journalists, local governments, auditors, unions and the public. The Controller’s Office was able to turn this idea into a public tool in only 90 days because it had:
Executive leadership from the Controller;
A governance structure already in place as a result of previous transparency projects;
A commitment to using internal resources; and
An agile development and publication approach that focused on quick sprints to launch for each update over the next several years.
The Controller’s Office made a strategic choice to launch the first version of the site in only three months. They could have waited six months or a year, which would have resulted in a site with more features. Instead, they focused on an iterative approach, publishing a dataset that was downloadable and that users could manipulate. The first version was a basic, tabular form that users could search. It did not have APIs or all the visualizations that publicpay.ca.gov has now, over five years later, but the data was open to the public, and that was the critical need the Controller’s Office wanted to meet initially.
The benefit of this approach was immediately apparent. Journalists, the League of Cities, and the California Association of Counties provided timely feedback to the press office and the team was then able to integrate that into later updates. The public and other stakeholders also provided feedback via a button on the website. Over time, the Controller added employee salaries from special districts, California courts, California Community Colleges, California State Universities and the University of California. In response to feedback, technical staff also edited documentation to include less technical jargon.
As the team added more tools to the site, they designed them for use by the general public. Power users, they knew, could download the entire dataset and use their own tools. However, these stakeholders made up only a small part of the potential audience for the data. To meet the needs of other users, the team designed easy point-and-click analysis and visualization tools that provided more opportunities for everyone to use the data.
When the Controller’s Office launched the first Public Pay site, they intentionally designed a basic system. As they and stakeholders had new ideas to improve the project, they would regularly add functionality. The team would assess each request and if they could easily integrate the change, they did. However, if the change was of a certain size, the Controller’s Office would integrate it into the set of design requirements they were developing for a full redesign, which they launched two years later.
In undertaking the redesign, the team wanted to integrate advanced analytic capability into the project. However, they also knew they did not have the capacity to do this in-house. To meet this need, they brought in a vendor using a competitive bid process. The team was able to find the right contractor for their needs because they had well-defined requirements resulting from the knowledge they gained designing the first version of the site and the benefit of two years of user feedback.
As the team developed later open data projects, like Track Prop 30 and By The Numbers, these sites differed from Public Pay because the team had more time to plan for how they would implement the Controller’s vision. Having more time to plan resulted in different development scenarios: the team continued to develop in-house, but also hired vendors for services where it made sense based on their business needs.
The most significant hurdle the project team overcame was building the dataset. At the time, cities and counties did not submit employee salary information to the Controller’s Office. Within a very short period, the governance team had to determine which data to request and ensure that data they received would be consistent across municipal boundaries. Job titles, for example, often differ from one city to another. They had to account for people who might hold more than one position, a potential problem exposed by the Bell Scandal. The team also had to ensure they were not releasing personally identifiable information that would compromise the safety and welfare of law enforcement personnel and victims of crime. These challenges were exacerbated by a compressed timeframe that left little room for error.
The Controller’s Office had to be ambitious to meet its deadline for the Public Pay project. It was successful, in part, because the leadership team had already worked together on the project to document the state’s IOUs the year before. The experiences the team had through Public Pay and the IOU project helped them work more quickly and efficiently during subsequent open data projects like the By The Numbers. Even though they had to build a new dataset for Public Pay, the team limited their risk by publishing information in phases. They first published data and provided basic tools, and then they enhanced their tools and brought in special district information including school districts. Executive leadership from the Controller as well as his Chief of Staff was also a critical element in the project’s success. Realigning or finding resources is an easier task when the head of an organization believes in the project. Similarly, the staff were committed to the project. They were a group of “can do” people eager to launch successful projects that would provide financial transparency for the public and other stakeholders.
The team used an agile, collaborative approach to meet these challenges. They met every day at 4 p.m. and adhered to an aggressive project management schedule. They built out pieces of their project iteratively–from policy to database structure–and made sure that the entire governance team had an opportunity to quickly view and respond to each piece. Very few of the challenges had easy solutions, but the team had previously built trust by working together. This helped them solve problems collaboratively and quickly. For example, the Controller’s Office had to determine how it would collect the salary and benefit data from cities and counties. It believed it had the authority to compel this information, but many cities were initially resistant. They were concerned about employee privacy and the legitimacy of the Controller’s authority. The Controller’s Office was able to broker an acceptable solution for many local governments by requesting that cities and counties submit information from employees’ W-2 forms. To protect the privacy of law enforcement and victims of crime, the Controller requested position numbers without names. At the same time, municipalities that did not submit data in a timely fashion found themselves on a public list of non-compliant entities and facing potential fines from the Controller. Though many cities did not initially see the benefit of the data, they later found that public salary data was beneficial for them. For instance, instead of conducting costly salary studies before pay negotiations, city managers found they could use the data from the Public Pay site instead. This also created the opportunity for the municipalities and the unions to start discussions with the same neutral data information in front of them.
The team used agile methodology to launch the Public Pay project. However, this approach was also representative of the organization’s entire open data portfolio. The team learned from each project, using that information to inform later projects. The IOU project helped shape Public Pay and Public Pay helped to create the protocols for the Track Prop 30 and By the Numbers projects. Each experience built on the one(s) before, and a core team learned together what worked and what did not.
A new Controller was elected to the office in 2015, and while there has been turnover among team members, several of the original team remain. That team is now determining what next steps to take with the existing open data sites. They are also formalizing structures by writing governance documents. In government, it is standard for a new administration to review a project, its outcomes and the resources needed to continue it. Within California state government, the open data projects at the Controller’s Office are the first to span two different elected leaders. As state agencies and other constitutional offices look to implement open data projects, knowing that executive leadership may or will change in the coming years, they can also learn from Controller’s Office staff as it bridged two administrations.
Staff at other agencies may also look to the staff that left the Controller’s Office, many of whom moved with the former Controller once he was elected to the post of State Treasurer. At the Treasurer’s Office, a team composed of former Controller staff and existing Treasurer staff are now developing new open data projects. They are bringing the lessons learned from the previous projects into a new organization and using them to more quickly launch new open data programs, such as DebtWatch.
Many departments across state government already publish data. However, truly “open” data is different, both in how it is published and in the expectations that the public places on it. To understand these differences, agencies beginning open data projects can benefit by learning from the experiences of the agencies and departments that have gone before. Drawing on the experiences of two such early adopters – the California Health and Human Services Agency and the State Controller’s Office – provides a few key lessons:
To open your data, you have to know your data
Data can be messy
Strong, involved, executive sponsorship is critical
An inclusive governance team can develop comprehensive policies
Start from a strategic strong point
Create opportunities for learning, training and asking questions
Use events to build momentum
Learn from other organizations
To open your data, you have to know your data. A successful open data project is about publishing data the public can use. It is necessary to know what data you have, what data the public likely will find most useful, and how to publish that data so that privacy is protected while still maximizing its value to the public. For example, the Controller’s Public Pay project was responding to an immediate need, so the required data was easier to identify. The Health and Human Services’ Data Portal project used an internal data inventory along with the knowledge of departmental data stewards to identify high-value datasets. Both agencies also categorized their datasets into three levels of confidentiality–low-risk/public data, medium-risk/possibly publishable data, and high-risk/not publishable data.
Data can be messy. In long-running programs data needs may have changed over time. New fields might have been added or existing fields may be coded differently for different years. Information stored in older database programs may be difficult to export. To prepare datasets for publication the Health and Human Services Agency and the Controller’s Office addressed many issues and documented details describing data collection, history and limitations. After release, they planned for and solicited public feedback to help them continue improving data quality and usefulness.
Strong, involved, executive sponsorship is critical. Neither of the programs profiled in this paper would have happened without clear, committed support by executive leaders. In many cases, implementing a robust open data program means altering existing business processes (e.g. how information is collected, catalogued and published) and shifting workloads in the short term. Making these changes without executive support is extremely difficult, especially if any skeptics of the program are in decision-making roles. The result of executive vision is twofold. First, it convincingly demonstrates to staff that this program is possible and important. Second, it results in clear processes that provide direction and support to staff implementing the program. Executive support requires the head of an organization to provide clear expectations for the quality of the data, and it also requires that executive to take responsibility for the data once it is published.
An inclusive governance team will facilitate comprehensive policies. Open data draws on many different areas of expertise. It is best to involve both those with decision authority as well as those with hands-on experience in managing and publishing data. The team should include the executive sponsor, the project manager, the chief information officer, legal affairs, department or unit heads, public affairs, subject matter experts, and data stewards. In addition to project oversight, the governance team will need to establish procedures for publishing data – including how to review data prior to publication. Because of the need for detailed review, the open data governance team will typically engage in more hands-on decision making at the beginning of the project. At the start, sign-off authority for dataset publication should be at the highest level needed to make the organizations comfortable. Once evaluation processes are operational, the governance team can transition to a more traditional role of setting project goals and providing oversight.
Start from a strategic strong point. Nothing builds momentum better than an early success. At the start, efforts should focus on a test case that maximizes the chances for a successful release. This means building a team that supports the goals of the open data project, and – as much as possible – has the necessary skills and experience already in place. It also means selecting the right pilot data to begin publishing. For example, Health and Human Services identified data across all 12 departments that had already been approved for public release. This allowed them to focus on the tasks of prepping data for an open environment and communicating how this data could be used. The Controller’s Office chose a different challenge – creating a new data set, built from data that the Controller had not collected up to that point. However, this risk was manageable because the Bell Scandal provided clear evidence of the need for such data, and provided the additional pressure the team needed to convince reluctant local governments to report the data.
Create opportunities for learning, training and asking questions. As the project rolls out, staff and stakeholders will have questions and concerns. Providing space for staff and stakeholders to learn, adjust to changes in regular business processes, receive training and have their questions taken seriously is important for positive cultural change. The Controller’s Office and Health and Human Services adopted multiple strategies to meet this need. Both organizations emphasized an agile approach. The Controller’s Office, for example, focused on a single new dataset that its experienced project team could focus on in detail. As they published the data they “learned by doing,” iteratively improving their process and product. Because of their large size, Health and Human Services had to emphasize more formal training. As each department has moved to the front of the queue, the project management team provided ongoing training and mentorship.
Use events to build momentum. Events play an important role in creating enthusiasm among staff and opening communication with internal and external stakeholders. Conferences, brown bags and road shows provide stakeholders across departments and agencies opportunities to meet and share their opinions about open data. They also act as mechanisms to communicate agency goals to members of the community, receive feedback about what works and what doesn’t, and build external partnerships.
Learn from other organizations. Other organizations inside and outside of state government have developed open data projects. For the most part, they are eager to share their successes and their challenges. They often have, and will share, governance documents, code, contract pricing and partnership arrangements. This is a movement built on collaboration and transparency. Organizations new to open data can leverage that culture to jump-start a new program.
When embarking on an open data project for the first time, agencies and departments often look to already successful efforts as planning guides. There is no “one size fits all” governance body because organization size and structure shape open data governance. For instance, California’s Health and Human Services Agency is a multi-tiered agency that created an agency-level executive governance team and also created department-level teams. While the agency team identified legal considerations and data to collect as well as established policy, standards and order of release, the department teams identified, collected, cleaned and otherwise prepared their data and documentation for agency approval. Later, the agency tasked the Office of Statewide Health Planning and Development with helping new departments release their data. Smaller agencies, such as the State Controller’s Office, often have a flatter structure and simplify governance into a single team responsible for open data planning and implementation.
Both Health and Human Services and the Controller’s Office included the roles below at each governance level, with some people acting in multiple roles. A governance team in a different agency or department might vary somewhat depending on mission and size.
Executive Sponsor: A new and public initiative like open data needs executive support. This means an executive sponsor who will align the project with agency objectives, assist in overcoming internal concerns and guide organizational culture change. An executive sponsor takes ultimate responsibility for the project, sets a vision for the team and shepherds the project to completion. Agency executives may eventually delegate final sign-off authority to a department director or designee.
Project Manager: A project manager develops a plan, monitors progress and follows through with goals by acting as the liaison between agency and department-level teams. If an organization does not have a project management office, this person could be the data publisher or someone with authority over the current data publication process. This might also be a subject matter expert with experience running projects.
Chief Information Officer: This executive-level technology expert advises about internal information operations and interoperability, technical capabilities and security.
Legal Counsel: A legal adviser helps open data by focusing on privacy, intellectual property rights and Public Records Act compliance and responsibilities.
Relevant Department, Program or Unit Directors: Unit and program directors may be more or less involved in a governance structure depending on agency or department size, staff expertise and when they are actively publishing data.
Public Affairs: Public information experts strategize how to optimize reaching the public, media and other stakeholders who will not only consume data but also provide feedback. They identify and create new audiences, track responses and establish new partnerships and collaborations.
Subject Matter Experts: These staff can answer specific questions about a dataset topic or policy issue. In smaller departments, the same people might attend meetings more regularly. But, in larger departments these unique specialty experts will likely cycle in and out as each program or unit with data moves forward.
Data Steward/Publisher: This expert writes the draft metadata and documentation and reconciles small cell sizes or any other dataset peculiarities. This staff person should be obvious. They are usually an encyclopedia of how the data have changed and when. In very limited cases this team member may be the data publisher – the person responsible for signing off on a steward’s work.
No existing state law specifically defines “open data,” however, current California law does provide one reference to it in Government Code § 6253.10 (2016), which defines the published data that local agencies can call “open.” This law specifies that open data must be published, among other things, in a manner that is platform independent and machine readable:
Government Code § 6253.10.
If a local agency, except a school district, maintains an Internet Resource, including, but not limited to, an Internet Web site, Internet Web page, or Internet Web portal, which the local agency describes or titles as “open data,” and the local agency voluntarily posts a public record on that Internet Resource, the local agency shall post the public record in an open format that meets all of the following requirements:
(a) Retrievable, downloadable, indexable, and electronically searchable by commonly used Internet search applications.
(b) Platform independent and machine readable.
(c) Available to the public free of charge and without any restriction that would impede the reuse or redistribution of the public record.
(d) Retains the data definitions and structure present when the data was compiled, if applicable.
A number of existing policies and laws written for general publication of data by state agencies also apply to open data. Data privacy, for example, is governed by the California Privacy Act, Information Practices Act and a host of Federal laws and regulations. See the privacy section of this publication for more information. In addition, agencies and departments will need to consider how they build the technology for their open data projects.
From building a platform in-house to contracting with a vendor, California agencies and departments have used a variety of procurement strategies. Below are the three most common methods:
Use existing staff, hardware and software.
Features:
Contract with a small business reseller to buy, build and/or host a custom site. Agencies using this method typically purchase or subscribe to a ready-made solution with customization options.
Features:
One way to contract using this method is through the state’s Software Licensing Program. The Program allows Agencies to purchase software through existing contracts that have already been negotiated with authorized participating resellers.
Issue a competitive Request for Proposal (RFP):
Features:
Open data creates the opportunity to build relationships with innovators outside of government. For example, developers can create new applications during hackathons, intensive events spread out over a few days or weeks where coders and designers develop projects leveraging publicly available data. If a state agency finds that a tool created during a hackathon might be something they want to build on, they need to determine what options they have for doing this. In many cases, the relationship has remained voluntary – members of a community group like a local Code for America brigade – continue to iterate the initial application after the hackathon ends. A number of government agencies are developing streamlined ways to fully develop these applications and incorporate them internally.
As agencies evaluate their data assets for inclusion in open data projects, they must separate confidential data from data that can be shared. When they tackled this problem, the California Health and Human Services Agency adapted from a three-tiered taxonomy used by New York State. Under this system, data falls into three broad classes: (I) data that qualifies as public in its current form; (II) data that could be made public, but would first need to be altered to protect privacy; and (III) data that is sensitive and confidential, and must remain so. Although agencies will spend most of their time with respect to privacy on data falling into tier II, categorizing data assets into all three tiers requires balancing between two core values: transparency and privacy. Doing so requires knowledge of the legal and policy requirements regarding privacy and confidentiality, as well as the current science of disclosure limitation.
Much of an agency’s data is public. The California Public Records Act (CPRA) establishes that the public has a fundamental right to access information concerning the conduct of public business. This has been described as a presumption of openness unless specifically exempted. Although not comprehensive, common Public Records Act exemptions include information that would–if published by an agency–constitute an invasion of personal privacy, impair contractual obligations or collective bargaining negotiations, harm law enforcement or judicial proceedings, expose trade secrets, or endanger public safety.
In particular, state agencies have a constitutional responsibility to protect privacy. Article 1, Section 1 of the California Constitution includes an inalienable right to privacy. There are also legislative, regulatory, and contractual mandates that require state government to maintain confidentiality and protect the privacy of Californians. Depending on the type of data held by an agency, the data may also be subject to Federal confidentiality rules.
The California Information Practices Act of 1977 (CIPA) expands the constitutional protection of privacy and places additional limits on what personal information state agencies may collect, how they manage it, and what they publish. This includes civil and legal liability for an agency that fails to comply with the Act, and may include civil and criminal liability for state employees that disclose data they should have reasonably known was obtained from personal information.
Some limitations of the Act are the failure to specifically address the difference between information that directly identifies an individual (i.e. name, social security number, etc.), and information that could be used to identify an individual by pairing it with other information (i.e. date and place of birth could be paired with public birth records to identify an individual). Likewise, it does not provide any specific guidance on processes to follow when clearing data for publication. More recent legislation typically includes such “Safe Harbor” provisions, so that agencies have a better understanding about how to protect themselves against liability from unintentional disclosure.
In addition to state requirements, federal law also governs state data disclosure. Two well-known examples, the Health Information Portability and Accountability Act (HIPAA) and the Family Educational Rights and Privacy Act (FERPA) outline both the types of data that are confidential, as well as de-identification methods that agencies can use to publish data without compromising the privacy of individuals. For instance, HIPAA provides two distinct ways to de-identify prior to publication, the “Expert Determination” (Section 164.514[b][1]) and the “Safe Harbor Method” (Section 164.514[b][2]). For expert determination, disclosure limitation specialists must determine that there is no more than a “very small” risk that someone could use the information to identify specific individuals. In a Safe Harbor evaluation, the Agency removes 18 specific types of personal identifiers and certifies that the published information could not – alone or when combined with other data – be used to identify individuals.
Legislation can also include specific carve-outs for public disclosure; for example, FERPA allows the disclosure of certain personally identifiable information when published in a directory. Yearbooks and student directories are two common cases covered under this exception. Finally, FERPA and other privacy laws often place fewer restrictions on data provided voluntarily, such as in optional surveys.
Using methods such as Expert Determination and Safe Harbor, and by anonymizing, masking or otherwise de-identifying data, agencies can make more information publicly available without compromising the privacy rights of Californians. These techniques involve simple deletion (dropping fields with confidential information), masking (scrambling or removing part of the field to make the information no longer sensitive), or noise infusion (adding random variance to numeric data), for example.
It is still possible to re-identify masked data by comparing published fields with other public datasets that include the same identifiers. For example, California voter data is considered public data, and includes names, addresses, and birth dates. Telephone directories, yearbooks and court records also contain information that could be used for re-identification. In addition, agencies that wish to minimize the risk of data re-identification must also consider newer sources of public information such as social media profiles and other digital records.
An entity attempting to re-identify data may do so broadly, trying to re-identify substantial portions of a complete dataset; or by targeting a specific individual believed to be in the dataset. Because a single individual is being targeted, it is easier to gather additional reference data that the entity can use to re-identify the record. The targets of many of these types of attacks are often noteworthy or influential, such as when researchers re-identified Massachusetts Governor William Weld’s medical data from an insurance dataset using his gender, birthdate, and zip code.
In all cases, however, disclosure limitation involves trade-offs. It is possible to lower the risk of disclosure by aggregating observations and/or adding masks to fields, but both of these actions reduce the value of the published data for those analyzing it. Agencies must balance between the right for citizens to be informed about the conduct of public business and an individual’s right to privacy. Departments must determine the correct level of de-identification and granularity for their community.
Just as agencies do today, if they cannot make data publicly available, an agency may make data-sharing agreements with other government agencies or researchers in order to facilitate analysis in the area.
Whatever methods of disclosure limitation an organization chooses, it is important to keep in mind that their effectiveness is dependent on the state of technology, social expectations about privacy, and the available datasets against which to match. These are all subject to change, meaning that any certification that a dataset is suitably de-identified is time-limited. It is, therefore, important to reassess the privacy of already published datasets at regular intervals.
The field of disclosure limitation is likewise evolving. One area of rapid change in California and across the nation is the emergence of photography and video created by new equipment such as traffic cameras, drones and body cameras. All three have been put into use by various state agencies as they adopt new technology to fulfill their mission. The use of inexpensive but high quality photographic and video recording equipment by governments has resulted in a dramatic increase in the volume of such data held by agencies. Determining how to hold, manage and publish such data is an ongoing discussion within government and it is likely to take time before a settled solution emerges.
If agencies store sensitive data off-site with third party vendors, it is important to consider the terms under which data is shared:
In general, it is more secure to only publish de-identified data to third party platforms; however, any time outside vendors have access to potentially confidential data it is necessary to explicitly identify the privacy concerns and remedies and to regularly assess privacy risks. When negotiating contracts, provisions can be changed to take into account the specific privacy needs reflected in the data. Many services do not readily allow for such negotiation, however, relying instead on standardized “click to agree” contracts. These are especially common with free or low-cost cloud services – often because they intend to monetize the data and/or traffic some other way – and agencies should take care when using a service under such terms of service. The open data governance team should read and agree to such contracts. If they cannot sign-off because of specific clauses, they can contact the company and negotiate the terms of service individually.
Determining whether the information in a dataset constitutes confidential material is not always obvious, instead requiring a balancing test of the rights to privacy against the interests of the public in access to the information. This makes it critical that the governance team includes individuals with the expertise necessary to make such determinations. Such experts can include data scientists with knowledge of disclosure limitation, agency legal counsel, and/or their Public Records Act (PRA) officers.
When releasing data to the public, many agencies decide to apply formal terms of service and licensing rules to their portals. In these cases, portals either require users to explicitly agree to the terms and rules or they rely on implied consent. The California Health and Human Services portal relies on implied consent. Anyone may view data on the site, but the terms of service require that any “reuse, publication, and/or distribution” of data must include an attribution back to the department that published the data. The terms of service also outline the Agency’s acceptable use policy and includes clauses on limitation of liability and indemnification. The State Controller’s Office, which often posts data from local governments, includes statements that the information is “posted as submitted by the reporting entity” and notes that it does not take responsibility for accuracy of data provided by other entities. The Controller’s Office lists this information on the bottom of their open data site pages and also does not ask for users’ consent prior to accessing data. Other agencies, for example the Board of Equalization and the Franchise Tax Board, require users to explicitly agree to their terms of service before viewing the data on the portal. Each agency publishing data will need to decide how they will approach this issue. The Open Data Commons provides several licensing options that agencies may consider as a starting place.
Organization | Portal | Launched |
---|---|---|
California Health and Human Services | https://chhs.data.ca.gov | 8/2014 |
State Controller’s Office | ||
•Public Pay | http://www.publicpay.ca.gov | 10/2010 |
• Track Prop 30 | http://trackprop30.ca.gov | 4/2014 |
• By the Numbers | https://bythenumbers.sco.ca.gov | 9/2014 |
Visit http://opendata.ca.gov/SODP/index.html for a current list of State of California open data projects.
For more information about the State of California’s Open Data efforts, visit http://opendata.ca.gov/.
We were fortunate enough to turn this project into reality when staff at the following organizations allowed us to interview them and tell their stories. We are grateful for both the time and insights they shared with us.
California Health and Human Services Agency and Departments (Scott Christman, Victoria Daher, Ashley Defranco, Scott Fujimoto, Jim Greene, Christin Hemann, Merry Holliday-Hanson, Jon Kirkham, Eric Reslock, Linette Scott, Mike Valle and Mike Wilkening); State Controller’s Office (Todd Boltjes, Jeff Christ, Matt Harada, John Hill, Bill Helms, Hitomi Sekine, and Mary Margaret Wilson); State Treasurer’s Office (Laurye Brownfield, Wendy Craig, Fred Kessler, Jason Montiel, Jan Ross, Tim Schaefer, Kumar Sah, and Collin Wong-Martinusen); California Health Care Foundation (Andy Krackov and Brian Purchia); Code for Sacramento (Joel Riphagen and Ash Roughani).
We thank Scott Christman, Mike Valle, Stuart Drown, Angelica Quirarte and Scott Gregory for their invaluable support of this project.
The California Research Bureau authored this report as part of a California State Library project: Anne Neville, California Research Bureau Director; John Cornelison, Senior Research Librarian; Tonya D. Lindsey, Senior Researcher and Patrick Rogers, Senior Researcher.
Like this design? Feel free to use it. Patrick Rogers and Anne Neville forked (i.e. copied) the original at https://github.com/CAHealthData/toolkit and made some minor design changes. (Isn’t open source great?)