To make policies that work for everyone, it is necessary to know something about everyone. The United States is a country of great size and diversity. Throughout its history, characteristics including sex, race, ethnicity, age, and geography have influenced the economic opportunities that are available to people.
One of the most salient measures of economic opportunity to individuals is income. How income is distributed, grows, and persists provides important insights into who has access to economic opportunity, where opportunities are flourishing, and where they are not.
The variations in how income is distributed and the disparities that at times result are as much a tapestry as the population itself. The experience of one group is not the experience of all groups. Yet narratives about inequality and income distributions in America are often too simplistic: Wealthy or poor. Have or have not.
Existing statistics on income distributions and inequalities may reinforce those narratives because they often group people into large categories, offering, say, the average income for all U.S. women compared with all U.S. men. Such aggregate measures have their use, but they don’t speak to how the experience of Black women compares with that of Hispanic women, or how economic outcomes in Idaho compare with those of Kentucky.
That deeper, richer understanding of income and earnings across America is the motivation behind a new dataset, which provides detailed income and earnings information based on the universe of individual tax forms filed between 1998 and 2019. This effort, the Income Distributions and Dynamics in America (IDDA) project, is the result of a research partnership between the Opportunity & Inclusive Growth Institute at the Minneapolis Fed and the U.S. Census Bureau.
The Institute is committed to making this data available for public download to catalyze new research and analysis on the economics of disparities across and within demographic and geographic groups that Americans identify with.1
“Economics perhaps underinvests in the collection and production of large datasets and making them broadly available,” said Illenin Kondo, senior research economist at the Institute and one of the principal investigators on the project. “To process all this data, to do so systematically for many groups, does not fit neatly into the traditional research contribution—it’s really an intermediate input. That makes it a particularly good Institute project because we have reduced the barriers to entry by investing the resources to produce a public good.”
The project website contains links to download the data, technical documentation, income charts and maps, and articles that cut into the data to analyze the many dimensions of income distributions and dynamics in America.
A universal, granular, accessible dataset
The statistics in IDDA are universal because the statistics were constructed starting with all U.S. individual tax forms, rather than starting with a limited sample. The data are granular because they contain extensive demographic information from Census Bureau records. And they are accessible because they can be downloaded by anyone from the Minneapolis Fed’s public website. These characteristics—universality, granularity, and accessibility—distinguish the IDDA dataset from existing resources.
The earnings and income statistics in the IDDA dataset come from Form W-2s and Form 1040s filed with the Internal Revenue Service between 1998 and 2019. Importantly, the project began with the universe of W-2 and 1040 forms. During the data collation that followed, some records were dropped if they were missing certain information or information appeared to be incorrect.2 What remains is the largest possible dataset containing both earnings and income data plus demographic data. This universality is particularly important when looking at statistics for groups that make up small parts of the population. Surveys of individuals or samples of income data often include too few respondents from such groups to accurately report the income distribution of the group.
For instance, one group that has not been well captured is the country’s highest earners. High earners tend to underreport their income on surveys—perhaps to seem more “average,” perhaps because they don’t recall all their income sources when they respond. In addition, many surveys provide only “top coded” data to the public, meaning that respondents who report income above a certain value will all be assigned that top value. By providing a more complete picture of these very high earners than was previously available, the IDDA data can be used to identify important patterns in inequality that require data from the very top earners to see.
The next key characteristic of the IDDA dataset is granularity. Where data on top earners does exist, for instance, it does not include the rich demographic detail that is necessary to study characteristics of these top earners. Tax forms by themselves contain little information about the demographic characteristic of the filer. What’s more, workers may have more than one W-2, and 1040s are often filed for “tax units”—a head of household plus dependents, for instance. Thus, the first step in building the IDDA dataset was to parse IRS forms to obtain the incomes of individuals and households.
Next, the researchers combined income data with data on individuals’ demographic characteristics from the Census Bureau, including sex, date of birth, date of death, race and ethnicity, and whether the individual was born in or outside the United States. The Census Bureau also maintains a master address file for the purpose of conducting the decennial census and various surveys, which allowed the researchers to understand in detail where people live.
Bringing together this amount of demographic detail leads to the dataset’s granularity. With the IDDA data, it’s possible to look at the earnings distribution for American Indian or Alaska Native women in Montana, for instance, or the share of top earners in New Jersey who are Asian women.
“The IDDA data provide a high-resolution picture of income throughout the distribution, within and across groups, that you just can’t get without the combination of universal coverage and granular demographic and income detail that we have here,” said Kevin Rinz, an economist with the Census Bureau and a principal investigator on the project. “It really makes IDDA a unique source of income statistics.”
Of course, using IRS and Census Bureau records is not new, but accessing this data is (appropriately) difficult. Because data collected by these agencies contain confidential information, their records can only be accessed by researchers who have undergone extensive background checks, which can take many months, and who have completed training on data security. Even after the background check, researchers never see raw files containing information such as names, and data must generally be accessed at secure research data centers that do not allow phones or internet. Researchers must be collaborating with a Census Bureau employee in order to access the data in this way, and there is also a fee. And once a researcher does gain access, the pure size of the IRS data—billions of data points—means it can take weeks for advanced computers to run certain analyses. The IDDA dataset makes some 6 million statistics available for download for use by anyone, anywhere.
To protect confidentiality, the IDDA dataset does not contain individual or household income and earnings data. Rather, the dataset reports average earnings and income statistics that are based on groups of a minimum size of similarly situated people (see box at end for more on preserving confidentiality).
How income is distributed: The who, what, when, and where
So just what is in the Income Distributions and Dynamics in America dataset? The data can be described in terms of what measures of income and earnings are included, the demographic details for whom, where those people are located, and when the data applies to.
What income
The IDDA dataset contains statistics based on several definitions of income and earnings, including adjusted gross income; wages and salaries; nonwage income; and total compensation. The goal was not to create new concepts of “income,” but rather to capture variation in different sources of income that are meaningful because they capture distinct concepts. “Wages and salaries” is what an individual earns via employment and so speaks to what employment opportunities are available to different people in different places. “Nonwage income” is a diverse category that captures income from work where someone is not a formal employee, including gig work and self-employment. In addition, it includes taxable benefits such as Social Security as well as capital gains, interest, gifts, and prizes. “Total compensation” adds together all income from employment, which totals wage and salary earnings plus other payments, like bonuses.
Whose income
The IDDA demographic data include sex (men and women); race and ethnicity (Hispanic, non-Hispanic American Indian or Alaska Native, non-Hispanic Asian, non-Hispanic Black, non-Hispanic Native Hawaiian or Pacific Islander, non-Hispanic other or multiple races, and non-Hispanic White); age bracket; and place of birth (in the U.S. or outside the U.S.).
Community identities are intentionally centered in the data to acknowledge that different groups have had, and continue to have, different economic experiences and outcomes. Of course, even these groups are aggregates, and there are important group identities that are not captured in this dataset.
Income where
A third dimension of the IDDA data is geography. Individuals and households in the IDDA data can be sorted according to their state of residence. The data also includes Washington, D.C., as well as an aggregation of Native areas defined by the Census Bureau.
Place matters, especially when intersected with sex, race, and ethnicity. In the U.S., states have wide latitude to set policies with enormous impact on economic opportunity and outcomes. Eligibility for benefit programs, minimum wages, occupational licensing, housing regulations, sales taxes, and many public education programs are set at the state or local level, to name just a few. This policy variation means economic outcomes are unlikely to be constant across the entire country. Such variation also provides an opportunity to investigate the impact of different policies on different groups.
Income when
Finally, the IDDA dataset can be used to study income at particular points in time as well as how incomes evolve over time. Incomes are generally not static: They fluctuate over time for individuals, and they may fluctuate differently for different groups of people. Groups may also experience macroeconomic conditions in disparate ways. The time period covered by the data, 1998 to 2019, includes two recessions —one shallow, one deep—a slow economic recovery, a period of low unemployment, and periods of lower and higher interest rates. Researchers can thus analyze the data to understand how and why income inequality has evolved over time, insights they can then use to try to understand where inequality is going and the impact different policies might have.
In addition, IDDA data include income mobility statistics for many demographic groups, such as the likelihood that a person in the bottom quartile of the income distribution will move to a higher quartile in the next year and in the next five years. These probabilities vary across time, across groups, and across place. The promise of upward mobility has long been woven into American ideology. The IDDA income dynamics data deepen our understanding of who, when, and where this promise is most likely to be realized.
Visualizing inequality
Currently, the IDDA data contain approximately 6 million summary statistics. These statistics offer new ways to study income growth, risk, mobility, and inequality, deepening our understanding of income patterns and processes in America. (Technical documentation explains each variable in detail.)
The data can be used to visualize inequality in a number of ways. First, the data contain demographically disaggregated income values for the 10th, 25th, 50th, 75th, 90th, 95th, 98th, 99th, 99.9th, 99.99th, and 99.999th percentiles of various income measures for every year from 1998 to 2019. These estimates are available for intersections of sex, race and ethnicity, age, geography, and foreign-born status. For instance, in 2019, Asian men in California at the 10th percentile of their group’s income distribution earned $8,391. Comparing incomes not just at the mean or median but across the income distribution—low, middle, high—shows where inequality is growing or narrowing, who is flourishing and who has been left behind.
Second, the data can be used to identify proportions—the demographic characteristics of a specific section of the income distribution. For instance, in 2019, 21 percent of workers with incomes at or above the 99th percentile were women.
Third, the data can answer what share of income specific demographic or earnings groups hold. For example, in 2019, workers at or above the 98th percentile of the income distribution earned 18 percent of total wage income in the economy.
The customizable IDDA chart and map toolkit offers an initial window into visualizing how income and earnings inequality vary over these dimensions of group identities. In addition, we provide instructions for how to create charts for each state that look at income growth by race and ethnicity over time.
Why it matters
It is well known that some earnings gaps in the U.S. have been stubbornly persistent—and yet, the sources of these gaps are still poorly understood. The Income Distributions and Dynamics in America dataset will help researchers uncover many more facts about inequality than were previously known. That more complete picture, in turn, can be used to better test theories about the sources of the gaps.
Furthermore, policy has consequences for inequality—and inequality has consequences for policy. One of the objectives of economic research is to study the allocation of resources and whether changing the allocation would increase productivity, growth, or welfare. Understanding who is at the bottom and top of the income distribution, how entrenched or precarious their positions are, and what factors led them there is important information to design effective policies that target those who need it.
“Without accurate and easy-to-use data for U.S. states and communities, it is too easy for policymakers to make assumptions about how different groups are faring,” said Institute Director Abigail Wozniak. “This can result in policies that are too narrow—missing subgroups that are low-earning but overshadowed in an average. But it can also lead to policies that are too broad, for example, assuming that all communities of color face challenges to earnings growth.”
A dataset is a way in which people are counted, even seen. This is particularly important for demographic groups that are typically too small to be incorporated in other datasets, leaving their economic circumstances poorly understood. By including more demographic groups than other datasets do, this data sees Americans—in their diversity, their heterogeneity, their inequality. And by making the data available for all, it also allows Americans to see themselves.
The project team
Principal investigators
- Abigail Wozniak, Federal Reserve Bank of Minneapolis
- Illenin Kondo, Federal Reserve Bank of Minneapolis
- Kevin Rinz, U.S. Census Bureau
- John Voorheis, U.S. Census Bureau
Researchers
- Andrew Goodman-Bacon, Federal Reserve Bank of Minneapolis
- Natalie Gubbay, Federal Reserve Bank of Minneapolis
- Brandon Hawkins, Federal Reserve Bank of Minneapolis
- Zach Swaziek, Federal Reserve Bank of Minneapolis
Endnotes
1 In the Income Distributions and Dynamics in America dataset, the race and ethnicity groups are Hispanic, non-Hispanic American Indian or Alaska Native, non-Hispanic Asian, non-Hispanic Black, non-Hispanic Native Hawaiian or Pacific Islander, and non-Hispanic White. We occasionally omit “non-Hispanic” in the text for brevity.
2 The process of putting together income measures, demographic characteristics, and geographic location is complex. The goal is to keep as many records as possible while screening out records that are incomplete or appear erroneous. For this project, records were dropped if sex, race or ethnicity, address, birth year, or certain income measures were missing. Individuals younger than 16, older than 100, or who were not a primary or secondary filer on a 1040 on which they appear were also omitted, as were records where an unusual number of records were associated with a single address.
Lisa Camner McKay is a senior writer with the Opportunity & Inclusive Growth Institute at the Minneapolis Fed. In this role, she creates content for diverse audiences in support of the Institute’s policy and research work.