IRS Compliance Data Warehouse
In 1996, The Internal Revenue Service (IRS) initiated a project to upload a single year of tax return data for analysis. The project has resulted in the Compliance Data Warehouse (CDW), which contains more than 1 Petabyte of information. Most of the legacy data is structured, but new data from electronically filed tax returns, international tax treaty partners and third parties come in XML or other semi/unstructured formats. The IRS research group runs analytics on the data for jobs ranging from estimating the U.S. tax gap to predicting identity theft, measuring the taxpayer burden and simulating the effects of policy changes on tax behavior.