NIST seeks to bring rigor to data science

NIST plans to develop a framework to help different industries better understand and use big data

Pak Chung Wong, chief scientist at the Pacific Northwest national Laboratory, explains the challenges of working with enormous data sets, as the NIST data science symposium

Pak Chung Wong, chief scientist at the Pacific Northwest national Laboratory, explains the challenges of working with enormous data sets, as the NIST data science symposium

The U.S. National Institute of Standards and Technology (NIST) wants to bring some metrics and rigor to the nascent but rapidly growing field of data science.

The government agency is embarking on a project to develop by 2016 a framework that can be used by all industries to understand how to use, and measure the results from data science, and big data projects.

NIST, an agency of the U.S. Department of Commerce, is holding a symposium on Tuesday and Wednesday at its Gaithersburg, Maryland, headquarters with big data specialists and data scientists to better understand the challenges around the emerging discipline.

"Data science is pretty convoluted because it involves multiple data types, structured and unstructured," said event speaker Ashit Talukder, NIST chief for its information access division. "So metrics to measure the performance of data science solutions is going to be pretty complex."

Starting with this symposium, the organization plans to seek feedback from industry about the challenges and successes of data science and big data projects. It then hopes to build a common taxonomy with the community that can be used across different domains of expertise, allowing best practices to be shared among multiple industries, Talukder said.

While computer-based data analysis is nothing new, many of the speakers at the event talked about a fundamental industry shift now going on underway with data analysis.

Doug Cutting, who originally created the Hadoop data processing platform noted that what made Hadoop unique is that it took a different approach to working with data. Instead of moving the data to a place where it can be analyzed -- an approach used with data warehouses -- the analysis takes place where the data is stored itself.

"You can't move [large] data sets without major performance penalties," Cutting said. Since its creation in 2005, Apache Hadoop has set the stage for storing and analyzing data sets so large that they can not fit into a standard relational database, hence the term "big data."

As these data sets grow larger, the tools for working with them are changing as well, noted Volker Markl, a professor and chair of the database systems and information management group at the Technische Universität Berlin.

"Data analysis is becoming more complex," Markl said. As a discipline, data science is challenging in that it requires both understanding the technologies to handle the data, such as Hadoop and R, as well as the statistics and other forms of mathematics needed to harvest useful information from the data, Markl said.

"A lot of companies are finding that they thought they were getting data science when they purchased Hadoop, but then they have to hire a data scientist to really do something useful with it," said symposium attendee Brand Niemann, director and senior enterprise architect at the data management consulting firm Semantic Community.

Another emerging problem with data science is that it is very difficult to maintain a data analysis system over time, given its complexity. As the people who developed the algorithms to analyze data move on to other jobs or retire, an organization may have difficulty finding other people to understand how the code works, Markl said.

Another challenge will be visualization, said Pak Chung Wong, chief scientist at the Department of Energy's Pacific Northwest National Laboratory. Visualization has long been a proven technique to help humans pinpoint trends and unusual events buried in large amounts of data, such as log files.

Standard visualization techniques may not work well with petabyte and exabyte-sized datasets, Wong warned. Such datasets may be arranged in hierarchies that can go 60 levels deep. "How can you represent that?" he asked.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection

Tags applicationsNational Institute for Standards and Technologydata miningsoftwareData managementdata warehousing

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

David Coyle

Brother PocketJet PJ-773 A4 Portable Thermal Printer

I rate the printer as a 5 out of 5 stars as it has been able to fit seamlessly into my busy and mobile lifestyle.

Kurt Hegetschweiler

Brother PocketJet PJ-773 A4 Portable Thermal Printer

It’s perfect for mobile workers. Just take it out — it’s small enough to sit anywhere — turn it on, load a sheet of paper, and start printing.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?