Structure: A Better Way of Thinking about Data

There are many different ways to classify data, but one classification that I hear frequently is “quantitative” versus “qualitative”. This can be a useful classification, at least in the context of determining what statistical analyses are appropriate for specific variables in the data. However, those classifications are being applied more and more broadly, as shorthand for other attributes of datasets. Quantitative is often used to mean “data collected by computers”, and is assumed to be consistent, objective, and reductive; qualitative is often used to mean “data collected by humans” and is assumed to be inconsistent, subjective, and rich.

This shorthand is sloppy at best; at worst, it is misleading, inaccurate, and obscures actual information about the data that would help a listener understand what analyses are appropriate for the dataset.

Fortunately, there is an alternative. Classifying data by its structure both avoids potentially false implications about the data while also giving a listener good information about what analysis methods may be appropriate for that data.

What is structure?

Structure is a consistent underlying organization. This consistent organization is the quality that makes it easier to search, transform, and analyze structured data. Unstructured data has no consistent underlying organization, which makes it more difficult to search, transform, and analyze.

Unstructured data is like a pile of silverware.

Unstructured data is like a pile of silverware. Accessing particular kinds of silverware in the pile requires inspecting all the silverware.

Unstructured data is like a pile of silverware at a flea market. If I asked you to pull all of the salad forks out of the pile of silverware, it would take a while. You would need to pick up each piece of silverware, determine whether or not it was a salad fork, and place the salad forks in a separate pile.

Structured data is like silverware in an organizer. Accessing a particular kind of silverware is as straightforward as reaching into a cubby.

Structured data is like silverware in an organizer. Accessing a particular kind of silverware is as straightforward as reaching into a cubby.

Structured data is like silverware in a silverware organizer. If I asked you to pull all the salad forks out of a silverware organizer, you would only need to reach into the salad fork cubby and pull out the stack of salad forks. If I asked you to pull out a particular spoon, you would only need to search through the teaspoon cubby, which contains only a small percentage of the silverware items in the drawer. With the unstructured pile of silverware, finding a particular spoon would require inspecting all of the silverware in the pile individually.

Most of the data we interact with on the Internet exists somewhere in between “highly structured” and “totally unstructured”. Images, status updates, books, web pages, videos, and countless other types of data have some consistent underlying organization. A file format, like jpeg or bitmap, is a consistent organization that a computer uses to recognize and display the data present in a file. Books have titles, authors, and pages of content–that’s all structure. However, the bulk of the data in those items is unstructured.

Generally speaking, computational analysis methods require structured data. It’s easy for your computer to order your images by creation date, because that information is included in an image’s structured data. It’s very difficult for your computer to identify all the images containing birds, because that part of the image information is not structured.

Unstructured data can be transformed into structured data. The process can be labor intensive and often requires human intervention, but depending on your analysis needs, it may be worthwhile. Image tagging is a good example of adding useful structure to unstructured data. For instance, if you went through your image collection and added tags to each indicating what was in the images, it would then be straightforward for your computer to identify all the images containing birds. When structuring data, it is useful to retain the original, unstructured version. This is important not only because data may be lost in transformation, but also because the structure that is appropriate for one type of analysis may not be appropriate for another.

Ready to start classifying data by its structure? Here’s a quick-reference for your future data-discussing pleasure!

structured_data_300

 Shoutout to my former classmate Jason Foss, who graciously provided the silverware organizer metaphor for structured data. Thanks, Jason!