Data preparation consumes much of a methodologist's time (Glaser, 2009). In fact, the activity can sometimes detract from the research itself. The time devoted to work analysis and research testing is encroached on by the act of ensuring the accuracy and accessibility of information. In some cases, data preparation even removes members of the research team from their actual jobs.
While data is indeed one of the most—arguably the most—important parts of the research process, there is absolutely no need to dwell on data consolidation more than necessary. Handling research data is, thus imperative to the continuity and veracity of research work.
To that end, research data management (RDM) is a discipline concerned with making data—generated in the course of research—to be accessed as easily as possible by peers, contributors, and readers. This article plans to outline what it is, what it can do, and how to make an effective RDM plan.
In the scope of this article, we will refer to research data as simply “data,” which, more specifically, refers to digital forms of data unless otherwise specified. But what is data, really?
In general, data is information that is collected and recorded for later reference or analysis. Note that you can generate data at any point in your academic research, but if you fail to document it properly, it will become useless. Spichtinger and Siren (2018) define research data, more specifically, as "recorded factual material commonly accepted in the scientific community as necessary to validate research findings including data sets used to support scholarly publications."
They further subdivide research data into 4 types, as you can see in the graphic below.
Data can be extracted in a variety of ways. Most readers would automatically assume that data is natively digital, but in fact, digital is only one of the latest entrants to the data sphere. Scientists and researchers prior to the computer age have recorded their research data on other formats, such as:
In addition, remember that "data" means differently among various disciplines. The fact remains, however, that recording data efficiently—meaning you devote less time preparing it—and making it accessible to peers and readers are the best ways to draw value from research. Research data management, or RDM, is a body of knowledge that seeks to do so.
Research data management describes a way to organize and store the data that a research project has accumulated in the most efficient way possible. It manages data gathered during the entire lifetime of the research project by coming up with consistent conventions. It is also responsible for the sharing, access, preservation, and secure disposal of data, thus making it an integral component of the various types of resource management. And, the practice is also intertwined with the tools that allow scientists to do so. Primarily, this involves the use of scientific data management system solutions.
[degree_finder]
There are several reasons RDM is important, apart from the obvious, which is to make data collection easy and efficient. Here are some:
Like anything else, managing research data has several challenges. The following are the biggest:
In general, there are five main sources of data belonging to the primary and secondary research methodologies, as explained below. Note that the source of research data influences how any team manages that data. For example, observational data should be recorded right away to prevent data loss, while reference data isn't as time-sensitive.
Whatever the source, researchers collect data using one of two types of research methods: qualitative and quantitative. As you can surmise from the name, qualitative is descriptive, which is useful for things that can be observed but not measured. Quantitative, meanwhile, regards numbers.
That said, the source or type of data means either method is much better suited. Language data for use in natural language processing, for example, cannot be measured, so it is more appropriate for a qualitative collection method (which brings us full circle as NLP can also automate qualitative data analysis, as Crowston [2011] pointed out in his paper).
The two methods are explained below.
Qualitative research is research that defines the associations of individuals and experience against a greater context, such as social realities or the world. It is more concerned with observing people and groups and how they live their lives in a particular setting. Therefore, qualitative research collects data that is more descriptive than empirical.
Denzin and Lincoln (1994) describe the many ways of collecting qualitative data using empirical means, such as interviews, observations, analysis, visual materials, and personal experience. In addition, qualitative data doesn't need to be constrained to text; images, like photographs, video, and audio recordings can also be considered qualitative data. An anthropologist collecting oral history and recording it is a type of qualitative data collection.
What you gain from qualitative research answers how people experience their world and how they act in their social sphere. However, one should note that the person producing the data (e.g. researcher, participant, annotator, etc) is a critical part of the data, as it may change depending on the participant. Therefore, most qualitative data (if not all) is subjective and exists only in relation to the observer (McLeod, 2019).
This, however, is one of the strengths of qualitative research, as the researcher has a closer look at the subject matter that is otherwise only afforded to an insider. This gives him/her unmitigated access to matters, such as nuance and other subtle cues, that quantitative researchers will often miss. It also gives the research team a rare view of contradictions and ambiguities in data, which often reflect real-life (Denscombe, 2010).
On the other hand, quantitative is a more objective type of method because it uses conventional standards of reliability and validity: numbers. While certainly not all data can be measured this way, quantitative data has the advantage of being categorized and/or ranked for a variety of purposes, such as graphs, charts, or tables. This property of quantitative data to be visualized shows the reader how to answer questions, not just display them (Cleveland, 1993).
The main order of business for quantitative researchers is to establish a general framework that bounds different settings and purposes, usually through the use of experimentation. This means that to limit extraneous variables, these experiments are often done in a controlled environment, such as a laboratory. However, this method often limits the resultant data to context, such as the assumptions, limitations, and expertise of the investigator (Black, 1999; Jansson-Boyd, 2018).
The main strength of quantitative data collection, however, is that it can be verified and interpreted with mathematical analysis. This, especially since the investigator detaches himself from the research, means it is more scientific and objective (Carr, 1994; Denscombe, 2010). In addition, numerical data is much easier to replicate than a qualitative one, and while large datasets used to be a monumental task, software can "crunch" numbers today faster than ever (Antonius, 2003).
Research data management offers a lot of benefits to researchers. Some of these are discussed below.
The most important benefit of RDM is that you can secure your data. By making an effective research data management plan, you minimize data loss and unauthorized access by adhering to data storage or organization standards. You also reduce the risk of losing the integrity of data either through accident or negligence.
The most common site to store your research data is in your institution's repository, like servers (for digital data). Your institution or organization may have advice on where to store your data. Note that many funders generally dislike storing research data that they funded on personal repositories or elsewhere, especially without authorization.
The second most important benefit of RDM is collaboration, especially in an age where research is more complex, with more moving parts. But this is an advantage, as there is a positive correlation between the number of authors in a study compared to those with only one (Lamberts, 2013). Making data accessible for everyone in the group, even those not in the team but in the same discipline can open up massive opportunities to further your own research.
Plus, good RDM routines also improve the efficiency of data access. An organized data directory structure, for example, can make contributing data or building upon the existing dataset much easier. Efficient data organization also makes keeping tabs on the progress of the project much more seamless and puts accountability front and center.
Should another team, using the data gleaned from those you generated, tries to replicate your research, they should arrive at the same result. Good RDM practices improve your research integrity by allowing third parties to validate your processes and findings. Markowetz (2015) also cites five "selfish reasons" that make reproducibility important, among them avoiding disaster and helps facilitate peer reviews.
In addition, putting your research up for review increases the visibility of your research, which, in turn, grows your number of citations. As Piwowar and Vision (2013) pointed out, open data benefits robust citation by improving its value and impact even after the project or the research is completed. That said, proper attribution is key to uncovering the results of your research, which is why some data citation standards are being developed. One initiative is using digital object identifiers (DOIs) to make data easily traceable across the internet.
[degree_finder]
Knowing why you should manage your research data is all well and good, but the question remains: how should you do it? The answer is that you start with a data management plan, or a DMP, which will cover how your files and datasets are stored, organized, and arranged in a database. There are several database formats, which you can use for huge volumes of data, but if you only need to array them that makes the most sense in a computer, you can find a few tips below.
Source: db-engines.com
Before you begin, you need to make many decisions on how to manage your data. For example, funders now require an outline of your data management plans even before you begin your research, along with how regularly you need to furnish them with this data, needed hardware and other equipment, and other issues. This is an ideal starting point to make a DMP as a map to all your planned research data—whether your funder requires it or not.
Additionally, you'll have to contend with other considerations. Some of them include:
We look at these considerations further below. This is especially useful for researchers who want to outline a more specific approach on how to organize and simplify their research data management.
Consistency and logic are the top two reasons researchers organize their data. It allows any member of the team to find and use them easily. You need not create a highly detailed flowchart for this, however, as it may simply entail thinking about a file naming convention and how to nest them in your directories for easy access. The ideal time to do this is before the project or the research begins.
Naming conventions also preclude the possibility of overwriting files. File names may contain dates and other identifiers to help you track which files are yours and when they were modified. Metadata, however, is much more accurate for this task, which we'll also cover below.
For reference, the Library of Congress has recommended formats for data and databases on this page: https://www.loc.gov/preservation/resources/rfs/data.html.
As mentioned, structuring your datasets in files and folders is an easy way to start your data management plan. Here are some ideas to get you started:
As for files, agree with your team on how to properly label files so you don't confuse one another when labeling them. It is a good idea to opt for a version control naming scheme, for example, a "v01" or "v02" appended on the file name. In many cases, a final version of the file with the data in question can be marked as "final." The one who will likely do this is the supervisor, the principal investigator, or the approver of the research.
Metadata means data about data. This is information that tells you about the data contained in a file, which is helpful to find the exact file you are looking for (and for others too). At present, not only does metadata define data but it is also useful in bridging connections among tools and software, like an API (Sen, 2004).
Metadata contains information that is necessary to find, interpret, and use your file, folder, or data. Like your file naming and folder structure conventions, deciding on metadata should be done at the start of the project.
There are generally two ways to attach metadata to your files: embedded metadata and supporting metadata.
Embedded Metadata
This means embedding information into the file itself using various means. This is the easiest, both for the creator of the file and those trying to find it. There are various ways to do this. Some embed metadata into the file itself using XML text, such as this:
<data camera="b" date="14-Jun-01" direction="left" filename="021b001.dv" session="021" start_frame="335" start_time=" 0:00:13.10" stop_frame="4914" stop_time=" 0:03:16.14" subject_id="001" xmlcreatedby="xmlwrite.py; Time Code for segments added" xmlcreatedon="Tue Mar 26 15:32:05 2002">
<data camera="b" date="14-Jun-01" direction="left" filename="021b001.dv" session="021" start_frame="335" start_time=" 0:00:13.10" stop_frame="4914" stop_time=" 0:03:16.14" subject_id="001" xmlcreatedby="xmlwrite.py; Time Code for segments added" xmlcreatedon="Tue Mar 26 15:32:05 2002">
<comments> No comment </comments>
<segments automatic="no" checked="yes">
<fullview> <start frame="51" start_frame="1104"/> <stop frame="2771"/> </fullview>
<postbackground> <start frame="2772" start_frame="4867"/> <stop frame="2822"/> </postbackground>
<prebackground> <start frame="0" start_frame="335"/> <stop frame="50"/> </prebackground> </segments>
</data>
Some operating systems also support embedding of metadata this way, such as Microsoft's Document Properties.
Other ways to embed metadata include descriptions, such as on the code or labels within the file itself. Some users also embed metadata using headers or summaries.
Supporting Metadata
This metadata is separate from the main datasets, and are often used in accompaniment with it. These are sets of documents that contain an explanation or context of the data they are trying to support (hence the name), much like an operating manual.
The main disadvantage of supporting metadata is that they run the risk of being as voluminous as the main dataset they prop up. In this case, best practices in structuring and naming, as explained above, also apply.
Data will outlive the project, so you should plan for ways to share and preserve your data for posterity. Data preservation is part of the research data lifecycle. Though there are slightly varying models of data lifecycles (Ball, 2012), the research data lifecycle involves the movement of data from creation to preservation and reuse, ad infinitum.
The basic processes in a typical data lifecycle.
Digital data has an advantage in the sense that it can be maintained for far longer than other types. However, the main drawback to this is that as technology progresses, the tools meant to access this data may change. Good RDM practices, thus plan for this inevitability by ensuring all data can be understood and used even years down the line.
Preserving data, however, does not mean merely saving to backups. As mentioned before, you should future-proof your data using these practices.
Data should not be siloed, and research data even more so. There is no sense to hoard data, after all. Sharing is not only a good source of feedback but it is also a way to increase funding interest, garner citations, and build reputation.
Researchers can share data using a variety of means. At its simplest, you can store them in a USB flash drive, which can be borrowed by colleagues. Otherwise, you can use FTP upload on a server, such as to your institution’s repository. Another way includes cloud sharing, which is explained below.
As for licensing, investigators can simply make a request form that anyone who wants to use their data can fill out. Otherwise, if internet publication is preferred, Creative Commons licenses are ideal for research work. Though there are many types of CC licenses, the most appropriate for research data is the “By-Attribution, Non-Commercial” license, which states that anyone can use the data in a researcher’s work as long as they cite their source/s and they avoid using it for profit. Some states or territories, however, have conflicting assessments of the NC clause (Hagedorn et al., 2011), so check with the licensing authority first.
And what is even better is that CC licenses do not need paperwork to be filed; you just need to notify your readers or other interested parties that you're using a particular type of Creative Commons license. However, CC licenses are irrevocable. Use it only when you are certain that you will not revoke it in the future for any reason.
Data loss is the enemy of nearly every researcher—or near everyone who has stored files in any kind of storage medium. This is why it is crucial to have backups of your data, and to even backup your backup if necessary.
Some institutions often use automatic backups to periodically save research work or any materials stored in their repositories. Ask your computer or network administrator for details of this automatic procedure, especially how often it happens, where it is stored, and how long the backups are kept. In any case, no matter how exemplary your institution's backup process is, it is still prudent to back your data up on your own.
Cloud storage provides a relatively affordable but highly reliable means of backing up data. In addition, they offer competitive cost-to-space ratios. No matter the cloud provider, though, cloud storage syncs in real-time, so your remote backup data is updated as soon as yours do.
Source: Annual Enterprise Cloud & Data Security Report, page 5
Whatever the case, it's a good idea to diversify your backup formats and locations so you can keep data as safe as possible.
It can be said that good data management is not the destination, but the journey; it is how researchers lead to discovery and innovation (Wilkinson, 2016). Data, freely shared, can lead to further insights long after the original project is done and the research team has moved on.
This is why it is important to have a logical data management system to index and store your research data, not only for your own use but for those who will come after. Citation is an essential part of the research environment, which brings your findings to the experts who can build on your work. Using initiatives like the DOI and new technologies, such as cloud storage can bring your research to more minds than ever before.
To do that, however, managing your data just as your predecessors did is still a good idea. Following conventions, practicing logical data structure, and citing wisely is the framework upon which the future of science is built.