By Ben Earnhart, University of Iowa Department of Sociology, incoporating comments from members of the Survey Research Methods Section of the ASA listserve.
The purpose of this handout is to aid you in the manipulation of data to be used for social science research. A user I spoke with in preparing these notes who discovered many of these practices on his own told me that he started following them because he was conscientious and wanted to do a good job, but just as importantly, didn’t want to waste his time. It may seems slower and more complicated to follow good practices at first, but in the long run, will make your life much, much easier.
Despite the tone taken in some of these pages, the specific implementation of good practices given in this handout is not the last word in how to handle things. Many researchers have different, and better ways of working with their data. But these different ways arose from deliberate, systematic efforts to implement good practices, and share the common goals about to be outlined. So if you find that there is a better way, please do use it, and let me know how and why you do things differently -- maybe these instructions can be improved through your input.
One goal is replicability. While the process of research is often presented as a linear one, progressing step by step from proposal to finished work, it is actually an inherently iterative process. You will often find yourself re-visiting an earlier stage in the process. If you follow these steps, you will be able to trace a path back to any given part of a research project much more easily. In addition, if anybody ever questions your work, you can easily verify how you got your numbers, or repeat an analysis in a slightly modified form – it is quite common to have reviewers say “yes, but what would happen if you used this model or that control variable?”
It will help you to improve the efficiency of your work. Often people assume that the first way they learn to do something is the most efficient. It’s also natural to start work on a problem without consulting the manual first. However, good practices such as writing efficient syntax, consulting the manual, and limiting the amount of work done at a given time can massively increase the speed at which you can accomplish a given task. Also, given the road-map above, if you ever need to repeat a task, finding your code from the last time you did something can save you from having to start from scratch.
It will help you to ensure the quality of your work. Much of the material and tips are intended to help you avoid common mistakes in the first place. Mistakes can at best force you to re-examine and re-do much of your work, and at worst can totally derail your project. Realize that no matter how good you are, you will make mistakes, and steps taken to minimize them will greatly speed up your progress.
Finally, it will prepare you to work with collaborators better. You may say to yourself “this is my thesis, I don’t have collaborators, I don’t need to follow guidelines that apply to group projects.” But realize that at some point, all projects are collaborations – whether it’s explaining to your advisor how you arrived at certain conclusions, or seeking assistance when you need help with a particularly difficult part of the project, others will need to understand what you did.
Some software packages have point-and-click interfaces that allow one to carry out many analyses and data manipulations without ever having to work with syntax. However, syntax files are essential in maintaining all four goals. Here’s my explanation below; you can click here to see another explanation by an SPSS guru.
· Working with a statistics package without an understanding of syntax is like being a scholar of French history and never learning the French language – you can do some basic work, but are severely restricted, and people will have a hard time taking you seriously.
· Syntax greatly enhances replicability. You have exact documentation of exactly what you did, and can repeat it exactly, simply by executing the syntax file. If you made a mistake or change your mind, it’s easy to make a few modifications and re-run it, whereas a complex series of mouse-clicks can be difficult to repeat exactly. Log files and the SPSS journal file (to be discussed in lesson 2) can help repeat things with a point-and click environment, but again, if you change your mind, it can be difficult to sort out what you did and didn’t really mean to do, what parts of the record were mistakes and what parts were important.
· Syntax can enhance the efficiency of your work. Besides being able to repeat things more easily, syntax files can increase efficiency in other ways. For example, it takes eight mouse-clicks in SPSS to change a variable in running frequencies, and if you have a large dataset, it also can take a lot of scrolling. With syntax, you merely need to type in the variable name. When you’re working with numerous variables, it can get much, much more efficient, as you can copy and paste lists of variables. Perhaps most importantly, it opens up more powerful features of the programs and enhances your abilities as a researcher – if you learn to do scripting, macros, loops, and other advanced syntax, you can massively improve your productivity.
· Syntax can enhance the quality of your work. When you type something, you know exactly what is happening. With pointing-and clicking, one slip of the mouse and you did something totally un-intended.
· Syntax is essential when communicating with others. A conversation with your advisor that goes: “Well, I clicked on this, then I clicked on that, then I clicked on this, then I clicked on that… anyway, here’s the output” is not going to be a very pleasant one. And if you ever come to somebody for help, or post to a list-serve looking for help, syntax is the only way to communicate.
That said, the GUI can be great for preparing syntax. In SPSS, you can click “paste” instead of “OK.” This was perhaps the #1 suggestion that was repeated by numerous people when I was asking for suggestions as to what should be in these lessons. In SAS, when you are using their wizards, such as the data import wizard, there is a similar option to save a syntax file rather than immediately executing the commands generated by the wizard.
One good practice is to have a comment section at the very beginning of each syntax file, and at each section of the syntax where you begin a new task. The comment at the top should at the very least that make a note of:
· The date the file was created and when it was last modified.
· The author (yourself)
· The project
· The basic purpose of the file.
It should also probably include:
· Mention any related files, both data and syntax. For some files, it’s obvious (you read in the data from a single file), but for others, like when you are merging data or are carrying out a complex series of manipulations, this can be very important.
· Acknowledgement of other sources if you’re using other peoples’ syntax. This is the right thing to do from a moral point of view, but can also give you a direction to look if something goes wrong.
· You might also use this comment section as a supplement to your diary (diary to be explained later).
When making these comments, it’s often good to refer to specific pages of the codebook, syntax file, email from your advisor, or whatever other documentation can help you make sense of what you are doing and why you are doing it. A good rule of thumb for how verbose your comments should be is to ask yourself "What if I were to set this down right now and not think about it for six months? What would I need to say here, so I can quickly figure out what I was doing?”
Filenames like “syntax.sps” are totally useless and will actively confuse you if you have many of them. Names like GssReadASCII1.sps and GssRecodeEmployment1.sps take a bit more imagination to name in the first place, but much less imagination to figure out what they do later. In addition, data file names should be meaningful, and generally should be related to the names of the syntax related to them unless you have a very good reason to have the two differ. A good habit is to save the syntax and data occasionally with an incremental filename (myfile1, myfile2, myfile3, etc). That way if you make a major mistake, you can always go back to an earlier version.
Similarly, variable names like V1, V2, V3 are not immediately intuitive. Names like caseID, income, and gender are much easier to understand, so when you create a new variable, think carefully about what it should be named – like with filenames, a little thought now can save you and others a lot of confusion later on. When creating new variables, you should make an addendum to your codebook listing your new variables, their meanings, valid values, etc. Despite the desirability of meaningful variable names, do not just go re-naming pre-existing variables willy-nilly – realize that you need to be able to relate variables to the codebook. So when a dataset comes to you with meaningless variable names such as V1, V2, V3, you probably will want to just leave the variable names alone. If you do re-name existing variables, make a note of it in the codebook, and print out a list of your variable re-namings so you and others can see the changes at a glance. However, variable names can only carry so much information (see the section on lengths below).
Variable labels can make up for the limits on information carried by the variable name. Attached to the variable, they can allow you to get verbose. Realize that when looking at output (the main place these show up), they don’t necessarily see all the related coding, so when creating them, think to yourself “What would I need to say to make sense to other people outside of the immediate context?”
Value labels map numeric values to substantive meanings, like 1=male, 2=female, 3=missing. These can be very handy to interpretation of variables. When you’re at an in-between stage, creating and re-creating variables and changing things moment to moment, you might not want to bother with them. However, when you reach a point in your project where you will be generating output that others will be looking at, value labels help interpretation immensely – your advisor hopefully will be checking your syntax carefully for errors and consistency, but he or she should not have to flip through pages and pages of syntax to figure out that 1 means employed, 2 means unemployed, and 3 means out of the labor force when looking at output. Value labels can make meetings much more productive and pleasant!
As a warning regarding this, some software packages limit the number of characters in filenames, variable names, and labels. Spaces, numbers, and non-alphanumeric characters (#,$,^,&,% etc.) are also problems for some software packages or operating systems, especially since they often carry special meanings, like “$” denoting a string variable, or “%” invoking a macro. I recommend that you look up the restrictions for your particular package, but to be on the safe side, you probably should take the least common denominator approach – restrict yourself to names that are accessible to as many packages as possible. Some general rules of thumb are:
· Be aware of limitations on the length of things. Some length restrictions, such as those variable names, are very common, and will actively cause errors if you make them too long. Others, such as those on variable labels, are peculiar to different software packages, and even different versions, and won’t cause immediate problems – but making them too long can actively cause confusion if they get truncated (where the software just chops them off past a certain length) and will at least make your output look odd. Here are some basics:
o Filenames. Once upon a time in the dark ages of DOS, you were restricted to eight characters, but most operating systems can handle longer ones, so long is OK, but try to keep them as short as reasonably possible, say 15-20 characters or so. If you need to go beyond that, then you’re probably putting in information that should be in the comments, or in your data diary (to be discussed later).
o Variable names. Eight is the magic number here. Most software packages still restrict you to eight, and even if your particular software will allow you to go beyond eight, this will cause problems if you ever need to transfer data or learn to use another package. If you’re going beyond eight, then you’re trying to put in information that could and should go into the variable label.
o Variable labels. 40 is good here. Some packages enforce this, others will allow you to go beyond this but will truncate and warn you, still others will truncate without warning, leading to some funny looking output. If you need to go beyond this, you probably need to choose your words more carefully, or are putting in information that should be in your codebook or data diary.
o Value labels. I don’t have a strong recommendation here (would any readers like to suggest one?). Keep in mind that shorter is better, and how they will appear in your output. Sometimes, for example, occupational codes, they might need to be long, but in general, keep them as short as is reasonably possible.
· Do not use spaces in filenames or variable names. Some operating systems will let you use spaces in filenames, and some programs will even allow them in variable names, but if you can avoid using spaces, it saves you from errors if you fail to wrap a filename in quotation marks. They are of course acceptable (and obviously quite important) in variable labels and value labels.
· Variable names and filenames should not begin with numbers. A few software packages will allow them, but most don’t. Ending with numbers can be a handy thing in tracking versions of related files or variables, but never begin with them.
· Avoid the use of non-alphanumeric characters. A few of them, such as “,” and “.” are generally OK in variable labels and value labels, and can be very useful to express meaning. However, some, like “%” and “*” are often going to cause problems (note that “pct” is much safer than “%”). Even if your particular software package allows a particular non-alphanumeric character, realize that they can cause problems down the road; for example, even if you normally work in SPSS, you may at some point need to export your data to a specialized package like HLM, Lisrel, or Limdep. So avoid them and you will save yourself from cryptic, difficult-to-diagnose problems later. One exception to this is that many people tend to use “_” when naming files to avoid using a space. As far as I know, “_” is acceptable to all the standard statistics packages.
Click on the links below for software-specific examples, along with notes on how comments are handled. For the syntax files, right-click and choose "save as" to store them on your hard drive. Before running them, you will need to unzip the Zip file with the data in it. Of course, there are certain lines in the files that tell it where to read data from, and where to write data that you will need to modify to get the examples to run.