How can I merge multiple files in Stata? | Stata FAQ
4 stars based on
However, the old syntax displayed on this page will still work in newer versions of Stata. We are grateful for their permission to reproduce this FAQ here. It is not uncommon for data, especially survey data, to come in multiple datasets there are practical reasons for distributing datasets this combining 2 datasets in stata. When data is distributed in multiple files, the variables you want to use will often be scattered across several datasets.
In order to work with information contained in two or more data files it is necessary to merge the segments into a new file that contains all of the variables you intend to work with.
In addition to finding the variables you want for your analysis, you need to know the name of the id variable. An id variable is a variable that is unique to a case observation in the dataset. For a given individual, the id should be the same across all datasets. This will allow you to match the data from different datasets to the right person.
For cross sectional data, this will typically be a single variable, in other cases, two or more variables are needed, this is commonly seen in panel data where subject id and date or wave are often needed to uniquely identify an observation. In order for Stata to merge the datasets, the id variable, or variables, will have to have the same name across all files.
Additionally, if the variable is a string in one dataset, it must also be a string in combining 2 datasets in stata other datasets, and the same is true of numeric variables the specific storage type is not important, as long as they are numerical. Once you have identified all the variables you need, and know what the id variable s are, you can begin to merge the datasets.
A good first step is to describe our data. We can do this without actually opening file this can be handy if the files are very combining 2 datasets in stataall we have to do is open Stata and issue the command.
The describe command gives us a lot of useful information, for our purposes the most important things it shows is that the variable id is numeric, and that the data are unsorted the data must be sorted by the id variable or variables in order to merge.
We also note that the variables we want from this dataset are in fact in the dataset. Lets assume that the datasets are all unsorted and that the id variable has the same name id in all three datasets. Although we can use the data from a website easily within Stata, we cannot save it there.
The syntax below opens each dataset, sorts it by id and then saves it in a new location with a new name. If the dataset were already on our computer, we could save it in the same location, and, possibly even under the same name replacing the old datasetthis is the users choice.
Next, we actually merge the datasets. The merge command merges corresponding observations from the dataset currently in memory called the master dataset with those from a different Stata-format dataset called the using dataset into single observations.
Assuming that we have data3 open from running the above syntax, that will be our master dataset. The first line of syntax below merges the data. Directly after the merge command is the name combining 2 datasets in stata the variable or variables that serve id variables, in this case id. Next is the argument using this tells Stata that we are done listing the id variables, and that what follows are the dataset s to be merged. The names are listed, with only spaces no commas, etc.
Note, if the names or paths of your datasets include spaces, be sure to enclose them in quotation marks, i. The next line of syntax saves our new merged dataset.
Note that merge does not produce output. In the above output we see the number of caseswhich is correct. This is important since problems with the merge process often result in too few, or more often too many, cases in the merged dataset. We also see a list of the variables, which includes all the variables we want.
The merged dataset contains three extra variables. These variables tell us where each observation in the dataset came from, this combining 2 datasets in stata useful as a check that your data merged properly. Sometimes an observation will not be present in a given dataset, this does not necessarily mean that something went wrong in the merge process, but this is another place where one can often get clues about what might have gone wrong in the merge process.
We will discuss these variables in greater detail below, when we deal with datasets where not all cases are present in all datasets. It is not uncommon to find that a large dataset contains many variables you are not going combining 2 datasets in stata use in your analysis.
You can just leave those variables in your datasets when you merge them together, however, there are several reasons you might not want to do this. First, there is a limit on the number of variables Stata can handle.
These limits may see high, but if you merge multiple datasets, each with a large number of variables, you may exceed the limit for your type of Stata. The second reason you might not want to leave unneeded variables in your dataset is that each variable in memory uses additional system resources.
Below we show several methods of eliminating extra combining 2 datasets in stata. There is at least one additional option, you can open the datasets placing only those variables you need in memory. If I have a dataset containing a number of variables, but the only variables I need from it are id and readI can add variable names to my use command as is shown in the first line of syntax below.
This is particularly useful with very large files which require a lot of memory to open. Once you have opened the desired subset of variables, all you have to do is save the subset of data under a new name.
In the above example, dataset2 contained the following variables: Assume that my analysis only requires the variables read and writethe only variables from combining 2 datasets in stata that are needed are combining 2 datasets in stata two and the variable id to merge the data with another dataset. Below are examples of the same sort of data preparation done above, using each of the techniques described.
These techniques are equivalent, in that they produce the same end result. The efficiency of each technique varies depending on the situation.
As discussed above, they tell us which dataset s combining 2 datasets in stata case came from. This is important because a lot of values that came from only one dataset may suggest a problem in the merge process.
However, it is not uncommon for some cases to be in one dataset, but not another. In panel data this can occur when a given respondent did not participate in all the waves of the study. It can also occur for a number of other reasons.
In the example above, where the same cases appeared in three datasets I would expect to see cases, combining 2 datasets in stata of which came from all three of the datasets. This is particularly common when the id variable is a string. Below we combining 2 datasets in stata a dataset combining 2 datasets in stata merging to see if all went as expected. The output below shows the file describe for a dataset data1m.
Finally we sort the data and save it under a new name. Assume that when we run describe on data2m and data3m we discover that they are also missing cases.
Dataset data2m contains observations, and dataset3m contains It is possible that some of these cases are missing from all three datasets i. We will find out once we merge the data. Once we have examined and sorted the datasets we can merge them. The syntax below does this, note that the command is the same as in the first example.
By default, Stata will allow cases to come from any of the three datasets. These variables are combining 2 datasets in stata to 1 if the observation was present in the dataset associated with that variable, and zero otherwise. The results show that we ended up with a total of observations, which is what we expected.
This includes both the cases that occur in all three datasets, and those that occur in only two out of the three. Finally there is one case that is only present in one of the using datasets, that is, one case exists in data1m or data2m that does not exist in either of the other two datasets. When there is more than one dataset in the using statement, there will combining 2 datasets in stata one of these variables for each dataset in the using statement.
The variables are assigned numbers based on the order of the datasets combining 2 datasets in stata the using statement.
The variable label also indicates which dataset it is for. This page combining 2 datasets in stata archived and no longer maintained. A simple example A good first step is to describe our data. Dropping unwanted variables It is not uncommon to find that a large dataset contains many variables you combining 2 datasets in stata not going to use in your analysis. Using keep to select variables: