Picture of Nicole Mirea

Nicole Mirea

[nɪ'kʰoʊl 'miɾe̯a]

Intro to Git, for the Social Scientist

This tutorial is aimed at social scientists, particularly those who want to code or have coded their own scripts for analysis and experimental design. Maybe you’ve downloaded something from Github and been curious about how share your own code there. Or maybe a collaborator has suggested using git for a project and you don’t know where to start. Either way, I hope this guide is helpful as a gentle introduction: an installation guide with a couple of usage examples to get you started, as well as links to places where you can learn more. It’s definitely not meant a be-all end-all, but I hope you get some use out of it.

If you’ve never used the command line, fear not. The commands you’ll need to get started are very simple.

  1. What is git?
    1. Do I even need to learn git?
    2. Big Caveat: Git Will Not Solve All Your Problems
  2. Getting Started: A Tutorial
    1. Installation
    2. Configuring Options
    3. Your First Repo: The Basics
      1. Initializing
      2. Staging Changes
      3. Committing
    4. Branching Out
    5. Time Travel
    6. Working Well with Others
      1. Setup
      2. Pushing
      3. Pulling
      4. Bonus: Squashing
  3. Further Reading
  4. Footnotes

What is git?

Git is a version control system, or a program designed to help you keep track of versions of a project. If you’ve ever restored an old version of a document using Dropbox or Google Drive, you’ve used version control.

Out of the different version control systems out there, git is the most popular for people who work with code. Part of this is sheer network effect: a lot of projects are hosted on Github, which makes them easy to find and facilitates collaboration, and all projects on Github use git. But its more important advantage over other version control systems is its well-suitedness to development across multiple machines: cases where you’re developing on your local machine and periodically syncing with other people, who might be working on a different part of the project entirely. It doesn’t require you to be constantly connected to the Internet, and it minimizes the risk of accidentally overwriting your work.

Another reason for git’s success is its extendability: it’s easy to build on1, so there’s a ton of software built on it. You can visualize how many lines of code each collaborator has written. You can use it to identify when a project chanaged the most (or when a critical bug was introduced). You can even use it to update a website, like I’m doing right now.

Do I even need to learn git?

That being said, not every social scientist needs to use git for every project, and not every social scientist needs to know how to use git at all (although learning new things is generally never a bad thing). There is, after all, a learning curve to consider, and some of git’s behavior can be… unintuitive at best. Fundamentally, the decision to use git for a project will depend on three things:

  1. Are you writing code? Git is really good at tracking changes in plaintext files; not so much for binary files like Word Documents or images. If most of your work deals with manipulating those types of files, then versioned file syncing software (like NextCloud, Google Drive, or Dropbox) might be the way to go. That’s not to say that these files can’t be part of a git project—just that git won’t be able to tell you which part of the file has changed2 , and so there’s fewer benefits to using git over another form of version control if you don’t have any code to keep track of. That being said, more and more social scientists do need to write code these days in order to carry out their research, particularly if they do any sort of quantitative data analysis.
  2. Are you planning to share/publish? If you’re a social scientist, you’re probably planning to publish your work at one point or another. Making your research as transparent and accessible as possible is essential for trustworthiness and reproducibility, and git can be a step towards that. Git timestamps your changes every time you save a version of the project. This can go a long way towards accountability—especially if you post those timestamps somewhere where they can be seen, like the Open Science Foundation’s website. In addition, it can be useful to identify who is responsible for what, when submitting work to conferences or journals, to facilitate fielding questions or critiques.
  3. Are your collaborators using git? If you’re just coming onto a project where people have been using git for years, then this would be a great opportunity to immerse yourself. If, on the other hand, none of your 5 other teammates have heard of it, your time may be better spent following the path of least resistance (although you should at least advocate for some version control system—few things are worse than overwriting your hard work, or being unable to reproduce it when you need it!).

If the answer to all of these is no (unlikely), it might still be useful to go through this tutorial if your future collaborators are using git. If you’re considering going into tech, or even just working with computer scientists on academic projects, it is a very good idea to be familiar with the dominant version control system being used in the field. Regardless of what they’re making or the programming languages they’re using, most people who write code professionally are familiar with git, and they’ll be relieved that you know the basics.

Big Caveat: Git Will Not Solve All Your Problems

I think a lot of people have the misconception that, if they just use git, their collaboration is guaranteed to be frictionless. I certainly had this idea when I started out, and boy was I quickly disabused.

The only thing that git does is keep track of versions. It won’t automatically organize your codebase. It won’t guarantee that someone on your team won’t merge bad code into your project, or merge their code in a way that makes it difficult to revert. You still need to write documentation. You still need to decide on a workflow ahead of time. You still need to communicate with each other. When git “fails,” the culprit is usually process.3

Getting Started: A Tutorial

Meta note: Throughout this tutorial, unless otherwise noted, I’m going to be using angle brackets <> to mean “replace the angle brackets and everything inside of them with whatever it says inside the angle brackets”. So, if I was following this tutorial, I’d replace <your name> with Nicole Mirea in all the commands.

Installation

Chances are, you already have git installed on your system. To check, simply run git version in a terminal. If you get something that looks like a version number, great! If you get any other error message, go ahead and install git for your operating system:

Configuring Options

Once you’ve installed git on your computer, you’ll need to tell it who you are by running the following in a terminal:

git config --global user.name "<your name>"

This sets “<your name>” as your name in git’s global configuration settings, for every project on your computer. If you want to set it on a per-project basis, leave out the --global flag and run the command inside of the folder where your project is stored. More on that later.

You’ll also need to set your email, especially if you’re planning to upload to Github. If so, you’ll want this email to match the one you signed up to Github with.

git config --global user.email <your email address>

If you ever forget which email you’ve chosen for this (or, hey, which name you put down), you can list all the options you’ve set:

git config --list

Your First Repo4: The Basics

Screenshot from the 1984 movie *Repo Man*: Otto Maddox walking up to his first repo.

Initializing

In git, a project is called a repository (repo, for short). To begin, navigate to the directory where your files are stored (or where you’re planning on storing your files), using cd:

cd <path to your project, with forward slashes>

Now that you’re here, run the following command to initialize your repo:

git init

This command creates a hidden folder inside your project directory called .git. This hidden folder keeps track of all the git-related info. It’s extremely useful to have this inside the project directory itself, because it means you can move the project directory around (on your computer and even onto different machines) without fear of losing any of your git history.

The most useful git command, hands down, is git status; it shows you what’s going on, at any given time. Run that now to see the state of things.

git status

If there are no files in the project directory, you’ll want to create or copy some over before proceeding. If there are, you’ll see them listed as “Untracked files”.

Staging Changes

Git works by staging changes before saving a version. This might seem kind of redundant sometimes (especially when you’re only working on a couple files at a time), but it’s basically an are-you-sure check. To start tracking all your files and stage your changes to them, run this:

git add -A

Sometimes you won’t want to track each and every file or directory. For example, if your data is sensitive or personally identifiable, or if you’ve included in your IRB that you won’t share it with anybody, or if the file contains passwords that you’d rather not disclose. In that case, you’ll create a file called .gitignore in your project directory, and add paths to the sensitive files/directories, relative to the main project directory. In my example, all my sensitive data is stored in a subdirectory called sensitive_data, so my .gitignore file looks like this, just a single line:

sensitive_data

Now, when I run git add -A, it’ll act like sensitive_data isn’t even there. If you need to have the directory there, but want to ignore the files inside of it, the solution is to create another .gitignore inside of sensitive_data, like this.

It’s also a good idea to ignore files and directories that are generated by other programs, for space. If you (or a collaborator) can simply regenerate these files from existing code as needed, there’s usually no need to add them in the repo5.

Run git status again, and you should see your staged files under “Changes to be committed”. If you created a .gitignore file, run git add .gitignore to add it to the repo.

Committing

In git, committing your changes means adding them to the history, by making a commit—a snapshot of the staged changes that you’ve made. When you’re ready to save that snapshot, run the following:

git commit

This will pop up a text editor and prompt you for a commit message, which is a descriptive message to indicate what the commit does. The first line here is the “subject line”, and everything else is the “body” of the message, kind of like an email. By default, the text editor is vim, but you can change it in the config options6:

git config --global core.editor <your editor>

There are a few degrees of stylistic freedom when writing commit messages, so work with your team ahead of time to develop your own explicit norms:

  1. How often should I commit? Personally, I err on the side of committing more often than not. Basically, every time I make a change that can be summarized in a few words and crossed off a to-do list (ex. “make stimuli bigger”, “generate regression graphs”). However, although committing often might make you feel secure in your ability to revert changes at any point, it might be more helpful for your collaborators if you squash down your commits before uploading them.
  2. What should I include in my commit message? This is more of an art than a science, so you can find a lot of advice on this out there. In general, keep in mind that your team might be poring over this later, when they’re trying to figure out what went wrong. My biggest tip here is to summarize the effect of the change in the subject line (in under 50 characters), and then use the body for bullet-pointed details (wrapped at 72 characters per line). For the longest time I just did not do this,and only used the subject line, which you can do in a pinch with git commit -m "Subject line of commit message" (but really, don’t do this unless you’re sure that your team doesn’t need it—for example, fixing a typo in a manuscript).

Once you’ve saved your commit message and closed your editor, congratulations! You’ve made your first commit! (Hopefully of many.)

From here, a lot of tutorials jump straight to accessing history, or how to communicate with a remote server. Instead, I’m going to go through branching first (creating alternate versions of a project), because starting by working with branches will give you a feel for how git “thinks” about files and history.

Branching Out

You can use git to cleanly separate out new features that you want to add to your project, or just alternative ways of doing something. For example: in a meeting with your advisor, they suggest visualizing your data using a violin plot instead of a box-and-whisker. But you’re not sure whether you want to include this visualization as part of your final codebase, because it might be too messy to read, or it might just not reveal anything new. On the other hand, it could be really fruitful!

So, instead of creating a duplicate of your analysis file like analysis-new-visualizations.R (which could get messy after only a couple iterations of this process) or adding on some functions to your analysis file that you have to comment out or suppress later (because you’re afraid to delete anything since you “might need it later”), git will let you create a new version of your analysis file that you can merge into the project if it’s helpful, or leave out (and keep your codebase clean) if not.

Before starting that visualization, create a new branch of the project called violin-viz—a descriptive name that captures what feature this branch is meant to introduce.

git branch violin-viz

You can see which branch you’re currently on by running

git branch

This will list all your branches, with a star next to the branch that you’re currently in. By default, you start out on the master branch of all projects. You can switch to your new branch with the checkout command7:

git checkout violin-viz

It may take a few commits to implement the feature; just continue making them on this new branch. If you decide the plot is helpful and you’d like to keep it in the project for the long term, you can switch back to the master branch and merge it in, like so:

git checkout master
git merge violin-viz

To visualize which commits belong(ed) to which branches after they’ve been merged:

git log --graph --oneline

(The --oneline argument prints just the header of the commit, on a single line.)

Now, why is having a separate branch useful? Let’s say you get interrupted while you’re working on the visualization; you need to refactor a significant part of your code. The visualization is too immature to be refactored, but you don’t want to lose the progress that you’ve made on it already. A good strategy here would be to create a new branch off master that is just for refactoring code:

git checkout master
git branch refactor
git checkout refactor

You can work on both the visualization and the refactoring simultaneously, and then, when you’re done, you can merge them back into master.

When you do that, you’ll probably get messages about merge conflicts. Something like:

CONFLICT (content): Merge conflict in analysis.R
Automatic merge failed; fix conflicts and then commit the result.

Don’t panic! It just means you modified the same files in incompatible ways on both branches. You’ll have to manually go into the conflicting file(s), which will have both versions of the incompatible code, separated by seven equals signs =======. Delete the code that you don’t want (or create new code that makes it compatible). Then make a new commit to complete the merge.

One last note: when you merge, the branch that you’re merging in doesn’t get deleted. So if you decide that the violin plots were a welcome addition (so you merge them into master), but that you don’t need to develop them further, delete that branch after you’ve merged it to keep things nice and tidy:

git branch -d <branch name>

There are many, many workflows for branching, but the most important thing is that your whole team is on the same page. This is a rather good one to start with—here, you only merge into master for big releases, and keep a separate development branch for day-to-day merges.

Time Travel

Screenshot from the 1984 movie *Repo Man*: Miller and Otto burning trash under the 6th Street bridge.

The point of commits is to be able to go back to a certain point in history. How do you do that? First, you’ll need to decide where you want to go. To browse through all the commits in the project, run this:

git log

This will print all the commits in the current timeline—i.e., all the commits that have been merged into the current branch—in reverse chronological order. Press the down arrow to scroll earlier, and press “q” to quit and return to your command prompt.

Each commit will have a unique ID, called a hash: a long string of letters and numbers. Once you’ve decided which commit want to go back to based on the descriptive commit messages, copy the hash.

Now there are a few things you can do with that hash, depending on what you need (this is essentially a re-organized paraphrase of Daniel’s StackOverflow post):

  1. Just look around. If you don’t anticipate yourself making any changes from that old state, you can just checkout the commit without attaching yourself to any branch.

    git checkout <hash>
    

    This will take you to the project’s state just after the specified commit happened. It’ll give you a “HEAD detached” error, which sounds scarier than it is; it just means that the place where you are in git’s history isn’t part of any branch. When you’re ready to go back to the branch you were on, run

    git checkout <name of previously checked-out branch>
    
  2. Branch out. If you plan on making new commits on top of the old one, create a new branch while you’re checking out your old one.

    git checkout -b <name of branch to create> <hash>
    

    Just remember that you’ll have to merge any new commits back into your main branch eventually, and resolve any conflicts that arise as a result.

    If you’re in a detached-head state, you can leave out <hash> in the command above, and it’ll just start making commits from where you are.

  3. Undo. To err is human. Fortunately, git has two ways of “undoing” commits, depending on whether you’ve published them or not.

    1. Revert adds new commits to the repo that undo the old commits. Essentially, these new commits are the “opposite” of the old ones. This is what you should do if you have already pushed the commits that you wish to undo.

      git revert <hash>
      

      This command will add a single new commit to the repo, that is the opposite of whatever the <hash> commit did. So it’ll take your repo to where it was before the <hash> commit happened.

    2. Hard reset will literally rewrite history, and can be used to undo commits that you haven’t published yet. Let me reiterate: you should only use this for commits you haven’t published yet, lest you incur the ire of your teammates, who have to figure out what the hell you did (or overwrite the undo).

      git reset --hard <hash>
      

      If you have uncommitted work that you’d like to keep, run git stash before doing the hard reset to “stash” your changes temporarily, and then run git stash pop to apply them after the reset.

Working Well with Others

Screenshot from the 1984 movie *Repo Man*: Shot of all the repo men (except Otto) in matching sheriff's hats, driving.

You’ve got all the tools you need to start using git to keep track of changes and versions on your local machine! But usually, you’ll want to upload (push) and download (pull) changes from another server (“remote”), like Github.

Setup

The first thing that you’ll need to do is create a bare repo (i.e., a repo that just contains the git files) on your remote server. Github has a nice step-by-step form that will take you through this. If you’re using your own server, follow the first part of the instructions here.

In either case, you’ll want to link up your local repo to the repo on the server. This is done by adding the repo on the server as a “remote”, in your working directory:

git remote add <alias> <path to bare repo>

If you’re using Github, the path to the bare repo will be something like git@github.com:username/repo_name.git, and the default scripts will have you put “origin” as your alias. In reality, you can set the alias to any word you like, and being descriptive here can help if you’re working with multiple different servers.

Pushing

The first time you push to a remote, you can set the branch you’re currently on (in this example, master) to track a branch on the server:

git push -u <remote alias> master

Now, when you run git pull or git push without any arguments, this branch will pull/push from the branch on the remote called master.

Otherwise, if you don’t add the -u flag, you’ll need to specify the branch:

git push <remote alias> <branch name>

Pulling

The syntax to download commits from a branch on a remote is really similar:

git pull <remote alias> <branch name>

This will merge commits from a remote into your local repo. Just like with a branch merge, you can get merge conflicts here. Tread carefully. It’s good to talk to your team in situations like this to be clear what should and what shouldn’t be kept. Here are some great tips for resolving merge commits, especially those resulting from pulls.

Bonus: Squashing

Sometimes your team doesn’t need to see all of your commits, and they can make the log tedious to wade through. In these situations, you can squash down your long list of commits into a single one before pushing. Here, ## is the number of commits to squash:

git reset --hard HEAD~##
git merge --squash HEAD{1}
git commit

What just happened? First, you time-traveled to the point just before you wanted to squash, ## commits ago (you can also replace HEAD~## with a hash). Then you merged the most recent state (HEAD{1}) as a fast-forward merge!

Like all hard resets, only do this if you haven’t pushed yet.

Further Reading

This was a bit long, but it should tell you almost all you need to know in order to start using git with a team and on your own. I drew upon many sources to compile this; here are a few more to consult if you’re curious:

In a future post, I’ll cover contributing to open source projects and submitting pull requests! Stay tuned.

Screenshot from the 1984 movie *Repo Man*: The last scene of the movie, in which the car rises into the sky.

Footnotes

  1. Due in no small part to the fact that git is open-source. Lesson: if you want people to build on your code, make it open-source. Check out GitHub’s guide to starting an open source project to learn more. 

  2. Because of the way that git keeps track of changes (storing the changes on top of the original file, instead of overwriting the original file), using it for binary files is also be less efficient than using it for plaintext files in terms of space. 

  3. See Alex Feinman’s excellent essay for an explanation, and links on how to develop a workflow with your team. 

  4. All images in this post are from the 1984 cult classic Repo Man, one of my favorite movies. Reposession has absolutely nothing to do with repositories (unless you’ve seriously miscalculated), but I’ve added some relevant stills from it to break up this post and make it more visually interesting. 

  5. There may be cases where you do need to add generated files, because you have scripts that do something with those generated files after you push them somewhere else (post-receive hooks). 

  6. I really like Atom; again, its extendability and the community behind it is the biggest plus. To use it as the default git editor, replace your_editor here with "atom --wait", which will wait to make the commit until after you’ve closed the Atom window. 

  7. You can create a new branch and switch to it all in the same command, by running git checkout -b <new branch name> 

Connect