Post

Git Internals Explained

Explore Git's internal architecture: objects, refs, and the .git directory. Learn how Git stores data and tracks changes under the hood.

Git Internals Explained

πŸ—ƒοΈ Objects

Git uses three main types of objects to store and manage your code:

  • Commit
  • Tree
  • Blobs

Git Internals Overview

Red: commit object, Blue: tree object, Grey: blob object

πŸ“„ Blobs (Binary Large Objects)

What are blobs?

  • Store the actual contents of files
  • Contain full snapshots, not just differences
  • Identified by unique SHA-1 hash (20 bytes = 40 hexadecimal characters)

Key characteristics:

  • Content only: Unlike regular files that have metadata (creation date, permissions), blobs store only raw file content
  • Immutable: Once created, a blob’s contents cannot be changed. Any modification creates a new blob with a different hash

🌳 Trees

What are trees?

  • Represent filesystem structure or directory listings
  • Reference other trees (subdirectories) or blobs (files) by their hashes
  • Each tree has its own unique SHA-1 hash

How they work:

  • Trees can contain other trees, representing nested directories
  • They maintain the structure and organization of your project

Tree Structure

πŸ“ Commits

What are commits?

  • Represent a complete snapshot of your repository at a specific point in time
  • Combine metadata with a pointer to the root tree

Commit contents:

  • Committer information: Author details
  • Timestamp: When the commit was created
  • Commit message: Description of changes
  • Parent pointers: References to previous commits (merge commits have multiple parents)
  • Tree reference: Points to the root tree object

Commit Structure

Important notes:

  • Commits store entire snapshots, not just diffs from previous commits
  • Identified by SHA-1 hash (same as shown in git log)

How changes propagate:

  • Updating a file creates a new blob with different hash
  • This changes the tree hash that contains the file
  • Which changes the commit hash that references the tree

Hash Chain Effect

Efficient storage:

  • Only modified files get new blobs
  • Unchanged files are referenced, not duplicated
  • New commits reference their parent commits

Efficient Storage

Hash uniqueness: Two different people creating identical files will have the same blob and tree hashes, but different commit hashes due to different author information and timestamps.


🌿 Branches

What are branches?

  • Named references to specific commits
  • Lightweight pointers that move as you create new commits

Branch Structure

How branches work:

  • HEAD defines your currently active branch
  • git checkout moves HEAD pointer to that branch
  • Creating commits on non-master branches updates that branch’s pointer

Branch Commits

πŸ”„ Changes and Workflow

Repository structure:

  • Repository: Collection of commits
  • Working directory: Your .git folder plus all project files
  • Staging area (index): Where changes are prepared before committing

Git Workflow

File states:

  • Tracked: Files present in previous commit or added to staging area
  • Untracked: New files Git doesn’t know about yet

Changes are registered in the index (staging area) using git add.

πŸ“ .git Directory Structure

The .git directory contains everything Git needs:

1
2
3
4
5
6
7
8
9
10
.git/
β”œβ”€β”€ HEAD (file)
β”œβ”€β”€ index (file)
β”œβ”€β”€ objects/
β”‚   β”œβ”€β”€ 11/
β”‚   β”‚   └── 8f108d76b16a058db9fcb385a5fd640b54e47a
β”‚   └── [other hash folders...]
└── refs/
    └── heads/
        └── master (file)

Directory components:

  • objects/: Stores all Git objects (blobs, trees, commits)
    • Subdivided by first two characters of hash for efficiency
  • refs/: Directory for references
    • heads/: Contains branch files with commit hashes they point to
    • master: File containing hash of latest commit on master branch
  • HEAD: Points to current active branch
    • Contains content like ref: refs/heads/master
  • index: Represents the staging area

πŸ› οΈ Git Commands

Basic Object Inspection

1
2
3
4
5
# Get the type of object from hash
git cat-file -t 

# Get the content of object from hash
git cat-file -p 

Working with Hashes

Generate and store hashes:

1
2
3
4
5
# Get hash of string
echo "git is awesome" | git hash-object --stdin

# Get hash and store as object in Git database
echo "git is awesome" | git hash-object --stdin -w

This creates a blob object stored as:

1
2
3
objects/
└── 11/
    └── 8f108d76b16a058db9fcb385a5fd640b54e47a

Retrieve object information:

1
2
3
4
5
6
7
8
# Get file type of hash
git cat-file -t 

# Get content of hash
git cat-file -p 

# Save hash content to file
git cat-file -p  > hello.txt

Note: A new blob is created when you add something to staging area using git add.

Staging Operations

1
2
# Manually add blob to staging area
git update-index --add --cacheinfo 100644  

This creates the index file.

Committing Process

1
2
# Create tree from current working directory
git write-tree

This returns the hash of the root tree, stored in the objects folder.

Inspect the tree:

1
2
3
# Check tree type and content
git cat-file -t 
git cat-file -p 

Create commit:

1
2
# Commit the tree
git commit-tree  -m "commit message" -p 

Managing HEAD and Branches

Update branch pointer:

1
2
# Point master to latest commit
echo  > .git/refs/heads/master

Branch operations:

  • Create branch: Add file in .git/refs/heads/ containing commit hash
  • Switch branch: Change HEAD file content to ref: refs/heads/

πŸ—œοΈ Compression

Git optimizes storage using zlib compression:

  • Combines LZ77 and Huffman coding algorithms
  • Significantly reduces repository size
  • Maintains data integrity while saving space

πŸ“š References

Understanding Git’s internal architecture helps you work more effectively with version control and troubleshoot issues when they arise.

This post is licensed under CC BY 4.0 by the author.