Lin Clark

Writing scripts to clone contrib projects from git.drupal.org

Submitted by Lin on

As part of my masters thesis, I’m looking into innovation diffusion within Drupal (I think y'all know which one ;).

In order to ‘grep’ against the codebase, I needed a copy of the most recent versions of ALL of Drupal 7 contrib, which was easy to do back when we used CVS (even a little too easy when you realized that you accidentally started downloading 7,000 projects). However, this is not so easy with git.

Before starting, I asked in #drupal-contribute whether pulling all of contrib for research purposes would be frowned upon, since it can really soak up a lot of server resources. Folks said it was ok and encouraged me to share whatever script I came up with... so here it is.

Note: There are lots of different ways to do this and this one may not be the optimal way. Any thoughts are welcome.

Getting a list of projects

First I wanted to get my list of projects. I only wanted projects that have 7.x branches, which are easy to find.

In order to pull out the parts I need, the project names, I wrote a scraper with ScraperWiki.

The neat thing about ScraperWiki is that it enables groups of people to collaborate on scraper code. When the scraper runs, the results can be saved to an SQLite database and then can be queried by constructing a URL that contains an SQL query.

While this is overkill for just pulling out the project names from the page (I could just have used a simple script like rfay demonstrates), the ScraperWiki approach can be extended and worked on collaboratively. For example, if we want to have additional information about a project, such as it's dependencies, we can extend the scraper to access drupalcode.org and get that information. Then people could easily search contrib to see which modules depend on their own modules.

So once I ran the scraper, my list of projects was available. Next I just need to:

  1. clone them all to my computer
  2. make sure to checkout the newest 7.x branch

Since I spend most of my day in PHP, I wrote a PHP script for this. I run this script from the command line.

Cloning the projects

First I use curl to run my query against ScraperWiki to get my list.

Getting the list of projects

// Initialize session and set URL.
$ch = curl_init();
$url = "http://api.scraperwiki.com/api/1.0/datastore/sqlite?format=jsondict&name=drupal_7x_full_projects&query=select%20*%20from%20swdata%20where%20project_name%20LIKE%20'a%25%25'";
curl_setopt($ch, CURLOPT_URL, $url);

// Set so curl_exec returns the result instead of outputting it.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Get the response and close the channel.
$response = curl_exec($ch);
curl_close($ch);

$projects = json_decode($response);

The only part that you might want to change from the above is the URL. It runs the SQL query against the data in ScraperWiki. You can write your SQL query, test it, and get the correct URI with the External API tool.

The URL I used runs a query that just retrieves any project that starts with the letter 'a' (you can click it in the code above to see). I did this because I wanted to be able to download it in chunks. I just change the letter 'a' to 'b' and ran the script again to get the next chunk.

Next I iterated through the project list and cloned each one.

Cloning the projects

foreach ($projects as $project) {
  $git_url = $project->git_url;
  shell_exec("git clone --no-checkout $git_url");
}

Checkout the most recent branch

I know that each of these projects has a 7.x branch, but I’m not sure what the version number of the most recent one is. To ensure that I’m on the most recent version, I just try everything from 7.x-9.x down to 7.x-1.x (I believe the highest I found was 4.x). Once one of those works and I’m able to switch to the branch, I stop trying. If none of them work, I switch to master.

I did this as part of the previous for loop, but it can also be it's own for loop. One thing to note is that I have to include '2>&1' in my checkout command in order to return the status message from the command. Otherwise, $checkout would be null.

Checking out the newest 7.x branch

foreach ($projects as $project) {
  chdir($project->project_name);

  $i = 9;
  $branch_set = FALSE;
  
  while ($i > 0 && !$branch_set) {
    $checkout = shell_exec("git checkout 7.x-" . $i .".x 2>&1");
    print($checkout);
    if (preg_match('/Switched to branch/', $checkout) || preg_match('/set up to track remote branch/', $checkout) || preg_match('/Already on/', $checkout)) {
      $branch_set = TRUE;
      break;
    }
    $i--;
  }
  
  if (!$branch_set) {
    $checkout = shell_exec("git checkout master");
  }

  chdir("../");
}

Any thoughts?

So I think that extending the ScraperWiki script could be extended to provide some neat filtering options. For instance, getting a list of modules that depend on my modules, or downloading every module that defines a Views style plugin. But there may be a better way to do this. Any ideas?

Downloads

  • Script file
  • Where the best place to upload a 1.12GB download to share with everyone?

Comments

So since they are all

So since they are all separate repositories there really is no "master command" for downloading them? That is a bit disappointing, but Git is such a huge improvement in other ways that it doesn't matter much.

git branch

Maybe, it would be more general trying git branch to find latest branch instead of trial and error with a hard coded 7.x-9.x maximum.

One problem with that is that

One problem with that is that the latest branch could be a feature branch and not the main 7.x development branch. For example, Media module has a bunch of feature branches that are synced with the main repo.

Exclude feature branches from consideration

Exclude feature branches from consideration:

$major = 7;
$needle = "origin/".$major.".x-";

$branches = explode("\n",shell_exec("git branch -r"));
foreach($branches as $branch)
{
// exclude feature branches not starting with 7.x
$pos = strpos($branch, $needle);
if (!$pos === FALSE)
{
$module_release = substr($branch, $pos + strlen($needle));
// exclude other feature branches
if (strlen($module_release) == strpos($module_release, ".x") + 2)
$module_releases[] = $module_release;
}
}

Another consideration is that

Another consideration is that git branch doesn't seem to show all branches. This might be because I used the no-checkout flag. When I tested with Media module on my local, it only shows master and 7.x-2.x, which would mean the output wouldn't display all the branches. When I did checkout 7.x-1.x, that branch was then listed above the more recent 7.x-2.x

git find modules

 

Hello, Lin! Thank you very much for your blog and screencasts. My question is a little bit off-topic, but I think you may know the answer. Is it possible to search throuth the contrib code in drupal git? I wanted to do this to find modules with dependence. For example search string dependencies[] = entity to find all modules that realy use Entity API. Is only way is to copy all modules as you showed in this post? Thanks.