Developer Interface

This part of the documentation covers all the interfaces of Okra.

Parallel Computing

Turns out that joins are less than ideal in Spark.

Instead of using a central database, we’ll solve the problem in a distributed way. Each function returns an immutable type. We’ll just use one server with like 32 cores or whatever. Need to throw out the parquet files and use the SQLite databases.

Reference:

tbonza/EDS19/project/PredictRepoHealth.ipynb

okra.distributed.consolidate_features_target(cache: str, repo_id: str, report=100)[source]

Consolidates distributed computations in cache

We’re going to have n feature tables, n target tables. We need to consolidate them into one feature and one target table. This is the ‘reduce’ part of a ‘map/reduce’ pattern. To avoid confusion with Hadoop, let’s call this a ‘hack/reduce’ pattern.

Parameters

cache – str, full path to cache

Returns

X_features, X_target written to disk

Return type

None

okra.distributed.create_features_target(df, k=6)[source]

Create dataset from train/test/val and Y

okra.distributed.create_working_table(dal: okra.models.DataAccessLayer)[source]

Create a base table to derive all other features and aggregations

Parameters

dal – okra.models.DataAccessLayer, session must be instantiated

Returns

base table for analysis

Return type

pd.DataFrame

okra.distributed.getwork_dbinfo(cache: str)[source]

Defines work for immutable function, write_features_target

Parameters

cache – str, full path to cache

Returns

dbinfo, (dburl, repo_name, cache)

Return type

list of tuples

okra.distributed.num_authors(df, period: str)[source]

Number of authors in a given time period.

Parameters
  • df – pd.DataFrame, base table

  • period – str, ‘Y’, year

Returns

df aggregated by period

Return type

pd.DataFrame

okra.distributed.num_mentors(df, period: str, subperiod: str, k: int)[source]

Number of authors in a larger time period who have also made commits in k number of smaller time periods.

For example, number of authors in a year who have also committed changes in each month of that year.

Parameters
  • df – pd.DataFrame, base table

  • period – ‘Y’, year

  • subperiod – ‘M’, month

  • k – int, authors exist at least k subperiods

Returns

df aggregated by period

Return type

pd.DataFrame

okra.distributed.num_orgs(df, period)[source]

Number of organizations commiting to a repo in a time period

Organizations are found by extracting the domain of email addresses used by authors.

Parameters
  • df – pd.DataFrame, base table

  • period – str, ‘Y’ year time period

Returns

df aggregated by period

Return type

pd.DataFrame

okra.distributed.run_distributed_pool(n_cores: int, func, work: list)[source]

Run distributed pool across n_cores

Parameters
  • n_cores – number of cores in available in server

  • func – immutable function to execute work

  • work – input for immutable function

Returns

func output from work

Return type

meta

okra.distributed.write_features_target(dbinfo: tuple, k=6) → bool[source]

Write out features and target dataframes to parquet.

Parameters

dburl – SQLite database url

Returns

write out feature, target dataframes as parquet

Return type

bool

GitHub Repository Management

GitHub Repo Managment

Related to downloading and updating GitHub repos. See the ‘assn1’ script in bin/assn1 to handle get and update features.

okra.repo_mgmt.clone_repo(repo_name: str, dirpath: str, ssh=False) → bool[source]

Clone GitHub repo.

okra.repo_mgmt.compress_repo(repo_name: str, cache: str, repo_comp: str) → bool[source]

Compress repo for upload.

Parameters
  • repo_name – git repo name with owner included; tensorflow/tensorflow

  • dirpath – directory path to git repo

Returns

creates a compressed file of github repo

Return type

True if git repo successfully compressed

okra.repo_mgmt.create_parent_dir(repo_name: str, dirpath: str) → bool[source]

Create parent directory before cloning repo.

https://github.com/tbonza/EDS19/issues/8

okra.repo_mgmt.decompress_repo(repo_comp: str, cache) → bool[source]

Decompress repo to a directory.

Parameters
  • repo_name – git repo name with owner included; tensorflow/tensorflow

  • dirpath – directory path to place uncompressed file with repo owner

  • filepath – path to file to be decompressed

Returns

Uncompresses file and writes ‘git_owner_name/git_repo_name’ to the specified directory.

Return type

boolean

Raises

okra.error_handling.DirectoryNotCreatedError

okra.repo_mgmt.gcloud_clone_or_fetch_repo(repo_name: str, ssh=False) → bool[source]

Clone or fetch updates from git repo

GCloud operations only work on one repository at a time so we don’t have to use a parent directory.

Parameters
  • repo_name – ‘<repo owner>/<repo name>’

  • repo_path – local path to the git repo

Returns

current git repo

Return type

None, file written to disk

okra.repo_mgmt.read_repos(fpath: str) → list[source]

Read list of repos from disk

okra.repo_mgmt.update_repo(repo_name: str, dirpath: str) → bool[source]

Update repo with new code.

Git Log Management

Handle the data requirements for Assignment 1

References:

Assignment 1: “http://janvitek.org/events/NEU/6050/a1.html” Git log formatting: “https://git-scm.com/docs/pretty-formats

okra.gitlogs.extract_data_main(fpath: str, dirpath: str)[source]

Extract data as requested in Assignment 1.

okra.gitlogs.parse_commits(rpath: str, chash='')[source]

Yields a protocol buffer of git commit information.

commits.csv collects basic information about commits and contains the following columns:

Parameters
  • rpath – path to git repository

  • chash – optional param, retrieve all commits since commit hash

Returns

class

okra.protobuf.assn1_pb2.Commit

Return type

generator, protocol buffer

okra.gitlogs.parse_committed_files(rpath: str, chash='')[source]

Parse file format from git log tool.

Parameters
  • rpath – path to git repository

  • chash – optional param, retrieve all commits since commit hash

Returns

class

okra.protobuf.assn1_pb2.Message

Return type

generator, protocol buffer

okra.gitlogs.parse_inventory(rpath: str, repo_name: str)[source]

Return inventory information for a git repo.

Parameters

rpath – repository path

Returns

inventory message object

Return type

okra.protobuf.assn1_pb2.Inventory

okra.gitlogs.parse_messages(rpath: str, chash='')[source]

Yields a protocol buffer of a git commit message.

messages.csv collects commit messages and their subject.

Parameters
  • rpath – path to git repository

  • chash – optional param, retrieve all commits since commit hash

Returns

class

okra.protobuf.assn1_pb2.Message

Return type

generator, protocol buffer

okra.gitlogs.write_line_commits(parsed_commits)[source]

Generate a line for each git commit message.

okra.gitlogs.write_line_files(parsed_files)[source]

Generate a line for each git filepath message.

okra.gitlogs.write_line_messages(parsed_messages)[source]

Generate a line for each git commit message.

Database Models

SQL Database Models

This is the database schema used for accessing the SQL database.

class okra.models.Author(**kwargs)[source]

Author email is false, not all authors require a github account, so an email is not going to be required.

class okra.models.CommitFile(**kwargs)[source]
class okra.models.Contrib(**kwargs)[source]
class okra.models.Info(**kwargs)[source]
class okra.models.Inventory(**kwargs)[source]
class okra.models.Meta(**kwargs)[source]

Okra Exceptions

Custom errors we can expect.

References:

https://docs.python.org/3/tutorial/errors.html

exception okra.error_handling.DirectoryNotCreatedError(expression, message)[source]

Exception raised when directory unable to be created.

Parameters
  • expression – input expression in which error occurred

  • message – explanation of error

exception okra.error_handling.Error[source]

Base class for exceptions in okra

exception okra.error_handling.MissingEnvironmentVariableError(expression, message)[source]

Exception raised when mandatory enviroment variable is missing.

Parameters
  • expression – input expression in which error occurred

  • message – explanation of error

exception okra.error_handling.NetworkError(expression, message)[source]

Exception raised for errors related to network requests

Parameters
  • expression – input expression in which the error occurred

  • message – explanation of error

Git Playbooks

Playbooks for running full analyses

okra.playbooks.local_persistance(repo_name: str, parent_dir: str, buffer_size=4096)[source]

Collect relevant data for locally cloned repos.

Parameters
  • repo_name – name of git repository, <linux>

  • parent_dir – parent directory path, </home/user/code/>

  • buffer_size – number of records processed before db commit

Returns

populate sqlite database for a repo

Return type

None

okra.playbooks.simple_version_truck_factor(repos: list, dirpath: str, dburl: str, b: int)[source]

Simple version of the truck factor analysis.

This is a basic version of the truck factor which does not attempt to run the analysis at scale. It’s just a proof of concept version. Writes out a csv file with the truck factor of each repository. You can use this csv file to do further analysis in R.

Parameters
  • repos – repo queue with value format ‘<repo owner>/<repo name>’

  • dirpath – path to working directory to store git repos

  • dburl – database url (ex. ‘sqlite:///:memory’)

  • b – batch size (ex. 1024)

Returns

outputs truck factor analysis as csv file

Return type

None, writes out csv file

Polite GitHub Repo Retrieval

A nice approach to to downloading GitHub repos.

Previous approaches to downloading GitHub repositories were set up in parallel on Kubernetes using Redis. This approach was considered too aggressive by both GitHub and Google. This approach is meant to be nice.

okra.be_nice.okay_benice(qpath: str, ssh=True)[source]

Polite GitHub repo retrieval

Includes option to clone repo using SSH. This is recommended if you’re requesting a large number of repositories (> 1000). Creates and populates SQLite database with pre-specified git log info. QPath is a text file from GitTorrent, queried using Google BigQuery.

Parameters
  • qpath – path to queue of repository names.

  • ssh – bool, default is True

Returns

writes SQLite database with parsed git log info to disk

Return type

none