Developer Interface¶

This part of the documentation covers all the interfaces of Okra.

Parallel Computing¶

Turns out that joins are less than ideal in Spark.

Instead of using a central database, we’ll solve the problem in a distributed way. Each function returns an immutable type. We’ll just use one server with like 32 cores or whatever. Need to throw out the parquet files and use the SQLite databases.

Reference:: tbonza/EDS19/project/PredictRepoHealth.ipynb

okra.distributed.consolidate_features_target(cache: str, repo_id: str, report=100)[source]¶

Consolidates distributed computations in cache

We’re going to have n feature tables, n target tables. We need to consolidate them into one feature and one target table. This is the ‘reduce’ part of a ‘map/reduce’ pattern. To avoid confusion with Hadoop, let’s call this a ‘hack/reduce’ pattern.

Parameters: cache – str, full path to cache
Returns: X_features, X_target written to disk
Return type: None

okra.distributed.create_features_target(df, k=6)[source]¶: Create dataset from train/test/val and Y

okra.distributed.create_working_table(dal: okra.models.DataAccessLayer)[source]¶

Create a base table to derive all other features and aggregations

Parameters: dal – okra.models.DataAccessLayer, session must be instantiated
Returns: base table for analysis
Return type: pd.DataFrame

okra.distributed.getwork_dbinfo(cache: str)[source]¶

Defines work for immutable function, write_features_target

Parameters: cache – str, full path to cache
Returns: dbinfo, (dburl, repo_name, cache)
Return type: list of tuples

okra.distributed.num_authors(df, period: str)[source]¶

Number of authors in a given time period.

Parameters

df – pd.DataFrame, base table
period – str, ‘Y’, year

Returns

df aggregated by period

Return type

pd.DataFrame

okra.distributed.num_mentors(df, period: str, subperiod: str, k: int)[source]¶

Number of authors in a larger time period who have also made commits in k number of smaller time periods.

For example, number of authors in a year who have also committed changes in each month of that year.

Parameters

df – pd.DataFrame, base table
period – ‘Y’, year
subperiod – ‘M’, month
k – int, authors exist at least k subperiods

Returns

df aggregated by period

Return type

pd.DataFrame

okra.distributed.num_orgs(df, period)[source]¶

Number of organizations commiting to a repo in a time period

Organizations are found by extracting the domain of email addresses used by authors.

Parameters

df – pd.DataFrame, base table
period – str, ‘Y’ year time period

Returns

df aggregated by period

Return type

pd.DataFrame

okra.distributed.run_distributed_pool(n_cores: int, func, work: list)[source]¶

Run distributed pool across n_cores

Parameters

n_cores – number of cores in available in server
func – immutable function to execute work
work – input for immutable function

Returns

func output from work

Return type

GitHub Repository Management¶

GitHub Repo Managment

Related to downloading and updating GitHub repos. See the ‘assn1’ script in bin/assn1 to handle get and update features.

okra.repo_mgmt.clone_repo(repo_name: str, dirpath: str, ssh=False) → bool[source]¶: Clone GitHub repo.

okra.repo_mgmt.compress_repo(repo_name: str, cache: str, repo_comp: str) → bool[source]¶

Compress repo for upload.

Parameters

repo_name – git repo name with owner included; tensorflow/tensorflow
dirpath – directory path to git repo

Returns

creates a compressed file of github repo

Return type

True if git repo successfully compressed

okra.repo_mgmt.create_parent_dir(repo_name: str, dirpath: str) → bool[source]¶

Create parent directory before cloning repo.

https://github.com/tbonza/EDS19/issues/8

okra.repo_mgmt.decompress_repo(repo_comp: str, cache) → bool[source]¶

Decompress repo to a directory.

Parameters

repo_name – git repo name with owner included; tensorflow/tensorflow
dirpath – directory path to place uncompressed file with repo owner
filepath – path to file to be decompressed

Returns

Uncompresses file and writes ‘git_owner_name/git_repo_name’ to the specified directory.

Return type

boolean

Raises

okra.error_handling.DirectoryNotCreatedError

okra.repo_mgmt.gcloud_clone_or_fetch_repo(repo_name: str, ssh=False) → bool[source]¶

Clone or fetch updates from git repo

GCloud operations only work on one repository at a time so we don’t have to use a parent directory.

Parameters

repo_name – ‘<repo owner>/<repo name>’
repo_path – local path to the git repo

Returns

current git repo

Return type

None, file written to disk

okra.repo_mgmt.read_repos(fpath: str) → list[source]¶: Read list of repos from disk

okra.repo_mgmt.update_repo(repo_name: str, dirpath: str) → bool[source]¶: Update repo with new code.

Git Log Management¶

Handle the data requirements for Assignment 1

References:: Assignment 1: “http://janvitek.org/events/NEU/6050/a1.html” Git log formatting: “https://git-scm.com/docs/pretty-formats”

okra.gitlogs.extract_data_main(fpath: str, dirpath: str)[source]¶: Extract data as requested in Assignment 1.

okra.gitlogs.parse_commits(rpath: str, chash='')[source]¶

Yields a protocol buffer of git commit information.

commits.csv collects basic information about commits and contains the following columns:

Parameters

rpath – path to git repository
chash – optional param, retrieve all commits since commit hash

Returns

class: okra.protobuf.assn1_pb2.Commit

Return type

generator, protocol buffer

okra.gitlogs.parse_committed_files(rpath: str, chash='')[source]¶

Parse file format from git log tool.

Parameters

rpath – path to git repository
chash – optional param, retrieve all commits since commit hash

Returns

class: okra.protobuf.assn1_pb2.Message

Return type

generator, protocol buffer

okra.gitlogs.parse_inventory(rpath: str, repo_name: str)[source]¶

Return inventory information for a git repo.

Parameters: rpath – repository path
Returns: inventory message object
Return type: okra.protobuf.assn1_pb2.Inventory

okra.gitlogs.parse_messages(rpath: str, chash='')[source]¶

Yields a protocol buffer of a git commit message.

messages.csv collects commit messages and their subject.

Parameters

rpath – path to git repository
chash – optional param, retrieve all commits since commit hash

Returns

class: okra.protobuf.assn1_pb2.Message

Return type

generator, protocol buffer

okra.gitlogs.write_line_commits(parsed_commits)[source]¶: Generate a line for each git commit message.

okra.gitlogs.write_line_files(parsed_files)[source]¶: Generate a line for each git filepath message.

okra.gitlogs.write_line_messages(parsed_messages)[source]¶: Generate a line for each git commit message.

Database Models¶

SQL Database Models

This is the database schema used for accessing the SQL database.

class okra.models.Author(**kwargs)[source]¶: Author email is false, not all authors require a github account, so an email is not going to be required.

class okra.models.CommitFile(**kwargs)[source]¶

class okra.models.Contrib(**kwargs)[source]¶

class okra.models.Info(**kwargs)[source]¶

class okra.models.Inventory(**kwargs)[source]¶

class okra.models.Meta(**kwargs)[source]¶

Okra Exceptions¶

Custom errors we can expect.

References:: https://docs.python.org/3/tutorial/errors.html

exception okra.error_handling.DirectoryNotCreatedError(expression, message)[source]¶

Exception raised when directory unable to be created.

Parameters

expression – input expression in which error occurred
message – explanation of error

exception okra.error_handling.Error[source]¶: Base class for exceptions in okra

exception okra.error_handling.MissingEnvironmentVariableError(expression, message)[source]¶

Exception raised when mandatory enviroment variable is missing.

Parameters

expression – input expression in which error occurred
message – explanation of error

exception okra.error_handling.NetworkError(expression, message)[source]¶

Exception raised for errors related to network requests

Parameters

expression – input expression in which the error occurred
message – explanation of error

Git Playbooks¶

Playbooks for running full analyses

okra.playbooks.local_persistance(repo_name: str, parent_dir: str, buffer_size=4096)[source]¶

Collect relevant data for locally cloned repos.

Parameters

repo_name – name of git repository, <linux>
parent_dir – parent directory path, </home/user/code/>
buffer_size – number of records processed before db commit

Returns

populate sqlite database for a repo

Return type

None

okra.playbooks.simple_version_truck_factor(repos: list, dirpath: str, dburl: str, b: int)[source]¶

Simple version of the truck factor analysis.

This is a basic version of the truck factor which does not attempt to run the analysis at scale. It’s just a proof of concept version. Writes out a csv file with the truck factor of each repository. You can use this csv file to do further analysis in R.

Parameters

repos – repo queue with value format ‘<repo owner>/<repo name>’
dirpath – path to working directory to store git repos
dburl – database url (ex. ‘sqlite:///:memory’)
b – batch size (ex. 1024)

Returns

outputs truck factor analysis as csv file

Return type

None, writes out csv file

Polite GitHub Repo Retrieval¶

A nice approach to to downloading GitHub repos.

Previous approaches to downloading GitHub repositories were set up in parallel on Kubernetes using Redis. This approach was considered too aggressive by both GitHub and Google. This approach is meant to be nice.

okra.be_nice.okay_benice(qpath: str, ssh=True)[source]¶

Polite GitHub repo retrieval

Includes option to clone repo using SSH. This is recommended if you’re requesting a large number of repositories (> 1000). Creates and populates SQLite database with pre-specified git log info. QPath is a text file from GitTorrent, queried using Google BigQuery.

Parameters

qpath – path to queue of repository names.
ssh – bool, default is True

Returns

writes SQLite database with parsed git log info to disk

Return type

none