Developer Interface¶
This part of the documentation covers all the interfaces of Okra.
Parallel Computing¶
Turns out that joins are less than ideal in Spark.
Instead of using a central database, we’ll solve the problem in a distributed way. Each function returns an immutable type. We’ll just use one server with like 32 cores or whatever. Need to throw out the parquet files and use the SQLite databases.
- Reference:
tbonza/EDS19/project/PredictRepoHealth.ipynb
-
okra.distributed.
consolidate_features_target
(cache: str, repo_id: str, report=100)[source]¶ Consolidates distributed computations in cache
We’re going to have n feature tables, n target tables. We need to consolidate them into one feature and one target table. This is the ‘reduce’ part of a ‘map/reduce’ pattern. To avoid confusion with Hadoop, let’s call this a ‘hack/reduce’ pattern.
- Parameters
cache – str, full path to cache
- Returns
X_features, X_target written to disk
- Return type
None
-
okra.distributed.
create_working_table
(dal: okra.models.DataAccessLayer)[source]¶ Create a base table to derive all other features and aggregations
- Parameters
dal – okra.models.DataAccessLayer, session must be instantiated
- Returns
base table for analysis
- Return type
pd.DataFrame
-
okra.distributed.
getwork_dbinfo
(cache: str)[source]¶ Defines work for immutable function, write_features_target
- Parameters
cache – str, full path to cache
- Returns
dbinfo, (dburl, repo_name, cache)
- Return type
list of tuples
Number of authors in a given time period.
- Parameters
df – pd.DataFrame, base table
period – str, ‘Y’, year
- Returns
df aggregated by period
- Return type
pd.DataFrame
-
okra.distributed.
num_mentors
(df, period: str, subperiod: str, k: int)[source]¶ Number of authors in a larger time period who have also made commits in k number of smaller time periods.
For example, number of authors in a year who have also committed changes in each month of that year.
- Parameters
df – pd.DataFrame, base table
period – ‘Y’, year
subperiod – ‘M’, month
k – int, authors exist at least k subperiods
- Returns
df aggregated by period
- Return type
pd.DataFrame
-
okra.distributed.
num_orgs
(df, period)[source]¶ Number of organizations commiting to a repo in a time period
Organizations are found by extracting the domain of email addresses used by authors.
- Parameters
df – pd.DataFrame, base table
period – str, ‘Y’ year time period
- Returns
df aggregated by period
- Return type
pd.DataFrame
GitHub Repository Management¶
GitHub Repo Managment
Related to downloading and updating GitHub repos. See the ‘assn1’ script in bin/assn1 to handle get and update features.
-
okra.repo_mgmt.
clone_repo
(repo_name: str, dirpath: str, ssh=False) → bool[source]¶ Clone GitHub repo.
-
okra.repo_mgmt.
compress_repo
(repo_name: str, cache: str, repo_comp: str) → bool[source]¶ Compress repo for upload.
- Parameters
repo_name – git repo name with owner included; tensorflow/tensorflow
dirpath – directory path to git repo
- Returns
creates a compressed file of github repo
- Return type
True if git repo successfully compressed
-
okra.repo_mgmt.
create_parent_dir
(repo_name: str, dirpath: str) → bool[source]¶ Create parent directory before cloning repo.
-
okra.repo_mgmt.
decompress_repo
(repo_comp: str, cache) → bool[source]¶ Decompress repo to a directory.
- Parameters
repo_name – git repo name with owner included; tensorflow/tensorflow
dirpath – directory path to place uncompressed file with repo owner
filepath – path to file to be decompressed
- Returns
Uncompresses file and writes ‘git_owner_name/git_repo_name’ to the specified directory.
- Return type
boolean
- Raises
-
okra.repo_mgmt.
gcloud_clone_or_fetch_repo
(repo_name: str, ssh=False) → bool[source]¶ Clone or fetch updates from git repo
GCloud operations only work on one repository at a time so we don’t have to use a parent directory.
- Parameters
repo_name – ‘<repo owner>/<repo name>’
repo_path – local path to the git repo
- Returns
current git repo
- Return type
None, file written to disk
Git Log Management¶
Handle the data requirements for Assignment 1
- References:
Assignment 1: “http://janvitek.org/events/NEU/6050/a1.html” Git log formatting: “https://git-scm.com/docs/pretty-formats”
-
okra.gitlogs.
extract_data_main
(fpath: str, dirpath: str)[source]¶ Extract data as requested in Assignment 1.
-
okra.gitlogs.
parse_commits
(rpath: str, chash='')[source]¶ Yields a protocol buffer of git commit information.
commits.csv collects basic information about commits and contains the following columns:
- Parameters
rpath – path to git repository
chash – optional param, retrieve all commits since commit hash
- Returns
- class
okra.protobuf.assn1_pb2.Commit
- Return type
generator, protocol buffer
-
okra.gitlogs.
parse_committed_files
(rpath: str, chash='')[source]¶ Parse file format from git log tool.
- Parameters
rpath – path to git repository
chash – optional param, retrieve all commits since commit hash
- Returns
- class
okra.protobuf.assn1_pb2.Message
- Return type
generator, protocol buffer
-
okra.gitlogs.
parse_inventory
(rpath: str, repo_name: str)[source]¶ Return inventory information for a git repo.
- Parameters
rpath – repository path
- Returns
inventory message object
- Return type
okra.protobuf.assn1_pb2.Inventory
-
okra.gitlogs.
parse_messages
(rpath: str, chash='')[source]¶ Yields a protocol buffer of a git commit message.
messages.csv collects commit messages and their subject.
- Parameters
rpath – path to git repository
chash – optional param, retrieve all commits since commit hash
- Returns
- class
okra.protobuf.assn1_pb2.Message
- Return type
generator, protocol buffer
Database Models¶
SQL Database Models
This is the database schema used for accessing the SQL database.
Okra Exceptions¶
Custom errors we can expect.
-
exception
okra.error_handling.
DirectoryNotCreatedError
(expression, message)[source]¶ Exception raised when directory unable to be created.
- Parameters
expression – input expression in which error occurred
message – explanation of error
Git Playbooks¶
Playbooks for running full analyses
-
okra.playbooks.
local_persistance
(repo_name: str, parent_dir: str, buffer_size=4096)[source]¶ Collect relevant data for locally cloned repos.
- Parameters
repo_name – name of git repository, <linux>
parent_dir – parent directory path, </home/user/code/>
buffer_size – number of records processed before db commit
- Returns
populate sqlite database for a repo
- Return type
None
-
okra.playbooks.
simple_version_truck_factor
(repos: list, dirpath: str, dburl: str, b: int)[source]¶ Simple version of the truck factor analysis.
This is a basic version of the truck factor which does not attempt to run the analysis at scale. It’s just a proof of concept version. Writes out a csv file with the truck factor of each repository. You can use this csv file to do further analysis in R.
- Parameters
repos – repo queue with value format ‘<repo owner>/<repo name>’
dirpath – path to working directory to store git repos
dburl – database url (ex. ‘sqlite:///:memory’)
b – batch size (ex. 1024)
- Returns
outputs truck factor analysis as csv file
- Return type
None, writes out csv file
Polite GitHub Repo Retrieval¶
A nice approach to to downloading GitHub repos.
Previous approaches to downloading GitHub repositories were set up in parallel on Kubernetes using Redis. This approach was considered too aggressive by both GitHub and Google. This approach is meant to be nice.
-
okra.be_nice.
okay_benice
(qpath: str, ssh=True)[source]¶ Polite GitHub repo retrieval
Includes option to clone repo using SSH. This is recommended if you’re requesting a large number of repositories (> 1000). Creates and populates SQLite database with pre-specified git log info. QPath is a text file from GitTorrent, queried using Google BigQuery.
- Parameters
qpath – path to queue of repository names.
ssh – bool, default is True
- Returns
writes SQLite database with parsed git log info to disk
- Return type
none