crabnet.utils package

Submodules

crabnet.utils.composition module

exception crabnet.utils.composition.CompositionError[source]

Bases: Exception

Exception class for composition errors

crabnet.utils.composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=False, sum_feat=False, mini=False)[source]
Parameters
  • df (Pandas.DataFrame()) –

    X column dataframe of form:
    df.columns.values = array([‘formula’, ‘target’,

    ’extended1’, ‘extended2’, …],

    dtype=object)

  • elem_prop (str) –

    valid element properties:

    ’oliynyk’, ‘jarvis’, ‘atom2vec’, ‘magpie’, ‘mat2vec’, ‘onehot’

  • drop_duplicates (boolean) – Decide to keep or drop duplicate compositions

  • extend_features (boolean) – Decide whether to use non [“formula”, “target”] columns as additional features.

Returns

  • X (pd.DataFrame()) – Feature Matrix with NaN values filled using the median feature value for dataset

  • y (pd.Series()) – Target values

  • formulae (pd.Series()) – Formula associated with X and y

crabnet.utils.composition.get_sym_dict(f, factor)[source]
crabnet.utils.composition.parse_formula(formula)[source]
Parameters

formula (str) – A string formula, e.g. Fe2O3, Li3Fe2(PO4)3.

Returns

sym_dict – A dictionary recording the composition of that formula.

Return type

dict

Notes

In the case of Metallofullerene formula (e.g. Y3N@C80), the @ mark will be dropped and passed to parser.

crabnet.utils.data module

crabnet.utils.data.get_data(module, fname='train.csv', mapper=None, groupby=True, dummy=False, split=True, val_size=0.2, test_size=0.0, random_state=42)[source]

Grab data from within the subdirectories (modules) of mat_discover.

Parameters
  • module (Module) – The module within CrabNet that contains e.g. “train.csv”. For example, from CrabNet.data.materials_data import elasticity

  • fname (str, optional) – Filename of text file to open.

  • mapper (dict, optional) – Column renamer for pandas DataFrame (i.e. used in df.rename(columns=mapper) By default, None.

  • dummy (bool, optional) – Whether to pare down the data to a small test set, by default False

  • groupby (bool, optional) – Whether to use groupby_formula to filter identical compositions

  • split (bool, optional) – Whether to split the data into train, val, and (optionally) test sets, by default True

  • val_size (float, optional) – Validation dataset fraction, by default 0.2

  • test_size (float, optional) – Test dataset fraction, by default 0.0

  • random_state (int, optional) – seed to use for the train/val/test split, by default 42

Returns

  • DataFrame – If split==False, then the full DataFrame is returned directly

  • DataFrame, DataFrame – If test_size == 0 and split==True, then training and validation DataFrames are returned.

  • DataFrame, DataFrame, DataFrame – If test_size > 0 and split==True, then training, validation, and test DataFrames are returned.

crabnet.utils.data.groupby_formula(df, how='max', mapper=None)[source]

Group identical compositions together and preserve original indices.

See https://stackoverflow.com/a/49216427/13697228

Parameters
  • df (DataFrame) – At minimum should contain “formula” and “target” columns.

  • how (str, optional) – How to perform the “groupby”, either “mean” or “max”, by default “max”

Returns

The grouped DataFrame such that the original indices are preserved.

Return type

DataFrame

crabnet.utils.estimatorselectionhelper module

class crabnet.utils.estimatorselectionhelper.EstimatorSelectionHelper(models, params, random_seed=42)[source]

Bases: object

__init__(models, params, random_seed=42)[source]
fit(X, y, cv=3, n_jobs=1, verbose=1, scoring=None, refit=False, random_seed=42)[source]
plot_gridsearch(model_name, elem_prop, mat_prop, fig_dir, gs)[source]
score_summary(ep, mp, fig_dir, sort_by='mean_test_r2')[source]

crabnet.utils.figures module

crabnet.utils.figures.act_pred(y_act, y_pred, name='example', x_hist=True, y_hist=True, reg_line=True, save_dir=None)[source]
crabnet.utils.figures.element_prevalence(formulae, name='example', save_dir=None, log_scale=False, ptable_fig=True)[source]
crabnet.utils.figures.loss_curve(x_data, train_err, val_err, name='example', save_dir=None)[source]
crabnet.utils.figures.residual(y_act, y_pred, name='example', save_dir=None)[source]
crabnet.utils.figures.residual_hist(y_act, y_pred, name='example', save_dir=None)[source]

crabnet.utils.get_compute_device module

crabnet.utils.get_compute_device.get_compute_device(force_cpu=False, prefer_last=True)[source]

crabnet.utils.get_core_count module

crabnet.utils.get_core_count.get_core_count()[source]

Get the number of available virtual or physical CPUs on this system

crabnet.utils.modelselectionhelper module

crabnet.utils.modelselectionhelper.modelselectionhelper(models, params, elem_props, mat_props_dir, mat_props, metrics_dir, fig_dir, scoring=None, n_jobs=1, cv=3, refit='neg_MAE', verbose=False, random_seed=42)[source]

crabnet.utils.optim module

class crabnet.utils.optim.SWA(*args: Any, **kwargs: Any)[source]

Bases: torch.optim.Optimizer

__init__(optimizer, swa_start=None, swa_freq=None, swa_lr=None)[source]

Implements Stochastic Weight Averaging (SWA). Stochastic Weight Averaging was proposed in Averaging Weights Leads to Wider Optima and Better Generalization by Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov and Andrew Gordon Wilson (UAI 2018). SWA is implemented as a wrapper class taking optimizer instance as input and applying SWA on top of that optimizer. SWA can be used in two modes: automatic and manual. In the automatic mode SWA running averages are automatically updated every swa_freq steps after swa_start steps of optimization. If swa_lr is provided, the learning rate of the optimizer is reset to swa_lr at every step starting from swa_start. To use SWA in automatic mode provide values for both swa_start and swa_freq arguments. Alternatively, in the manual mode, use update_swa() or update_swa_group() methods to update the SWA running averages. In the end of training use swap_swa_sgd method to set the optimized variables to the computed averages. :param optimizer: optimizer to use with SWA :type optimizer: torch.optim.Optimizer :param swa_start: number of steps before starting to apply SWA in

automatic mode; if None, manual mode is selected (default: None)

Parameters
  • swa_freq (int) – number of steps between subsequent updates of SWA running averages in automatic mode; if None, manual mode is selected (default: None)

  • swa_lr (float) – learning rate to use starting from step swa_start in automatic mode; if None, learning rate is not changed (default: None)

Examples

>>> # automatic mode
>>> base_opt = torch.optim.SGD(model.parameters(), lr=0.1)
>>> opt = torchcontrib.optim.SWA(
>>>                 base_opt, swa_start=10, swa_freq=5, swa_lr=0.05)
>>> for _ in range(100):
>>>     opt.zero_grad()
>>>     loss_fn(model(input), target).backward()
>>>     opt.step()
>>> opt.swap_swa_sgd()
>>> # manual mode
>>> opt = torchcontrib.optim.SWA(base_opt)
>>> for i in range(100):
>>>     opt.zero_grad()
>>>     loss_fn(model(input), target).backward()
>>>     opt.step()
>>>     if i > 10 and i % 5 == 0:
>>>         opt.update_swa()
>>> opt.swap_swa_sgd()

Note

SWA does not support parameter-specific values of swa_start, swa_freq or swa_lr. In automatic mode SWA uses the same swa_start, swa_freq and swa_lr for all parameter groups. If needed, use manual mode with update_swa_group() to use different update schedules for different parameter groups.

Note

Call swap_swa_sgd() in the end of training to use the computed running averages.

Note

If you are using SWA to optimize the parameters of a Neural Network containing Batch Normalization layers, you need to update the running_mean and running_var statistics of the Batch Normalization module. You can do so by using torchcontrib.optim.swa.bn_update utility.

Note

See the blogpost https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/ for an extended description of this SWA implementation.

Note

The repo https://github.com/izmailovpavel/contrib_swa_examples contains examples of using this SWA implementation.

add_param_group(param_group)[source]

Add a param group to the Optimizer s param_groups. This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses. :param param_group: Specifies what Tensors should be optimized along :type param_group: dict :param with group specific optimization options.:

static bn_update(loader, model, device=None)[source]

Updates BatchNorm running_mean, running_var buffers in the model. It performs one pass over data in loader to estimate the activation statistics for BatchNorm layers in the model. :param loader: dataset loader to compute the

activation statistics on. Each data batch should be either a tensor, or a list/tuple whose first element is a tensor containing data.

Parameters
  • model (torch.nn.Module) – model for which we seek to update BatchNorm statistics.

  • device (torch.device, optional) – If set, data will be trasferred to device before being passed into model.

load_state_dict(state_dict)[source]

Loads the optimizer state. :param state_dict: SWA optimizer state. Should be an object returned

from a call to state_dict.

state_dict()[source]

Returns the state of SWA as a dict. It contains three entries:

  • opt_state - a dict holding current optimization state of the base

    optimizer. Its content differs between optimizer classes.

  • swa_state - a dict containing current state of SWA. For each

    optimized variable it contains swa_buffer keeping the running average of the variable

  • param_groups - a dict containing all parameter groups

step(closure=None)[source]

Performs a single optimization step. In automatic mode also updates SWA running averages.

swap_swa_sgd()[source]

Swaps the values of the optimized variables and swa buffers. It’s meant to be called in the end of training to use the collected swa running averages. It can also be used to evaluate the running averages during training; to continue training swap_swa_sgd should be called again.

update_swa(mae)[source]

Updates the SWA running averages of all optimized parameters.

update_swa_group(group, reset=False, mae=None)[source]

Updates the SWA running averages for the given parameter group. :param param_group: Specifies for what parameter group SWA running

averages should be updated

Examples

>>> # automatic mode
>>> base_opt = torch.optim.SGD([{'params': [x]},
>>>             {'params': [y], 'lr': 1e-3}], lr=1e-2, momentum=0.9)
>>> opt = torchcontrib.optim.SWA(base_opt)
>>> for i in range(100):
>>>     opt.zero_grad()
>>>     loss_fn(model(input), target).backward()
>>>     opt.step()
>>>     if i > 10 and i % 5 == 0:
>>>         # Update SWA for the second parameter group
>>>         opt.update_swa_group(opt.param_groups[1])
>>> opt.swap_swa_sgd()

crabnet.utils.utils module

crabnet.utils.utils.BCEWithLogitsLoss(output, log_std, target)[source]
class crabnet.utils.utils.CONSTANTS[source]

Bases: object

__init__()[source]
class crabnet.utils.utils.DummyScaler(data)[source]

Bases: object

__init__(data)[source]
load_state_dict(state_dict)[source]
scale(data)[source]
state_dict()[source]
unscale(data_scaled)[source]
class crabnet.utils.utils.EDMDataset(*args: Any, **kwargs: Any)[source]

Bases: torch.utils.data.Dataset

Get X and y from EDM dataset.

__init__(dataset, n_comp, extra_features=None)[source]
class crabnet.utils.utils.EDM_CsvLoader(data, extra_features=None, batch_size=64, groupby=False, random_state=0, shuffle=True, pin_memory=True, n_elements=6, inference=False, verbose=True, elem_prop='mat2vec')[source]

Bases: object

__init__(data, extra_features=None, batch_size=64, groupby=False, random_state=0, shuffle=True, pin_memory=True, n_elements=6, inference=False, verbose=True, elem_prop='mat2vec')[source]
Parameters
  • data (str or DataFrame) – name of csv file containing cif and properties or DataFrame

  • extra_features (str or None) – names of extended features

  • csv_val (str) – name of csv file containing cif and properties

  • val_frac (float, optional (default=0.75)) – train/val ratio if val_file not given

  • batch_size (float, optional (default=64)) – Step size for the Gaussian filter

  • groupby (bool, optional) – Whether to reduce repeat formulas to a unique set, by default False.

  • random_state (int, optional (default=123)) – Random seed for sampling the dataset. Only used if validation data is not given.

  • shuffle (bool (default=True)) – Whether to shuffle the datasets or not

get_data_loaders(inference=False)[source]

Input the dataset, get train test split

class crabnet.utils.utils.Lamb(*args: Any, **kwargs: Any)[source]

Bases: torch.optim.optimizer.Optimizer

Implements Lamb algorithm. It has been proposed in `Large Batch Optimization for Deep Learning:

Training BERT in 76 minutes`_.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • adam (bool, optional) – always use trust ratio = 1, which turns this into Adam. Useful for comparison purposes.

  • Learning (_Large Batch Optimization for Deep) – Training BERT in 76 minutes:

  • https – //arxiv.org/abs/1904.00962

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, adam=False, min_trust=None)[source]
step(closure=None)[source]

Performs a single optimization step. :param closure: A closure that reevaluates the model

and returns the loss.

class crabnet.utils.utils.Lookahead(*args: Any, **kwargs: Any)[source]

Bases: torch.optim.optimizer.Optimizer

__init__(base_optimizer, alpha=0.5, k=6)[source]
load_state_dict(state_dict)[source]
state_dict()[source]
step(closure=None)[source]
sync_lookahead()[source]
update_slow(group)[source]
class crabnet.utils.utils.NumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.encoder.JSONEncoder

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
crabnet.utils.utils.RobustL1(output, log_std, target)[source]

Robust L1 loss using a lorentzian prior. Allows for estimation of an aleatoric uncertainty.

crabnet.utils.utils.RobustL2(output, log_std, target)[source]

Robust L2 loss using a gaussian prior. Allows for estimation of an aleatoric uncertainty.

class crabnet.utils.utils.Scaler(data)[source]

Bases: object

__init__(data)[source]
load_state_dict(state_dict)[source]
scale(data)[source]
state_dict()[source]
unscale(data_scaled)[source]
crabnet.utils.utils.count_gs_param_combinations(d)[source]
crabnet.utils.utils.count_parameters(model)[source]
crabnet.utils.utils.get_cbfv(data, elem_prop='oliynyk', scale=False, extend_features=False)[source]

Loads the compound csv file and featurizes it, then scales the features using StandardScaler.

Parameters
  • path (str) – DESCRIPTION.

  • elem_prop (str, optional) – DESCRIPTION. The default is ‘oliynyk’.

Returns

  • X_scaled (TYPE) – DESCRIPTION.

  • y (TYPE) – DESCRIPTION.

  • formula (TYPE) – DESCRIPTION.

crabnet.utils.utils.get_edm(data, n_elements='infer', inference=False, verbose=True, groupby=False)[source]

Build a element descriptor matrix.

Parameters

data (str or DataFrame) – Filepath to data or DataFrame.

Returns

  • X_scaled (TYPE) – DESCRIPTION.

  • y (TYPE) – DESCRIPTION.

  • formula (TYPE) – DESCRIPTION.

Module contents