crabnet.utils package
Submodules
crabnet.utils.composition module
- exception crabnet.utils.composition.CompositionError[source]
Bases:
Exception
Exception class for composition errors
- crabnet.utils.composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=False, sum_feat=False, mini=False)[source]
- Parameters
df (Pandas.DataFrame()) –
- X column dataframe of form:
- df.columns.values = array([‘formula’, ‘target’,
’extended1’, ‘extended2’, …],
dtype=object)
elem_prop (str) –
- valid element properties:
’oliynyk’, ‘jarvis’, ‘atom2vec’, ‘magpie’, ‘mat2vec’, ‘onehot’
drop_duplicates (boolean) – Decide to keep or drop duplicate compositions
extend_features (boolean) – Decide whether to use non [“formula”, “target”] columns as additional features.
- Returns
X (pd.DataFrame()) – Feature Matrix with NaN values filled using the median feature value for dataset
y (pd.Series()) – Target values
formulae (pd.Series()) – Formula associated with X and y
- crabnet.utils.composition.parse_formula(formula)[source]
- Parameters
formula (str) – A string formula, e.g. Fe2O3, Li3Fe2(PO4)3.
- Returns
sym_dict – A dictionary recording the composition of that formula.
- Return type
dict
Notes
In the case of Metallofullerene formula (e.g. Y3N@C80), the @ mark will be dropped and passed to parser.
crabnet.utils.data module
- crabnet.utils.data.get_data(module, fname='train.csv', mapper=None, groupby=True, dummy=False, split=True, val_size=0.2, test_size=0.0, random_state=42)[source]
Grab data from within the subdirectories (modules) of mat_discover.
- Parameters
module (Module) – The module within CrabNet that contains e.g. “train.csv”. For example, from CrabNet.data.materials_data import elasticity
fname (str, optional) – Filename of text file to open.
mapper (dict, optional) – Column renamer for pandas DataFrame (i.e. used in df.rename(columns=mapper) By default, None.
dummy (bool, optional) – Whether to pare down the data to a small test set, by default False
groupby (bool, optional) – Whether to use groupby_formula to filter identical compositions
split (bool, optional) – Whether to split the data into train, val, and (optionally) test sets, by default True
val_size (float, optional) – Validation dataset fraction, by default 0.2
test_size (float, optional) – Test dataset fraction, by default 0.0
random_state (int, optional) – seed to use for the train/val/test split, by default 42
- Returns
DataFrame – If split==False, then the full DataFrame is returned directly
DataFrame, DataFrame – If test_size == 0 and split==True, then training and validation DataFrames are returned.
DataFrame, DataFrame, DataFrame – If test_size > 0 and split==True, then training, validation, and test DataFrames are returned.
- crabnet.utils.data.groupby_formula(df, how='max', mapper=None)[source]
Group identical compositions together and preserve original indices.
See https://stackoverflow.com/a/49216427/13697228
- Parameters
df (DataFrame) – At minimum should contain “formula” and “target” columns.
how (str, optional) – How to perform the “groupby”, either “mean” or “max”, by default “max”
- Returns
The grouped DataFrame such that the original indices are preserved.
- Return type
DataFrame
crabnet.utils.estimatorselectionhelper module
crabnet.utils.figures module
- crabnet.utils.figures.act_pred(y_act, y_pred, name='example', x_hist=True, y_hist=True, reg_line=True, save_dir=None)[source]
- crabnet.utils.figures.element_prevalence(formulae, name='example', save_dir=None, log_scale=False, ptable_fig=True)[source]
crabnet.utils.get_compute_device module
crabnet.utils.get_core_count module
crabnet.utils.modelselectionhelper module
crabnet.utils.optim module
- class crabnet.utils.optim.SWA(*args: Any, **kwargs: Any)[source]
Bases:
torch.optim.Optimizer
- __init__(optimizer, swa_start=None, swa_freq=None, swa_lr=None)[source]
Implements Stochastic Weight Averaging (SWA). Stochastic Weight Averaging was proposed in Averaging Weights Leads to Wider Optima and Better Generalization by Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov and Andrew Gordon Wilson (UAI 2018). SWA is implemented as a wrapper class taking optimizer instance as input and applying SWA on top of that optimizer. SWA can be used in two modes: automatic and manual. In the automatic mode SWA running averages are automatically updated every
swa_freq
steps afterswa_start
steps of optimization. Ifswa_lr
is provided, the learning rate of the optimizer is reset toswa_lr
at every step starting fromswa_start
. To use SWA in automatic mode provide values for bothswa_start
andswa_freq
arguments. Alternatively, in the manual mode, useupdate_swa()
orupdate_swa_group()
methods to update the SWA running averages. In the end of training use swap_swa_sgd method to set the optimized variables to the computed averages. :param optimizer: optimizer to use with SWA :type optimizer: torch.optim.Optimizer :param swa_start: number of steps before starting to apply SWA inautomatic mode; if None, manual mode is selected (default: None)
- Parameters
swa_freq (int) – number of steps between subsequent updates of SWA running averages in automatic mode; if None, manual mode is selected (default: None)
swa_lr (float) – learning rate to use starting from step swa_start in automatic mode; if None, learning rate is not changed (default: None)
Examples
>>> # automatic mode >>> base_opt = torch.optim.SGD(model.parameters(), lr=0.1) >>> opt = torchcontrib.optim.SWA( >>> base_opt, swa_start=10, swa_freq=5, swa_lr=0.05) >>> for _ in range(100): >>> opt.zero_grad() >>> loss_fn(model(input), target).backward() >>> opt.step() >>> opt.swap_swa_sgd() >>> # manual mode >>> opt = torchcontrib.optim.SWA(base_opt) >>> for i in range(100): >>> opt.zero_grad() >>> loss_fn(model(input), target).backward() >>> opt.step() >>> if i > 10 and i % 5 == 0: >>> opt.update_swa() >>> opt.swap_swa_sgd()
Note
SWA does not support parameter-specific values of
swa_start
,swa_freq
orswa_lr
. In automatic mode SWA uses the sameswa_start
,swa_freq
andswa_lr
for all parameter groups. If needed, use manual mode withupdate_swa_group()
to use different update schedules for different parameter groups.Note
Call
swap_swa_sgd()
in the end of training to use the computed running averages.Note
If you are using SWA to optimize the parameters of a Neural Network containing Batch Normalization layers, you need to update the
running_mean
andrunning_var
statistics of the Batch Normalization module. You can do so by using torchcontrib.optim.swa.bn_update utility.Note
See the blogpost https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/ for an extended description of this SWA implementation.
Note
The repo https://github.com/izmailovpavel/contrib_swa_examples contains examples of using this SWA implementation.
- add_param_group(param_group)[source]
Add a param group to the
Optimizer
s param_groups. This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to theOptimizer
as training progresses. :param param_group: Specifies what Tensors should be optimized along :type param_group: dict :param with group specific optimization options.:
- static bn_update(loader, model, device=None)[source]
Updates BatchNorm running_mean, running_var buffers in the model. It performs one pass over data in loader to estimate the activation statistics for BatchNorm layers in the model. :param loader: dataset loader to compute the
activation statistics on. Each data batch should be either a tensor, or a list/tuple whose first element is a tensor containing data.
- Parameters
model (torch.nn.Module) – model for which we seek to update BatchNorm statistics.
device (torch.device, optional) – If set, data will be trasferred to
device
before being passed intomodel
.
- load_state_dict(state_dict)[source]
Loads the optimizer state. :param state_dict: SWA optimizer state. Should be an object returned
from a call to state_dict.
- state_dict()[source]
Returns the state of SWA as a
dict
. It contains three entries:- opt_state - a dict holding current optimization state of the base
optimizer. Its content differs between optimizer classes.
- swa_state - a dict containing current state of SWA. For each
optimized variable it contains swa_buffer keeping the running average of the variable
param_groups - a dict containing all parameter groups
- step(closure=None)[source]
Performs a single optimization step. In automatic mode also updates SWA running averages.
- swap_swa_sgd()[source]
Swaps the values of the optimized variables and swa buffers. It’s meant to be called in the end of training to use the collected swa running averages. It can also be used to evaluate the running averages during training; to continue training swap_swa_sgd should be called again.
- update_swa_group(group, reset=False, mae=None)[source]
Updates the SWA running averages for the given parameter group. :param param_group: Specifies for what parameter group SWA running
averages should be updated
Examples
>>> # automatic mode >>> base_opt = torch.optim.SGD([{'params': [x]}, >>> {'params': [y], 'lr': 1e-3}], lr=1e-2, momentum=0.9) >>> opt = torchcontrib.optim.SWA(base_opt) >>> for i in range(100): >>> opt.zero_grad() >>> loss_fn(model(input), target).backward() >>> opt.step() >>> if i > 10 and i % 5 == 0: >>> # Update SWA for the second parameter group >>> opt.update_swa_group(opt.param_groups[1]) >>> opt.swap_swa_sgd()
crabnet.utils.utils module
- class crabnet.utils.utils.EDMDataset(*args: Any, **kwargs: Any)[source]
Bases:
torch.utils.data.Dataset
Get X and y from EDM dataset.
- class crabnet.utils.utils.EDM_CsvLoader(data, extra_features=None, batch_size=64, groupby=False, random_state=0, shuffle=True, pin_memory=True, n_elements=6, inference=False, verbose=True, elem_prop='mat2vec')[source]
Bases:
object
- __init__(data, extra_features=None, batch_size=64, groupby=False, random_state=0, shuffle=True, pin_memory=True, n_elements=6, inference=False, verbose=True, elem_prop='mat2vec')[source]
- Parameters
data (str or DataFrame) – name of csv file containing cif and properties or DataFrame
extra_features (str or None) – names of extended features
csv_val (str) – name of csv file containing cif and properties
val_frac (float, optional (default=0.75)) – train/val ratio if val_file not given
batch_size (float, optional (default=64)) – Step size for the Gaussian filter
groupby (bool, optional) – Whether to reduce repeat formulas to a unique set, by default False.
random_state (int, optional (default=123)) – Random seed for sampling the dataset. Only used if validation data is not given.
shuffle (bool (default=True)) – Whether to shuffle the datasets or not
- class crabnet.utils.utils.Lamb(*args: Any, **kwargs: Any)[source]
Bases:
torch.optim.optimizer.Optimizer
Implements Lamb algorithm. It has been proposed in `Large Batch Optimization for Deep Learning:
Training BERT in 76 minutes`_.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
adam (bool, optional) – always use trust ratio = 1, which turns this into Adam. Useful for comparison purposes.
Learning (_Large Batch Optimization for Deep) – Training BERT in 76 minutes:
https – //arxiv.org/abs/1904.00962
- class crabnet.utils.utils.Lookahead(*args: Any, **kwargs: Any)[source]
Bases:
torch.optim.optimizer.Optimizer
- class crabnet.utils.utils.NumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]
Bases:
json.encoder.JSONEncoder
- default(obj)[source]
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
- crabnet.utils.utils.RobustL1(output, log_std, target)[source]
Robust L1 loss using a lorentzian prior. Allows for estimation of an aleatoric uncertainty.
- crabnet.utils.utils.RobustL2(output, log_std, target)[source]
Robust L2 loss using a gaussian prior. Allows for estimation of an aleatoric uncertainty.
- crabnet.utils.utils.get_cbfv(data, elem_prop='oliynyk', scale=False, extend_features=False)[source]
Loads the compound csv file and featurizes it, then scales the features using StandardScaler.
- Parameters
path (str) – DESCRIPTION.
elem_prop (str, optional) – DESCRIPTION. The default is ‘oliynyk’.
- Returns
X_scaled (TYPE) – DESCRIPTION.
y (TYPE) – DESCRIPTION.
formula (TYPE) – DESCRIPTION.
- crabnet.utils.utils.get_edm(data, n_elements='infer', inference=False, verbose=True, groupby=False)[source]
Build a element descriptor matrix.
- Parameters
data (str or DataFrame) – Filepath to data or DataFrame.
- Returns
X_scaled (TYPE) – DESCRIPTION.
y (TYPE) – DESCRIPTION.
formula (TYPE) – DESCRIPTION.