# Introduction

- Automated Theorem Proving (ATP) - Attempting to prove mathematical theorems automatically.
- Bottlenecks in ATP:
**Autoformalization**- Semantic or formal parsing of informal proofs.**Automated Reasoning**- Reasoning about already formalised proofs.- Paper evaluates the effectiveness of neural sequence models for premise selection (related to automated reasoning) without using hand engineered features.
- Link to the paper

# Premise Selection

- Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A

# Dataset

- Mizar Mathematical Library (MML) used as the formal corpus.
- The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem).

# Approach

- The model predicts the probability that a given axiom is useful for proving a given conjecture.
- Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss.
- The two embedding networks and the joint predictor path are trained jointly.

**Stage 1: Character-level Models**

- Treat premises on character level using an 80-dimensional one hot encoding vector.
- Architectures for embedding:
- pure recurrent LSTM and GRU Network
- CNN (with max pooling)
- Recurrent-convolutional network that shortens the sequence using convolutional layer before feeding it to LSTM.

**Stage 2: Word-level Models**

- MML dataset contains both implicit and explicit definitions.
- To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings.
- This is better than recursively expanding and embedding the word definition as the definition chains can be very deep.
- Once word level embeddings are obtained, the architecture from Character-level models can be reused.

# Experiments

**Metrics**

- For each conjecture, the model ranks the possible premises.
- Primary metric is the number of conjectures proved from top-k premises.
- Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking.
- Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR.

**Network Training**

- Asynchronous distributed stochastic gradient descent using Adam optimizer.
- Clipped vs Non-clipped Gradients.
- Max Sequence length of 2048 for character-level sequences and 500 for word-level sequences.

**Results**

- Best Selection Pipeline - Stage 1 character-level CNN which produces word-level embeddings for the next stage.
- Best models use simple CNNs followed by max pooling and two-stage definition-based def-CNN outperforms naive word-CNN (where word embeddings are learnt in a single pass).