## What is Data Science?

Data science is an interdisciplinary field that leverages statistics, computer science, and machine learning to extract information from both structured and unstructured data.

- Statistics is the science concerned with developing and studying methods for collecting, analyzing, interpreting, and presenting empirical data.
- Computer Science is the study of computation, automation, and information. Computer science spans theoretical disciplines, such as algorithms, theory of computation, and information theory, to practical disciplines.
- Machine Learning is the study of algorithms that automatically identify patterns in data. These algorithms are capable of learning and improving over time as additional data is made available to the algorithm.

#### Blog: How Machine Learning is Changing the Data Organization Game

The addition of data science into traditional reliability methods allows models to evolve continually and learn, preventing results from becoming stagnant. To learn about more applications of machine learning in reliability, read our blog in Inspectioneering.

## Why is Data Science Valuable in Reliability?

The addition of data science into traditional reliability methods allows models to continually evolve and learn, preventing results from becoming Data science and machine learning eliminate the need for humans to code rules that tend to and hard to work with. The goal of data science is to set up algorithms that learn and improve automatically without a significant amount of human intervention, ultimately creating a better solution than what could have been coded by hand. Having a Data-Driven Reliability framework can elevate your program to improve performance and allows you to connect data to business decisions.

## Quantitative Reliability Optimization (QRO) Example Application: CML Optimization

## Machine Learning Models Leveraged in the Data-Driven Reliability Framework

**Linear Regression**

Linear Regression is a set of statistical processes used to estimate the relationships between a dependent variable (occasionally multiple) and one or more independent variables.

**Non-Linear Regression**

Non-Linear Regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more dependent/independent variables.

**Classification**

Classification is the process of trying to predict a class-based outcome from some set of independent variables.

**Supported Vector Machines**

Supported Vector Machine are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.

**Pattern Recognition**

Pattern Recognition is a data analysis method that uses machine learning algorithms to automatically recognize patterns and regularities in data.

**Bayesian Modeling**

Bayesian Modeling are statistical models that make probability-based inference from data. Unlike other statistical models, however, Bayesian models enable one to specify a prior distribution on the model parameters of interest. This prior distribution represents any beliefs that we have about how the model should behave. As an example, a prior distribution can encode the beliefs of a subject matter expert. This prior distribution serves to regularize the statistical model and is especially useful when working with a relatively small amount of data.

**Graph Theory**

Graph Theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). Graph theory is used to find the shortest path in road or a network.

**Resampling**

Resampling is any of a variety of methods such as estimating the precision of sample statistics by using subsets of available data or drawing randomly with replacement from a set of data points, permutation tests, and validation models by using random subsets. These methods are useful when there is a small amount of, or gaps, in data.

- Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient.
- Jackknifing is used in statistical inference to estimate the bias and standard error (variance) of a statistic, when a random sample of observations is used to calculate it.

## Applications of Data Science in Reliability

**Data-Driven Reliability Models**

An example of this type of model is Lifetime Variability Curves (LVCs). LVCs create a prediction of when assets are likely to fail based on a facility’s data. The data obtained is usually provided in a wide range of formats and requires the use of Bayesian techniques to help Subject Matter Experts (SMEs) steer the models to something fit for use.

## Big Data in Mechanical Integrity: The Next Generation of Corrosion Models

**Condition Monitoring Location (CML) Prioritization**

Unlike traditional CML Optimization methods, CML Prioritization takes degradation rates, historical inspection data, maintenance histories, and the uncertainty of the facility’s data into account. As a part of this method, a LVC is created for every CML to predict a range of dates in which the CML is likely to fail. From this information, CML Prioritization recommends which CMLs require inspection activities in the near-term as well as those that can be explored at some future date.

## CML Optimization Pilot Project Helps Refinery Reduce Risk and Identify Minimum Reduced Inspection Spend of $384K

**Task Optimization**

A facility must complete a variety of tasks such as inspections, repairs, and maintenance tasks to keep its assets operating/to ensure its assets function properly. Task Optimization (TO) finds the “optimal” set of tasks that should be done and tells a facility when to do them to reduce downtime and costs and improve overall facility availability and throughput. TO is a deliverable of Quantitative Reliability Optimization (QRO) and leverages multiple data science techniques such as Bayesian Modeling and LVCs to create a regression model.

**Throughput Modeling**

This deliverable of QRO utilizes graph theory, network analysis, and Bayesian modeling to analyze a facility’s assets, how they are connected, and how much product they can produce. Additionally, throughput modeling predicts when assets are likely to fail and need to be repaired, and forecasts how much total product the facility can expect to produce over time. This can ultimately help facilities look for bottlenecks that are hurting their production capacity and make improvements as needed.

**Natural Language Processing**

Natural Language Processing (NLP) and classification can be leveraged for a broad range of applications such as:

- Inspection Grading
- Quality Checks
- U-1 Form Mining
- Incident Log Analysis

NLP can evolve inspection grading by eliminating the need for people to read reports and make subjective judgments about inspection quality. More importantly, NLP applies machine learning to all kinds of document-related tasks, like mining U1 forms, etc. In addition to being tiresome for humans, these tasks can also be prone to error. Leveraging data science techniques to get a computer to do these things automatically is a huge win on costs and accuracy.

## North American Refiner Leverages Data Science to Improve Quality Control in PMI Program

#### Corrosion Modeling

Data science can aid in the creation of unit studies where data can estimate how quickly we expect an asset to degrade as a function of inputs like temperature, pressure, metallurgy, and stream info. It also automatically assigns damage mechanisms and shows how subjective these assignments currently are when left up to the discretion of humans. These tools are leveraged to support Subject Matter Expert (SME) knowledge and create consistency in the industry.

#### Anomaly Detection

Imagine there is a rotating asset (like a pump), and it has a variety of sensors on it (like pressure, temperature, flow, etc.). When things are working normally, the sensors report certain streams of information. Data science can be used to figure out the baseline performance of an asset and can then sound the alarm when anomalies occur. Additionally, anomaly detection gives facilities the ability to identify what type of problem is occurring (like is the impellor failing, are the bearings shot, etc.) and then predict failure dates and recommend maintenance activities based on the specific problem.

#### Computer Vision:

Computer vision is concerned with extracting information from video and image data. Related problems include image segmentation, object detection, and structure from motion. In reliability, computer vision is used to quantify the level of damage and degradation on assets.

## Image Analytics POC Results in Inspection Planning Efficiencies

#### Deep Learning

Deep learning (or deep neural networks) are a family of machine learning models that use architectures that mimic the human brain. These models can be extremely complicated and often require a tremendous amount of data to train effectively. However, when sufficient training data is available, deep learning models can perform extremely well over traditional machine learning models. Deep learning models have gained prominence in areas as diverse as image recognition, machine translation, and text generation. Google translate and ChatGPT are both well-known examples of the power of deep learning.

## Stay in the know.

Providing data-driven insights, perspectives, and industrial inspiration from the forefront of the reliability transformation.