**Аннотация**:

the paper discusses using machine learning methods to predict oil production, specifically implementing a regression algorithm with polynomial properties. Regression algorithms are effective in predicting oil production based on data-driven approaches. A synthetic dataset was created using the Buckley-Leverett mathematical model to determine saturation distribution in oil production problems. Input parameters such as porosity, viscosity, and permeability were used to predict the oil recovery factor. Testing over 400 thousand synthetic data showed that linear regression underfits the data, while polynomial regression models accurately predicted the oil recovery factor. To prevent overfitting, L1 regularization was applied. The quadratic polynomial regression model achieved a coefficient of determination of 0.98, indicating strong predictive accuracy. The study concludes that machine learning methods can be useful in predicting the oil recovery factor in practical oil field scenarios.

**Ключевые слова**:

machine learning, oil production, regression algorithms, lasso regularization

**DOI 10.24412/2712-8849-2024-473-596-614**

1. Introduction.One of the most vital commodities, oil’s price and volatility significantly affect everyone’s quality of life in the world. Oil production forecasting is a critical aspect of the energy industry as it enables companies to make informed decisions about their operations and investments. The impact of the oil price and its effects on daily life is apparent. This impact may be seen in a variety of areas, including daily consumer goods like tires, shampoo, paint, and many more goods. The world’s primary source of energy and heat is oil, making it difficult to replace with alternative resources. Oil has a significant impact on economic growth, from the production of everyday goods to the military and energy industries. Unexpected changes in the price of oil can affect the economies of both suppliers and producers, although oil-importing nations are more at risk. Several researchers concluded that higher volatility and less predictability in the oil price are bad since they might have a detrimental impact on numerous economic indicators. Linear regression algorithms are one of the most used techniques for forecasting oil production, as they are simple to implement and can provide accurate results when applied correctly. In recent years, there has been significant research and development in the use of linear regression algorithms for oil production forecasting, with a focus on improving the accuracy and reliability of the predictions.2. Literature Review. Over the past ten years, several efforts have been undertaken to use machine learning techniques to model oil and gas output from unconventional resources. Most importantly, ANNs have demonstrated their ability to forecast production effectively using a number of techniques. The effectiveness of PCA as a method for extracting time-series features and as a forecasting tool has been established. The application of machine learning and data-driven analytics to address issues in the market for unconventional gas and oil has grown in popularity [1]. According to the authors of [2], using machine learning (ML) algorithms might yield better results than performing conventional computations on a regular grid.A strategy for building a proxy model based on machine learning techniques was developed by researchers [3], specifically the random forest method. In order to collect oil production data to compare to historical records and forecast future reservoir performance, two fictitious cases that used reservoir simulation models to simulate real reservoirs were taken into consideration.There are several approaches that have been put out for projecting oil production using linear regression algorithms. A popular method is to build a model that can forecast future production levels by using linear regression and historical production data as input features. This method has been applied in several studies, including [4] study, which trained a linear regression model for predicting oil production in shale reservoirs using historical data on oil production, wellbore pressure, and other pertinent variables. Their research proved that linear regression methods are a useful tool for forecasting production levels in intricate geological formations.To increase the precision of linear regression models, researchers have looked into the application of sophisticated data preprocessing methods in addition to historical production data. For instance, [5] presented a unique method for feature engineering and selection in the context of predicting oil production. Their research showed that the accuracy of linear regression models may be greatly increased, resulting in more accurate predictions of oil output, by carefully choosing and modifying input features.Extensive research on the application of linear regression algorithms to oil production forecasting has provided important new understandings of the possible advantages and drawbacks of this methodology. Numerous scholarly investigations have emphasized the significance of integrating domain information and experience during the model development phase to guarantee the precision and significance of the ensuing forecasts. For instance, [6] stressed the importance of taking engineering and geological aspects into account when using linear regression algorithms to anticipate oil output, since these might have a big impact on the reservoir's behavior and the accuracy of the projections.Additionally, for the purpose of projecting oil output, researchers have investigated how well various kinds of linear regression algorithms perform. The accuracy and resilience of several linear regression variations, such as ridge regression, lasso regression, and ordinary least squares, in foretelling oil output were examined in a study conducted by [7]. The results revealed that while ridge and lasso regression is a popular alternative to ordinary least squares regression, it can perform better in some situations, especially when handling multicollinearity and overfitting problems in the input data.While there has been significant progress in the application of linear regression algorithms for oil production forecasting, there are still opportunities for further research and innovation in this area. One potential direction for future work is the integration of machine learning techniques with linear regression models to enhance their predictive capabilities. For instance, the use of ensemble methods such as random forests or gradient boosting, as suggested by [8], could enable more accurate and robust forecasting of oil production by leveraging the strengths of both traditional regression and modern machine learning approaches.Additionally, with the increasing availability of real-time production data and advances in sensor technology, there is potential for the development of dynamic and adaptive forecasting models that can continuously update their predictions based on new information. This could be particularly valuable in the context of rapidly changing market conditions and external factors that can impact oil production, such as geopolitical events and environmental regulations.As part of many of these initiatives, time-series-related problems have been solved using artificial neural networks (ANNs), specifically long short-term memory (LSTM) time-series neural networks for forecasting month-by-month output rates. In [9], the use of machine learning for EOR screening was discussed. To predict the appropriate group of candidates for enhanced oil recovery technologies, the authors took into account a variety of machine learning techniques and deep artificial neural networks, with Deep ANN RF models performing best with an average accuracy of 90 percent. The researchers came to the conclusion that while machine learning provides extremely strong indicators for the initial screening of EOR, it shouldn't be the only prediction technique used. The examination of this work demonstrates that screening of increased oil recovery methods using machine learning methods suffers from several issues, including insufficient input characteristics, imbalanced noisy data, and a lack of sufficient data for generalized learning.The productivity of oil wells was calculated by the authors of [10] using an artificial neural network. In this study, shape-related or pseudo-skin in horizontal wells was calculated using ANN and least squares support vector machines (LSSVM). It was discovered that among artificial neural network techniques, the created LSSVM methodology provides the closest match to real data. The authors emphasized the efficiency of horizontal wells from a technical and financial standpoint, but they discovered that the general studies being conducted to precisely measure this parameter had certain inherent limits and instead switched their focus to numerical smart models.According to [11], standard DCA and reservoir simulation approaches are unable to predict the unique production variations, but LSTM and Prophet models can. For forecasting oil output over a shorter period of time, such as the upcoming year, ARIMA is a better option. The use of raw production data from coastal Gulf of Mexico oil wells to forecast oil production over time was proven using a similar strategy in [12]. In order to forecast the oil flow rate in oilfields in Iran's northern Persian Guld, [13] presented a neural network. Most recently, ANN was used to forecast the cumulative oil output over the next six and eighteen months for wells in North America's Bakken Shale, using engineering completion data and the well's location as predictor variables [14].Moving forward, ANNs have shown promise in regression-based production forecasting challenges. Instead of forecasting a time series month by month in these situations, ANN is used to predict a cumulative production amount or production rate over a given future period. This strategy was illustrated by [15], which used ANN techniques to assess the intricate features of shale oil reservoirs to improve their production performance and create new procedures and guidelines for the exploration and exploitation of shale oil resources.3. Methods. In this work, regression algorithms will be applied to help accurately predict oil production in the field, which will contribute to advanced planning of oil production, reduce investment costs for field development, ensure stability, and increase oil production and economic benefits from field development.The goal of this study is to create a regression algorithm that can accurately forecast the oil recovery factor from synthetic data. This will enable the use of regression techniques to anticipate and enhance the effectiveness of the oil recovery factor.The following goals must be achieved in order to create an algorithm that can estimate the oil recovery factor with high accuracy:create synthetic data using an ensemble scenario approach based on the numerical 2D Buckley-Leverett model, apply machine learning techniques to predict the oil recovery factor, such as linear regression and polynomial regression, evaluate the effectiveness of machine learning algorithms to choose the best regression model,apply data augmentation to improve results. In this study, a training sample and a test sample were created using the synthetic data that a mathematical model produced. The machine learning model used four factors as input, and the oil recovery factor was used as the output parameter.The procedure for creating a machine learning model in this study is shown in Fig.1. Fig.1.A synthetic data set was obtained from a mathematical model: absolute permeability k, porosity p, viscosity µ, time iteration t and oil recovery factor . In our case, the oil recovery factor is represented as the objective function y, and the other four data are presented as signs of x. = (1)where, =1…, , and (i) is the sign of the th training example. ( )= ( ), =1,…, , (2)where ( ), ( ), ( ) are absolute permeability, porosity, viscosity, temporary iteration on th data, and is the number of training examples (training example m=403440). Thus, is an ( +1) matrix, and the objective function is an 1 vector. The regression model can be written as follows: ( )= ( ( ))+ ( ), =1,…, , (3)where model describes a pattern between and , and is a model error and measures some discrepancies. Consider the subgenus about methods.Linear Regression. Linear regression is a supervised learning algorithm used to predict a continuous output variable based on one or more input variables. The algorithm finds the best-fitting linear relationship between the input variables and the output variable.The formula for simple linear regression is: = 0 + 1 + (4)Where:- is the predicted output variable- is the input variable- 0 is the y-intercept- 1 is the slope of the line- is the error termThe coefficients 0 and 1 are estimated using the least squares method to minimize the sum of squared differences between the observed values and the values predicted by the regression equation. For multiple linear regression with input variables, the formula is: = 0 + 1 1+ 2 2+…+ + (5)where:- 1, 2, ..., - are the input variables- 0, 1, 2 ..., - are the coefficients for each input variableThe algorithm aims to find the values of the coefficients that best fit the data and minimize the error term .To evaluate regression models, a quadratic loss function is often chosen. A coefficient of determination 2 was also used to evaluate the results, which provides a measure of how well the observed results are reproduced by the model, based on the proportion of the total variation in the results explained by the model. The mean square error (MSE) is often used as an estimate of the loss function between the target and the predicted function: = 1 =1 ( )2 (6)Using linear regression, the model was trained with four input parameters and oil recovery factor. As a result, the trained model predicts the value of the oil recovery factor based on test data. Although multiple linear regression is very simple, the model has several good advantages. The linear regression model frees the engineer from the need for good physics knowledge in this study. This model is well-trained and highly interpreted, since all independent variables of multiple regression directly affect the target function. Consequently, the influence of input parameters is easily detected and visualized.Polynomial regression.Polynomial regression is a type of regression analysis where the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. In the case of prediction degree=2, the polynomial regression model takes the form: = 0 + 1 + 2 2 (7)where:- is the predicted value- is the independent variable- 0, 1 2 - are the coefficients of the modelThe goal of polynomial regression is to find the best-fitting curve that minimizes the sum of the squared differences between the actual and predicted values. This allows for capturing nonlinear relationships between variables and making more accurate predictions compared to linear regression. Lasso regression.Lasso regression is a type of linear regression that includes a regularization term to prevent overfitting by penalizing the absolute values of the regression coefficients. The formula for Lasso regression can be written as: Minimize: RSS + | i| (8)where: - RSS is the residual sum of squares - is the regularization parameter that controls the strength of the penalty on the coefficients - i are the regression coefficients that need to be estimated Lasso regression can be applied to various degrees of polynomial regression, such as linear, polynomial, cubic, etc.Linear Regression:In linear regression, Lasso aims to minimize the residual sum of squares (RSS) and the sum of the absolute values of the coefficients, subject to a constant penalty term ( ) multiplied by the sum of the absolute value of the coefficients. The objective function can be formulated as:minimize: RSS + * | i|Here, i represents the coefficients, the sum is taken over all predictors, and is the regularization parameter.Polynomial Regression:In polynomial regression, the model involves higher degree polynomial terms. For example, in a quadratic regression with one independent variable x, the model would have the form: = 0 + 1 + 2 2+ (9) The Lasso objective function for the polynomial regression would be similar to the linear regression case, but would include higher order terms of the predictors:minimize: RSS + * | i|Cubic Regression:Similarly, in cubic regression, the model may involve higher degree polynomial terms up to 3. The Lasso objective function for cubic regression would also be as described for polynomial regression, but with the inclusion of the additional cubic terms in the model.The Lasso technique is used to perform variable selection and regularization, potentially reducing overfitting and leading to a more interpretable model.It's important to note that while Lasso can effectively perform feature selection, it tends to zero out some coefficients completely, leading to a simpler model with fewer features.4. Results and discussion.The dataset was generated synthetically using an ensemble of scenarios based on the Buckley-Leverett 2D model. As input parameters, various combinations of parameters of the oil production problem (porosity, viscosity of the oil phase and absolute rock permeability, time iteration) were taken (Table 1). And as the output parameter, the value of the oil recovery factor was chosen. Thus, in this work, the number of sample pairs is 41*41*6=10 086. Using the Buckley-Leverett model, 6 synthetic data packets were generated for various permeability indices. Each data packet contains the values of viscosity, porosity and oil recovery factor (if we consider the data for each time layer, then the total amount of data is 403 440). Oil viscosity varies in the range 0.1–0.5, porosity in the range 0.1–0.3, and various permeability options.Input Parameters.Table 1. The data was divided into a training and test set. For training, 8 069 sets (80%) of the total data were used, and for the test the remaining 2 017 pairs (20%). Python was chosen as the runtime environment for machine learning. As mentioned earlier, the total number of sample pairs is 10,086 models. Each sample pair consists of 40 oil recovery factor values.The outcomes of multiple linear regression and polynomial regression predictions are presented below, based on a sample size exceeding 10,000 pairs. However, outcomes are demonstrated for only a few selected sample pairs. (a) – linear regression prediction (b) – polynomial regression predictionFig.2. Regression algorithms to model the oil recovery.– quadratic regression prediction (b) – cubic regression predictionwith Lasso regularizationFig.3. Regression algorithms to model the oil recovery.– quadratic regression prediction (b) – cubic regression predictionwith Lasso regularization for other test pairsFig.4. Regression algorithms to model the oil recovery.– quadratic regression prediction (b) – cubic regression predictionwith Lasso regularization (after augmentation)Fig.5. Regression algorithms to model the oil recovery.The outcomes of one test sample pair for the linear and quadratic polynomial regression approach are displayed in Fig. 2. Polynomial regression (PR) adds to the model's complexity. The degree of the polynomial, or the desired model, must be selected in order to train with polynomial properties. The following outcomes are obtained by both increasing and changing the model's degree to cubic polynomial regression using the L1 regularization (Fig. 3). An L1 type regularization with the ideal value of was used to enhance the cubic model. For the remaining test data pairs, the accompanying figure (Fig.4) displays cubic polynomial regression with Lasso regularization as well as polynomial regression with Lasso regularization. The final outcome for polynomial and cubic regression with Lasso regularization is displayed in Fig. 5 following data augmentation.The test results show that the model captures more data and outpredicts linear regression as the polynomial's degree grows. Because over-fitting causes the coefficient of determination to decline in the majority of test data, L1 type regularization is added to the cubic and polynomial regression. On the other hand, the model matches the test findings quite closely in certain situations where the oil recovery factor is minimal. We also see that extra regularization and data augmentation contribute to lower MSE and higher coefficients of determination.Table 2. Table 3. The mean square error (MSE) and the coefficient of determination 2 were used to evaluate machine learning regression methods. The following Table 2 shows the average MSE score for all 20% of the test sets. The following Table 3 shows the average 2 score for 80% of training sets and 20% of test sets. We can see from Fig. 2 that not all patterns in the data are captured by the anticipated LR function. Consequently, there is an instance of under-fitting in the linear regression model. It is also evident that a polynomial model trains data more effectively than a linear model. The MSE estimations, which show a decrease in the polynomial regression's error, support this. Furthermore, in contrast to the linear model, it is evident that the determination coefficient 2 has grown. It is evident that, when combined with lasso regularization, a cubic model forecasts data more accurately than a quadratic model (Fig. 3). In addition, the 2 determination coefficient rose in contrast to the quadratic model. For this test pair, the cubic model is therefore the most ideal. However, this pattern is true only for this pair. The outcomes might differ for the remaining couples from the complete test sample. This is demonstrated, for instance, in Fig. 4, where the polynomial model with lasso regularization trains far more effectively and makes extremely near predictions to the test data in both scenarios. This is because the high variance of the cubic model with lasso regularization in our situation causes over-fitting. Compared to the straightforward cubic model with lasso regularization, the polynomial model with lasso regularization predicts the function for all test data pretty well by applying data augmentation. Table 3 makes this clear. But according to the results, using our synthetic data, one of the best models for predicting the oil recovery factor for every test pair was the quadratic polynomial regression. The R^2 score was increased by implementing cubic and quadratic polynomial regression models with modified L1 type regularization. As a result, we discovered that the Cubic Lasso regression model functions well and provides a decent compromise between prediction performance and complexity.A feature of the considered methods is the use of regression algorithms for the generated data, which were obtained from the launches of the implementation of the Buckley-Leverett mathematical model using an ensemble of scenarios. This paper discusses a data-driven approach that can perfectly predict the output parameter using big data, however, the slight disadvantage of this method is the difficulty of interpretation. Therefore, in the future, for the development of this study, there is a motivation to consider the direction of scientific machine learning based on physical modeling, which takes into account physics. In the following works, it is planned to conduct research in the direction of physics-informed neural networks (PINN) for solving problems of fluid flow in a porous medium.4. Conclusion.Oil and other fossil fuels are currently the most significant energy sources. They are frequently used in a number of commercial and industrial areas. However, the production and planning involved in their manufacture are unique and difficult.In this paper the dataset was generated synthetically using the scenario ensemble method from the Buckley-Leverett mathematical model, where for different oil input parameters and different output values of the oil recovery factor were obtained. More than 400 000 synthetic datasets were used to train regression methods.The oil recovery factor was predicted using regression algorithms. In order to enhance the linear model, quadratic degree polynomial regression was implemented and tested. Quadratic polynomial regression with lasso regularization, however, was the best model for predicting the oil recovery factor for all test pairs using our synthetic data (after data augmentation). To evaluate the quality of the considered regression methods, the metrics mean squared error and the coefficient of determination 2 were used. It was found that for some test pairs where the recovery factor is small, the polynomial model at 0.98 accurately predicts with respect to the test data. However, for our synthetic data, quadratic polynomial regression with lasso regularization is the best model for predicting the oil recovery factor for all test pairs. In future works, it is planned to add noise in the form of practical data from real oil fields.

Полная версия статьи PDF

**Ссылка для цитирования**:

* Kassenova A. OIL PRODUCTION FORECASTING USING REGRESSION ALGORITHMS // Вестник науки №4 (73) том 3. С. 596 - 614. 2024 г. ISSN 2712-8849 // Электронный ресурс: https://www.вестник-науки.рф/article/14001 (дата обращения: 05.11.2024 г.) *

- напишите письмо в редакцию журнала:

Вестник науки СМИ ЭЛ № ФС 77 - 84401 © 2024. 16+

***** *В выпусках журнала могут упоминаться организации (Meta, Facebook, Instagram) в отношении которых судом принято вступившее в законную силу решение о ликвидации или запрете деятельности по основаниям, предусмотренным Федеральным законом от 25 июля 2002 года № 114-ФЗ 'О противодействии экстремистской деятельности' (далее - Федеральный закон 'О противодействии экстремистской деятельности'), или об организации, включенной в опубликованный единый федеральный список организаций, в том числе иностранных и международных организаций, признанных в соответствии с законодательством Российской Федерации террористическими, без указания на то, что соответствующее общественное объединение или иная организация ликвидированы или их деятельность запрещена. *