Automated Prediction and Management of Logical Volumes

Topics & Technologies

Linux
LVM (Logical Volume Management)
Filesystems
Python
PostgreSQL
Ansible
Data science
LSTM Recurrent Networks
Deep Learning
Servers Automation

Project description

Logical volumes represent distinct partitions within a storage system, providing a flexible way to manage and allocate storage space. Unlike traditional partitions tied directly to physical disks, logical volumes can span multiple disks and offer features such as resizing and snapshotting. They serve as the building blocks for organizing data efficiently within a storage environment, enabling administrators to allocate resources dynamically and optimize storage usage according to changing needs.

The essence of a robust system infrastructure lies in efficient resource management. In today's dynamic data storage and processing environment, the ability to adaptively adjust logical volume sizes is crucial for ensuring seamless operations and avoiding performance bottlenecks. APMLV (Automated Prediction and Management of Logical Volumes) redefines resource optimization by employing intelligent resizing strategies.

The solution involves executing the resource management process at regular intervals of one hour. During these intervals, data regarding the used space of logical volumes is extracted every 10 minutes. This frequent data collection ensures up-to-date information for analysis.

Additionally, predefined thresholds are established to trigger resizing actions. Specifically, if the used space of any logical volume within a volume group surpasses 80%, it prompts the system to initiate resizing procedures for the concerned volume group. This proactive approach helps prevent potential performance issues and ensures optimal resource utilization across the volume group. This project consists of 4 main modules:

Data Extraction Module: Collects logical volume usage data from various sources and stores them in a PostgreSQL database
Data Analysis and Preprocessing Module: Analysis and preprocessing module that cleans and transforms the data using Python
Deep Learning Module: Trains an LSTM model for time series forecasting and makes predictions on future usage of logical volumes
Automation Module: Uses Ansible to adjust the logical volume configuration based on the predictions and improve the performance and efficiency of the system

In our system architecture, APMLV operates independently on a separate machine dedicated to performing all necessary calculations and executing actions on other machines or servers. This separation of functionality enhances system efficiency and allows for centralized management of logical volume resources.

Data Extraction Module

The data extraction module operates on a regular interval of every 10 minutes, ensuring timely updates of logical volume usage data. This frequency enables the module to capture changes in usage patterns effectively, providing up-to-date information for analysis and decision-making.

Furthermore, the module utilizes various sources to collect comprehensive data on logical volume usage. This includes direct querying of system utilities such as lvs and lsblk.

Once collected, the extracted data is stored in a PostgreSQL database, which serves as a centralized repository for storage usage information. Storing the data in a structured database format enables efficient data management, querying, and analysis, facilitating the extraction of insights and the generation of reports. Here is the schema of the database:

Data Analysis and Preprocessing Module

Responsible for preparing the collected data for further analysis and modeling. This module performs a series of tasks aimed at cleaning and transforming the raw data into a format suitable for the time series analysis and deep learning.

Time series analysis requires the data to be structured in a sequential format, where each data point is associated with a specific time index. To capture temporal dependencies and patterns, the module incorporates a lookback window, which defines the historical context considered for each observation. The lookback window determines how far back in time the model should consider when making predictions.

Which in our case is 6, since we are using 10 minutes intervals to collect data and we are predicting 1 hour in the future.

Deep Learning Module

The deep learning module leverages LSTM (Long Short-Term Memory) networks to forecast the future usage of logical volumes. LSTM networks are a type of recurrent neural network (RNN) that excel at capturing long-term dependencies in sequential data. Here's a simplified breakdown of how the LSTM model works:

Long-term Memory Cells: LSTMs have memory cells that can maintain information over long sequences. This ability helps capture trends and patterns in storage usage data that traditional methods might overlook.
Gates: LSTMs have mechanisms called gates (input, forget, and output gates) that regulate the flow of information into and out of the memory cells. This gating mechanism allows LSTMs to decide what information to keep or discard, making them adept at handling sequences with long-range dependencies.
Training: The LSTM network is trained using historical storage usage data. During training, the network learns the underlying patterns and correlations in the data, enabling it to make accurate predictions.

Each logical volume is associated with its own trained LSTM model, enabling tailored predictions and proactive storage management. This individualized approach allows APMLV to account for unique usage patterns, trends, and requirements specific to each logical volume.

Here's a plot of a portion of the training and the testing data on one of the logical volumes (1GB volume):

Training data: The blue line represents the actual usage data, while the orange line represents the predicted usage data.
Testing data: The green line represents the actual usage data, while the red line represents the predicted usage data.

As shown in the plot, the LSTM model captures the underlying patterns and trends in the data, enabling accurate predictions of future usage.

Decision-making Process

The allocation of logical volumes within the system follows a multi-step process that ensures efficient utilization of resources while maintaining balance and fairness across the volume group. Here's a breakdown of each step:

Predicting Future Usage: Utilizing LSTM Recurrent Networks, the system forecasts the future usage of each logical volume. This prediction serves as a baseline for understanding the expected growth or reduction in storage requirements over time.
Historical Proportion Analysis: To maintain balance within the volume group, the historical proportion of each logical volume's usage relative to its neighbors is calculated. This analysis ensures that resources are distributed fairly among logical volumes, preventing any single volume from disproportionately consuming resources and causing bottlenecks.
The MP ( i.e Mean Proportion ) for a logical volume over a period of time T is determined by the formula:
Where:
- n: The number of logical volumes in the volume group.
- U_t: The usage volume of a specific logical volume at time t.
- U_it: The usage of the ith logical volume in the volume group at time t.
- T: The period of time during which the usage of a logical volume is collected and predictions are made. (in our case, 6 as we are collecting data every 10 minutes and predicting 1 hour in the future)
Priority Factor Integration: The priority factor, representing the importance or urgency of each logical volume, is integrated into the allocation factor calculation. Logical volumes with higher priority factors receive preferential treatment in the allocation process, ensuring that critical volumes receive adequate resources.
Where:
- Priority: Numerical value representing the priority level of a logical volume.
- Count(Priority): Function that counts the number of logical volumes with the same priority level.
To scale the Mean Proportion-to-Priority Factor P_i to ensure they sum up to 1, the scaled factor bi can be computed as:
Where:
- n: The number of logical volumes in the volume group.
- L: A list of Mean Proportion-to-Priority Factors [p1,p2,...,pn]
- S: Scaled Mean Proportion-to-Priority Factors [b1,b2,...,bn]
Demand-to-Space Ratio: The allocation factor is determined by scaling the cumulative demands of all logical volumes relative to the available free space in the volume group. This ensures that resources are allocated efficiently, taking into account both the demand for storage and the available capacity. To determine the free allocation space in the volume group, we calculate Allocation/Reclaim size for each logical volume by :
Where:
- p_i: 20% of the model’s prediction ( i.e 20% of free space )
- d_i: The difference between the prediction and the current filesystem size for each logical volume
- m_i: The scaled Mean Proportion-to-Priority Factor of the logical volume
Then, we sum up the results, TAR ( i.e Total Allocations/Reclaims ):
Finally, The allocation factor is calculated by:

The allocation/reclaim of a logical volume is determined by comparing The allocation/reclaim size and The allocation/reclaim size scaled by the allocation factor, which are calculated as below:

Allocation/Reclaim Size Calculation: The allocation/reclaim size for each logical volume is calculated based on the difference between the predicted usage and the current filesystem size. This calculation takes into account the scaled Mean Proportion-to-Priority Factor of each volume and ensures that allocation/reclaim decisions are made with consideration of volume importance and historical usage patterns.
- For a negative allocation/reclaim size :
  Where:
- For a positive allocation/reclaim size :
  Where:

Upon calculating the necessary adjustments for each logical volume, these changes are stored in the database and communicated to the automation module for execution.

The automation module, powered by Ansible, applies the resizing actions for every host using SSH connections, ensuring that the changes are implemented across the system in a secure and efficient manner.

Supported Filesystems

The APMLV (Automated Prediction and Management of Logical Volumes) system supports a variety of filesystems for logical volumes, providing flexibility and compatibility with different storage requirements. The supported filesystems include:

ext2
ext3
ext4
vFAT
F2FS
XFS (experimental)
Btrfs (experimental)

Future improvements

The APMLV system is designed to evolve and adapt to changing storage environments and requirements. Future improvements and enhancements might include:

Multi-threaded Processing:
- Description: Implementing threads to assign each volume group to a separate thread can significantly speed up the processing time by allowing parallel execution of tasks.
- Benefits: Improved efficiency and reduced processing time
Volume group size adjustments:
- Description: Expanding the model to incorporate adjustments for volume group sizes, enabling dynamic resizing based on workload requirements and resource availability.
- Benefits: Greater flexibility and adaptability to changing system demands.
Incremental Learning for LSTM Networks:
- Description: Implementing incremental learning techniques for LSTM (Long Short-Term Memory) networks, allowing the model to continuously learn from new data without retraining the entire model.
- Benefits: Improved model accuracy over time, adaptability to evolving usage patterns, and reduced computational overhead for frequent updates.
User Interface for Monitoring & Visualization:
- Description: Developing a user-friendly interface that provides real-time monitoring and visualizations of logical volumes, allocation/reclaim metrics, and historical trends.
- Benefits: Enhanced user experience, easier data interpretation, and quicker decision-making capabilities for administrators and users.