Analyze the problem of dealing with problems by using six typical Linux operation and maintenance problems

February 11, 2023

As a qualified Linux operation and maintenance engineer, you must have a clear and clear solution to the problem. When the problem occurs, you can quickly locate and solve the problem. Here is a general idea for dealing with the problem:

Pay attention to the error message: every error occurs, it gives the error message. Under normal circumstances, this prompt basically locates the problem, so we must pay attention to this error message. If you ignore these error messages, the problem will never be To solve.

Check the log file: Sometimes the error message is just a superficial phenomenon. To understand the problem more deeply, you must check the corresponding log file, and the log file is divided into system log file (/var/log) and application. Log files, combined with these two log files, generally locate the problem.

Analysis and positioning problems: This process is more complicated. According to the error information, combined with the log files, we must also consider other related situations, and finally find the cause of the problem.

Solving the problem: Finding the cause of the problem, solving the problem is a very simple matter.

It can be seen from this process that the process of solving the problem is the process of analyzing and finding the problem. Once the cause of the problem is determined, the fault is solved.

After combining the above-mentioned solutions for solving Linux operation and maintenance problems, we have selected six typical Linux operation and maintenance problems to see how they are analyzed and solved:
Analyze the problem of dealing with problems by using six typical Linux operation and maintenance problems

Problem 1: File system corruption caused the system to fail to boot

Checking root filesystem

/dev/sda6 contains a file system with errors, check forced

An error occurred during the file system check

This error can be seen, the operating system / dev / sda6 partition file system has a problem, the probability of this problem is very high, usually caused by the sudden power outage of the system, resulting in inconsistent file system structure, in general, The solution to this problem is to use the fsck command to force a fix.

# umount /dev/sda6

# fsck.ext3 -y /dev/sda6

Question 2: "Argument list too long" error and solution

# crontab -e

After saving and exiting after editing, error no space left on device

According to the above error, I learned that the disk space is full, then first check the disk space.

# df -h

See if the / var disk partition space has reached 100%, so the problem is located. Yes / var The disk space is full, because crontab will write the file information to the / var directory when saving, but this disk has no space, so an error is reported.

Then use the command du â€“sh * to check the size of all files or directories under the /var directory and find that the /var/spool/clientmqueue directory occupies 90% of the entire partition size of /var, then the /var/spool/clientmqueue directory How are the files generated? Can you delete them? Basically, they are all email messages. You can delete them.

# rm *

/bin/rm :argument list too long

When trying to pass too many parameters to a command in the Linux system, the "argument list too long" error will occur. This is a limitation that the Linux system has always had. You can do this by using the command "getconf ARG_MAX".

# getconf ARG_MAX

# more /etc/issue View version

Solution: 1,

# rm [an]* -rf

# rm [oz]* -rf

2, use the find command to delete

# find /var/spool/clientmqueue â€“type f â€“print â€“exec rm â€“f {} ;

3, through the shell script

#/bin/bash

RM_DIR='/var/spool/clientmqueue'

Cd $RM_DIR

For I in `ls`

Rm â€“f $i

Done

4, recompile the kernel

You need to manually increase the number of pages allocated to the command line parameters in the kernel. Open the include/linux/binfmts.h file under kernel source and find the following line:

#denfine MAX_ARG_PAGES 32

Change 32 to a larger value, such as 64 or 128, and recompile the kernel

Problem 3: Inode exhaustion causes application failure

After an Oracle database such as a weapon is shut down and restarted, the Oracle monitor cannot be started, and the error is reported. Linux error : No space left on device

It is seen from the output that the monitor cannot be started because the disk is exhausted, because Oracle needs to create a listener log file when starting the listener, so first check the disk space usage.

# df -h

From the disk output information, there is still a lot of disk space left in all partitions, and the path that Oracle listens to write logs is under the /var partition, and the partition space under /var is sufficient.

Solutions:

Since the error message is related to disk space, then the problem of disk space is deeply studied. The disk space in the Linux system is divided into three parts: the first is the physical disk space, and the second is occupied by the inode node. Disk space, the third is the space used by linux to store semaphores, and the usual contact is more physical disk space. Since it is not a problem with physical disk space, then check if the inode node is exhausted. Check the available inode nodes by executing the command "df -i". It is obvious from the output that the inode is exhausted and the file cannot be written.

You can view the total number of disk partition inodes by the following command

# dumpe2fs -h /dev/sda3 |grep 'Inode count'

Each inode has a number. The operating system uses the inode number to distinguish different files. The 'ls -i' command can be used to view the inode number corresponding to the file name.

If you want to view more detailed inode information of this file, you can use the stat command to achieve

# stat install.log

Solve the problem

# find /var/spool/clientmqueue/ -name "*" -exec rm -rf {} ;

Question 4: The file has been deleted, but the space is not released.

The operation and maintenance monitoring system sent a notice, report that a server space is full, log in to the server to view, the root partition is indeed full. Here, let me talk about some deletion strategies of the server. Since Linux does not have the function of the recycle bin, all the online servers must The deleted files are first moved to the system / tmp directory, and then the data in the / tmp directory is periodically cleared. The strategy itself has no problem, but it is found that the system partition of this server does not have a separate /tmp partition, so the data under /tmp actually takes up the space of the root partition. If you find a problem, delete the /tmp directory. Some data files that take up a lot of space can be used.

# du -sh /tmp/* | sort -nr |head -3

The command finds that there is a 66G file access_log in the /tmp directory. This file should be the access log file generated by apache. From the log size, it should be an apache log file that has not been cleaned for a long time. The basic decision is that this file is caused. The root space is full. After confirming that this file can be deleted, execute the following delete command.

# rm /tmp/access_Iog

# df -h

From the output point of view, the root partition space is still not released, what is going on?

Generally speaking, there is no case that the space is not released after deleting the file, but there are exceptions, such as file process locking, or a process has been writing data to this file. To understand this problem, you need to know the file storage mechanism under Linux. And storage structure.

A file is stored in the file system in two parts: the data part and the pointer part. The pointer is located in the meta-data of the file system. After the data is deleted, the pointer is cleared from the meta-data, and the data part is stored. On the disk. After the pointer corresponding to the data is cleared from meta-data, the space occupied by the file data portion can be overwritten and written to the new content. The space is not released after the access_log file is deleted, because the httpd process is still The content is always written to this file, resulting in the deletion of the access_Ilog file, but because the process is locked, the pointer portion of the file is not cleared from the meta-data, and since the pointer is not deleted, the system kernel considers that the file has not been deleted. Therefore, the query space is not released by the df command.

Troubleshooting:

Now that you have a solution, let's see if there is a process that has been writing data to the access_log file. Here you need to use the losf command under linux, which can get a list of deleted files still occupied by the application.

# lsof | grep delete

As you can see from the output, the /tmp/access_log file is locked by the process httpd, and the httpd process always writes log data to this file. The 'deleted' state of the last column indicates that the log file has been deleted, but since the process is still The data is written to this file, so the space is not released.

Solve the problem:

Basically, the problem is solved here. There are many ways to solve this kind of problem. The easiest way is to close or restart the httpd process. Of course, you can restart the operating system. However, these are not the best methods. To deal with this process, write the log to the file continuously. To release the disk space occupied by the file, the best way is to clear the file online. You can do this by using the following command:

# echo "">/tmp/access_log

In this way, disk space can be released immediately, and it can also be guaranteed to continue to write logs to files. This method is often used to clean log files generated by web services such as apache /tomcat/nginx.

Question 5: "too many open files" error and solution

Problem phenomenon: This is a java-based web application system. When adding data in the background, the prompt cannot be added. Then log in to the server to view the tomcat log and find the following exception information, java.io.IOException: Too many open files

Through this error message, the basic judgment is that the file descriptor that the system can use is not enough. Since the tomcat service room system www user starts, so the www user logs in to the system, and the system can open the maximum number of file descriptors by using the ulimit â€“n command. The output is as follows:

$ ulimit -n

65535

You can see that the maximum file descriptor that can be opened by this server is already 65535. This large value should be enough, but why is this error?

Solve the idea, this case involves the use of the ulimit command

When using ulimit, there are several ways to use it:

1, join in the user environment variable

If the user is using bash, you can limit the user to use up to 128 processes by adding "ulimit â€“u128" to the environment variable file. bashrc or . bash_profile in the user directory.

2, join in the application's startup script

If the application is tomcat, you can add 'ulimit -n 65535' to the tomcat startup script startup.sh to limit the user to use up to 65535 file descriptors.

3. Execute the ulimit command directly on the shell command terminal.

The resource limit of this method is only valid at the terminal where the command is executed. After exiting or closing the terminal, the setting is invalid, and this setting does not affect other shell terminals.

Solve the problem:

After understanding the ulimit knowledge, following the above case, since there is no problem with the ulimit setting, then the setting must not be valid. Next, check whether the www user environment variable that starts tomcat is added with the ulimit limit. After checking, it is found that the www user does not have Ulimit limit. So continue to check whether the tomcat startup script startup.sh file has added the ulimit limit, and found that it has not been added after checking. Finally, check if the limit is added to the limits.conf file, so check the limits.conf file as follows:

# cat /etc/security/limits.conf | grep >

Nofile 65535

From the output, the ulimit limit is added to the limits.conf file. Since the limit has been added, there is nothing wrong with the configuration. Why is there an error? After thinking, there is only one possibility, that is, the startup time of tomcat is earlier than the ulimit resource. Limit the time to add, so first check the start time of the next tomcat, the operation is as follows

# uptime

Up 283 days

# pgrep -f tomcat

4667

# ps -eo pid,lstart,etime|grep 4667

4667 Sat Jul 6 09;33:39 2013 77-05:26:02

As you can see from the output, this server has 283 not restarted, and tomcat was started at 9:00 on July 6, 2013. It started for nearly 77 days, and then continue to check the modification time of the limits.conf file.

# stat /etc/security/limits.conf

As seen by the stat command, the last modification time of the limits.conf file is July 12, 2013, later than the tomcat startup time. After the problem is clear, the solution to the problem is very simple. Restart the tomcat.

Problem 6: Read-only file system error and solution

Resolution: There are many reasons for this problem. It may be caused by inconsistency in the file system data block, or it may be caused by a disk failure. The mainstream ext3/ext4 file system has a strong self-healing mechanism. For simple errors, The file system can generally be repaired by itself. When a fatal error cannot be repaired, the file system temporarily shields the file system from writing to ensure data consistency and security. The file system becomes read-only. The "read-only file system" phenomenon.

Manually repair the file system error command fsck, before repairing the file system, it is best to unmount the disk partition where the file system is located.

# umount /

Umount : /: device is busy

The prompt cannot be uninstalled. It may be that the corresponding file in the disk is running. Check as follows:

# fuser -m /dev/sdb1

/dev/sdb1: 8800

Then check what process corresponds to port 8800.

# ps -ef |grep 8800

After checking, I found that apache did not close, stop apache

# /usr/local/apache2/bin/apachectl stop

# umount /

# fsck -V -a /dev/sdb1

# mount /dev/sdb1 /

Wire Connector

Our wire and cable requires for product quality certification documents from the supplier, The wiring shall have the factory quality certificate documents, including: certificate of quality (the certificate has the production license number and the "CCC" certification mark), test report and the "CCC" certification certificate; he quality certificate of electric wire shall be the original, if it is a copy, the copy shall be consistent with the original content, with the official seal of the original storage unit affixed, indicating the place where the original is stored, and the signature and time of the handler; The manufacturer shall have the business license of enterprise legal person.

Automotive Wire Connectors,Waterproof Wire Connectors,Wire Connectors,Wire Harness Connectors

Dongguan YAC Electric Co,. LTD. , https://www.yacentercn.com