Spotting the Digraph
Most of you might have faced the issue of having a ^M
character at the end of every line in some files on Linux machines. Most probably those files are not created by yourself and received from some other sources. Sometimes you will see this issue after pushing the patch.
Let me help you understand the reason behind this problem.
The first part of understanding a problem is to understand how to recreate the problem. Here is a list of situations where you might face these issues:
Copying text files from Windows to Linux or Unix systems.
Working with shell scripts or configuration files written on Windows and executed on Linux.
Transferring data or logs from Windows machines to Linux servers, causing data processing or parsing issues.
Committing code files from a Windows development environment to a Linux-based version control system like Git.
Using shared folders or file transfer mechanisms that do not handle line-ending conversions automatically.
Cracking the problem
The root cause of this End-Of-Line character difference between OS.
In Linux and other Unix-based systems, the newline sequence is represented by a single character: the line feed (LF), denoted as \n
.
In MAC, the newline sequence is represented by a single character: the line feed (CR), denoted as \r
.
In Windows, the newline sequence is represented by two characters: carriage return (CR) followed by a line feed (LF), often denoted as \r\n
. This combination dates back to the typewriter era and is known as CRLF.
Behind the scenes
When you copy a text file from Windows to Linux, the file's content remains the same, including the Windows-style CRLF line endings.
However, Linux expects the lines to be terminated with only the LF character.
When you commit the file with Windows-style line endings to a version control system (VCS) like Git on a Linux machine, the VCS recognizes the difference in line endings between the Windows and Linux environments.
To preserve the integrity of the file and ensure consistent line endings across platforms, the VCS flags the CRLF line endings.
To indicate the presence of the carriage return character (CR) in the file, some text editors and terminal emulators on Unix-based systems display the ^M
character at the end of each line.
The caret ^
symbol denotes the control character, and M
represents the carriage return.
The Escape Plan
On Unix-based systems, the file utility can display what kind of line endings are present in a file. For example, file .c
will report what line terminators (CRLF, CR, LF) are present in each .c
file. The dos2unix
utility can convert from dos or mac format to Unix, and the unix2dos
utility can convert from Unix to dos format, optionally while preserving file timestamps.
Back to the roots
When DOS was developed by Microsoft in the early 1980s many early printers and teletype machines used the carriage return () character to move the print head back to the beginning of a line, and the line feed () character to advance to the next line.
To maintain compatibility with these older devices, DOS adopted the convention CRLF.
The original Macintosh operating system, introduced by Apple in 1984, was designed to be user-friendly and targeted at non-technical users.
Unix adopted the use of a single character as the line ending to optimize for these devices and maintain compatibility with the existing teletype standards.
Not enough?
What about writing your dos2unix or unix2dos?
Conclusion
When faced with the unknown, fear may cloud your path, leaving you confused and hesitant. But when you courageously delve deeper, gaining understanding and insight, fear dissipates, and clarity emerges as your guide.
Please, share a real case where you faced this issue in the comments to this article, I am curious to hear about this :)