Need explanations about compiling

foremanguy@lemmy.ml · 8 months ago

Need explanations about compiling

d3Xt3r@lemmy.nz · edit-2 8 months ago

Others here have already given you some good overviews, so instead I’ll expand a bit more on the compilation part of your question.

As you know, computers are digital devices - that means they work on a binary system, using 1s and 0s. But what does this actually mean?

Logically, a 0 represents “off” and 1 means “on”. At the electronics level, 0s may be represented by a low voltage signal (typically between 0-0.5V) and 1s are represented by a high voltage signal (typically between 2.7-5V). Note that the actual voltage levels, or what is used to representation a bit, may vary depending on the system. For instance, traditional hard drives use magnetic regions on the surface of a platter to represent these 1s and 0s - if the region is magnetized with the north pole facing up, it represents a 1. If the south pole is facing up, it represents a 0. SSDs, which employ flash memory, uses cells which can trap electrons, where a charged state represents a 0 and discharged state represents a 1.

Why is all this relevant you ask?

Because at the heart of a computer, or any “digital” device - and what sets apart a digital device from any random electrical equipment - is transistors. They are tiny semiconductor components, that can amplify a signal, or act as a switch.

A voltage or current applied to one pair of the transistor’s terminals controls the current through another pair of terminals. This resultant output represents a binary bit: it’s a “1” if current passes through, or a “0” if current doesn’t pass through. By connecting a few transistors together, you can form logic gates that can perform simple math like addition and multiplication. Connect a bunch of those and you can perform more/complex math. Connect thousands or more of those and you get a CPU. The first Intel CPU, the Intel 4004, consisted of 2,300 transistors. A modern CPU that you may find in your PC consists of hundreds of billions of transistors. Special CPUs used for machine learning etc may even contain trillions of transistors!

Now to pass on information and commands to these digital systems, we need to convert our human numbers and language to binary (1s and 0s), because deep down that’s the language they understand. For instance, in the word “Hi”, “H”, in binary, using the ASCII system, is converted to 01001000 and the letter “i” would be 01101001. For programmers, working on binary would be quite tedious to work with, so we came up with a shortform - the hexadecimal system - to represent these binary bytes. So in hex, “Hi” would be represented as 48 69, and “Hi World” would be 48 69 20 57 6F 72 6C 64. This makes it a lot easier to work with, when we are debugging programs using a hex editor.

Now suppose we have a program that prints “Hi World” to the screen, in the compiled machine language format, it may look like this (in a hex editor):

As you can see, the middle column contains a bunch of hex numbers, which is basically a mix of instructions (“hey CPU, print this message”) and data (“Hi World”).

Now although the hex code is easier for us humans to work with compared to binary, it’s still quite tedious - which is why we have programming languages, which allows us to write programs which we humans can easily understand.

If we were to use Assembly language as an example - a language which is close to machine language - it would look like this:

     SECTION .data
msg: db "Hi World",10
len: equ $-msg

     SECTION .text
     
     global main   
main:
     mov  edx,len
     mov  ecx,msg
     mov  ebx,1
     mov  eax,4

     int  0x80
     mov  ebx,0
     mov  eax,1
     int  0x80

As you can see, the above code is still pretty hard to understand and tedious to work with. Which is why we’ve invented high-level programming languages, such as C, C++ etc.

So if we rewrite this code in the C language, it would look like this:

#include <stdio.h>
int main() {
  printf ("Hi World\n");
  return 0;
}

As you can see, that’s much more easier to understand than assembly, and takes less work to type! But now we have a problem - that is, our CPU cannot understand this code. So we’ll need to convert it into machine language - and this is what we call compiling.

Using the previous assembly language example, we can compile our assembly code (in the file hello.asm), using the following (simplified) commands:

$ nasm -f elf hello.asm
$ gcc -o hello hello.o

Compilation is actually is a multi-step process, and may involve multiple tools, depending on the language/compilers we use. In our example, we’re using the nasm assembler, which first parses and converts assembly instructions (in hello.asm) into machine code, handling symbolic names and generating an object file (hello.o) with binary code, memory addresses and other instructions. The linker (gcc) then merges the object files (if there are multiple files), resolves symbol references, and arranges the data and instructions, according to the Linux ELF format. This results in a single binary executable (hello) that contains all necessary binary code and metadata for execution on Linux.

If you understand assembly language, you can see how our instructions get converted, using a hex viewer:

So when you run this executable using ./hello, the instructions and data, in the form of machine code, will be passed on to the CPU by the operating system, which will then execute it and eventually print Hi World to the screen.

Now naturally, users don’t want to do this tedious compilation process themselves, also, some programmers/companies may not want to reveal their code - so most users never look at the code, and just use the binary programs directly.

In the Linux/opensource world, we have the concept of FOSS (free software), which encourages sharing of source code, so that programmers all around the world can benefit from each other, build upon, and improve the code - which is how Linux grew to where it is today, thanks to the sharing and collaboration of code by thousands of developers across the world. Which is why most programs for Linux are available to download in both binary as well as source code formats (with the source code typically available on a git repository like github, or as a single compressed archive (.tar.gz)).

But when a particular program isn’t available in a binary format, you’ll need to compile it from the source code. Doing this is a pretty common practice for projects that are still in-development - say you want to run the latest Mesa graphics driver, which may contain bug fixes or some performance improvements that you’re interested in - you would then download the source code and compile it yourself.

Another scenario is maybe you might want a program to be optimised specifically for your CPU for the best performance - in which case, you would compile the code yourself, instead of using a generic binary provided by the programmer. And some Linux distributions, such as CachyOS, provide multiple versions of such pre-optimized binaries, so that you don’t need to compile it yourself. So if you’re interested in performance, look into the topic of CPU microarchitectures and CFLAGS.

Sources for examples above: http://timelessname.com/elfbin/

foremanguy@lemmy.ml · 8 months ago

Thank so much!! 👌👍

Antithetical@lemmy.deedium.nl · 8 months ago

Great explanation, thank you for the well written post.

Mazoku@lemmy.ml · 8 months ago

Wonderful write up, thank you!

mudle@lemmy.ml · edit-2 8 months ago

Incredible!! I don’t think I have ever heard this explained in such simplicity. Great write up.

Trent@lemmy.ml · 8 months ago

Compiling code converts it from human readable source code into optimized machine code which the processor understands how to execute. For a lot of software you can just unpack the source code, run ./configure, run ‘make’, and then ‘make install’. This can vary a lot and is a simplified explanation, but it’s a start…

foremanguy@lemmy.ml · 8 months ago

Ok thank you!

jbrains@sh.itjust.works · 8 months ago

Computers don’t directly understand the code that humans write. Humans find it extremely difficult to directly write the code that computers understand.

Compiling is how we convert the code that humans write into the code that computers can run. (It’s more complicated than that, but that explanation is probably enough for now.)

Different computers understand different flavors of computer code. Each kind of computer can compile the same human code, but they produce the flavor of computer code specific to that kind of computer. That’s why you sometimes need to compile the human code on your computer: it’s easier for your computer to know how to compile human code than for a human to know how to compile human code for every kind of computer that exists now and might exist in the future. There are some common kinds of computer and many projects pre-compile human code so that you don’t have to, but that’s not always easy. Also, some people insist on compiling the code themself, rather than trust someone else to correctly compile the code for their computer.

As for how to compile, that can be complicated. When you find the human code (“source code”) for a software project, the README often gives you instructions for how to compile that project’s code. Many of the instructions look familiar, because they are similar between projects, but the detail can vary a lot from project to project. Moreover, different human programming languages have very different instructions for how to compile their flavor of human code into computer code.

foremanguy@lemmy.ml · 8 months ago

ooh ok, so I’ve always aked myself “what is for the source code?”. If I’ve understand, it’s all the code writes in C, C++, Rust, etc. And then if you want to use the programm you just have to compile the source code. It’s useful for the developer to do not have to compile for every OS. Is that right?

EinfachUnersetzlich@lemm.ee · 8 months ago

Yes, but for developers it’s good to not have to program for each CPU architecture/OS.

I can write some C, C++ or Rust code and compile it for loads of platforms and have it do the same thing (simplified).

jbrains@sh.itjust.works · 8 months ago

Yes.

TimeSquirrel@kbin.social · edit-2 8 months ago

Human readable code:

// do stuff 100 times
for (int x = 0; x < 100; x++) {
do.stuff();
}

After compiling might look like:

1011 1100 1011 1011 and so on and so forth, which corresponds directly to cells in a memory chip being switched on or off, which are the instructions that get fed to your CPU to do everything your program does in the machine.

foremanguy@lemmy.ml · 8 months ago

f00f/eris@startrek.website · edit-2 8 months ago

Open-source software is distributed primarily as source code in a human-readable programming language. Computers can’t actually read these programming languages directly; they need to be translated into the machine language of their CPU (such as x86_64). For some languages, like Python, code can be “interpreted” on the fly; for others, like C, programs must be “compiled” into a separate file format. Additionally, most programs consist of multiple files that need to be compiled and linked together, and installed in certain folders on your system, so the compiler and additional tools work to automate that process.

Most users of Linux rarely if ever have to compile anything, because the developers of Linux distros, and some third parties like Flathub, curate collections of (mostly) open-source software that has already been compiled and packaged into formats that are easy to install and uninstall. As part of this process, they usually add some metadata and/or scripts that can automate compiling and packaging, so it only requires a single command (makepkg on Arch, dpkg-buildpackage on Debian.) However, some newer or more obscure software may not be packaged by your distribution or any third-party repo.

How to compile depends on the program, its programming language and what tools the developers prefer to use to compile it. Usually the README file included with source code explains how to compile the software. The most common process uses the commands ./configure; make; sudo make install after installing all of the program’s dependencies and cd-ing to the source code directory. Other programs might include the metadata needed for something like makepkg to work, be written in an interpreted language and thus require no compilation, or use a different toolchain, like CMake.

foremanguy@lemmy.ml · 8 months ago

thank you so much for taking time to answer

velox_vulnus@lemmy.ml · edit-2 2 months ago

deleted by creator

foremanguy@lemmy.ml · 8 months ago

humm… ok, and for example when you have the binaries of a file you have to compile it a last time, no? That’s my experience with aur, when you get the bin, you have to makepkg a other time

f00f/eris@startrek.website · 8 months ago

In that case makepkg isn’t compiling anything, it’s just packaging the existing binaries so that they can be more easily installed and recognized by your package manager.

foremanguy@lemmy.ml · 8 months ago

like linking all the files and make a clean package? So makepkg does everything from the start to end of the compiling process

f00f/eris@startrek.website · 8 months ago

Yeah, basically. makepkg automates the process of creating an Arch package, and while usually that involves compiling source code, sometimes it just means converting proprietary software that has already been compiled into a different format.

foremanguy@lemmy.ml · 8 months ago

thx 🙏

lemmyreader@lemmy.ml · 8 months ago

The Arch Linux makepkg is a bash script with description

make packages compatible for use with pacman

Some packages of AUR are not about compiling but fetching the binary (sometimes converting it from deb) and then prepare it for you so you can install it. So when you use AUR to install a binary package instead of compiling there is really no compiling involved afair.

foremanguy@lemmy.ml · 8 months ago

ohhh okay, thank you for your explanation!

lemmyreader@lemmy.ml · 8 months ago

Some software is brand new and not yet packaged as software packages. And for already existing software Linux distributions will have to make choices as not everything can be included and maintained. Now if a developer creates new software, and things are not packaged yet (With Debian for Debian stable this can take a really long time) it can be comfortable for the developer to provide just instructions about how to compile the software so users can run the software, while the maintainer does not have to bother about packaging.

foremanguy@lemmy.ml · 8 months ago

So it’s a bit like taking all the useful packages and mix it up in a clean package? And are .deb .rpm… packages made like that?

lemmyreader@lemmy.ml · 8 months ago

Yes, indeed. If you would want to you can re-compile Debian deb packages from Debian sources. To give you an idea : https://wiki.debian.org/apt-src There’s also Gentoo Linux which has a history of compiling software. Years ago that was interesting because of flags for compiling, make the resulting software optimized for certain CPU models.

EinfachUnersetzlich@lemm.ee · 8 months ago

That’s still true, dozens of us still use it!

lemmyreader@lemmy.ml · 8 months ago

I see. What are you optimizing for ? A smaller and faster kernel, or optimizing for certain CPU or certain other hardware parts ?

EinfachUnersetzlich@lemm.ee · 8 months ago

I don’t use it for the optimisations, I just prefer its package manager and ecosystem.

Possibly linux@lemmy.zip · 8 months ago

You can use GCC to compile C code. Once you compile it you can run your program. Once its compiled you can pull it up in gdb and look at the final assembly code that runs on the CPU.