|
In this readme we want to provide some more technical information about RPM and PyRPM itself and how everything works. Some of the collected information here can be found in other documents spread out over the Internet in more detail. See the related projects section at the end of this file.
Use pychecker or pylint to improve code quality, 4 space indents and no tabs. Most scripts support giving --hotshot as first option to run them with the hotshot python profiler. oldpyrpm.py has some TODO items listed in the source.
Python is a very good scripting language. Very nice for fast prototyping and implementing new features. We currently see limits if too much data needs to be processed. This is happening with parsing xml data as well as with the huge amount of dependency data in rpm packages (where all files in rpm packages can also be used for dependencies, those are about 350000 files in FC-development).
These python modules could see some improvements:
python-elementtree is about 3 times faster in reading in the comps.xml file compared to libxml2. elementtree still does some internal automatic detection and conversion between data encodings that looks like it might add overhead. Migrating to python-elementtree should either be done for the complete code or not at all. Since libxml2 is the default xml library, we'll stick with it for now. XML processing is one of the biggest CPU consumers right now.
gzip decompression should offer a filetype object with .read(). The current hacks in the class PyGZIP should go away.
bsddb access seems very slow (reading rpmdb)
bzip2 decompression does not have streaming support
At first we want to provide some high level information about how rpm works without going into too much techincal detail. This should help understand how rpm works and what problems it solves (and which it doesn't and why yum is needed). Following that will be a section that describes then what yum does and what problems it solves (and again, which it can't solve and which we try to solve in our yum derivate). Afterwards we will go more into the rpm binary format and the format of yum repositories as well as several interessting points about the way rpm works and the problems that appear with doing to.
Rpms whole concept is based on socalled dependencies which define the relation between packages. Those are simply expressed using Requires and Provides. In mathematical terms this can be viewed as a directed graph where the nodes are packages and the edges are requirements from packages that require something to packages that provide that requirement. To make this a little more visible here a small ascii art:
A ----> B ----> C
In this example:
A requires B
and
B requires C
So in order to be able to install A we need to install B and C as well, as B needs C.
Requires and provides can be versioned and have additional flags like <, <=, ==, >=, > which need to be handled properly.
The rpmvercmp algorithm compares two labels (like the Version or the Release tag) to see which is newer.
In this algorithm, "digits" and "letters" are defined as ASCII digits (0-9) and ASCII letters (a-z and A-Z). Other Unicode digits and letters (like accented Latin letters) are not considered letters. ASCII letters and digits are called "alphanumeric" characters.
Please note that the algorithm's actions is undefined in some cases, in a ways may make the resulting comparisons stop working sanely (see [WWW] https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178798 for an example where the order of the comparison is more important than the operands). To avoid these, make sure that all your labels start and end with alphanumeric characters. So while things like "1.", "+a", or "_" are allowed as labels, the result of such comparisons are undefined. For the exact (non-symmetric) algorithm, see lib/vercmp.c in the RPM source code. The following algorithm is a simplification based on the version available in FC4, and is considered to be stable, as the last time it changed in any way was in January 2003.
Each label is separated into a list of maximal alphabetic or numeric sections, with separators (non-alphanumeric characters) ignored. If there is any extra non-alphanumeric character at the end, that. So, 2.0.1 becomes (2, 0, 1), while (2xFg33.+f.5) becomes (2, xFg, 33, f, 5).
All numbers are converted to their numeric value. So 10 becomes 10, 000230 becomes 230, and 00000 becomes 0.
The elements in the list are compared one by one using the following algorithm. If two elements are decided to be different, the label with the newer element wins as the newer label. If the elements are decided to be equal, the next elements are compared until we either reach different elements or one of the lists runs out. In case one of the lists run out, the other label wins as the newer label. So, for example, (1, 2) is newer than (1, 1), and (1, 2, 0) is newer than (1, 2).
The algorithm for comparing list elements is as follows:
If one of the elements is a number, while the other is alphabetic, the numeric elements is considered newer. So 10 is newer than abc, and 0 is newer than Z.
If both the elements are numbers, the larger number is considered newer. So 5 is newer than 4 and 10 is newer than 2. If the numbers are equal, the elements are decided equal.
If both the elements are alphabetic, they are compared using the Unix strcmp function, with the greater string resulting in a newer element. So b is newer than a, add is newer than ZULU (because lowercase characters win in strcmp comparisons), and aba is newer than ab. If the strings are identical, the elements are decided equal.
Examples
Some random examples, to make sure you understand the rpmvercmp algorithm:
1.0010 is newer than 1.9 because 10 is more than 9.
1.05 is equal to 1.5, because both 05 and 5 are treated as the number 5.
1.0 is newer than 1, because it has one more element in the list, while previous elements are equal.
2.50 is newer than 2.5, because 50 is more than 5.
fc4 is equal to fc.4, because the alphabetic and numeric sections will always get separated into different elements anyway.
FC5 is older than fc4, because it uses uppercase letters.
2a is older than 2.0, because numbers are considered newer than letters.
1.0 is newer than 1.fc4 because numbers are considered newer than letters.
3.0.0_fc is the same as 3.0.0.fc, because the separators themselves are not important.
Sometimes it is necessary to replace an installed package with a new package that provides the same functionality as an installed one. By obsoleting that installed package in the new package the installed package will be removed during an update and replaced with the new package. Obsoletes can be versioned and have flags just like requires and provides.
An example was when ucd-snmp got replaced by net-snmp. There the new net-snmp package contained a "Obsoletes: ucd-snmp" and, as it provided the functionality of the old ucd-snmp a "Provides: ucd-snmp".
In rpm you can specify that your package conflicts with something another package provides. This can be used so that e.g. packages that are know not to work with one another can refuse to be installed if the other is already there. Conflicts are tested against the provides, just like requires. But instead of fullfilling a requirement they produce an error if there is a match. Conflicts can be versioned and have flags just like requires and provides.
If you look at a set of rpm packages, then a fileconflict is any duplicate filename where two files differ in their filemode (file permissions as well as file type like directory, regular file, etc), their filemd5sum (which is only set for regular files), their fileusername or their filegroupname. To support multilib/compatarch installations also all fileconflicts where one file is a ELF32 and the other is a ELF64 binary are ignored.
Since fileconflicts do occur pretty frequently, the default setting is to ignore them. yum within Fedora Core has fileconflict detection enabled.
Dependency checking is fairly simple: For a given set of packages take all requirements and check for every requirement if any package provides this requirement, possibly restricted by it's version and flags.
Given a set of packages to determine the order in which the packages need to be installed (or removed) a dependency graph is constructed. All packages with no requirements can be installed first. Afterwards all requires of the remaining packages that were provided by the installed packges get removed. This normally results in a new set of packages without any requirements. This process is then repeated until either no more packages need to be installed or we have requirement loops between packages. These loops then need to be broken up by intelligently selecting one dependency and removing that until we have packages without requirements again. If there are several packages in one round without requirements the order of those are independant and can be installed in any order.
A special case for the loop breaking are PreReqs. Those requirements are special in that sense that they should never ever be broken up if possible as they specify that a certain package absolutely has to be installed before the other one can be safely and correctly installed. This happens often if a package needs some specific binaries in one of it pre or post scripts.
Now that we know how dependency checks and ordering works the next step is dependency resolving. Here we need to leave the space of a single set of rpms and include some kind of repository from which the depresolver can pull packages. A typical scenario is where you have an installed system with a set packages B, a set of packages to be installed or updated I and a set of packages in one or more repositories called R.
The depresolver now takes all requirements from the packages in I and checks which are still not resolved by B. For those unfulfilled requirements it then looks at the packages in R and tries to find packages that fulfill those requirments and add the best matching package to I. This process is repeated until all requirments are fullfilled or aborted if they can't be resolved using R.
An important fact is that dependencies are "arch-less", meaning if you have a requirement on multilib systems this can lead to 64bit packages being pulled it from 32bit packages and vice versa. For libraries this can and usally will automatically be prevented as 64bit libraries on 64bit systems have a (64bit) at the end of their provides, so they specifically and correcltly will be required by binaries linked against them.
More problematic on the other hand are development packages where there a package foo-devel typically only has a "Requires: foo" as a requirement which would pull in a 64bit foo package for a 32bit foo-devel package, which is most of the time not what you want.
Multilib systems (mixed 32 and 64bit systems like AMD64 or S390x) have some special rules that apply to them as on those architectures both 32bit and 64bit packages can be installed at the same time. Each file in a rpm has a color. The color is 0 if the file is arch independant (text files, etc), 1 if it's a 32bit binary and 2 if it is a 64bit binary. If the color is 0, normal file conflict handling is done. If the color larger than 0 and both colors are equal, again, normal file conflicts and handling is done. If both are larger than 0 and differ then the "higher" color wins, meaning the 64bit binary. No file conflicts will be done for that case either.
There are 2 basic forms for updates: Specified or complete updates. For the former this is easily done by looking for packages that match the ones given on the command line. Those can either be names (with version, release and arch information, of course), regular expressions or binary rpm packages. For the later a special pre updates pass has to be done in order to correctly obsolete old packages with newer ones (like the changes from xorg-x11 to modular-x from FC3 to FC4). Other than that it's identical to the specified case where the names to be updates are simply the names of all packages installed on the system.
For a single arch system it is fairly easy to find the best update package: We just look for the package with the "highest" arch (except if exactarch is specified in which case the arch of the update package needs to exactly match the one of the installed package) and then the highest version.
For multilib system things get a lot trickier. If the given update package is either marked as "install only" (like usually kernel packages) or if no package with that name has already been installed we use the simple algorithm as before, selecting the highest arch/version match for the update. If exactarch has been set then we just need to go through all installed archs of each name and find the highest versions for each arch.
If exactarch isn't set we now need to determine if more than one major arch for a name is installed. Major arch in this case means either 32bit or 64bit, e.g i386, i486, i586, i686, athlon for 32bit and x86_64 for 64bit. If there is only one we can and do allow even major arch changes (from 32bit to 64bit and vice versa). If packages for more than one major arch are already installed we need to find updates for each of the installed major archs and do the final selection as in the previous cases.
RPM provides the ability to have packages execute a set of scripts before and after installation and uninstallation. This often is needed to either prepare the system for the actual installation, do some post processing after a package is installed, some pre processing before a package is uninstalled or cleanup after a packages has been uninstalled.
Although the basic idea is fairly simple it still can cause some confusion when scripts are executed for example during updates of packages. This is the order how an update of a package is done in relation to the scripts:
Run %pre of new package
Install new files
Run %post of new package
Run %preun of old package
Delete any old files not overwritten by newer ones
Run %postun of old package
So during an update the new package gets installed first with all it's scripts and then the old package with all it's scripts get uninstalled.
Sometimes it is necessary for packages to run scripts when other packages are installed or uninstalled. This can be done using trigger scripts. Triggers are therefore very similar to requirements and are written the same way in the spec files. The order of how and when trigger scripts are called looks like this:
Run %pre for new version of package being installed
Install new files
Run %post for new version of package being installed
Run %triggerin from other packages set off by new install
Run %triggerin of new package
Run %triggerun of old package
Run %triggerun from other packages set off by old uninstall
Run %preun for old version of package being removed
Delete any old files not overwritten by newer ones
Run %postun for old version of package being removed
Run %triggerpostun of old package
Run %triggerpostun from other packages set off by old uninstall
Looks a little more complicated, but allows to perform postprocessing by other packages after some package has been installed or uninstalled. Take notice that there isn't a %triggerpostin and that %triggerin basically behaves like a %triggerpostin.
For RPM there are nowadays several "formats" in which you can find information about rpm packages. The most typical one is of course the binary rpm header which is part of every binart rpm package. A typical binary rpm package looks like this:
+------+-----------+--------+-------------+ | Lead | Signature | Header | Gziped CPIO | +------+-----------+--------+-------------+ |
The lead has a fixed size of 96 bytes and contains some very basic information about the binary rpm. It can also generally be used to determine if a file is a binary rpm or not (using file e.g.) as it contains some very specific to easily identify them.
The signature and the header are stored as rpm header structures. Rpm header structures look like this:
+-------+---------+-----------+-----------+-----------+ | Magic | IndexNr | StoreSize | Indexdata | Storedata | +-------+---------+-----------+-----------+-----------+ |
The Magic is a hardcoded value, IndexNr the number of index entries and StoreSize the size in bytes of the store data.
Indexdata consists of IndexNr index entries each of which is 16 bytes. Each index entry looks like this:
+-----+------+--------+-------+ | Tag | Type | Offset | Count | +-----+------+--------+-------+ |
Tag specifies which tag this entry is about. Type specifies the type of the tage. Offset specifies at which offset in the Storedata the data begins for this tag. Count has various size meanings depending on the type.
Storedata finally contains the real tag information. As mentioned in the previous paragraph by using an index entry from the Indexdata you can find and parse all data relevant to a specifc tag. The format depends of course on the type of the tag.
More detailed information about the binary rpm format can be found here: http://www.rpm.org/max-rpm/s1-rpm-file-format-rpm-file-format.html
The rpm binary format can be partially found in the rpmdb as well. The file /var/lib/rpm/Packages contains the complete headers of the orignal binary rpms in a rpm header structure format without the 8 byte magic and with some additional installation revelvant indexes appended.
Another nowadays common format for reduced rpm header data is the repo metadata format used by yum. It is a split up and reduced version of the orignal rpm header information using XML. It is mainly useful to determine and resolve dependencies of rpm packages. More information about the metadata can be found here:
http://linux.duke.edu/projects/metadata/
Other less common storage formats include databases like SQLite or MySQL which e.g yum uses to convert the repodata format to a more usable form locally.
Apart from that rpm itself extracts quite a bit of the information from rpm binary headers and writes them in various db4 files in /var/lib/rpm.
This section describes the structure from the various files in /var/lib/rpm. All files are db4 files, either hash or btree based. With the exception of Packages all files have the corresponding rpmtag based value as key. The data consists of integer pairs which contain the package id and the index at which this entry can be found in the rpm header of that tag. The values are 4 byte integers in host byte order. For some tags the index doesn't make any sense. In those cases the index value will always be set to 0.
key: Basename (string)
values: list of 2-tuples: installid (4 byte int), basenameindex (4 byte int)
key: Conflictname (string)
values: list of 2-tuples: installid (4 byte int), conflictindex (4 byte int)
key: Dirname (string)
values: list of 2-tuples: installid (4 byte int), dirindex (4 byte int)
key: md5sum (4 * 4 byte int, no hex string!)
values: list of 2-tuples: installid (4 byte int), filemd5sindex (4 byte int)
Only stored if file md5sum exists and if the file is a regular file (usually equivalent)
key: Groupname (string)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: Installtime of transaction (4 byte int, time() value)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: Packagename (string)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: Installid (4 byte int)
values: Complete binary rpm header with some additional information from signature without lead.
key: Providename (string)
values: list of 2-tuples: installid (4 byte int), providenameindex (4 byte int)
key: Provideversion (string)
values: list of 2-tuples: installid (4 byte int), provideversionindex (4 byte int)
key: unknown yet
values: unknown yet
key: Requirename (string)
values: list of 2-tuples: installid (4 byte int), requirenameindex (4 byte int)
Only contains the requirenames of not install prereqs
key: Requireversion (string)
values: list of 2-tuples: installid (4 byte int), requireversionindex (4 byte int)
key: Sha1header (string) (just as the value from the header)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: md5sum from header (4 * 4 byte int)
values: list of 2-tuples: installid (4 byte int), index (4 byte int) (always 0)
key: Triggername (string)
values: list of 2-tuples: installid (4 byte int), triggerindex (4 byte int)
Only contains the first entry for each name from a package
Now an example of the connection between the package headers which are stored in Packages and the rest of the files.
The connection between /var/lib/rpm/Packages and the other files looks like this:
Package id | Requirename | Index |
---|---|---|
5 | a | 0 |
b | 1 | |
8 | c | 0 |
a | 1 | |
b | 2 |
Requirename | Package Id | Index |
---|---|---|
a | 5 | 0 |
8 | 1 | |
b | 5 | 1 |
8 | 2 | |
c | 8 | 0 |
That means the complete /var/lib/rpm files can be cross checked with /var/lib/rpm/Packages and can be regenerated from that file as well.
An exception is Installtid. This db file contains as keys the TID which is a unique time in seconds since 1970 that reflects a complete transaction. Every header in Packages contains that TID as "installtid" tag. The values of the Installtid db file are again pairs of integers with a package id as first value and the second value always 0. Here a small example:
Package id | Install Tid |
---|---|
5 | 1000000 |
8 | 1000000 |
6 | 1234567 |
9 | 1234567 |
7 | 2345678 |
Install Tid | Package ID | Index |
---|---|---|
1000000 | 5 | 0 |
8 | 0 | |
1234567 | 6 | 0 |
9 | 0 | |
2345678 | 7 | 0 |
As you can see it can happen that package ID's get reused, in our example 6. This can happen if a package gets deleted and the ID "dropped". So there is unfortunately no autoincrementing ID for the packages.
The data eating up RAM in rpm headers are descriptions, changelogs and filelists.
The dependency data we operate with is extremely huge. In addition to the Provides: data which contains shared libs, rpm versions and explicitely listed ones in .spec files, dependency data can also use any filerequires like e.g. Requires: /usr/bin/foo to reference any file in any other rpm package. That means we potentially have to look at a filelist of all rpm packages. That data is extremely huge as the current Fedora Core development tree contains more than 350000 files.
As the dependency data is worked with on each client to update the machine, it must be a goal to reduce this data to a smaller subset.
The current repo metadata has a fixed file regex of (.*bin/.*|/etc/.*|/usr/lib/sendmail)$ and a directory regex of (.*bin/.*|/etc/.*)$. That regex specifies the data given in the repodata/primary.xml.gz file and you have to fallback to the complete filelists available in repodata/filelists.xml.gz if any dependency request is done outside of that data. (The regex gives a deterministic way to know when to load the full filelist.) The regex used to be pretty complete for Fedora Core in the past, but additional filerequires are present in newer Fedora Core and Fedora Extra rpm packages which require a reload of the complete lists.
In addition to the completeness problems above, it was also noted that the regex lists contain 100 times more data than actually being used in current repositories. Conary is thus maintaining explicit lists of possible file requires. Maybe new ways to add autogenerated, small filelists can be worked out that would work for most comon usage cases, also with the fallback to the complete lists like yum / createrepo implement right now.
It would also be possible to store dependency graphs that contain data for the resolver to select the right rpm packages plus the orderer to specify the right sequence to install them. But many machines do have further packages installed outside of that package set, so this would then mostly be used for new installs. Optimizing the general update path for running machines should be more important than improving the install path for new installs, so this is currently no goal, but would very well be possible todo.
The following things should be noted about the repo metadata. yum is using the repodata only within the resolver part to determine a set of rpms that should be updated and/or installed. Then the complete rpm headers are downloaded and another dependency check from librpm is run in addition to determining the ordering of rpm packages.
Here a few limitations you should be aware of if you want to work with the repodata for more than the resolver or understand the limits of the resolver:
Repodata has evolved over time. Until now no version information has been added to the created data, this might make sense for future changes.
Even if no epoch is specified in the rpm header, the metadata will specify this as "0". That's the correct way for version and dependency checks.
Dependency information is often specified like bash >= 3.0 and consists of a (name, flag, version) triple. The flag part is specified as integer within the rpm header and is only partially copied over into the repodata. Installation ordering of rpm packages is not possible with the current available data (or only based on reduced data). Future repodata could make the data more complete or just copy the integer into the output to provide it as exact copy. (Repo data adds a "pre" flag if the RPMSENSE_PREREQ flag is set. That information is actually not complete to identify install prereq versus an erase prereq.)
The primary.xml.gz file contains a subset of the included files. Because of thise some operations cannot be equivalently with binary rpms or repodata headers.
Last updated 20-Mar-2007 14:48:02 CEST