Datasets ▶ Z-Library scrape [zlib/zlibzh]
If you are interested in mirroring this dataset for archival or LLM training purposes, please contact us.
Overview from datasets page.
Source Metadata Files
Z-Library [zlib/zlibzh]
👩‍💻 Anna’s Archive and Z-Library collaboratively manage a collection of Z-Library metadata and Z-Library files

Z-Library has its roots in the Library Genesis community, and originally bootstrapped with their data. Since then, it has professionalized considerably, and has a much more modern interface. They are therefore able to get many more donations, both monetarily to keep improving their website, as well as donations of new books. They have amassed a large collection in addition to Library Genesis.

The collection consists of three parts. The original description pages for the first two parts are preserved below. You need all three parts to get all data (except superseded torrents, which are crossed out on the torrents page).

The “Chinese” collection in Z-Library appears to be the same as our DuXiu collection, but with different MD5s. We exclude these files from torrents to avoid duplication, but still show them in our search index.

Resources

Zlib releases (original description pages)

Release 1 (2022-07-01)

The initial mirror was painstakingly obtained over the course of 2021 and 2022. At this point it is slightly outdated: it reflects the state of the collection in June 2021. We will update this in the future. Right now we are focused on getting this first release out.

Since Library Genesis is already preserved with public torrents, and is included in the Z-Library, we did a basic deduplication against Library Genesis in June 2022. For this we used MD5 hashes. There is likely a lot more duplicate content in the library, such as multiple file formats with the same book. This is hard to detect accurately, so we don't. After the deduplication we are left with over 2 million files, totalling just under 7TB.

The collection consists of two parts: a MySQL “.sql.gz” dump of the metadata, and the 72 torrent files of around 50-100GB each. The metadata contains the data as reported by the Z-Library website (title, author, description, filetype), as well as the actual filesize and md5sum that we observed, since sometimes these do not agree. There seem to be ranges of files for which the Z-Library itself has incorrect metadata. We might also have incorrectly downloaded files in some isolated cases, which we will try to detect and fix in the future.

The large torrent files contain the actual book data, with the Z-Library ID as the filename. The file extensions can be reconstructed using the metadata dump.

The collection is a mix of non-fiction and fiction content (not separated out as in Library Genesis). The quality is also widely varying.

This first release is now fully available. Note that the torrent files are only available through our Tor mirror.

Release 2 (2022-09-25)

We have gotten all books that were added to the Z-Library between our last mirror and August 2022. We have also gone back and scraped some books that we missed the first time around. All in all, this new collection is about 24TB. Again, this collection is deduplicated against Library Genesis, since there are already torrents available for that collection.

The data is organized similarly to the first release. There is a MySQL “.sql.gz” dump of the metadata, which also includes all the metadata from the first release, thereby superseding it. We also added some new columns:

We mentioned this last time, but just to clarify: “filename” and “md5” are the actual properties of the file, whereas “filename_reported” and “md5_reported” are what we scraped from Z-Library. Sometimes these two don't agree with each other, so we included both.

For this release, we changed the collation to “utf8mb4_unicode_ci”, which should be compatible with older versions of MySQL.

The data files are similar to last time, though they are much bigger. We simply couldn't be bothered creating tons of smaller torrent files. “pilimi-zlib2-0-14679999-extra.torrent” contains all the files that we missed in the last release, while the other torrents are all new ID ranges. Update 2022-09-29: We made most of our torrents too big, causing torrent clients to struggle. We have removed them and released new torrents. Update 2022-10-10: There were still too many files, so we wrapped them in tar files and released new torrents again.

Release 2 addendum (2022-11-22)

This is a single extra torrent file. It does not contain any new information, but it has some data in it that can take a while to compute. That makes it convenient to have, since downloading this torrent is often faster than computing it from scratch. In particular, it contains SQLite indexes for the tar files, for use with ratarmount.