diff options
Diffstat (limited to '3gpp-spec/README.md')
-rw-r--r-- | 3gpp-spec/README.md | 143 |
1 files changed, 143 insertions, 0 deletions
diff --git a/3gpp-spec/README.md b/3gpp-spec/README.md new file mode 100644 index 0000000..7a7d31f --- /dev/null +++ b/3gpp-spec/README.md @@ -0,0 +1,143 @@ +The purpose of this project is to put the [3GPP](http://www.3gpp.org/) specifications in a version control system. +This is accomplished using the `Rakefile` script. + +requirement +=========== + +To perform all operations, the `Rakefile` script requires the following components: + +- `rake`: to run the `Rakefile` (the ruby alternative of `Makefile`) +- `wget`: to download the 3GPP specifications and `tika` +- `rubyzip`: to extract the .doc specification documents from the downloaded .zip archives +- `java`: to convert the .doc specification documents to machine readable .xhtml documents using `tika`. +- `nokogiri`: to parse the .xhtml specification documents and output .txt documents +- `git`: to put the resulting .txt specification documents in a revision control + +To install all requirements on a Debian based distribution: + + sudo apt install rake wget ruby-zip default-jre ruby-nokogiri git + +steps +===== + +download +-------- + +The 3GPP specification are available [online](http://www.3gpp.org/ftp/Specs/archive/). +There also is an [FTP mirror](ftp://ftp.3gpp.org/Specs/archive/). +While it is faster to download then over HTTP the first time, updating them afterwards is significantly faster over FTP (~15 minutes over FTP vs >90 minutes over HTTP when no new file is present). + +`rake download` will download all publicly available specification, or get the new ones when not already downloaded. +The first time `rake` is used this will be done automatically, but this task should then be done by hand periodically to get new specifications. + +The specifications come in .zip archives containing the .doc specification documents. + +extract +------- + +The .doc specification documents are extracted from the .zip archives. +The file name of the .zip archive is use the get the specification number and version. +These are then used to extract the corresponding .doc specification document. + +`rake extract` will extract all .doc specification documents from all .zip archives. +Else the corresponding .doc specification documents are extracted from .zip archives when need to generate the .XHTML documents for the particular specification. + +convert +------- + +The .doc specification documents are converted to machine readable .xhtml documents using *tika*. + +`rake convert` will convert all .doc specification documents to .xhtml documents. +Else the corresponding .doc specification documents is converted to the .xhtml document when need to generate the final text document for the particular specification. + +NOTE: This is the longest step. + +text +---- + +The .xhtml specification documents are converted to cleaner .txt documents. +All embedded graphics, diagrams, and page information are removed since we are only interested in the text changes. + +`rake text` will convert all .xhtml specification documents to .txt documents. +Else the corresponding .xhtml specification documents is converted to the .text document when need to generate the final text document for the particular specification. + +versioning +---------- + +The generated .txt specification documents correspond to single versions of a specification. +These become the final specification xx.yyy.txt text file. +The individual versions are committed into the git repository. + +`rake ftp.3gpp.org/Specs/archive/xx_series/xx.yyy/xx.yyy.txt` will generate and commit a single specification. +`rake` will generate and commit all specifications. + +exceptions +========== + +The 3GPP specification and created, maintained, and only meant for humans. +Trying got find a scheme able to handle all documents is pretty hard. +The script allows to add exceptions for specifications that don't match the general schemes. + +The script will first collects a list of all specifications. +Wrongly formatted specification archive names are displayed. +Add these to the **BAD_ARCHIVE** list to ignore these. + +extract +------- + +Sometimes the .doc specification document file name does not correspond to the specification number and version extracted from the .zip archive file name. +In this case a warning message will be displayed and an empty .doc file is created, so to avoid trying to extract the document every time the final text specification needs to be generated. + +You can add the actual .doc specification document file name corresponding to the expected file name in the **DOC_FILENAME** list. +Also delete the existing empty .doc file. +The specified .doc will automatically be extracted and used to generate the final text specification next time the specification needs to be generated, or `rake` is run to generate all text specifications. + +Sometimes the .zip archive does not contain a .doc(x) specification document, but a .txt text, or again a .zip archive containing the final .doc documents. +This is not handled by the script. +Extract the .doc file by hand to the expected .doc file, and this will then be used to generate the text specification. + +convert +------- + +Sometimes the .doc specification document uses the old Microsoft Word format not supported by *tika*, or is malformed. +In this case a warning message will be displayed and an empty .xhtml file is created, so to avoid trying to extract the document every time the final text specification needs to be generated. +To prevent this warning to be displayed add the file path in the **DOC_NOCONVERT** list. + +You can also try to manually fix the malformed .doc file into a regular .doc file (readable by *tika*). +The new .doc file will automatically be converted to an .html document and used to generate the final text specification next time the specification needs to be generated, or `rake` is run to generate all text specifications. + +versioning +---------- + +Extracting the specification version from the document is a hard task due to the various formatting of the documents. +It also happens that the actual version in the document is wrong. +Use the **FILENAME_VERSION** list to enforce a specification number. +If **USE_FILENAME_VERSION** is set the script will use the version information extracted from the file name in case it does not find the version information in the document of this is erroneous. + +cleaning +-------- + +`rake clean_bogus` will remove empty .txt, .xhtml, and .doc specification version files. +These files are generated during the above documents steps when the documents are malformed to prevent re-performing the steps every time. + +`rake clobber` will remove the gir repository and .txt specifications. + +`rake clean` will remove all .txt, .xhtml, and .doc specification version files. + +tips & tricks +============= + +Because of the large number of .doc, .xhtml, and .txt intermediate files, it is recommended to use a file system offering transparent compression. +For example BTRFS with zstd compression uses 136 GB of disk space for 228 GB of total file size (use `btrfs filesystem df /mountpoint` to get the used disk space). + +`rake stats` will show the number and total size of the different specification files. + +future +====== + +- set commit date to specification release (or file) date +- set author to working group +- add a task to get all specification up to a version (should create a new branch and pick the commits of the specification up to a version using the commit tags) +- improve specification version extraction +- improve exception and version list +- extract release notes from document and put in commit message |