[Buildroot] [PATCH 4/4] support/scripts/pkg-stats: reimplement CPE parsing in pkg-stats
Yann E. MORIN
yann.morin.1998 at free.fr
Sat Apr 2 17:20:12 UTC 2022
Thomas, All,
On 2022-04-02 16:15 +0200, Thomas Petazzoni via buildroot spake thusly:
> pkg-stats currently uses the services from support/scripts/cpedb.py to
> match the CPE identifiers of packages with the official CPE database.
>
> Unfortunately, the cpedb.py code uses regular ElementTree parsing,
> which involves loading the full XML tree into memory. This causes the
> pkg-stats process to consume a huge amount of memory:
>
> thomas 1310458 85.2 21.4 3708952 3450164 pts/5 R+ 16:04 0:33 | | \_ python3 ./support/scripts/pkg-stats
>
> So, 3.7 GB of VSZ and 3.4 GB of RSS are used by the pkg-stats
> process. This is causing the OOM killer to kick-in on machines with
> relatively low memory.
>
> This commit reimplements the XML parsing needed to do the CPE matching
> directly in pkg-stats, using the XmlParser functionality of
> ElementTree, also called "streaming parsing". Thanks to this, we never
> load the entire XML tree in RAM, but only stream it through the
> parser, and construct a very simple list of all CPE identifiers. The
> max memory consumption of pkg-stats is now:
>
> thomas 1317511 74.2 0.9 381104 152224 pts/5 R+ 16:08 0:17 | | \_ python3 ./support/scripts/pkg-stats
>
> So, 381 MB of VSZ and 152 MB of RSS, which is obviously much better.
>
> Now, one will probably wonder why this isn't directly changed in
> cpedb.py. The reason is simple: cpedb.py is also used by
> support/scripts/missing-cpe, which (for now) heavily relies on having
> in memory the ElementTree objects, to re-generate a snippet of XML
> that allows us to submit to NIST new CPE entries.
>
> So, future work could include one of those two options:
>
> (1) Re-integrate cpedb.py into missing-cpe directly, and live with
> two different ways of processing the CPE database.
>
> (2) Rewrite the missing-cpe logic to also be compatible with a
> streaming parsing, which would allow this logic to be again
> shared between pkg-stats and missing-cpe.
>
> Signed-off-by: Thomas Petazzoni <thomas.petazzoni at bootlin.com>
> ---
> support/scripts/pkg-stats | 39 +++++++++++++++++++++++++++++++++++----
> 1 file changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
> index ae1a9aa5e4..cc163ebb1a 100755
> --- a/support/scripts/pkg-stats
> +++ b/support/scripts/pkg-stats
> @@ -27,12 +27,14 @@ import re
> import subprocess
> import json
> import sys
> +import time
> +import gzip
> +import xml.etree.ElementTree
You for to import requests, which is used later on.
I also fixed a bunch of flake8 issues:
support/scripts/pkg-stats:49:1: E302 expected 2 blank lines, found 1
support/scripts/pkg-stats:632:9: E306 expected 1 blank line before a nested definition, found 0
support/scripts/pkg-stats:635:9: E306 expected 1 blank line before a nested definition, found 0
support/scripts/pkg-stats:639:5: E303 too many blank lines (2)
1 E302 expected 2 blank lines, found 1
1 E303 too many blank lines (2)
2 E306 expected 1 blank line before a nested definition, found 0
> brpath = os.path.normpath(os.path.join(os.path.dirname(__file__), "..", ".."))
>
> sys.path.append(os.path.join(brpath, "utils"))
> from getdeveloperlib import parse_developers # noqa: E402
> -from cpedb import CPEDB # noqa: E402
>
> INFRA_RE = re.compile(r"\$\(eval \$\(([a-z-]*)-package\)\)")
> URL_RE = re.compile(r"\s*https?://\S*\s*$")
> @@ -42,6 +44,7 @@ RM_API_STATUS_FOUND_BY_DISTRO = 2
> RM_API_STATUS_FOUND_BY_PATTERN = 3
> RM_API_STATUS_NOT_FOUND = 4
>
> +CPEDB_URL = "https://static.nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.gz"
Instead of duplicating it here, I changed that to import it from cpedb.
Applied to master with all the aboved fixed, thanks.
Regards,
Yann E. MORIN.
> class Defconfig:
> def __init__(self, name, path):
> @@ -624,12 +627,40 @@ def check_package_cves(nvd_path, packages):
>
>
> def check_package_cpes(nvd_path, packages):
> - cpedb = CPEDB(nvd_path)
> - cpedb.get_xml_dict()
> + class CpeXmlParser:
> + cpes = []
> + def start(self, tag, attrib):
> + if tag == "{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item":
> + self.cpes.append(attrib['name'])
> + def close(self):
> + return self.cpes
> +
> +
> + print("CPE: Setting up NIST dictionary")
> + if not os.path.exists(os.path.join(nvd_path, "cpe")):
> + os.makedirs(os.path.join(nvd_path, "cpe"))
> +
> + cpe_dict_local = os.path.join(nvd_path, "cpe", os.path.basename(CPEDB_URL))
> + if not os.path.exists(cpe_dict_local) or os.stat(cpe_dict_local).st_mtime < time.time() - 86400:
> + print("CPE: Fetching xml manifest from [" + CPEDB_URL + "]")
> + cpe_dict = requests.get(CPEDB_URL)
> + open(cpe_dict_local, "wb").write(cpe_dict.content)
> +
> + print("CPE: Unzipping xml manifest...")
> + nist_cpe_file = gzip.GzipFile(fileobj=open(cpe_dict_local, 'rb'))
> +
> + parser = xml.etree.ElementTree.XMLParser(target=CpeXmlParser())
> + while True:
> + c = nist_cpe_file.read(1024*1024)
> + if not c:
> + break
> + parser.feed(c)
> + cpes = parser.close()
> +
> for p in packages:
> if not p.cpeid:
> continue
> - if cpedb.find(p.cpeid):
> + if p.cpeid in cpes:
> p.status['cpe'] = ("ok", "verified CPE identifier")
> else:
> p.status['cpe'] = ("error", "CPE version unknown in CPE database")
> --
> 2.35.1
>
> _______________________________________________
> buildroot mailing list
> buildroot at buildroot.org
> https://lists.buildroot.org/mailman/listinfo/buildroot
--
.-----------------.--------------------.------------------.--------------------.
| Yann E. MORIN | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software Designer | \ / CAMPAIGN | ___ |
| +33 561 099 427 `------------.-------: X AGAINST | \e/ There is no |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL | v conspiracy. |
'------------------------------^-------^------------------^--------------------'
More information about the buildroot
mailing list