Jupyterhub's records are in a mess. 07/11 Update SLTechnology News&Howtos

Jupyterhub's records are in a mess.

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The data scientists of Party An and Party B have to use a variety of interface tools to do the work of data scientists, so we got jupyterlab from zeppelin, and then from lab to hub.

For the programming level of the data scientist of Party A, it is really unflattering, but we still have to pay a compliment. There are not many opportunities to write python as pig in this lifetime. The brain circuit of the data scientist is quite PigLatinic. Well, people who have written pig should be able to understand.

However, I just want to record the work process. I can use it myself when other Party A needs it in the future.

Environment Anaconda3 + Jupyterhub + Spark2.1 + CDH 5.14 + Kerberos

I. Integration of Hub and Spark

Not to mention how anaconda installs jupyterhub and generates configuration files, there are a lot of them online.

Since data scientists can only use python, various other language interpreters based on toree don't record it yet. With only spark, it's pretty simple. I created a file.

Create one if the / usr/share/jupyter/kernels/pyspark2/kernel.json path does not exist, and vi yourself if the file does not exist. The contents are as follows

{"argv": ["python3.6", "- m", "ipykernel_launcher", "- f", "{connection_file}"], "display_name": "Python3.6+PySpark2.1", "language": "python", "env": {"PYSPARK_PYTHON": "/ opt/anaconda3/bin/python" "SPARK_HOME": "/ opt/spark-2.1.3-bin-hadoop2.6", "HADOOP_CONF_DIR": "/ etc/hadoop/conf", "HADOOP_CLIENT_OPTS": "Xmx2147483648-XX:MaxPermSize=512M-Djava.net.preferIPv4Stack=true" "PYTHONPATH": "/ opt/spark-2.1.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip:/opt/spark-2.1.3-bin-hadoop2.6/python/", "PYTHONSTARTUP": "/ opt/spark-2.1.3-bin-hadoop2.6/python/pyspark/shell.py" "PYSPARK_SUBMIT_ARGS": "--master yarn-- deploy-mode client-- name JuPysparkHub Test pyspark-shell"}

And then there's no more.

2. Jupyterhub + independent Kerberos integration without LDAP

Since the cluster is a separate Kerberos system and is not integrated with the PAM and LDAP of the system, the Jupyter code needs to be modified here.

If the cluster is a combination of KRB5 and LDAP, you can ignore changing the code. Jupyterhub itself officially has a plug-in for LDAP authentication.

Based on my understanding of the Jupyterhub source code, Hub itself launches a multi-user authenticated shell that uses the user's linux account to launch notebook based on the authentication of each user.

This is the architecture process that I read the Hub source code summary, not necessarily accurate, probably a meaning. NB means NoteBook, which doesn't mean awesome.

So Hub actually finally starts Notebook to start a new jupyter-notebook process, and it starts with the logged-in user's environment variable. Then there are two ways to initialize kerberos principal.

Idea 1:

Using the system user's environment variable to do kinit, the advantage is that kinit is written in the user's .bash_profile or .bashrc file, as long as you call the user's environment variable, you can directly kinit without modifying any code. The downside is that if the kinit expires, you have to reopen the notebook, and there is a risk that the editing content may be lost.

Idea 2:

Modify the hub-related code to make the kinit go inside the nb, so it doesn't matter the system environment variables. In particular, the hub has an automatic saving mechanism. The front end will regularly send requests for saving notes to the back-end api. If you find the api, you can complete this function, so the kinit is refreshed regularly, and as long as the ticket of the nb,kerberos is opened, it will never expire.

Considering that Party A's data scientists are a group of Down syndrome who have no self-care and learning ability besides writing pyspparksql, I decided to adopt the second scheme. Save trouble for Father Tang, but also save trouble for yourself, once and for all.

Then according to the process architecture and idea 2 of hub, I only need to modify the kinit authentication of notebook itself. But for insurance, I also changed the login authentication shell of hub.

Since the hub and nb series are written by tornado, it's easy to change.

The first step is to write a simple and rough piece of python code to do kinit verification.

Def kinit (self, username): "" add by xianglei "import os from subprocess import Popen, PIPE uid_args = ['id','-upright, username] uid = Popen (uid_args, stdin=PIPE, stdout=PIPE, stderr=PIPE) uid = uid.communicate () [0] .decode (). Strip () gid_args = ['id','-g''. Username] gid = Popen (gid_args, stdin=PIPE, stdout=PIPE, stderr=PIPE) gid = gid.communicate () [0] .decode () .strip () self.log.info ('UID:' + uid + 'GID:' + gid) self.log.info ('Authenticating:' + username) realm = 'XX.COM' kinit =' / usr/bin/kinit' krb5cc ='/ tmp/krb5cc_%s'% ) keytab ='/ home/%s/%s.wb1.keytab'% (username, username) principal ='% s'% s% (username, realm,) kinit_args = ['kinit','-kt', keytab,'- cations, krb5cc, principal] self.log.info ('Running:' + '.join (kinit_args)) kinit = Popen (kinit_args, stdin=PIPE, stdout=PIPE) Stderr=PIPE) self.log.info (kinit.communicate ()) ans = None import os if os.path.isfile (krb5cc): os.chmod (krb5cc, 0o600) os.chown (krb5cc, int (uid), int (gid)) ans = username return ans

They all call operating system commands directly, which can be said to be quite rude.

Then find the auth.py in the jupyterhub source package. Don't ask me where the source code is. If you can't find the installation location of python third-party module, what's the difference with the data scientist?

Find the authenticate method and change it to the following.

Run_on_executor def authenticate (self, handler, data): "Authenticate with PAM, and return the username if login is successful. Return None otherwise." Username = data ['username'] try: pamela.authenticate (username, data [' password'], service=self.service, encoding=self.encoding) username = self.kinit (username) except pamela.PAMError as e: if handler is not None: self.log.warning ("PAM Authentication failed (% slots% s):% s", username, handler.request.remote_ip E) else: self.log.warning ("PAM Authentication failed:% s", e) else: if not self.check_account: return username try: pamela.check_account (username, service=self.service Encoding=self.encoding) username = self.kinit (username) except pamela.PAMError as e: if handler is not None: self.log.warning ("PAM Account Check failed (% slots% s):% s", username, handler.request.remote_ip, e) else: self.log.warning ("PAM Account Check failed:% s" E) else: return username

In this way, hub can first do a kinit authentication when logging in, but it is of no use. But when I face more Down patients than cluster machines, I still need some psychological comfort.

Then find the notebook/handler.py file for notebook and modify it as follows.

Class NotebookHandler (IPythonHandler): @ web.authenticated def get (self, path): "get renders the notebook template if a name is given, or redirects to the'/ files/' handler if the name is not given." Path = path.strip ('/') cm = self.contents_manager # will raise 404on not found try: model = cm.get (path, content=False) except web.HTTPError as e: if e.status_code = = 404and 'files' in path.split (' /'): # 404, but'/ files/' in URL Let FilesRedirect take care of it return FilesRedirectHandler.redirect_to_files (self, path) else: raise if model ['type']! =' notebook': # not a notebook, redirect to files return FilesRedirectHandler.redirect_to_files (self, path) name = path.rsplit ('/' 1) [- 1] username = self.current_user ['name'] self.kinit (username) self.write (self.render_template (' notebook.html', notebook_path=path, notebook_name=name, kill_kernel=False, mathjax_url=self.mathjax_url, mathjax_config=self.mathjax_config) Get_custom_frontend_exporters=get_custom_frontend_exporters))

The function of the above code is to do kinit authentication when you open notebook.

Then open the service/contents/handlers.py of notebook

@ gen.coroutine def _ save (self, model, path): "Save an existing file." Chunk = model.get ("chunk", None) if not chunk or chunk =-1: # Avoid tedious log information self.log.info (u "Saving file at% s", path) if 'name' in self.current_user: if isinstance (self.current_user [' name'] Str): self.kinit (self.current_user ['name']) # pass model = yield gen.maybe_future (self.contents_manager.save (model, path)) validate_model (model, expect_content=False) self._finish_model (model)

The purpose of the above code is to trigger kinit authentication when notebook is saved automatically or manually.

All kinds of insurance measures have been done, and Down's patients with international faces say that their lives are very happy, showing a smile that they have not seen for a long time.

Third, hub and nginx reverse integration of multiple domain names and SSL.

As a normal person, I can never guess what Down's patients are thinking. They will always have a way to make Party B work endlessly, so that they can stop working under the banner of system upgrade. So, can't IP access it? Why do we have to get a domain name?

Moreover, our cluster environment is divided into business * * and management * *. The two * * bound domain names are different, and then both need SSL connections.

The trouble here is that Jupyter forbids cross-domain access. SSL plus anti-generation configuration is not difficult, the difficulty is cross-domain access, in fact, cross-domain access is not difficult, the difficulty is how to kill these data scientists.

Upstream hub10000 {server 172.16.191.110 server 10000;} server {listen 3000; server_name mgmthub.xxx.cn buhub.xxx.cn; ssl on; ssl_certificate_key cert/xxx.cn.key; ssl_certificate cert/xxx.cn.crt Ssl_ciphers ECDHEmurECDHEMAUE ECDHEMurSHA384WECDHERSAE ECDHERSAE ECDHE4H ECDHEFUR ECDHEMAUE ECDHEMER ECDHEMER ECDHEMAUE ECDHEMAUE ECDHEMER ECDHEMAUE ECDHEMAUE ECDHESAUE AES128l8GCMSHA256WECDHEMAE ECDHEMAUE ECDHEMAUE ECDHEMER ECDHEMAE ECDHESAUE ECDHESAUE ECDHEMAE ECDHESAMY ACE256WECDHEMAE ECDHEMAE AES128FEA256FECDHEMAE ECDHEMETH ECDHEMETH ECDHEMER ACERC4FECDRSAE6SHA: ! RC4color SHAAveHIGHRO aNULLLV eNULLRV LOWV 3DESV MD5WV EXPV CBC EDH EDH KEDH PSKR SRPv Ssl_protocols TLSv1 TLSv1.1 TLSv1.2; ssl_session_cache shared:SSL:10m; ssl_prefer_server_ciphers on; ssl_session_timeout 1d; ssl_stapling on; ssl_stapling_verify on; add_header Strict-Transport-Security "max-age=31536000; includeSubdomains;"; add_header X-Frame-Options SAMEORIGIN; add_header X-Content-Type-Options nosniff Add_header X-XSS-Protection "1; mode=block"; add_header Content-Security-Policy "default-src 'self';style-src' self' 'unsafe-inline';script-src' self' 'unsafe-inline'' unsafe-eval';font-src 'self' data:;connect-src' self' wss:;img-src 'self' data:;"; location / {proxy_pass http://hub10000; Proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_set_header Origin ""; include proxy.conf;}}

Proxy.conf

Proxy_redirect off;proxy_set_header Host $host;proxy_set_header X-Real-IP $remote_addr;proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;proxy_set_header X-Nginx-Proxy true;proxy_connect_timeout 604800sendings timeout 604800th proxycards readreadable timeout 604800 proxies buffers 64kb proxies buffers 64 32k proxies bus buffers size 128k Proxy_temp_file_write_size 64k delegation proxylic next extents upstream error timeout invalid_header http_500 http_503 http_404;proxy_max_temp_file_size 128m leading clientsbodybuilding temptation path client_body 12; proxy_temp_path proxy_temp 12

110ROR 10000 is the jupyterhub listening address and is another machine.

And then add

Proxy_set_header Origin ""

The problem of cross-domain access to tornado can be solved without modifying the bind_url setting in the configuration of jupyterhub.

I apologize to the Down patient mentioned in the article. Although I know that using the Down patient analogy to the data scientist of Party An is disrespectful to the Down patient, there is little talent and learning, and there is no suitable word to describe the data scientist of Party A. I don't discriminate against Down patients. I discriminate against data scientists.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.