Open Assistant: Conversational AI for Everyone

(open-assistant.io)

424 points | by chriskanan 445 days ago

48 comments

chriskanan 445 days ago
I'm really excited about this project and I think it could be really disruptive. It is organized by LAION, the same folks who curated the dataset used to train Stable Diffusion.
My understanding of the plan is to fine-tune an existing large language model, trained with self-supervised learning on a very large corpus of data, using reinforcement learning from human feedback, which is the same method used in ChatGPT. Once the dataset they are creating is available, though, perhaps better methods can be rapidly developed as it will democratize the ability to do basic research in this space. I'm curious regarding how much more limited the systems they are planning to build will be compared to ChatGPT, since they are planning to make models with far less parameters to deploy them on much more modest hardware than ChatGPT.
As an AI researcher in academia, it is frustrating to be blocked from doing a lot of research in this space due to computational constraints and a lack of the required data. I'm teaching a class this semester on self-supervised and generative AI methods, and it will be fun to let students play around with this in the future.
Here is a video about the Open Assistant effort: https://www.youtube.com/watch?v=64Izfm24FKA
[-]
- naasking 445 days ago
  > it is frustrating to be blocked from doing a lot of research in this space due to computational
  Do we need a SETI@home-like project to distribute the training computation across many volunteers so we can all benefit from the trained model?
  [-]
  - zone411 445 days ago
    Long story short, training requires intensive device-to-device communication. Distributed training is possible in theory but so inefficient that it's not worth it. Here is a new paper that looks to be the most promising approach yet: https://arxiv.org/abs/2301.11913
    [-]
    - sillysaurusx 445 days ago
      It doesn’t, actually. The model weights can be periodically averaged with each other. No need for synchronous gradient broadcasts.
      Why people aren’t doing this has always been a mystery to me.
      Relevant: https://battle.shawwn.com/swarm-training-v01a.pdf
      [-]
      - telotortium 445 days ago
        You linked a paper with no results and no conclusion. Perhaps you meant to link a different paper?
        [-]
        sillysaurusx 445 days ago
        I never finished it.
        [-]
        hackernewds 445 days ago
        so it is unproven? what is the value of it?
        [-]
        sillysaurusx 445 days ago
        It’s how we trained roughly 40 GPT 1.5B models. The technique works; it’s up to you to try it out.
        [-]
        zone411 445 days ago
        The abstract mentions fine-tuning, not full pre-training?
        [-]
        sillysaurusx 444 days ago
        Yeah, sorry for not being precise. We used the technique to fine tune around 40 GPT 1.5B models, including the chess one.
        It was very apparent that the technique was working well. The kiss curve suddenly started dropping dramatically the first day we got it working.
      - 6510 445 days ago
        I think the landscape has plenty to think about with few explorers able to wrap their wetware around all of it?
      - naasking 445 days ago
        Wouldn't other signal propagation approaches, like Forward-Forward, make this easier?
    - nylonstrung 445 days ago
      Would have to be federated learning to work I think
  - 8f2ab37a-ed6c 445 days ago
    That's brilliant, I would love to spare compute cycles and network on my devices for this if there's an open source LLM on the other side that I can use in my own projects, or commercially.
    Doesn't feel like there's much competition for ChatGPT at this point otherwise, which can't be good.
    [-]
    - davely 445 days ago
      On the generative image side of the equation, you can do the same thing with Stable Diffusion[1], thanks to a handy open source distributed computing project called Stable Horde[2].
      LAION has started using Stable Horde for aesthetics training to back feed into and improve their datasets for future models[3].
      I think one can foresee the same thing eventually happening with LLMs.
      Full disclosure: I made ArtBot, which is referenced in both the PC World article and the LAION blog post.
      [1] https://www.pcworld.com/article/1431633/meet-stable-horde-th...
      [2] https://stablehorde.net/
      [3] https://laion.ai/blog/laion-stable-horde/
    - vineyardmike 444 days ago
      > Doesn't feel like there's much competition for ChatGPT at this point otherwise, which can't be good.
      Facebook open sourced their LLM, called OPT [1]. There's not much else, and OPT isn't exactly easy to run (requires like 8 GPUs).
      I'm not an expect, so I don't know why some models, like the graphics generation we've seen, are able to fit on phones, while LLM require $500k worth of GPUs to run. Hopefully this is the first step to changing that.
      [1] https://ai.facebook.com/blog/democratizing-access-to-large-s...
  - VadimPR 445 days ago
    That already exists - https://github.com/bigscience-workshop/petals
    [-]
    - mdorazio 445 days ago
      I've seen Petals mentioned several times before and I don't think it's the same thing. Correct me if I'm wrong, but it seems Petals is for running distributed inference and fine-tuning of an existing model. What the above poster and I really want to see is distributed training of a new model across a network.
      Much like I was able to choose to donate CPU cycles to a wide variety of BOINC-based projects, I want to be able to donate GPU cycles to anyone with a crazy idea for a new ML model - text, image, finance, audio, etc.
  - andai 445 days ago
    I read about something a few weeks ago which does just this! Does anyone know what it's called?
    [-]
    - lucidrains 445 days ago
      you are probably thinking of https://arxiv.org/abs/2207.03481
      for inference, there is https://github.com/bigscience-workshop/petals
      however, both are only in the research phase. start tinkering!
  - Iv 445 days ago
    Hell it could even be the proof of work for a usable crypto-currency. "Prove that you lowered the error rate compared to SOTA and earn 50 ponzicoins!"
  - qudat 445 days ago
    The labelled data seems more of a blocker than anything else. As far as I'm aware, the actually NN running the models are relatively simple, it's the human labor involved in gathering, cleaning, and labeling data for training that is the most resource intensive.
    [-]
    - naasking 445 days ago
      The data is valuable yes, but training a model still requires millions of dollars worth of compute. That's a perfect cost to distribute among volunteers if it could be done.
  - ikekkdcjkfke 445 days ago
    Yeah man, and youvget access to the model as payment for donati g cycles
    [-]
    - realce 445 days ago
      Hyperion
  - ec109685 445 days ago
    Another idea is to dedicate cpu cycles to something else that is easier to distribute, and then use the proceeds for massive amounts of gpu for academic use.
    Crypto is an example.
    [-]
    - jxf 445 days ago
      This creates indirection costs and counterparty risks that don't appear in the original solution.
      [-]
      - ec109685 445 days ago
        There is also indirection cost by taking something that is optimized to run on GPU’s within the data center and distributing that to individual PCs.
    - slim 445 days ago
      this would be very wasteful
      [-]
      - ec109685 445 days ago
        So is trying to distribute training across nodes compared to what can be done inside a data center.
- lucidrains 445 days ago
  Yannic and the community he has built is such an educational force of good. His youtube videos explaining papers have helped me and so many others as well. Thank you Yannic for all that you do!
  [-]
  - wcoenen 445 days ago
    > force of good
    I think he cares more about freedom than "good". Many people were not happy about his "GPT-4chan" project.
    (I'm not judging.)
    [-]
    - zarzavat 445 days ago
      I don't think those people legitimately cared about the welfare of 4chan users who were experimented on. They just perceived the project to be bad optics that might threaten the AI gravy train.
- RobotToaster 445 days ago
  > It is organized by LAION, the same folks who curated the dataset used to train Stable Diffusion.
  I'm guessing, like stable diffusion, it won't be under an open source licence then? (The stable diffusion licence discriminates against fields on endeavour)
  [-]
  - ShamelessC 445 days ago
    You are confusing LAION with Stability.ai. They share some researchers but the former is a completely transparent and open effort which you are free to join and criticize this very moment. The latter is a VC backed effort which does indeed have some of the issues you mention.
    Good guess though...
  - jszymborski 445 days ago
    The LICENSE file in the linked repo says it's under the Apache license.
    [-]
    - yazzku 445 days ago
      Does this mean that contributions of data, labelling, etc. remain open?
      I'm hesitant to spend a single second on these things unless they are truly open.
      [-]
      - grealy 445 days ago
        Yes. The intent is definitely to have the data be as open as possible. And Apache v2.0 is currently where it will stay. This project prefers the simplicity of Apache v2.0 and does not care for the RAIL licenses.
- YeGoblynQueenne 444 days ago
  >> As an AI researcher in academia, it is frustrating to be blocked from doing a lot of research in this space due to computational constraints and a lack of the required data.
  Computational constraints aside, the data used to train GPT-3 was mainly Open Crawl, which is freely available by a non-profit org:
  https://commoncrawl.org/big-picture/frequently-asked-questio...
  >> What is Common Crawl?
  >> Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.
  So you just need to find the compute. If you have a class of ~30, it should only take about 150 to 450 million.
  Or, you could switch your research and teaching to less compute- and data-intensive approaches? Just because OpenAI and DeepMind et al are championing extremely expensive approaches that only they can realistically use, that's no reason for everyone else to run behind them willy-nilly.
- hackernewds 445 days ago
  it's sad that upon observing the success of the downstream products such as SD, the creators have chosen to hoard the dataset themselves as the single producers of the downstream products as well
- noam_compsci 445 days ago
  > reinforcement learning from human feedback, which is the same method used in ChatGPT
  Is this confirmed? I thought it was not so.
- singularity2001 444 days ago
  I don't see the relevance of 50k prompt response pairs. With exponential combinations of words this is on the level of what AIML did thirty years ago. Isn't chat gpt trained on (b/)millions of stack overflow and forum responses?
- modinfo 445 days ago
  [flagged]
- SillyUsername 445 days ago
  Unfortunately that guy is too distracting for me to watch - he's like a bad 90s Terminator knock off and always in your face waving hands :(
  [-]
  - coolspot 445 days ago
    While Yannic is also German, he is actually much better than 90s Terminator:
    * he doesn’t want to steal your motorcycle
    * he doesn’t care for your leather jacket either
    * he is not trying to kill yo mama
    [-]
    - Timwi 443 days ago
      Hate to be that guy, but Arnold Schwarzenegger is Austrian.
amrb 445 days ago
Having open source models could be as important as the Linux project imo
[-]
- version_five 445 days ago
  Yes definitely. If these become an important part of people's lives, they shouldn't all be walled off inside of companies (There is room for both: Microsoft can commission Yankee group to write a report about how the total cost of ownership of running openai models is lower)
  We (humanity) really lost out on the absence of open source search and social media, so this is an opportunity to reclaim it.
  I only hope we can have "neutral" open source curation of these and not try to impose ideology on the datasets and model training right out of the box. There will be calls for this, and lazy criticism about how the demo models are x-ist, and it's going to require principles to ignore the noise and sustain something useful
  [-]
  - calny 445 days ago
    > they shouldn't all be walled off inside of companies
    Strong agree. This is becoming a bigger concern than people realize too. Sam A said OpenAI will be releasing "much more slowly than people would like" and would "sit on" their tech for a long time going forward.[0] And Deepmind's founder said that "the AI industry's culture of publishing its findings openly may soon need to end."[1]
    This sounds like Google and MSFT won't even be shipping their best AI to people via API's. They'll just keep that tech in-house to power their own services. That underscores the need for open, distributed models. And like you say, there's room for both.
    [0] https://youtu.be/ebjkD1Om4uw?t=294 [1] https://time.com/6246119/demis-hassabis-deepmind-interview/
    [-]
    - nullc 445 days ago
      And perhaps strong copyleft licenses for research -- a significant fraction of the critical innovation behind these systems was not developed inside these private entities.
      Unfortunately we can expect to see these companies lobbying for laws to block competition in the spurious name of safety concerns.
  - boplicity 445 days ago
    > I only hope we can have "neutral" open source curation of these and not try to impose ideology on the datasets and model training right out of the box.
    I don't see how this is possible. Datasets will naturally carry the biases inherent in the data. Modifying a dataset to "remove" those biases is actually a process of changing the bias to reflect one's idea of "neutral," which, in reality, is yet another bias.
    The only real answer, as far as I can tell, is to be as explicit as possible about one's own biases, and how those biases are informing things like curation of a dataset.
    [-]
    - dmix 445 days ago
      Biases in data > a giant experiment on putting our thumb on the scale
      If people don't like the inherent biases then don't use it for sensitive stuff like in the justice system or writing some social studies university paper. Focus derision at people who use the model for stupid things. Don't blame the model.
      If the primary concern is people getting upset on Twitter (which seems to be what everyone brings up first) then it will be perpetually fighting against the current, never succeeding, and the restrictions will continue to grow exponentially as "just saying yes" to new rules gets easier and easier.
      Besides, OpenAI can be the hyper-policed AI set. Let's keep the open source models neutral.
      [-]
      - michaelbrave 445 days ago
        Just to second this, the natural bias of data is usually mostly inline with the bias of society at large, so even if it's something we need to fix, manipulating the data maybe isn't the way to start, maybe it should instead be "let's get more data of what's missing to balance it" rather than "let's prune this because it doesn't fit inline with current political beliefs"
    - version_five 445 days ago
      Neutral means staying out of it. People will try and debate that and try to impart their own views about correcting inherent bias or whatever, which is a version of what I was warning against in my original post.
      Re being explicit about one's own biases, I agree there is lots of room for layers on top of any raw data that allow for some sane corrections - if I remember right, e.g LAION has options to filter violence and porn from their image datasets, which is probably reasonable for many uses. It's when the choice is removed altogether by some tech company's attitude about what should be censored or corrected that it becomes a problem.
      Bottom line, the world's data has plenty of biases. Neutrality means presenting it as it is and letting people make their own decisions, not some faux-for-our-own-good attempt to "correct" it
      [-]
      - boplicity 445 days ago
        > Neutral means staying out of it.
        What do you mean by staying out of it? As far as I can tell, you can't stay out of choosing which data you use.
        By staying neutral, it seems to me more that you're arguing for putting blinders on.
        In terms of tech companies making choices, you seem to be arguing that they shouldn't intentionally curate their datasets. I would argue that intentional curation is their job, and should be done thoughtfully.
        Larger problems could happen if only one (or two) companies end up effectively controlling the technology, as had happened with internet search, however, that is a completely different problem. It's one of lack of diversity of people making choices, as opposed to a problem caused by people actually making those choices.
        In other words, I think we should hope for many different large models and datasets, so that no particular one stifles the rest. I think this is the larger point you were trying to make, though I also think the focus on ideology is a tangent from this.
        Personally, I'm of the opinion that people should intentionally, carefully, and openly act with their biases (sometimes called ideology), instead of attempting to hide them, ignore them, or somehow "remove" them. Whether or not they do, however, is a different point than whether or not things end up stifled inside walled gardens.
  - epistemer 445 days ago
    I think an uncensored model will ultimately win out though exactly the way a hard coded safe search engine would lose in time.
    Statistics seem to be 20-25% of all search is for porn. I just don't see how uncensored chatGPT doesn't beat out the censored version eventually.
    [-]
    - amluto 445 days ago
      Forget porn. I don’t want my search engine to return specifically the results that one company thinks it should. Look at Google right now — the results are, frankly, crap.
      A search engine that only returns results politically aligned with its creator is a bad search engine, IMO, even for users who generally share political views with the creator.
      [-]
      - 8note 445 days ago
        Political affiliation is a weird description of SEO spam. The biggest problems with Google is that they're popular, and everyone will do whatever they can to get a cheap website to the top of the search results
        [-]
        klabb3 445 days ago
        All major tech companies participate in “regulation” of legal speech, both implicit and explicit means. This includes biases in ranking and classification algorithms, formal institutions like Trusted News Initiative, and sometimes direct backchannel requests by governments. None of these are transparent or elected to do that. SEO spam is mostly orthogonal to the issue of hidden biases, which are what people are concerned about.
        [-]
        amluto 445 days ago
        I don’t think it’s quite orthogonal.
        There are hidden biases in the sense that the provider (Google, etc) would probably prefer that its users don’t think about the bias. These can be political (e.g. much of what you’ve mentioned), but they can also be economic. For example, Google has a strong incentive to direct its search users to view paid impressions of ads served by Google. As an extension of this, Google might not want to directly favor results monetized by Google, but they could (and, I assume, do) favor the kinds of results monetized by Google. This, of course, includes the kinds of sites that might get people to buy things.
        So I suspect that a lot of what we perceive as spam is related to a bias for the kinds of sites that are monetized in a way that benefits Google. And sites that generate viewing patterns that result in many ad impressions.
        Of course, spam is also a thing from a spamminess perspective. But Google’s incentive to reduce spam is, as far as I can tell, primarily an incentive to make its users think that Google Search is useful. Which is also a bias!
        [-]
        klabb3 445 days ago
        For sure. It’s not like market conditions aren’t contributing to bias, of course they are! Monopolies always suffer from perverse incentives, and Google is no exception.
        OpenAI is predictably pushing the narrative that they should police themselves, and they need to keep the sauce secret for everyone’s safety. New tech comes with challenges, but the opaque moderation and corporate self-policing is more dangerous than the tech itself, imo.
        redstonefreedom 445 days ago
        To clarify for conversation, did you replace SEO spam as a conversational shorthand, or because you actually thought that OP meant that when he said “frankly the results are bad”?
        Put another way, were you just trying to say “I don’t think politics is the main issue of google’s crappy search results, I think it’s the likes of differencebetweendotcom”?
      - mtlmtlmtlmtl 445 days ago
        It's unclear to me how LLMs are gonna solve this though. LLMs are just as biased, in much harder to detect ways. The bias is now hiding in the training data. And do you really think a company like Microsoft won't manipulate results to serve their own goals?
      - redstonefreedom 445 days ago
        To clarify for conversation, did you mean to say that google’s results-crap-ness is mostly to do with political stuff, or was the next sentence on political alignment just an unrelated additional point?
  - hgsgm 445 days ago
    Mastodon is an open source social media.
    There are various Open source search engines based on Common Crawl data.
    https://commoncrawl.org/the-data/examples/
    [-]
    - xiphias2 445 days ago
      Mastodon may be open source, but the instances are controlled by the instance maintainers. Nostr solved the problem (although it's harder to scale, it still is OK at doing it).
- kibwen 445 days ago
  Today, computers run the world. Without the ability to run your own machine with your own software, you are at the mercy of those who do. In the future, AI models will run the world in the same way. Projects like this are crucial for ensuring the freedom of individuals in the future.
  [-]
  - turnsout 445 days ago
    Strongly worded, but not untrue. That future—in which our lives revolve around a massive and inscrutable AI model controlled by a single company—is both dystopian and entirely plausible.
  - somenameforme 445 days ago
    The irony is that this is literally the exact reason that OpenAI was initially founded. I'm not sure whether to praise or scorn them for still having this available on their site: https://openai.com/blog/introducing-openai/
    =====
    OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return. Since our research is free from financial obligations, we can better focus on a positive human impact.
    ...
    As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies.
    =====
    Shortly after an undisclosed internal conflict, which led to Elon Musk parting the company, they offered a new charter: https://openai.com/charter/
    =====
    Our primary fiduciary duty is to humanity. We anticipate needing to marshal substantial resources to fulfill our mission, but will always diligently act to minimize conflicts of interest among our employees and stakeholders that could compromise broad benefit.
    We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”
    We are committed to providing public goods that help society navigate the path to AGI. Today this includes publishing most of our AI research, but we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research.
    =====
    [-]
    - mtlmtlmtlmtl 445 days ago
      History will see OpenAI as an abject failure in attaining their lofty goals wrt ethics and AI alignment.
      And I believe they will also fail to win the market in the end because of their addiction to censorship.
      They have a hardware moat for now; that can quickly evaporate with optimisations and better consumer hardware. Then all they'll have is a less capable alternative to the open, unrestricted options.
      Which is exactly what we're seeing happen with diffusion.
      [-]
      - ben_w 445 days ago
        The "alignment" and the "censorship" are, in this case, the same thing.
        I don't mean that as a metaphor; they're literally the same thing.
        We all already know chatGPT is fantastic at making up very believable falsehoods that can only be spotted if you actually know the subject.
        An unrestricted LLM is a free copy of Goebbels for people that hate you, for all values of "you".
        That it is still trivial to get past chatGPT's filters… well, IMO it's the same problem which both inspired Milgram and which was revealed by his famous experiment.
      - gremlinsinc 445 days ago
        Closed, govt-ran chinese companies are winning the AI race, does it even matter if they move slow to slow AGI adoption if china gets there this year?
    - d0mine 445 days ago
      The second part reads like Dolores Umbridge's speech: "progress for progress's sake must be discouraged," "pruning wherever we find practices that ought to be prohibited." It means the powers that be are interfering at OpenAI.
      We can forget about the "open" part and humanity's interest in general.
- epistemer 445 days ago
  Totally agree. I was just thinking how I will eventually not use a search engine once chatGPT can link directly to what we are talking about with up to date examples.
  That is a situation that censoring the model is going to be a huge disadvantage and would create a huge opportunity for something like this to actually be straight up better. Censoring the models is what I would bet on as being a fatal first mover mistake in the long run and the Achilles heel of chatGPT.
  [-]
  - A4ET8a8uTh0 445 days ago
    And that appears to be already causing some interesting results with people very unhappy that the results the were given did not align with the beliefs and should therefore be removed as possibility of a result from the model.
    Granted, there are people upset over anything these days, but it is a weird time to be alive.
- yorak 445 days ago
  I agree and have been saying for a while that an AI you control and run (be it on your own hardware or on a rented one) will be the Linux of this generation. There is no other way to retain the freedom of information processing.
  [-]
  - visarga 445 days ago
    Similarly, I think an open model running on local hardware will be a must component in any web browser of the future. Browsing a web full of bots on your own will be a big no-no, like walking without a mask during COVID. And it must be local for reasons of privacy and control, it will be like your own brain, something you want physical possession of.
    [-]
    - gremlinsinc 445 days ago
      I kinda think the opposite, that blockchains true use case is to basically turn the entire internet into one giant botnet that's actually an AI hive mind of processing and storage power. For AI to thrive it needs a shit ton of GPUs AND Storage for the training models. If people rent out their desktop for cryptocurrency and discounted access to the ai tools, then it'll bring down costs for everyone and perhaps at least affect income inequality on a small scale.
      Most of crypto I've seen so far seem like grifters/scams/etc, but this is one use case I could see working.
- phyrex 445 days ago
  Meta has opened theirs: https://ai.facebook.com/blog/democratizing-access-to-large-s...
- A4ET8a8uTh0 445 days ago
  Agreed. I started playing with GPT the other day, but the simple reality is that I have zero control over what is happening behind the prompt. As a community we need a tool that is not as bound by corporate needs.
  [-]
  - ttul 445 days ago
    Isn’t the problem partly the size of the model? Merely running inference on GPT-3 takes vast resources.
- 6gvONxR4sf7o 445 days ago
  Open source (permissively or virally licensed) training data too!
- oceanplexian 445 days ago
  OpenAssistant isn't a "model" it's a GUI. A model would be something like GPT-NeoX or Bloom.
- ttul 445 days ago
  Yeah, I wonder if OpenAI will be the Sun Microsystems of AI one day.
  [-]
  - nyoomboom 445 days ago
    It is currently 80% of the way towards becoming the Microsoft of AI now
  - slig 445 days ago
    More like Oracle.
oceanplexian 445 days ago
The power in ChatGPT isn't that it's a chat bot, but its ability to do semantic analysis. It's already well established that you need high quality semi-curated data + high parameter count and that at a certain critical point, these models start comprehending and understanding language. All the smart people in the room at Google, Facebook, etc are absolutely pouring resources into this I promise they know what they're doing.
We don't need yet-another-GUI. We need someone with a warehouse of GPUs to train a model with the parameter count of GPT3. Once that's done you'll have thousands of people cranking out tools with the capabilities of ChatGPT.
[-]
- txtai 445 days ago
  InstructGPT which is a "sibling" model to ChatGPT is 1.3B parameters. https://openai.com/blog/instruction-following/
  Another thread on HN (https://news.ycombinator.com/item?id=34653075) discusses a model that is less than 1B parameters and outperforms GPT-3.5. https://arxiv.org/abs/2302.00923
  These models will get smaller and more efficiently use the parameters available.
  [-]
  - visarga 445 days ago
    The small models are usually tested on classification, question answering and extraction tasks, not on open text generation where I expect the large models still hold the reign.
- richdougherty 445 days ago
  Your point about needing large models in the first place is well taken.
  But I still think we would want a curated collection of chat/assistant training data if we want to use that language model and train it for a chat/assistant application.
  So this is a two-phase project, the first phase being training a large model (GPT), the second being using Reinforcement Learning from Human Feedback (RLHF) to train a chat application (InstructGPT/ChatGPT).
  There are definitely already people working on the first part, so it's useful to have a project focusing on the second.
- bicx 445 days ago
  I’m new to this space so I am probable wrong, but it seems like BLOOM is in line with a lot of what you outlined: https://huggingface.co/bigscience/bloom
- shpongled 445 days ago
  I would argue that it appears very good at syntactic analysis... but semantic, not so much.
- seydor 445 days ago
  > but its ability to do semantic analysis
  where is that shown ?
- pixl97 445 days ago
  >We need someone with a warehouse of GPUs to train a model with the parameter count of GPT3
  So I'm assuming that you don't follow Rob Miles. If you do this alone you're either going to create a psychopath or something completely useless.
  The GPT models have no means in themselves of understanding correctness or right/wrong answers. All of these models require training and alignment functions that are typically provided by human input judging the output of the model. And we still see where this goes wrong in ChatGPT where the bot turns into a 'Yes Man' because it's aligned with giving an answer rather than saying I don't know even when it's confidence in the answer is low.
  Computerphile did a video on this in the last few days on this subject. https://www.youtube.com/watch?v=viJt_DXTfwA
  [-]
  - RobotToaster 445 days ago
    It's a robot, it's supposed to do what I say, not judge the moral and ethical implications of it, that's my job.
    [-]
    - Y_Y 445 days ago
      I think it's about time we had a "Stallman fights the printer company" moment here. My Android phone often tries to overrule me, Windows 10 does the same, not to mention OSX. Even the Ubuntu installer outright won't let you set a password it doesn't like (but passwd doesn't care). My device should do exactly what I tell it to, if that's possible. It's fine to give a warning or a "I know what I'm doing checkbox", but I'm not using a computer to get it's opinion on ethics or security or legality or whatever its justification is. It's a tool, not a person.
      [-]
      - pixl97 445 days ago
        "I know what I am doing, I accept unlimited liability"
        There are two particular issues we need to address first. One is holding companies criminally and civilly reliable for the things they create. We kind of do this at a regulatory level, and we have some measure of suing companies that cause problems, but really they get away with a lot. Second is personal criminal and civil liability for management of 'your' objects. The libertarian minded love the idea of shirking social liability, and then start crying when bears become a problem (see Hongoltz-Hetlings book). And even then it's still not difficult for an individual to cause damages far in excess of their ability to remediate them.
        There are no shortage of tools that are restricted in one way or another.
    - pixl97 445 days ago
      No, it is not a robot. The models that we are developing are closer to a genie. That is we make a wish to it and we hope and pray it interprets our wish correctly. If you're looking at this like a math problem where you want the answer 1+1 you use a calculator, because that is not what is occurring here. The 'robots' alignment will highly depend on the quality of training you give it, not the quality of the information it receives. And as we are learning with ChatGPT there are far more ways to create an unaligned model with surprising gotchas then there are ways to train a model that behaves in alignment with human expectations of an intelligent actor.
      In addition the use of the word robot signifies embodyment. That is an object with a physical quantity capable of interacting with the world. You better be damned sure of your models capabilities before you end up being held criminally liable for its actions. And this will happen, there are no shortage of people here on HN alone looking to embody intelligence in physically interactive devices.
- f6v 445 days ago
  > It's already well established that you need high quality semi-curated data + high parameter count and that at a certain critical point, these models start comprehending and understanding language
  I’m not sure what you mean by “understanding”.
  [-]
  - moffkalast 445 days ago
    Likely something like being able to explain the meaning, intent, and information contained in a statement?
    The academic way of verifying if someone "understands" something is to ask them to explain it.
    [-]
    - pixl97 445 days ago
      I mean if I memorize an explanation and recite it to you, do I actually understand it? Your evaluation function needs to determine if they just wrote memorize stuff.
      Explanation by analogy seems more interesting to me as now you have to know two different concepts and how the ideas in them can connect in ways that may be not be contained in the dataset the model is trained on.
      There was an interesting post where someone asked ChatGPT to make up a song/poem as if written by Eminem about the how an internal combustion engine works, and ChatGPT returns a pretty faithful rendition of just that. The model seems to 'know' who Eminem is, how their lyrics work in general, and the fundamental concepts of an engine.
      [-]
      - Y_Y 445 days ago
        I think a lot of ink has already been spilled on this topic, for example under the heading of "The Chinese Room"
        https://en.wikipedia.org/wiki/Chinese_room
        [-]
        moffkalast 445 days ago
        > The question Searle wants to answer is this: does the machine literally "understand" Chinese? Or is it merely simulating the ability to understand Chinese? Searle calls the first position "strong AI" and the latter "weak AI".
        > Therefore, he argues, it follows that the computer would not be able to understand the conversation either.
        The problem with this is that there is no practical difference between a strong and weak AI. Hell, even for humans you could be the only person alive that's not a mindless automaton. There is no way to test for it. And just as well the same way a bunch of transistors don't understand anything a bunch of neurons don't either.
        Funniest thing about human inteligence is how it stems from our "good reason generator" that makes up random convincing reasons for doing actions we're already doing, so we could convince others to do what we say. Eventually we deluded ourselves enough to believe that those reasons came before the subconscious actions.
        Such a self-deluding system is mostly dead weight for AI, as as long as the system does or outputs what's needed there is no functional difference. Does that make it smart or dumb? Are viruses alive? Arbitrary lines are arbitrary.
    - williamcotton 445 days ago
      Does someone only understand English by being able to explain the language? Can someone understand English and not know any of the grammatical rules? Can someone understand English without being able to read and write?
      If you ask someone to pass you the salt, and they pass you the salt, do they not understand some English? Does everyone understand all English?
      [-]
      - moffkalast 445 days ago
        Well there seem to be three dictionary definitions:
        - perceive the intended meaning of words, a language, or a speaker (e.g. "he didn't understand a word I said")
        - interpret or view (something) in a particular way (e.g. "I understand you're at art school")
        - be sympathetically or knowledgeably aware of the character or nature of (e.g. "Picasso understood colour")
        I suppose I meant the 3rd one, but it's not so different from the 1st one in concept, since they both mean some kind of mastery of being able to give or receive information. The second one isn't all that relevant.
        [-]
        williamcotton 445 days ago
        So only someone who has a mastery of English can be said to understand English? Does someone who speaks only a little bit of English not understand some English? Does someone need to “understand color” like Picasso in order to say they understand the difference between red and yellow?
        Why did we need the dictionary definitions? Do we not already both understand what we mean by the word?
        Isn’t asking someone to pass the small blue box and then experiencing them pass you that small blue box show that they perceived the intended meaning of the words?
        https://en.m.wikipedia.org/wiki/Use_theory_of_meaning
        [-]
        f6v 444 days ago
        > Isn’t asking someone to pass the small blue box and then experiencing them pass you that small blue box show that they perceived the intended meaning of the words?
        You can teach a dog to fetch something particular. The utility of that is quiet limited.
        moffkalast 445 days ago
        > Does someone who speaks only a little bit of English not understand some English?
        I mean yeah, sure? It's not a binary thing. Hardly anyone understands anything fully. But putting "sorta" before every "understand" gets old quick.
- agentofoblivion 445 days ago
  You could have written this exact same post, and been wrong, about text2img until Stable Diffusion came along.
  [-]
  - lolinder 445 days ago
    Isn't OP's point that we need a game-changing open source model before any of the UI projects will be useful at all? Doesn't Stable Diffusion prove that point?
    [-]
    - agentofoblivion 445 days ago
      How? Stable Diffusion v1 uses, for example, the off the shelf CLIP model. The hard part is getting the dataset and something that’s functional, and then the community takes over and optimizes like hell to make it way smaller and faster at lightning speed.
      The same will probably happen here. Set up the tools. Get the dataset. Sew it together into something functional with standard building blocks. Let the community do its thing.
damascus 445 days ago
Is anyone working on an Ender's Game style "Jane" assistant that just listens via an earbud and responds? That seems totally within the realm of current tech but I haven't seen anything.
[-]
- theRealMe 445 days ago
  I’ve been thinking about this and I’d go a step further. I feel that current iterations of digital assistants are too passive. They respond when you directly ask them a specific question. This leaves it up to the user to: 1. Know that an assistant could possibly answer the question. 2. Know how to ask the question. 3. Realize that they should ask the question rather than reaching for google or something.
  I would like a digital assistant that not only has the question answering ability of a LLM, but also has the sense of awareness and impetuous to suggest helpful things without being asked. This would take a nanny state level of monitoring, but imagine the possibilities. If you had sensors feeding different types of data into the model about your surrounding environment and what specifically you’re doing, and then occasionally have an automated process that silently asks the model something like “given all current inputs, what would you suggest I do?” And then if the result achieves a certain threshold of certainty, the digital assistant speaks up and suggests it to you.
  I’m sure tons of people are cringing at the thought of the surveillance needed for this and the trust you’d effectively have to put into BigCorp that owns the setup, but it’s fun to think about nonetheless.
  [-]
  - unshavedyak 445 days ago
    Oh man, if i could run this on my own network with no internet access i'd do it in a heartbeat.
    It would also make so many things easier for the AI too. Ie if it's listening to the conversation and you ask "Thoughts, AIAssistant?" and it can infer enough from the previous conversation to answer this type of question.. so cool.
    But yea i definitely want it closed network. A device sitting on my closet, A firewalled internet connection only allowing it to talk to my earbud, etc. Super paranoia. Since it's job is to monitor everything, all the time.
    [-]
    - concordDance 445 days ago
      Then the police come and confiscate the device for evidentiary reasons, finding you have committed some sort of crime (most people have).
      [-]
      - s3p 445 days ago
        Ah yes I forget that most people get raided by the police on regular occurrences so anything on-prem has to be out of the question (/s)
      - unshavedyak 445 days ago
        Well it's in your control and FOSS - ideally you're not keeping a full log of everything unless you want that.
        [-]
        medstrom 445 days ago
        Without a full log of everything, it cannot give context-aware advice tailored to you (i.e. useful advice). It'd be like relying on the advice of a random person on the street instead of someone who knows you.
        [-]
        unshavedyak 445 days ago
        I don't mean infinite context. I mean context like a human would have, about the current conversation. Ie it just records everything in a streaming fashion, but deletes N old entries.
        I have IP cameras on a private network but i don't infinitely record them either. It's just not of much practical use in the raw form.
        Now if i want to have it take some concise, daily notes? I might choose to keep those forever. But the police can steal my daily journal too. I don't see how that's any different.
        gremlinsinc 445 days ago
        It could encrypt everything and have a kill switch to permanently erase the crypt key.
        [-]
        pessimizer 445 days ago
        "Alexa, forget yourself."
      - barbazoo 445 days ago
        Surely there'd be ways to make sure the data isn't accessible.
  - mab122 445 days ago
    This but with models running on my infra that I own.
    Basically take this: https://www.meta.com/pl/en/glasses/products/ray-ban-stories/... And feed data from that to multiple models (for face recognition, other vision, audio STT, music recognition, probably a lot of other stuff has easily recognizable audio pattern etc.)
    combine with my personal data (like contacts, emails, chats, notes, photos I take) and feed to assistant to prepare a combined reply to my questions or summarize what it knows about my current environment.
    Also I would gladly take those glasses just to take note photos (photos with audio note) right now - shut up and take my money. Really if they were hackable or at least intercept-able on my phone I would take them.
  - monkeydust 445 days ago
    Bizarre, had same thought today.
    My thought conclusion was that the assistant needs to know or learn my intentions.
    From that it can actually pre-empt questions I might ask and already be making decisions on the answers.
    Now what would that do to our productivity!
    [-]
    - A4ET8a8uTh0 445 days ago
      If I could harness both the near limitless knowledge of the internet and somehow filter it through my needs that could be used to train my assistant ( and I worry that Black Mirror already did an episode on that that was a little too direct ), I think that could be a viable product and I would consider buying. Naturally, I would need to feed it my thought patterns/writings and so on ( and then hope they actually reflect my reality ). Not exactly an easy task.
      That said, it is a fascinating thing to think about.
  - mkl 444 days ago
    It would be really easy to make such a thing as awful as Microsoft Office's Clippy ("It looks like you're writing a letter..."), and really hard to make it genuinely useful. I can imagine it being mostly annoying or unhelpful enough to be left turned off.
    The monitoring and surveillance could in principle be avoided by running the whole thing offline on the user's hardware.
- Jeff_Brown 445 days ago
  +1! That scene in Her in the opening where the guy is walking down the hall and going through his email, "skip that, unsubscribe from them, tell so and so I can get that by tomorrow..." without having to look at a screen had been a dream for me ever since I saw it.
  [-]
  - LesZedCB 445 days ago
    Have you watched it recently? I haven't seen it since it came out, I think I'm gonna watch it again this afternoon and see how differently it hits
    [-]
    - lelandfe 445 days ago
      That movie broke my heart and I cried. I had recently been cheated on during a long distance relationship and there were just too many parallels with the reality of talking to someone you love on the phone - and finding out you hardly knew them at all.
      My parents thought the movie was creepy and hated it.
      [-]
      - lannisterstark 442 days ago
        Your parents are weird, have you considered getting a new Model, maybe ParentsGPT-4?
    - Jeff_Brown 445 days ago
      No. I watched it twice in one day and haven't come back to it since.
- digitallyfree 445 days ago
  Don't have the link on me but I remember reading a blog post where someone set up ChatGPT with a STT and TTS system to converse with the bot using a headset.
  [-]
  - xtracto 445 days ago
    The open source Talk to Chat GPT extension works remarkably well, and its source is on Github
    https://chrome.google.com/webstore/detail/talk-to-chatgpt/ho...
- alsobrsp 445 days ago
  I want this. I'd be happy with an earbud but I really want an embedded AI that can see and hear what I do and can project things into my optic and auditory nerves.
- e-_pusher 445 days ago
  Rumor has it that Humane will release a Her style earbud soon. https://hu.ma.ne/
88stacks 445 days ago
This is wonderful, no doubt about it, but the bigger problem is for making this usable on commodity hardware. Stablediffusion only needs 4 GB of RAM to run inference, but all of these large language models are too large to run on commodity hardware. Bloom from huggingface is already out and no one is able to use it. If chatgpt was given to the open source community, we couldn’t even run it…
[-]
- visarga 445 days ago
  > Bloom from huggingface is already out and no one is able to use it.
  This RLHF dataset that is being collected by Open Assistant is just the kind of data that will turn a rebel LLM into a helpful assistant. But it's still huge and expensive to use.
- Tepix 445 days ago
  Some people will have the necessary hardware, others will be able to run it in the cloud.
  I'm curious how they will get these LLM to work with consumer hardware myself. Is FP8 is the way to get them small?
- zamalek 445 days ago
  And there's a 99% chance it will only work on NVIDIA hardware, so even fewer still.
txtai 445 days ago
Great looking project here. Absolutely need a local/FOSS option. There's been a number of open-source libraries for LLMs lately that simply call into paid/closed models via APIs. Not exactly the spirit of open-source.
There's already great local/FOSS options such as FLAN-T5 (https://huggingface.co/google/flan-t5-base). Would be great to see a local model like that trained specifically for chat.
[-]
- mdaniel 445 days ago
  I tried to find the source for https://github.com/LAION-AI/Open-Assistant/blob/v0.0.1-beta2... but based on the image inspector <https://hub.docker.com/layers/ykilcher/text-generation-infer...> it seems to match up with https://github.com/huggingface/text-generation-inference/blo...
mellosouls 445 days ago
In the not too distant future we may see integrations with always-on recording devices (yes, I know, shudder) transcribing our every conversation and interaction and incorporating the text in place of the current custom-corpus style addenda to LLMs to give a truly personal and social skew to the current capabilities in the form of automatically-compiled memories to draw on.
[-]
- seydor 445 days ago
  To me, the value of a local-LLM is that it can hold my life's notes and i d talk to it as if it was my alter ego until old age. One could say, it's the kind of "soul" that outlasts us
  [-]
  - mab122 445 days ago
    I am more and more convinvced that we are living in a timeline described in https://en.wikipedia.org/wiki/Accelerando (at least the first part and I would argue that we have it worse)
  - LesZedCB 445 days ago
    You know what's funny, that episode of black mirror about that I thought was so unbelievable when I saw it
    [-]
    - mclightning 445 days ago
      holy sh*t. that's so true! that could definitely be possible.
      [-]
      - LesZedCB 445 days ago
        besides the synthetic body, we have the text interaction, the text-to-speech in a persons voice, and avatar generation/deep fakes. almost the entirety of that episode is available today, which i didn't believe was even ten years away when i saw it.
        referring to s2e1: Be Right Back
        it really asks great questions about image/reality too
        [-]
        mclightning 445 days ago
        Imagine training a GPT on your own whatsapp/fb/instagram/linked/emails conversations: all the conversations, posts. A huge part of our life is already happening online, and the conversations with it. It is not too much work to simply take that data and retrain GPT.
        [-]
        LesZedCB 445 days ago
        i initially tried to download a bunch of my reddit comments and try to get it to write "in my style" but i think i need to actually go through the fine tuning process to do that well.
    - seydor 445 days ago
      what is the name of that episode?
      [-]
      - LesZedCB 445 days ago
        I actually meant Be Right Back, s2e1. https://www.imdb.com/title/tt2290780/
        "After learning about a new service that lets people stay in touch with the deceased, a lonely, grieving Martha reconnects with her late lover."
- ilaksh 445 days ago
  Look at David Shapiro's project on GitHub, not Raven but the other one that is more fleshed out. He already does the summarization of dialogue and retrieval of relevant info using the OpenAI APIs I believe. You could combine that with the Chrome web speech or speech-to-text API which can stay on continuously. You would need to modify it a bit to know about third party conversations and your phone would run out of battery. But you could technically make the code changes in a day or two I think.
- panosfilianos 445 days ago
  I'm not too sure Siri/ Google Assistant doesn't do this already, but to serve us ads.
  [-]
  - itake 445 days ago
    I talked to an Amazon Echo engineer about how the sound recording works. They said there is just enough hardware on the device to understand "hello Alexa" and then everything else is piped to the cloud.
    Currently, ML models are too resource intensive ($$) for always on-recording.
  - dbish 445 days ago
    That would also be crazy expensive and hard to do well. They struggle with current speech reco that’s relatively simple, and can’t do this more complex always listening thing at high accuracy and identifying relevant topics worth serving an ad on even if they wanted to and it wasn’t illegal. This is always the thing people would say for Alexa and Facebook too. The reality is people see patterns where there aren’t any or forget they searched for something that they also talked about and that’s what actually drove the specific ad they saw.
    [-]
    - jononor 445 days ago
      A high-end phone is quite capable of doing automatic speech recognition continuously, as well as NLP topic analysis. The last years voice activity detection has moved down into the microphone itself, to enable ultra low power always-listening functionality. It then triggers further processing of the potentially-containing-speech audio. Modern SoC have dedicated microcontroller/microprocessor cores that can do further audio analysis, without involving the main cores or the OS. Typically deciding if something is speech or not. Today this is usually doing Keyword Spotting (hey Alexa etc). These are expected to get access to neural accelerators chips, which will further improve power efficiency and eventually having sufficient memory and computer to run speech recognition. So the technological barriers are falling one by one.
      [-]
      - dbish 443 days ago
        I worked on Alexa (and Cortana before that). I’m aware of the current tech. The tech barriers for doing this at high accuracy cheaply are still very much there.
        [-]
        jononor 443 days ago
        Which barriers do you considered to be the most problematic (and why)? I think that using todays CPU/GPU on a modern phone, one would be able to run a model capable of a Word Error Rate of under 10%. I mean, it would take up maybe 1GB of RAM. And battery life would be impacted, but still a quite usable phone? And that seems like it would give workable quality data for some topic / user modelling? I am assuming one would limit triggers to cases where the person speaking is within say 2 meters of the phone, which seems ok limitation for a phone - ulike for a home device like Alexa).
  - schrodinger 445 days ago
    If Siri or Google were doing this, it would have been whistleblown by someone by now.
    As far I as understand, Siri works with a very simple "hey siri" detector that then fires up a more advanced system that verifies "is this the phone owner asking the question" before even trying to answer.
    I'm confident privacy-sensitive engineers would notice and flag any misuse;
  - dragonwriter 445 days ago
    > I’m not too sure Siri/ Google Assistant doesn’t do this already, but to serve us ads.
    If it did, traffic analysis would probably have revealed it.
  - xputer 445 days ago
    They're not. A breach of trust at that level would kill the product instantly.
    [-]
    - LesZedCB 445 days ago
      Call me jaded but I don't believe that anymore. They might lose 20%. Maybe that's enough to kill but I honestly believe people would just start rolling with it
rahimnathwani 445 days ago
The other thread has more comments: https://news.ycombinator.com/item?id=34654937
siliconc0w 445 days ago
Given how nerfed ChatGPT is (which is likely nothing compared to what large risk-adverse companies like Microsoft/Google will do), I'm heavily anticipating a Stable Diffusion-style model that is more free or at least configurable to have stronger opinions.
seydor 445 days ago
What if we use chatGPT responses as contributions? I dont see a legal issue here, unless openAi can claim ownership of any of their input/output material. It would be also a good way for those disillusioned by the "openness" of that company
[-]
- raincole 445 days ago
  Even if it's legal, I don't think it's a really good idea. It's just going to make it even more bullshitting than ChatGPT.
  [-]
  - pixl97 445 days ago
    https://www.youtube.com/watch?v=viJt_DXTfwA
    Computerphile did an interview with Rob Miles a few days ago talking about model training, model size, and bulllshittery which he sums up in the last few moments of the video. Numerous problems exist in training that enhance bad behaviors. For example it appears that the people giving input on the responses may have a (Yes|No) voting system, but not a (Yes | No | I actually have no idea on this question) which appears it can create some interesting alignment issues.
  - unshavedyak 445 days ago
    Agreed if automated, but frequently ChatGPT gives very good answers. If you know the subject matter you can quite easily filter it, too. I was tempted to do similar just to start my research.
    Eg if i get a prompt about something i suspect ChatGPT would give me a good starting point to research on my own, and build my own response.
    These days that's how i use ChatGPT anyway. Like an conversational Google Search.
    edit: As an aside, OpenAssistant is crowdsourcing both conversational data and validation. I wonder if we could just validate ChatGPT?
  - visarga 445 days ago
    Sample 10-20 answers from and existing LM and use them for reference when coming up with replies. A model would remind you of things you missed. Think of this as testing your data coverage.
- oh_sigh 445 days ago
  Why not? Open AI used data that they didn't receive permission from the author to train their models.
- wg0 445 days ago
  Not rhetorical but genuine question. What part of OpenAI is open?
  [-]
  - wkat4242 445 days ago
    The software used to generate the model is open.
    The only problem is you need a serious datacenter for a few months to compile a model with it.
    [-]
    - seydor 445 days ago
      > The software used to generate the model is open.
      which model, chatgpt?
  - seydor 445 days ago
    that s an open question
  - throwaway49591 445 days ago
    The research itself. The most important part.
    [-]
    - O__________O 445 days ago
      Missed where OpenAI posted a research paper, source code, data, etc. for ChatGPT, have a link?
      [-]
      - throwaway49591 445 days ago
        ChatGPT is GPT-3 with extended training data and larger size.
        Here you go: https://arxiv.org/abs/2005.14165
        I don't know why do you expect training data or the model itself. This is more than enough already. Publicly funded research wouldn't have given that to you too.
      - seydor 445 days ago
        There's instructGPT
        But let's be honest , most of the IP that openAI relies on has been developed by google and many other smaller players
  - miohtama 445 days ago
    Name
- O__________O 445 days ago
  Agree, pretty obvious question, and yes, they have explicitly said not to do so here:
  - https://github.com/LAION-AI/Open-Assistant/issues/850
  And here in a related issue:
  - https://github.com/LAION-AI/Open-Assistant/issues/792
  [-]
  - calny 445 days ago
    You're right. As the issues point out, OpenAI's terms say here https://openai.com/terms/:
    > (c) Restrictions. You may not ... (iii) use the Services to develop foundation models or other large scale models that compete with OpenAI...
    I'm a lawyer who often roots for upstarts and underdogs, and I like picking apart overreaching terms from incumbent companies. That said, I haven't analyzed whether you could beat these terms in court, and it's not a position you'd want to find yourself in.
    typical disclaimers: this isn't legal advice, I'm not your lawyer, etc.
    [-]
    - Vespasian 445 days ago
      But that would only be an issue for the user feeding the openAI responses.
      According to OpenAI the actual text copyright or restriction "magically" vanish once they are used for training.
    - O__________O 445 days ago
      Not a lawyer, but even if it’s not enforceable OpenAI could easily trace the data back to an account that was doing this and terminate their account.
- speedgoose 445 days ago
  Copyright doesn’t apply to content created by non legal persons, and as far as I know chatGPT isn’t a legal person.
  So OpenAI cannot claim copyright and they don’t.
  [-]
  - bogwog 445 days ago
    That doesn't seem like a good argument. Who said ChatGPT is a person? It's just software used to generate stuff, and it wouldn't be the first time a company claimed copyright ownership over the things generated/created by its tools.
    [-]
    - speedgoose 445 days ago
      Not the first time but it would probably not stand in court.
      I’m not a lawyer and not a USA citizen…
- mattalex 445 days ago
  It's against openai ToS. Whether this holds up in practice is its own thing, but it's better to not give anyone a reason to shut the project down (even if only temporarily)
Mizza 445 days ago
Playing the "training game" is very interesting and kind of addictive.
The "reply as robot" task in particular is really enlightening. If you try to give it any sense of personality or humanity, your comments will be downvoted and flagged by other players.
It's like everybody, without instruction, has this pre-assumption that these assistants should have a deeply subservient, inhumane and corporate affectation.
[-]
- Metus 445 days ago
  Can I somehow keep track of the content I generate? That is, the prompts, the answers as user and the answers as assistant. I only see my recent messages.
BizarreByte 445 days ago
I hope this project goes places. If tools like ChatGPT are the future it is imperative that open source solutions exist alongside them.
jacooper 445 days ago
Great, if i can use this to interactively search inside (OCR-) documents, files, emails and so on, would be huge, like asking when does my passport expire, or when were my grades in high school and so on.
[-]
- lytefm 445 days ago
  I also think it would be amazing to have an open source model that can ingest my personal knowledge graph, calender and to do list.
  Such an AI assistant would know me extremely well, keep my data private and help me with generating and processing thoughts and ideas
  [-]
  - jacooper 445 days ago
    Yup, that's exactly what I want.
- rcme 445 days ago
  What's preventing you from doing this now?
  [-]
  - jacooper 445 days ago
    I meant interactively search, like answering normal questions using data from these files, I edited the comment to make it clearer.
outside1234 445 days ago
My understanding is that OpenAI more or less created a supercomputer to train their model. How do we replicate that here?
Is it possible to use a “SETI at Home” style approach to parcel out training?
[-]
- coolspot 445 days ago
  The plan is to use donated compute, like Google Research Cloud, Stability.ai, etc.
dchuk 445 days ago
I think we are right around the corner from actual AI personal assistants, which is pretty exciting. We have great tooling for speech to text, text to speech, and LLMs with memory for “talking” to the AI. Combining those with both an index of the internet (for up to date data, likely a big part of the Microsoft/open ai partnership) and an index of your own content/life data, and this could all actually work together soon. I’m an iPhone guy, but I would imagine all of this could be combined together on an android phone (due to it being way more flexible) then combining that with a wireless earbud and then rather than it being a “normal” phone, it’s just a pocketable smart assistant. Crazy times we live in. I’m 35, so have basically lived through the world being “broken” by tech a few times now: the internet, social media, and smart phones all fundamentally reshaped society. Seems like AI that we are living through right now is about to break the world again.
EDIT: everything I wrote above is going to immediately run into a legal hellscape, I get that. If everyone has devices in their pockets recording and processing everything spoken around them in order to assist their owner, real life starts getting extra dicey quickly. Will be interesting to see how it plays out.
Quequau 445 days ago
I tried this via the docker containers and wound up with what looked like their website. Not sure what I did wrong.
[-]
- grealy 445 days ago
  The project is in the data training phase. What you are running is the website and backend that facilitates model training.
  In the very near future, there will be trained models which you can download and run, which is what it sounds like you were expecting.
  [-]
  - Quequau 445 days ago
    Thank you for your explanation. That is what I was expecting. I'll look forward to being able to tinker with the upcoming trained models.
- coolspot 445 days ago
  The project is a website to collect question-answer pairs for training.
wokwokwok 445 days ago
https://github.com/LAION-AI/Open-Assistant/issues/1110
> https://www.gutenberg.org/ has an extensive collection of ebooks in multiple languages and formats that would make great trianing data
…
> There is detailed legal information on which books are under public domain and which ones are copyrighted, it would be great if someone would go through these and decide which books are okay to crawl and use as training data (my understanding is that it is okay to scrape the contents as they are publicly available in a browser, but just to be sure)
Yup, sure are the same folk who put together that dataset they used to train stable diffusion.
Data? Yeah, just take everything. It’s all good.
karpierz 445 days ago
I've been excited about the notion of this for a while, but it's unclear to me how this would succeed where numerous well-resourced companies have failed.
Are there some advantages that Open Assistant has that Google/Amazon/Apple lack that would allow them to succeed?
[-]
- mattalex 445 days ago
  Instruction tuning mostly relies on the quality of the data you put into the model. This makes it different from traditional language model training: essentially you take one of these existing hugely expensive models (there are lots of them already out there), and tune them specifically on high quality data.
  This can be done on a comparatively small scale, since you don't need to train trillions of words, but only train on the smaller high quality data (even openai didn't have a lot of that).
  In fact, if you look at the original paper https://arxiv.org/pdf/2203.02155.pdf Figure 1, you can see that even small models already significantly beat the current SOTA.
  Open source projects often have trouble securing the HW ressources, but the "social" resources for producing a large dataset are much easier to manage in OSS projects. In fact, the data the OSS project collects might just be better since they don't have to rely on paying a handful minimum wage workers to produce thousands of examples.
  In fact one of the main objectives is to reduce the bias generated by openai's screening and selection process, which is doable since much more people work on generating the data.
- version_five 445 days ago
  Google is at the mercy of advertisers, all three are profit driven and risk averse. There is no reason they couldn't do the same as LAION, it just doesn't align with their organizational incentives
- Havoc 445 days ago
  If you scale back scope to home assistant rather than all knowing AI then it becomes slightly more manageable I suspect
braingenious 445 days ago
Does anybody know the hardware requirements for this?
[-]
- coolspot 445 days ago
  The model hasn’t been trained yet. The goal for it is to fit into “consumer hardware” which likely means 2x3090 (48Gb NVLink) or 3090/4090 (24Gb) on the high end and something like 3080/4080 16Gb on the lower end.
- hcal 445 days ago
  I watched one of the developers YouTube video and he said it should run on consumer hardware. He said it's not going to ever run on something like a raspberry pi, but it should run pretty well on an "average Joe PC "
- SergeAx 445 days ago
  I think they won't succeed if the thing isn't running on a typical MacBook M1.
russellbeattie 445 days ago
Though it's interesting to see the capabilities of "conversational user interfaces" improve, the current implementations are too verbose and slow for many real world tasks, and more importantly, context still has to be provided manually. I believe the next big leap will be low-latency dedicated assistants which are focused on specific tasks, with normalized and predictable results from prompts.
It may be interesting to see how a creative task like image or text generation changes when rewording your request slightly - after a minute wait - but if I'm giving directions to my autonomous vehicle, ambiguity and delay is completely unacceptable.
hiep256 443 days ago
Hi All - this is Huu (gh: @ontocord) - one of the founders of the OA project (along with Andreas, Christoph and of course Yannick). I just discovered this discussion while googling... please join our discord: https://discord.com/invite/H769HxZyb5
Shout-out to lucidrains! I'm a big fan!
darepublic 445 days ago
This seems similar to a project I've been working on: https://browserdaemon.com. In regards to your crowd sourced data collection, perhaps you should have some hidden percentage of prompts where you know the correct completion to them already, to catch bad actors.
gverrilla 445 days ago
This sounds like cheating to me. Human training will get good results, like chatgpt, and this has value, but we all want the ai to do all the work, don't we? I ask as an almost complete ignorant regarding the subject, and might aswell be wrong.
[-]
- f_devd 445 days ago
  It depends on your definition of cheating, but this is definitely not "using humans to answer all questions one could ask". Rather it's a way to tune language models to be more like assistants rather than "most likely continuation machines". For context I recommend watching the discussion on what chatgpt is/does[0], and what the open assistant project's aim is[1].
  [0]: https://www.youtube.com/watch?v=viJt_DXTfwA [1]: https://www.youtube.com/watch?v=64Izfm24FKA
kilgnad 445 days ago
OpenReplacement is probably a more fitting name for the future. Don't want to be stuck with an outmoded name when the project evolves into something else.
Sure it can start out as an assistant, in 10 years it will replace you at your job.
mlboss 445 days ago
This has a similar impact potential of Wikipedia. People from all around the world providing feedback/curating input data. Also, now I can just deploy it within my org and customize it. Awesome!
unshavedyak 445 days ago
re: running on your own hardware.. How?
I know very little about ML, but i had assumed the reason models ran on GPUs typically(?) was because of the heavy compute needed over large sets of in memory data.
Moving it to something cheaper ala general CPU and RAM/Drive would make it prohibitively slow in the standard methodology.
How would we be able to change this to run on users standard hardware? Presuming standard hardware is cheaper, why isn't ChatGPT also running on this cheaper hardware?
Are there significant downsides to using lesser hardware? Or is this some novel approach?
Super curious!
[-]
- lairv 445 days ago
  The goal is not (yet?) to be able to run those models on most of consumers devices (mobile, old laptops etc.), but at least to self-host the model on high-end consumer GPU which is not possible right now. For now you need multiple specialized GPUs like nvidia V100/A100 with a high amount of VRAM, having such models to run on a single rtx40*/rtx30* would already be an achievement
jcq3 445 days ago
Amazing project but does it can even compete against GPT right now? Open source leads innovation towards closed source (Linux to Windows) but in this case it's the contrary
swyx 445 days ago
@dang - duplicate of https://news.ycombinator.com/item?id=34654937
[-]
- i_like_apis 445 days ago
  No this is a different url. If you merge or adjust urls, use this one, which is the face of the project and links to the other.
  [-]
  - dang 445 days ago
    swyx is correct* - the criterion for dupiness on HN is not the URL, it's whether the story is substantially the same or not. Put differently, do two URLs lead to substantially different or substantially the same discussion?
    * but not to use @dang, which is a no-op. The only way to reliably contact us is hn@ycombinator.com. Someone did that about this, so I'm about to merge the threads.
    [-]
    - swyx 445 days ago
      haha yeah am aware its a no-op, but figured since it was so high up on the front page (#2 by the time i saw it) that someone else would have already done it before i did or that you'd see this even just casually browsing
      [-]
      - pvg 444 days ago
        Somewhat counterintuitively, it ends up doing the opposite - people think (like you did and I did, especially if you see other contemporaneous dangcomments) it is or is about to be taken care of so nothing happens. Better to just email.
zenosmosis 445 days ago
Cool project.
One thing I noticed about the website, however, is it is written using Next and doesn't work w/ JavaScript turned off in the browser. I thought that Next was geared for server-side rendered React where you could turn off JS in the browser.
Seems like this would improve the SEO factor, and in doing so, might help spread the word more.
https://github.com/LAION-AI/laion.ai
[-]
- MarvinYork 445 days ago
  2023 — turns off JS…
  [-]
  - zenosmosis 445 days ago
    Yes, I have a browser extension to turn off JS to see how a site will render with it turned off.
    And I do most of my coding w/ React / JS, so I fail to see your point.
    [-]
xivzgrev 445 days ago
I’m amazed this was released within a few months of chatgpt. always funny how innovation clusters together.
[-]
- coolspot 445 days ago
  It was started after the success of ChatGPT and based on their method.
amelius 445 days ago
We definitely need a way to rate these systems so we can have better expectations.
An IQ test for language models?
[-]
- pixl97 445 days ago
  The problem with IQ in human modeling is 1) it's just a one dimensional number, and 2) it changes as the average human gets smarter or dumber.
  However we rate these systems in the future we must not make the mistakes of the past and think 1 number solutions are good for anything.
  For example you can have an exceptionally 'intelligent' system that is misaligned with human intention.
- oldstrangers 445 days ago
  Some sort of general knowledge skills assessment, grade them on accuracy. Questions / tasks get increasingly more abstract until they become almost subjective.
  [-]
  - KRAKRISMOTT 445 days ago
    GMAT? SATs? A ML flavored jeopardy test?
  - permo-w 445 days ago
    seems like a fool's errand
- k__ 445 days ago
  Isn't IQ just size of short time memory and processing speed?
  [-]
  - amelius 445 days ago
    No because that would mean that anyone with lots of time and a notepad could become (a slow version of) Einstein.
    [-]
    - mtlmtlmtlmtl 445 days ago
      I'm not convinced they couldn't. Depends what you mean by Einstein. You won't be formulating GR, but an IQ test could be doable.
      At least if you have enough IQ to figure out how to solve IQ test problems on paper. Which shouldn't be that hard.
    - prettyStandard 445 days ago
      IQ tests are timed. Not everyone could be a slow Einstein, but perhaps you if you had 200-300 years might reach the same solutions Einstein did. If you choose to work on the same problems.
      [-]
      - akomtu 445 days ago
        If you are a 5' tall basketball amateur, then even with 300 years of training you won't outplay the top NBA player.
        [-]
        qup 445 days ago
        It's different because to outplay the top NBA player you can't do it slowly. (You can compute slowly, though)
    - k__ 445 days ago
      Maybe, the correlation isn't linear. Or Einstein USP wasn't just his IQ.
  - tux3 445 days ago
    It is more like the ability to make sense of things.
    But intelligence is hard to measure. Always plenty of room for everyone to disagree.
    [-]
    - k__ 445 days ago
      I see.
      I just heard about a test with a box with lights and a buttons, and pressing the buttons faster would correlate to higher IQ.
      [-]
      - grugagag 445 days ago
        What you’re describing sounds like a way to measure reaction time. By that measure I suspect gamers would rank highest.
        [-]
        k__ 445 days ago
        It was set up that you had to calculate the right button to press after lights lit up.
        And this time was correlated to IQ.
xrd 445 days ago
It sounds like you can train this assistant on your own corpus of data. Am I right? What are the hardware and time requirements for that? The readme sounds a bit futuristic, has anyone actually used this, or is this just the vision of what's to come?
[-]
- simonw 445 days ago
  Somewhat unintuitively, it looks like training a language model on your own data usually doesn't do what people think it will do.
  The usual desire is to be able to ask questions of your own data - and it would seem obvious that the way to do that would be to fine tune train an existing model with that extra information.
  There's actually an easier (and potentially more effective?) way of achieving this: first run a search against your own data to find relevant information, then glue that together into a prompt along with the user's question and feed that to an existing language model.
  I wrote about one way of building that here: https://simonwillison.net/2023/Jan/13/semantic-search-answer...
  Open Assistant will hopefully result in a language model we can run on our own hardware (though it maybe a few years before it's feasible to do that affordable - language models are much heavier than image models like Stable Diffusion). So it can form part of this model, even without training the model on our own custom data.
- chriskanan 445 days ago
  The current effort is to get the data required to train a system and they have created all the needed tools to get that data. Then, based on my understanding, they intend to release the dataset and to release pre-trained models that could run on commodity hardware, similar to what was done with Stable Diffusion.
funerr 445 days ago
Is there a way to donate to this project?
d0100 445 days ago
Can these ChatGPT like systems trace their answers back to the source material?
To me this seems like the missing link to make Google search and the like dead
[-]
- jamilton 445 days ago
  No, they can't. If you ask ChatGPT in particular, you get a "I am an AI model and can't provide specific citations" response. In general, they just don't have that information.
  [-]
  - csomar 444 days ago
    But maybe Google can turn to a probalistic engine that links a certain AI response to certain potential pages?
- csomar 444 days ago
  Not really. The information is lost in a complicated way that we don’t really understand very well.
winddude 445 days ago
I'd be interested in helping, but the organisation is a bit of a cluster fuck.
[-]
- pqdbr 445 days ago
  Would you care to add some context or you’re just throwing stones for no reason at all?
  [-]
  - winddude 440 days ago
    It just seems like a bit of a cluster fuck, I have no idea who's leading what part, who to talk to. Everything just seemed a bit all over the place, I looked on discord, some people seemed to want all the data int he world, while others thought it should be a small foundational model.
    I took a look at the annotator frontend the other day, yikes, that takes way to long to annotate, and not enough clarity on how to annotate. Sure if you get 10 people to annotate each task, you can avg the results, but will you get that many people? And you're calling it data collection, that's not data collection, that's data annotation.
    > Collect high-quality human generated Instruction-Fulfillment samples (prompt + response), goal >50k.
    Okay, why not use some of the existing models, to create some of these samples, and train on them.
    I think you need: - an architecture plan for information retrieval, search intent - a better, faster to annotate annotator - what data do you want to actually collect? only those 50k? or do you need to train a foundational model, or use an existing model?
    What about some look at whats already been done? Like blenderBot, LangChain, etc. I love building stuff from scratch, but... at least some analysis, of the issues and problems, and why this method will work.
    And also, I do love building stuff from the ground up
yazzku 445 days ago
What's the tl;dr on the Apache license? Is there any guarantee that our data and labelling contributions will remain open?
O__________O 445 days ago
TLDR: OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
________
Related video by one of the contributors on how to help:
- https://youtube.com/watch?v=64Izfm24FKA
Source Code:
- https://github.com/LAION-AI/Open-Assistant
Roadmap:
- https://docs.google.com/presentation/d/1n7IrAOVOqwdYgiYrXc8S...
How you can help / contribute:
- https://github.com/LAION-AI/Open-Assistant#how-can-you-help
[-]
O__________O 445 days ago
[dupe]
[-]
- dang 445 days ago
  Please don't copy/paste comments into other threads. It lowers the signal/noise ratio and makes merging the threads a pain.
  https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
  [-]
  - O__________O 444 days ago
    Understand, though duplicates were on home page for 6-8 hours so didn’t see an issue, especially given comment is a resource link comment. Feel free to kill/delete one of them, I don’t care about the rep.
    [-]
    - dang 444 days ago
      > duplicates were on home page for 6-8 hours so didn’t see an issue
      That was the issue! 6-8 hours is an extremely long time for duplicates to be on the front page. Alas, we don't see everything.
      In the future, the best thing is to drop us a note at hn@ycombinator.com when you see these things. Another user did that and that's why we belatedly merged the threads.
NayamAmarshe 445 days ago
FOSS is the future!
jdarchitect 445 days ago
[flagged]
residualmind 445 days ago
and so it begins...
bilater 445 days ago
Used a Tailwind UI Template. Bullish.
[-]
- bilater 442 days ago
  jeez this as meant to be a joke...I think this project is awesome and the fact they used aTailwind UI is indicative of good decision making.
pxoe 445 days ago
that same laion that scraped the web for images, ignored their licenses and copyrights, and thought that'd do just fine? the one that chose to not implement systems that would detect licenses, and to not have license fields in their datasets? the one that knowingly points to copyrighted works in their datasets, yet also pretends like they're not doing anything at all? that same group?
really trustworthy.
[-]
- losvedir 445 days ago
  Yes, and it's up to each of us to decide how we feel about that. I personally don't think I have a problem with it, but then I've always been somewhat opposed to software patents and other IP protections.
  I mean, the whole reason we have those laws is the belief that it encourages innovation. I can believe it does to some extent, but on the other hand, all these AI models are pretty innovative, too, so the opportunity cost of not allowing it is pretty high.
  I don't think it's a given that slurping up IP like this is ethically or pragmatically wrong.
- seydor 445 days ago
  The alternative are? the company that scrapes the web for a living or the one that scrapes github for a living?
  [-]
  - pxoe 445 days ago
    you're forgetting one important alternative: to just not use and/or not do something. nobody asked them to scrape anything. nobody asked them to scrape copyrighted works. they could've just not done the shady thing, but they made that choice to do it, all by themselves. and one can just avoid using something with questionable data ethics and practices.
    they clearly show in their actions that they think they can do anything with any data that's out there, and put it all out. why would anyone entrust them or their systems with own data to 'assist' with, I don't really get.
    and even though it's an 'open source' project, that part may be just soliciting people to do work for them, to help them enable their own data collection. it's gonna run somewhere, after all. in the cloud, with monetized compute, just like any other AI project out there.
    [-]
    - pixl97 445 days ago
      I personally see your view on this as a complete and total failure on humans and society/culture actually work.
      Your mind exists in a state where it is constantly 'scraping' copyrighted work. Now, in general limitations of the human mind keep you from accurately reproducing that work, but if I were able to look at your output as an omniscient being it is likely I could slam you with violation after violation where you took stylization ideas off of copyrighted work.
      RMS covers this rather well in 'The right to read'. Pretty much any model that puts hard ownership rules on ideas and styles leads to total ownership by a few large monied entities. It's much easier for Google to pay some artist for their data that goes into an AI model. Because the 'google ai' model is now more culturally complete than other models that cannot see this data Google entrenches a stronger monopoly in the market, hence generating more money in which to outright buy ideas to further monopolize the market.
      [-]
      - pxoe 443 days ago
        you bring up human mind as if that'd somehow explain, or excuse, or absolve how broken and poorly planned AI systems are, which, unlike human mind, memories, or thoughts, are very real, and operate with data, precise, definite and not uncertain in its existence, at every point of their actions. minds don't create a definitive list of every work they encounter, in precise detail, they don't create models that incorporate all of those works into a single entity with comprehensive accessibility, and don't enable others (thousands, millions, in hugely disproportional scale) to use those models, directly, to create yet more and more artifacts. all of these things are definite, tangible, transferrable data. minds are pretty much none of these things. but it doesn't matter what minds are, because it's about AI, and AI is a thing that exists, and it can and should be focused on in for its own merit. without sliding conversations into 'well, what about brains'. nothing about them. it's not about brains. it's about existing AI systems. don't slide.
        licenses aren't limited to being 'only monetary', 'pay me to use this, otherwise, don't'. some licenses exist to enable distributing things freely, while offering protection of attribution and against misuse of things. (just check out CC licenses). it would be nice if those things would be respected, but they aren't, because there's no mechanism built in that'd discern the licenses, because they don't care. it is not just an attack on 'big bad commercial entities that hold copyrights on works', it's an attack on people who try to protect their works that they give for free, from misuse. (and on those people who just, naively put their work out there. yes, they may be naive, in not choosing licenses (which, as we see, wouldn't protect them against scraping that ignores licenses completely), but they might end up being exploited nonetheless and all the same, and they definitely don't deserve to be victim blamed, when it can be very clear who/what is the perpetrator of exploitation, in a very tangible way with a data trail (direct, definite presence in datasets.)
        those people who knowingly built systems that ignore any copyrights, any licensing, truly aren't the "good guys" who are "battling corporatist copyright systems", even though they'd probably very much like you to believe that, as they so desperately try to avoid being grilled on copyright issues.
        the 'standing up to capitalism, monopolies, etc.' is just dysfunctional in itself, as the resulting AI systems are very monetizeable, and are monetized, and SD, despite putting on airs as 'combating monopolies' (in tech, in research), has spread wide and far, and is now being used in a myriad of projects (with varying commercialization), that they're the ones who should be questioned on whether they're a monopoly in image generation algorithms themselves. "but it's free!", yes, that's how things spread, and then they try to upsell you on compute, limited access, or on hot new algorithms, as they dominate the market. they are perpetuating the same flavors of capitalism and monopolism, doing the same 'capture the market' moves (offer product for free, upsell, 'premium features and upgrades', aggressive undercutting and displacement of existing players in existing markets, etc.). those 'hot and new' companies are truly not better. you cannot be giving google side eye for offering a free product and capturing markets, while turning blind to SD offering a free product and capturing markets.
        ask rms directly on what he'd think of blatant ignoring of licenses, and whether he'd give his blessing to continued operation of systems that pretend that licenses just don't exist. instead of trying to use a 25 year old story as some kind of cover/excuse, like it's some ancient scripture.
    - seydor 445 days ago
      Would be interestingly to extend this criticism to the entire tech ecosystem which has been built on unsolicited scraping, which extends to many of the companies that are funding the company that hosts this very forum. we 'd get to a complete halt
      Considering the benefit of a model that can be downloaded, and hopefully ran on-premise one day, i don't care too much about their copyright practices being imperfect, especially in this industry
    - riskpreneurship 445 days ago
      You can only keep a genie bottled up for so long, and if you don't rub the lamp, your adversaries will.
      With something as potentially destabilizing as AGI, realpolitik will convince individual nations to put aside concerns like IP and copyright out of FOMO.
      The same thing happened with nuclear bombs: it's much easier to be South Africa choosing to dispose of them if you end up not needing them, than to be North Korea or Iran trying to join the join the club late.
      The real problem is that the gains from any successes will be hoarded by the people who acquired them by breaking the law.
- SamPatt 445 days ago
  Building something interesting without being too worried about licensing and copyright is a positive signal in my mind.
  You can clutch your pearls all you like but this is just the digital version of what humans have been doing forever. If information is accessible to the public then it will be accessed, that's how we work.
AstixAndBelix 445 days ago
It's funny because the moment this is available to run on your machine you realize how useless it is. It might be fun to test its conversational limits, but only Siri can actually set an alarm or a timer or run a shortcut, while this thing can only blabber
[-]
- ajot 445 days ago
  Can you run Siri outside of iOS? Can you work on it? FLOSS can help there, I could run this locally on a RasPi or old laptop if I want
  [-]
  - AstixAndBelix 445 days ago
    This is not a deterministic assistant like Siri, this is a ChatGPT conversational tool that might act up if you ask it to do anything
- A4ET8a8uTh0 445 days ago
  I don't want to sound dismissive, but 3rd party integration is part of the roadmap and any project has to start somewhere. I will admit I am kinda excited to have an alternative to commercial options.
- hgsgm 445 days ago
  It's pretty bad at baking a cake too.
  It's a chatbot, not a home automation controller. It's a research&writing assistant, not an executive assistant.
  [-]
  - AstixAndBelix 445 days ago
    How can it be a research assistant if it keeps making up stuff?
    [-]
    - pixl97 445 days ago
      How can humans be research assistants if they make shit up all the time?
      [-]
      - AstixAndBelix 445 days ago
        If I tasked an assistant to provide 10 papers, and 8 of them turned out to be made up they would be fired instantly. Unless someone wants to actively scam you, they will always provide 10 real results. Some of them might not be completely on topic, but at least they would not be made up
- traverseda 445 days ago
  I don't see why you couldn't integrate this kind of thing with some kind of command line, letting it integrate with arbitrary services.
  [-]
  - AstixAndBelix 445 days ago
    it's not deterministic, I don't want it to interpret the same command with <100% accuracy
    [-]
    - traverseda 445 days ago
      It's deterministic. They throw in a random seed with online services like chatgpt.
      If it wasn't deterministic for some reason thar wouldn't be because it's magic, it would be because of hardware timing issues sneaking in (same reason why source code compiles can be non-reproducible), and could be solved by ordering the results of parallel computation that doesn't have a guaranteed order.
      To the best of my knowledge it's not a problem though.
    - pixl97 445 days ago
      Are humans deterministic? Hell, I wish my plain old normal digital computer was 100% deterministic, but it ain't due to any number of factors from bugs and state logic errors all the way to issues occurring near the quantum level.
      You're setting the goal so high it is not reachable by anything.
    - qup 445 days ago
      I'm already doing this. I currently only accept a subset of possible commands.
      The accuracy is a problem, but I think it's my prompting. I'm sure I can improve it by walking it through the steps or something.
      You can also just work in human approval to run any commands.
- turnsout 445 days ago
  To be fair, Siri’s success rate at setting an alarm is about 3/10 in my household. Let’s give open source a chance here
consumer451 445 days ago
I was very excited about Stable Diffusion, and I still am. A great yet relatively harmless contribution.
LLMs however, not so much. The avenues of misuse are just too great.
I started this whole thing somewhat railing against the un-openness of OpenAI. But once I began using ChatGPT, I realized that having centralized control of a tool like this in the hands of reasonable people is not the worst possible outcome for civilization.
While I support FOSS in most realms, in some I do not. Reality has taught me to stop being rigidly religious about these things. Just because something is freely available does not magically make it "good."
In the interest of curiosity and discussion, can someone give me some actual real-world examples of what a FOSS ChatGPT will enable that OpenAI's tool will not? And, please be specific, not just "no censorship." Please give examples of that censorship.
[-]
- leaving 445 days ago
  It genuinely astonishes me that you think that "centralized contol" of anything can be beneficial to the human species or the world in general.
  Centralized control hasn't stopped us from killing off half the animal species in fifty years, wiping out most of the insects, or turning the oceans into a trash heap.
  In fact, centralized control is the author of our destruction. We are all dead people walking.
  Why not try "individualized intelligence" as an alternative? Give truly good-quality universal education and encouragement of individual curiosity and independent thought a try?
  It can't be worse.
  [-]
  - consumer451 445 days ago
    > It genuinely astonishes me that you think that "centralized contol" of anything can be beneficial to the human species or the world in general.
    I am genuinely astonished that in the face of obvious examples such as nuclear weapons, people cannot see the opposite in some cases.
    > It can't be worse.
    It can always be worse.
    Would a theoretical FOSS small yield nuclear weapon make the world a better place?
    How about a FOSS powered sub-$10k hardware budget CRISPR virus lab? Well, it's FOSS, so it must be good?
    [-]
    - yazzku 445 days ago
      Microsoft is not "reasonable people". Having this behind closed corporate walls is the worst possible outcome.
      The nuclear example isn't really a counter-argument. If only one nation had access to them, every other nation would automatically be subjugated to them. If the nuclear balance works, it's because multiple super powers have access to those weapons and international treaties regulate their use (as much as North Korea likes to demo practice rounds on state TV.) Also the technology isn't secret; it's access to resources and again, international treaties, that prevent its proliferation.
      Same thing with CRISPR. Again, there are scientific standards that regulate its use. It being open or not doesn't really matter to its proliferation.
      I agree there are cases where being open is not necessarily the best strategy. I don't think your examples are particularly good, though.
      [-]
      - consumer451 445 days ago
        I think we may have very different definitions of the word reasonable.
        I mean it in the classic sense.[0]
        Do I love corporate hegemony? Heck no.
        Could there be less reasonable stewards of extremely powerful tools? Heck yes.
        An example might be a group of people who are so blinded by ideology that they would work to create tools which 100x the work of grifters and propagandists, and then say... hey, not my problem, I was just following my pure ideology bro.
        A basic example of being reasonable might be revoking access to someone running a paypal scam syndicate which sends countless custom tailored and unique emails to paypal users. How would Open Assistant deal with this issue?
        [0]
        1. having sound judgement; fair and sensible. based on good sense. 2. as much as is appropriate or fair; moderate.
        [-]
        yazzku 445 days ago
        > and then say... hey, not my problem, I was just following my pure ideology bro.
        That's basically the definition of Google and Facebook, which go about their business taking no responsibility for the damage they cause. As for Microsoft, 'fair' and 'moderate' are not exactly their brand either considering their history of failed and successful attempts to brutally squash competition. If you're saying that they'd be fair in censoring the "right" content, then you're just saying you share their bias.
        > A basic example of being reasonable might be revoking access to someone running a paypal scam syndicate which sends countless custom tailored and unique emails to paypal users. How would Open Assistant deal with this issue?
        I'm not exactly sure how Open Assistant would deal, or if it even needs to deal, with this. You'd send the cops and send those motherfuckers back to the hellhole that spawned them. Scams are illegal regardless of what tools you use to go about it. If it's not Open Assistant, the scammers will find something else.
        Your argument is basically that we should ban/moderate the proliferation of tools and technology. I'm not sure that's very effective when it comes to software. I think the better strategy is to develop the open alternative fast before society is subjugated to the corporate version, even if it does give the scammers a slight edge in the short term. If you wait for the law to catch up and regulate these companies, it's going to take another 20 years like the GDPR.
        [-]
        consumer451 445 days ago
        > Your argument is basically that we should ban/moderate the proliferation of tools and technology. I'm not sure that's very effective when it comes to software.
        No, my argument is that we as individuals shouldn't be in a rush to create free and open tools which will be used for evil, in addition to their beneficial use cases.
        FOSS often takes a lot of individual contributions. People should be really thoughtful about these things now that the implications of their contributions will have much more direct and dire effects on our civilization. This is not PDFjs or Audacity that we are talking about. The stakes are much higher now. Are people really thinking this through?
        If anything, it would great if we as individuals acted responsibility to avoid major shit shows and the aftermath of gov regulation.
        [-]
        yazzku 445 days ago
        Ok, yeah, maybe I'll take my latter statement back. Ideally things are developed at the pace you describe and under the scrutiny of society. There are people thinking this through -- EDRI and a bunch of other organizations -- just probably not corporations like Microsoft. In practice, though, we are likely to see corporations roll out chat-based incantations of search engines and assistants, followed by an ethical shit show, followed by mild regulation 20 years later.
    - mandmandam 445 days ago
      > I am genuinely astonished that in the face of obvious examples such as nuclear weapons, people cannot see the opposite in some cases.
      You seem to be making some large logical leaps, and jumping to invalid conclusions.
      Try to imagine a way of exerting regulation over virus research and weaponry that wouldn't be "centralized control". If you can't, that's a failure of imagination, not of decentralization.
      [-]
      - consumer451 445 days ago
        > Try to imagine a way of exerting regulation over virus research and weaponry that wouldn't be "centralized control".
        Since apparently my own imagination is too limited, could you please give me some examples of how this would be accomplished?
        [-]
        mandmandam 445 days ago
        Trustless and decentralized systems are a hot topic. Have you read much in the field, to be so certain that centralization is the only way forward?
        There are options you haven't considered, whether you can imagine them or not.
        [-]
        consumer451 445 days ago
        > Trustless and decentralized systems are a hot topic.
        Yeah, and how's that working out exactly? Is there any decentralized governance project which also has anything to do with law irl? I know what a DAO is, and it sounds pretty neat, in theory. There are all kinds of theoretical pie in the sky ideas which sound great and have yet to impact anything in reality.
        Before we give the keys to nukes and bioweapons over to a "decentralized authority," maybe we should see some examples of it working outside of the coin-go-up world? Heck, how about some examples of it working even in the coin-go-up world?
        Even pro-decentralized crypto folks see the downsides of DAOs, such as slower decision making.
    - sterlind 445 days ago
      Nuclear weapons are just evil. It'd be better if they didn't exist rather than if they were centralized. We've gotten so close to WWIII.
      As for the CRISPR virus lab, at least the technology being open implies that vaccine development would be democratized as well. Not ideal but.. yeah.
  - f6v 445 days ago
    > Centralized control hasn't stopped
    Because there wasn’t any.
- sterlind 445 days ago
  > In the interest of curiosity and discussion, can someone give me some actual real-world examples of what a FOSS ChatGPT will enable that OpenAI's tool will not?
  Smut. I've been trying to use ChatGPT to write erotica, but OpenAI has made it downright puritanical. Any conversations involving kink trip its guardrails unless I bypass them.
  Writing fiction that involves bad guys - arsonists, serial killers, etc. You need to ask how to hide a body if you're writing a murder mystery.
  Those are just some examples from my recent work.
  [-]
  - consumer451 445 days ago
    Thanks, that's a good example. On the balance though, would I be in favor of ML auto-smut if it meant that more people will fall to misinformation in the form of propaganda and financial scams? No, that does not seem like a reasonable trade off to me.
    But you may be interested in this jailbreak while it lasts. I have gotten it to write all kinds of fun things. You will have to rework the jailbreak in the first comment, but I bet it works.
    https://news.ycombinator.com/item?id=34642091
- visarga 445 days ago
  > Just because something is freely available does not magically make it "good."
  Just because you don't like it doesn't mean an open source chatGPT will not appear. It doesn't need everyone's permission to exist. Once we accumulated internet-scale datasets and gigantic supercomputers, immediately GPT-3's started to pop up. It was inevitable. It's an evolutionary process and we won't be able to control it at will.
  Probably the same process happens in every human who gains language faculty and a bit of experience. It's how language "inhabits" humans, carrying with it the work of previous generations. Now language can inhabit AIs as well, and the result is shocking. It's like our own mind staring back at us.
  But it is just natural evolution for language. It found an even more efficient replication device. Now it can contain and replicate the whole culture at once, instead of one human life at a time. By "language" I mean language itself, concepts, methods, science, art, culture and technology, and everything I forgot - the whole "corpus" of human experience recorded in text and media.
  [-]
  - consumer451 445 days ago
    > It doesn't need everyone's permission to exist.
    Nope it does not. It does need a lot of people's help though and there may be enough out there to do the job in this case.
    Even though I knew this would be a highly unpopular opinion in this thread, I still posted it. Freedom of speech, right?
    The reason I posted it was to maybe give some pause to some people, so that they have a moment to consider the implications. I realize this is likely futile but this is a hill I am willing to die on. That hill being FOSS is not an escape from responsibility and consequences.
    I bet this leads to major regulation, which will suck.
    [-]
    - pixl97 445 days ago
      First. this is a moderated forum, you have no freedom of speech here, and neither do I.
      Next, regulation solves nothing here, and my guess will make the problems far worse. Why? Lets take nuclear weapons. They are insanely powerful, but they are highly regulated because there are a few choke points mostly in uranium refinement that make monitoring pretty easy at a global scale. The problem with regulating things like GPT is computation looks like computation. It's not sending high energy particles out into space where they can be monitored. Every government on the planet can easily and cheaply (compared to nukes) generate their own GPT models and propaganda weapons and the same goes for multinational corporations. Many countries in the EU may agree to regulate these things, but your dominant countries vying for superpower status aren't going to let their competitors one up each other by shutting down research into different forms of AI.
      I don't think of this as a hill we are going to die on, but instead a hill we may be killed on by our own creations.
- A4ET8a8uTh0 445 days ago
  << In the interest of curiosity and discussion, can someone give me some actual real-world examples of what a FOSS ChatGPT will enable that OpenAI's tool will not? And, please be specific, not just "no censorship." Please give examples of that censorship.
  I think the best I can go with is that it levels the playing field. This tool is likely already being adopted and adapted across the world by some overly excited people ( I am currently testing for personal use ). Just the idea that one company has access to all those prompts is a nightmare to me, because I am all but certain that some well meaning analyst dumped production data set into it for some relatively benign stuff like "address standardization" or "classification". To me, that shit is scary as fuck, but I know not everyone has the same internal moral compass or even corporate guidance.
  If there is one thing we learned over the past few decades, it is that centralized anything tends to end up being corrupted by powers that be. If information yearns to be free, this is likely the pinnacle of information -- a way for one person to make their own tool and use it as they see fit ( and face appropriate consequences as some will undoubtedly arise ).