用户名:  密码:   
栏目导航 — 美国华裔教授专家网人在海外创业就业
关键字  范围   
R. Szopa:收集数据比AI技能更重要!
R. Szopa:收集数据比AI技能更重要!
2/6/2019 8:59:17 AM | 浏览:311 | 评论:0

Your AI skills are worth less than you think

R. Szopa:收集数据比AI技能更重要!

My son and me, image processed using Artistic Style Transfer. This technique sparked my fascination with Deep Learning.

We are in the middle of an AI boom. Machine Learning experts command extraordinary salaries, investors are happy to open their hearts and checkbooks when meeting AI startups. And rightly so:this is one of those transformational technologies that occur once per generation. The tech is here to stay, and it will change our lives.
That doesn’t mean that making your AI startup succeed is easy. I think there are some important pitfalls ahead of anyone trying to build their business around AI.

The value of your AI skills is declining

In 2015 I was still at Google and started playing with DistBelief(which they would later rename to TensorFlow). It sucked. It was painfully awkward to write, the main abstractions didn’t quite match what you expected. The idea of making it work outside of Google’s build system was a pipe dream.

In late 2016 I was working on a proof of concept to detect breast cancer in histopathological images. I wanted to use transfer learning:take Inception, Google’s best image classification architecture at the time, and retrain it on my cancer data. I would use the weights from a pretrained Inception as provided by Google, just changing the top layers to match what I was doing. After a long time of trial and error in TensorFlow, I finally figured out how to manipulate the different layers, and got it mostly working. It took a lot of perseverance and reading TensorFlow’s sources. At least I didn’t have to worry too much about dependencies, as the TensorFlow people mercifully prepared a Docker image.

In early 2018 the task from above wasn’t suitable for an intern’s first project, due to lack of complexity. Thanks to Keras (a framework on top of TensorFlow)you could do it in just a few lines of Python code, and it required no deep understanding of what you were doing. What was still a bit of a pain was hyperparameter tuning. If you have a Deep Learning model, you can manipulate multiple knobs like the number and size of layers, etc. How to get to the optimal configuration is not trivial, and some intuitive algorithms(like grid search)don’t perform well. You ended up running a lot of experiments, and it felt more like an art than a science.

As I am writing these words(beginning of 2019), Google and Amazon offer services for automatic model tuning(Cloud AutoML, SageMaker), Microsoft is planning to do so. I predict that manual tuning is going the way of dodo, and good riddance.

I hope that you see the pattern here. What was hard becomes easy, you can achieve more while understanding less. Great engineering feats of the past start sounding rather lame, and we shouldn’t expect our present feats to fare better in the future. This is a good thing and a sign of amazing progress. We owe this progress to companies like Google, who are investing heavily in the tools, and then giving them away for free. The reason why they do that is twofold.

R. Szopa:收集数据比AI技能更重要!
Your office after you’ve been commoditized.

First, this is an attempt to commoditize the complement to their actual product, which is cloud infrastructure. In economics, two goods are complementary if you tend to buy them together. Some examples:cars and gasoline, milk and cereal, bacon and eggs. If the price of one of the complements goes down, the demand for the other will go up. The complement to the cloud is the software that runs on top of it, and AI stuff has also the nice property that it requires a lot of computational resources. So, it makes a lot of sense to make its development as cheap as possible.

The second reason why Google in particular is so gung ho on AI is that they have a clear comparative advantage with respect to Amazon and Microsoft. They started earlier, and it was them who popularized the concept of Deep Learning after all, so they managed to snatch a lot of the talent. They have more experience in developing AI products, and this gives them an edge when it comes to developing tools and services necessary for them.

As exciting as the progress is, it’s bad news for both companies and individuals who have invested heavily in AI skills. Today, they give you a solid competitive advantage, as training a competent ML engineer requires plenty of time spent reading papers, and a solid math background to start with. However, as the tools get better, this won’t be the case anymore. It’ll become more about reading tutorials than scientific papers. If you don’t realize your advantage soon, a band of interns with a library may eat your lunch. Especially, if the interns have better data, which brings us to my next point…

Data is more important than fancy AI architectures

Let’s say that you have two AI startup founders, Alice and Bob. Their companies raised around the same amount of money, and are fiercely competing over the same market. Alice invests in the best engineers, PhDs with a good track record in AI research. Bob hires mediocre but competent engineers, and invests her(“Bob” is short for Roberta!)money in securing better data. On which company would you bet your money?

My money would be squarely on Bob. Why? At its essence, machine learning works by extracting information from a dataset and transferring it to the model weights. A better model is more more efficient at this process(in terms of time and/or overall quality), but assuming some baseline of adequacy(that is, the model is actually learning something) better data will trump a better architecture.

To illustrate this point, let’s have a quick and dirty test. I created two simple convolutional networks, a “better” one, and a “worse” one. The final dense layer of the better model had 128 neurons, while the worse one had to make due with only 64. I trained them on subsets of the MNIST dataset of increasing size, and plotted the models’ accuracy on the test set vs the number of samples they were trained on.

R. Szopa:收集数据比AI技能更重要!

Blue is the “better” model, green the “worse” model.

The positive effect of training dataset size is obvious(at least until the models start to overfit and accuracy plateaus). My “better” model, blue line, clearly outperforms the “worse” model, green line. However, what I want to point out is that the accuracy of the “worse” model trained on 40 thousand samples is better than of the “better” model at 30 thousand samples!

In my toy example we are dealing with a relatively simple problem, and we have a comprehensive dataset. In real life, we usually don’t have such a luxury. In many cases, you never escape the part of the graph in which increasing the dataset has such a dramatic effect.

What is more, Alice’s engineers in fact aren’t competing just with Bob’s people. Due to the open culture of the AI community and its emphasis on knowledge sharing, they are also competing with researchers at Google, Facebook, Microsoft, and thousands of universities around the world. Taking the best performing architecture currently described in the literature and retraining it on your own data is a battle tested strategy if your goal is to solve a problem(as opposed to making an original contribution to science). If there’s nothing really good available right now, it’s often a matter of waiting a quarter or two until someone comes up with a solution. Especially that you can do things like host a Kaggle competition to incentivize researchers to look into your particular problem.

Good engineering is always important, but if you are doing AI, the data is what creates the competitive advantage. The billion dollar question is, however, if you are going to be able to maintain your advantage.

In AI, maintaining your competitive advantage is hard

With her superior dataset Bob successfully competes with Alice, and she’s doing great. She launches her product and is steadily gaining market share. She can even start hiring better engineers, as the word on the street is that her company is the place to be.

R. Szopa:收集数据比AI技能更重要!

Photo by Alex Holyoake on Unsplash

Chuck has some catching up to do, but he has a lot more money than Bob. This matters when it comes to building the dataset. It’s very hard to accelerate an engineering project by throwing money at it. In fact, assigning too many new people can hinder development. Creating a dataset, however, is a different sort of problem. Usually, it requires a lot of manual human labor — and you can easily scale it by hiring more people. Or it could be that someone has the data — then all you have to do is pay for a license. In any case — the money makes it go a lot faster.

Why was Chuck able to raise more money than Bob?

When a founder raises a round, they are trying to balance two objectives potentially at odds with each other. They need to raise enough money to be able to win. But they cannot raise too much money, because that will lead to excessive dilution. Taking an external investor means selling part of the company. The founding team must maintain a high enough stake in the startup, lest their lose their motivation(running a startup is a hard job!).

The investors, on the other hand, want to invest in ideas that have a huge potential upside, but they must control for risk. As the perceived risk increases, they will ask for bigger chunk of the company for every dollar they pay.

When Bob was raising money, it was a leap of faith that AI can actually help with her product. Regardless of her qualities as founder, or how good her team was, it wasn’t out of the question that the problem she had been attacking was simply intractable. Chuck’s situation is very different. He knows the problem is tractable:Bob’s product is the living proof!

One of Bob’s potential responses to this challenge is raising another round. She should be in a good position for that, as(for the time being)she is still leading in the race. However, the situation may be more complicated. What if Chuck can secure access to the data through a strategic relationship? For example, imagine we are talking talking about a cancer diagnosis startup. Chuck could use his insider position at an important medical institution and secure a sweetheart deal with said institution. It could well be impossible for Bob to match that.

R. Szopa:收集数据比AI技能更重要!

Your product should be defensible, ideally by having a deep moat.

So, how would you go about building a maintainablecompetitive advantage for an AI product? Some time ago I had the pleasure of talking to Antonio Criminisi from Microsoft Research. His idea is that the project’s secret sauce shouldn’t consist only of AI. For example, his InnerEye project uses AI and classical(not ML based)computer vision to analyze radiological images. To some extent, this may be at odds with why you are doing an AI startup in the first place. The ability to just throw data at a model and see it work is incredibly attractive. However, a traditional software component, the sort of which requires programmers to think about algorithms and utilize some hard to gain domain knowledge, is much more difficult to reproduce.

AI is best used like a lever

One way of categorizing something in a business is whether it adds value directly, or provides leverage to some other value source. Let’s take an an ecommerce company as an example. If you created a new product line you added value directly. There was nothing, now there are widgets, and the customers can pay for them. Establishing a new distribution channel, on the other hand, is a lever. By starting to sell your widgets on Amazon, you can double your sales volume. Cutting costs is also leverage. If you negotiate a better deal with the Chinese widget supplier, you can double your gross margin.

Levers have the potential to move the needle further than direct force application. However, a lever only works when it’s coupled with a direct value source. A minuscule number doesn’t stop to be small if you double or triple it. If you have no widgets to sell, getting a new distribution channel is a waste of time.

How should we look at AI in this context? There are plenty of companies that try to make AI their direct product(APIs for image recognition and the like). This can be very tempting if you are an AI expert. However, this is a singularly bad idea. First, you are competing with companies like Google and Amazon. Second, making a generic AI product that is genuinely useful is crazy hard. For example, I have always wanted to use Google’s Vision API. Unfortunately, we never ran into a customer whose needs would be adequately matched by the offering. It was always too much, or not enough, and custom development was preferable to the exercise of fitting a square peg in a round hole.

A much better option is to treat AI as a lever. You can take an existing, working business model and supercharge it with AI. For example, if you have a process which depends on human cognitive labor, automating it away will do wonders for your gross margins. Some examples I can think of are ECG analysis, industrial inspection, satellite image analysis. What is also exciting here is that, because AI stays in the backend, you have some non-AI options to build and maintain your competitive advantage.


AI is a truly transformational technology. However, basing your startup on it is a tricky business. You shouldn’t rely solely on your AI skills, as they are depreciating due to larger market trends. Building AI models may be very interesting, but what really matters is having better data than the competition. Maintaining a competitive advantage is hard, especially if you encounter a competitor who is richer than you, which is very likely to happen if your AI idea takes off. You should aim to create a scalable data collection process which is hard to reproduce by your competition. AI is well suited to disrupt industries which rely on the cognitive work of low qualified humans, as it allows to automate this work.


R. Szopa:收集数据比AI技能更重要!

你的 AI 技能正在贬值!





你的 AI 技能正在贬值

2015 年我还在 Google 工作,那会儿就开始鼓捣 DistBelief(后来改名为 TensorFlow)。这玩意儿那时候实在太槽糕了,写起来非常笨拙,主要的抽象还不符合你的预期。

要想让它在 Google 所构建的系统之外发挥作用?



2016 年底,我正在进行一个概念证明的研究,就是在组织病理学图像中检测出乳腺癌。我想使用迁移学习:采用 Inception,这是 Google 当时最好的图像分类架构,然后使用我的癌症数据重新进行训练。我使用了 Google 提供的经过预训练的初始权重,只不过更改了顶层以便能够匹配我所做的工作。我在 TensorFlow 中,经过长时间的反复实验后,终于弄明白了如何操纵不同的层,并使其大部分发挥作用。这些都需要很大的毅力去阅读 TensorFlow 的资料,但至少我不必太过担心依赖关系,因为 TensorFlow 准备好了 Docker 镜像,真是太贴心了!

在 2018 年初,由于缺乏复杂性,上述任务并不适合作为实习生的第一个项目。多亏了 Keras(TensorFlow 之上的一个框架),你只需几行 Python 代码就可以完成,而且不需要深入理解你在做什么。但有一个痛点,就是超参数调优。如果你有深度学习模型,那你可以调整多个参数,如层的数量和大小等等。但如何得到最优配置并非易事,一些直观的算法(如网格搜索)效果并不怎么样。你做了很多实验,感觉更像是一门艺术,而不是一门科学。

在我写下这些文字的时候(2019 年初),Google 和 Amazon 已经提供了自动模型调优服务(Cloud AutoML、SageMaker),Microsoft 也正在计划提供这一服务。我预测,手动模型调优将会像渡渡鸟一样灭亡,而对于机器学习工程师来说,这也算是一种很好的解脱。


这是一件好事,也是取得惊人进步的标志。我们将这一进步归功于像 Google 这样的公司,正是它们在这些工具上投入巨资,然后免费给人们提供这些工具。它们之所以这样做,主要有两个原因。



Google 如此热衷人工智能的第二个原因是,与 Amazon 和 Microsoft 相比,Google 拥有比较明显的优势:起步更早。毕竟是 Google 普及了深度学习的概念,因此,它们成功抢走了很多人才。它们在开发人工智能产品方面有着更多的经验,这些使得它们在开发必要的工具和服务方面占据了优势。




更多的数据比花哨的 AI 架构更重要

假设你认识两个人工智能初创公司的创始人:Alice 和 Bob。他们的公司筹集到的资金大致相当,而且在同一个市场上激烈竞争。Alice 在最好的工程师和拥有丰富的人工智能研究经验的博士上进行投资,而 Bob 则雇佣了平庸但能干的工程师,并投资给她 (“Bob” 是 Robreta 的简称!)以获得更好的数据。那么,你会在哪个公司身上下注呢?

我会在 Bob 身上下注。为什么呢?


为了说明这一点,让我们做一个快速而粗略的测试。我创建了两个简单的卷积网络,其中,一个是 “更好” 的网络,一个是 “更差” 的网络。那个 “更好” 模型的最后一层全连接层 (Dense Layer) 有 128 个神经元,而 “更差” 的模型则只有 64 个。我在 MNIST 数据集的不断增大的子集上对这两个模型进行训练,并绘制出了模型在测试集上的正确率与训练的样本数的关系图。

蓝色曲线代表 “更好” 的模型,绿色曲线代表 “更差” 的模型

训练数据集大小的积极作用是显而易见的(至少在模型开始出现过拟合和正确率达到稳定之前是这样)。代表 “更好” 模型的蓝色曲线明显优于代表 “更差” 模型的绿色曲线。然而,我想要指出的是,就正确率而言,在 4 万个样本上训练的 “更差” 模型的表现,要比在 3 万个样本上训练的 “更好” 模型更好!


而且,Alice 的工程师们实际上不仅仅是与 Bob 的工程师竞争。由于人工智能社区的开放文化及其对知识共享的重视,他们还与 Google、Facebook、Microsoft 和全球数千所大学的研究人员竞争。

如果你的目标只是解决问题(而非对科学做出原创贡献),那么采用目前文献中描述的表现最好的架构,并根据你自己的数据对其进行重新训练,这是一个经过实战考验的策略。如果现在没有什么可用的东西的话,通常只需等待一两个季度,直到有人提出解决方案。特别值得一提的是,你可以做一些事情,比如举办一场 Kaggle 竞赛来激励研究人员研究你的特定问题。


保持 AI 竞争优势是很困难的

凭借出色的数据集,Bob 成功地与 Alice 展开竞争,她做得很好:推出了自己的产品,市场份额稳步增长。她甚至可以开始聘用更好的工程师,因为坊间传言她的公司是合适的选择。

Chunk 想要赶进度,不过他比 Bob 有更多的钱。这一点在构建数据集时很重要。通过砸钱来加速一个工程项目是非常困难的。事实上,指派太多的新人反而有可能会阻碍项目的进展。然而,创建数据集却是另外一种问题。通常来说,创建数据集需要大量的人工劳动,但你可以通过雇佣更多的劳动力来轻松扩展规模。或者可能某人拥有数据,那么你只需做的事就是向他支付许可费用。无论如何,有钱就是好办事。

那么问题来了,为什么 Chunk 能够比 Bob 筹到更多的资金呢?



当 Bob 筹集资金的时候,这是信心上的一次飞跃:人工智能能够真正提升她的产品。不管她作为创始人的素质如何,也不管她的团队有多优秀,毫无疑问,她一直在努力攻克的问题难以解决。而 Chunk 的情况非常不同,他知道这个问题是很容易解决的,因为 Bob 的产品就是活生生的证据!

Bob 应对这一挑战的可能反应之一是发起另一轮新的挑战。她应该处于有利地位,因为(目前)她在这场竞赛中仍然保持领先。然而,情况可能会更复杂。如果 Chunk 可以通过战略关系确保能够对数据的访问呢?遇到这种情况该怎么办?例如,假设我们正在讨论一家癌症诊断初创公司,那么 Chunk 可以利用他在一家重要医疗机构的内部地位,与该机构达成私下交易,而 Bob 很可能无法做到这点。



前段时间我有幸与 Microsoft 研究院的 Antonio Criminisi 交谈。他的想法是,这个项目的秘密武器不应该只由人工智能组成。例如,他的 InnerEye 项目除了利用了人工智能外,还用到了经典(不是基于机器学习)的计算机视觉来分析放射图像。



对业务进行分类的一种方法是,看它是直接增加价值,还是为某些其他价值来源提供杠杆作用。让我们以一家电子商务公司为例。如果你创建了新的产品系列,那么你可以做到直接增加价值。以前什么都没有,现在有了小商品,客户就可以为它们支付费用。另一方面,建立新的分销渠道相当于起到杠杆作用。比如,通过开始在 Amazon 上销售你的小商品,你就可以将销售量翻倍。削减成本也是一种杠杆,如果你与中国的小商品供应商达成更好的交易谈判,那么,你的毛利率将会翻一番。


在这种情况下,我们应该如何看待人工智能呢?有很多公司试图将人工智能作为它们的直接产品(如用于图像识别的 API 等)。如果你是人工智能专家,那么这个想法可能非常诱人。然而,这实在是一个非常槽糕的主意。首先,你是在与 Google、Amazon 等公司竞争。其次,制造真正有用的通用人工智能产品是非常困难的。比如,我一直想使用 Google 的 Vision API。 不幸的是,我们还没有遇到这样的一个客户:他的需求与我们的产品完全匹配。它要么是太多,要么是不够,总是这样。定制开发可比在圆孔中钉入方形桩钉要好多了。

综上所述,我们可以得出一个结论:将人工智能视为杠杆是更好的选择。你可以采用现有的、可行的商业模式,并通过人工智能来增强它。例如,如果你有个流程依赖于人类的认识劳动力,那么,将这一流程自动化可以提高你的毛利率。我能想到的一些例子是心电图分析、工业检查、卫星图像分析等等。同样令人兴奋的是,因为人工智能留在后端,所以你有一些非 AI 选择来形成并保持你的竞争优势。





中国教育部、财政部:以国际化视野和行动推进职业教育“双高计划” 2019-05-25 [20]
第二届“创业齐鲁•共赢未来”高层次人才创业大赛 2019-05-15 [110]
新一代创业者的助梦圆:北京石景山首钢侨梦苑商學院CLUB 2019-05-10 [160]
通知:上海市2019年度“科技创新行动计划”人工智能领域项目指南 2019-05-10 [161]
今年834万毕业生何去何从? 2019-05-08 [160]
长沙龙头企业美国洛杉矶招才引智专场招聘会 (5/11) 2019-05-06 [162]
Postdoctoral position at The City of Hope 2019-05-03 [150]
第二届“才聚金平湖•引领新崛起”创业创新大赛 2019-04-29 [212]
哈佛2019中国论坛创业大赛 决赛名单揭晓 2019-04-16 [355]
广东省人民政府《关于进一步促进科技创新若干政策措施的通知》 2019-04-16 [226]
王正:台獨害台 獨台弱台 和統壯台 :Chas W. Freeman(傅立民):论与中国的敌对式共存 :龙永图:一定要维护好中美关系的大局 :David Weinstein:特朗普用错了我的研究! :日本的科技人文和逆天设计 :丛日云:中国知识界何以误判美国、误判特朗普? :“九评”中美关系 :桂林电子科技大学2019年海内外招聘
注意: 留言内容不要超过4000字,否则会被截断。
未 审 核:  是
Copyright © 2019 ScholarsUpdate.com. All Rights Reserved.